best speedup and efficiency achieved

[ue-pp-squarematrixparallelisation.git] / MatrixParallelPrograming.tex
diff --git a/MatrixParallelPrograming.tex b/MatrixParallelPrograming.tex

index 7dd727639e5e737072fa341b10c489a4633ea3de..aca5c5ea3347a6ff7137a7bf2fd85771b7b8184e 100644 (file)
--- a/MatrixParallelPrograming.tex
+++ b/MatrixParallelPrograming.tex
@@ -95,29 +95,31 @@ The program was executed three times using the time command, and the average exe
  m\_rows & no HPC (sec) & with HPC (sec)\\
  \hline
  30k & 5,482 & 5,395 \\
  m\_rows & no HPC (sec) & with HPC (sec)\\
  \hline
  30k & 5,482 & 5,395 \\
-40k & error & 8.886 \\
+40k & error & 9.286 \\
  \end{tabular}
  \end{center}
  \caption{program's total time execution per amount of matrix rows.}
  \label{sequencial program time results}
  \end{table}%
  
  \end{tabular}
  \end{center}
  \caption{program's total time execution per amount of matrix rows.}
  \label{sequencial program time results}
  \end{table}%
  
-We could now use the \textbf{time.h} library to evaluate the time spent on each operation, 1st allocate memory, 2nd building the matrix and 3rd search for the biggest value, but we save this exercise for later. 
+We could now use the \textbf{time.h} library to evaluate the time spent on each operation, 1st allocate memory, 2nd building the matrix and 3rd search for the biggest value, we perform this exercise in next section. 
  
  \section*{Parallel Strategies}
  
  \section*{Parallel Strategies}
-From the three tasks presented previously, the search for the biggest value is the one that we should parallelise, therefore, a certain number of threads are spawned to split the task among them with the usage of pthread\_create() and pthread\_join() functions from \textbf{pthread.h} library.
+From the three tasks presented previously, the search for the biggest value is the one that we will parallelise first, therefore, a certain number of threads are spawned to split the task among them with the usage of pthread\_create() and pthread\_join() functions from \textbf{pthread.h} library.
  
  With the current implementation of our program, the global variable hv, used to store the highest value found during the search operation, will be shared between threads by the time we execute a parallelised version of it. We need to take into account this \textit{race condition} and decide on a strategy to overcome it. The usage of a pre variable lock() and after unlock() \textbf{muttex} pattern will stop the execution of threads waiting for the access to that variable, due to the nature of our problem, we update this variable to be \textbf{atomic int} instead, but this change alone actually increases the computation time up to 18,35 seconds with $m\_rows = 40k$, which again, suggests the use of \textbf{time.h} library to better understand the contribution of each parallel strategy presented in the following sections of this document. The next picture displays the time our program toke on each task after applying the implementations mentioned before:
  
   \begin{figure}[h]
         \centering
  
  With the current implementation of our program, the global variable hv, used to store the highest value found during the search operation, will be shared between threads by the time we execute a parallelised version of it. We need to take into account this \textit{race condition} and decide on a strategy to overcome it. The usage of a pre variable lock() and after unlock() \textbf{muttex} pattern will stop the execution of threads waiting for the access to that variable, due to the nature of our problem, we update this variable to be \textbf{atomic int} instead, but this change alone actually increases the computation time up to 18,35 seconds with $m\_rows = 40k$, which again, suggests the use of \textbf{time.h} library to better understand the contribution of each parallel strategy presented in the following sections of this document. The next picture displays the time our program toke on each task after applying the implementations mentioned before:
  
   \begin{figure}[h]
         \centering
-       \includegraphics[width=0.85\textwidth]{imagens/matrixsearchsequencial-02.png}  
+       \includegraphics[width=0.6\textwidth]{imagens/matrixsearchsequencial-02.png}  
  \end{figure}
  
  \end{figure}
  
-From the previous image we can be sure the 3rd task is the one to parallelise. We should also notice that the changes performed to the program actually increase it's time of execution, the view operation performed on this type of variable is more resource consuming than when performed on a variable of type integer. This can be proved by analysing the time for each task on the first version of our program, populating the matrix toke around 4.5 seconds while the search matrix task toke less than 4 seconds for 40k rows. The wall-clock time of the above figure is 20.460 seconds, this representes the real time a users waits for a program to finish,
+From the previous image we can be sure the 3rd task is the one to parallelise. We should also notice that the changes performed to the program actually increase it's time of execution, the view operation performed on this type of variable is more resource consuming than when performed on a variable of type integer. This can be proved by analysing the time for each task on the first version of our program, populating the matrix toke around 4.5 seconds while the search matrix task toke less than 4 seconds for 40k rows. The wall-clock time of the above figure is 18,931 seconds, this representes the real time a users waits for a program to finish,
  
  \subsection*{Columns Split}
  
  \subsection*{Columns Split}
-Since the matrix in the spotlight is a square one it doesn't really matter if we perform a split by columns or rows. What matters is the speed one of the about to be spawned threads takes to find the highest value present in the matrix. This way, threads will only perform read on hv global var and not write, which is a more time consuming computer operation.
+Because the matrix in the spotlight is a square one, at first approach, one might think it doesn't really matter if a split is performed by columns or rows, but it does. Since C approaches 2D arrays per row, by the time the cpu access the 32 or 64 bytes of the matrix memory stored in ram to make it available closer, at LP2 or LP3, it brings the closest row values also, making it faster to loop though rows instead of columns. 
+
+Again, one might thinks that what matters to the performance of such program, is the speed one of the about to be spawned threads takes to find the highest value present in the matrix. But the amount of memory access during a program execution will definitely affect it's performance. For such big matrix, with this many elements, for sure the way we loop through it will affect performance. %This way, threads will only perform read on hv global var and not write, which is a more time consuming computer operation.
  
  A new function, parallel\_search\_cols(void *args), was defined in order to share the work between threads, it expects the start and end column indices as arguments in order to split the matrix in as many slots as needed. A second function, run\_parallel\_search(int num\_threads), was also defined to make the program more useful by allowing to define the amount of threads to be created. 
  
  
  A new function, parallel\_search\_cols(void *args), was defined in order to share the work between threads, it expects the start and end column indices as arguments in order to split the matrix in as many slots as needed. A second function, run\_parallel\_search(int num\_threads), was also defined to make the program more useful by allowing to define the amount of threads to be created. 
  
@@ -125,20 +127,33 @@ The best execution times obtained on this scenario were with the creation of 600
  
  \begin{figure}[h]
         \centering
  
  \begin{figure}[h]
         \centering
-       \includegraphics[width=0.8\textwidth]{imagens/matrixsearchsequencial-03.png}
-       \caption{Time taken by all threads on the search matrix task.}
+       \includegraphics[width=0.6\textwidth]{imagens/matrixsearchsequencial-03.png}
+       \caption{Time taken by all threads on search matrix task.}
  \end{figure}
  
  \end{figure}
  
-A performance analysis is computed later this document, but we shall advance that the wall-clock time for 800 threads is around 8,21 seconds (around 1 millisecond per thread), which is better than the sequencial program in table 1. Next we present the strategy to split the matrix by blocks.
+A performance analysis is computed later this document, but we shall advance that the wall-clock time for 800 threads is around 8,21 seconds (around 1 millisecond per thread), which is better than the sequencial program in table 1. The next figure plots the amount of user time taken by the columns parallel solution implemented, it's more comum to observe then the total cpu time.
+
+\begin{figure}[h]
+       \centering
+       \includegraphics[width=0.6\textwidth]{imagens/matrixsearchsequencial-04.png}
+       \caption{Total time taken by program with columns parallelisation.}
+\end{figure}
  
  
  \subsection*{Blocks Split}
  
  
  \subsection*{Blocks Split}
-We updated the function run\_parallel\_search (int num\_threads), in order to divide the matrix in blocks of the same size. First it splits the matrix in two and calculate the amount of blocks for each side by the square root of the amount of blocks to be created. If we pick 400 threads for example, means we have 400 blocks to be worked out during the 3rd task. Each of this blocks has 2000 rows and columns, $\sqrt{400} = 20$,which means 20 blocks for each side of the first split, and then we can validate the amount of items in the blocks with: $20*(2000*2000)*2=160k$.
+We updated the function run\_parallel\_search (int num\_threads), in order to divide the matrix in blocks of the same size. First it splits the matrix in two and calculate the amount of blocks for each side by the square root of the amount of blocks to be created. If we pick 400 threads for example, means we have 40 blocks to be worked out during the 3rd task. Each of this blocks has 2000 rows and columns, $\sqrt{400} = 20$,which means 20 blocks for each side of the first split, and then we can validate the amount of items in the blocks with: $20*(2000*2000)*2=160k$. 
  
  We also defined a new struct object to carry the start/end row and column for each block assigned to a created thread, this helps to define the block's rows and columns boundaries. The function parallel\_search\_block() receives this list of arguments by the time is invoked by pthread\_create() and performs the comparison of matrix index with the atomic int hv variable.
  
  From figure 1 we can advance that, in general, both strategies have similar performances, but the second solution appears to scale better than the first one. Some of the spikes present in the graph for this solution could be explained by the size of sub-matrixes to assign on threads. The amount of threads used in those cases do not divide the matrix in perfectly square sub-matrixes, all threads access the atomic hv var on each iteration, therefor some weird behaviours could happen when trying to access it's address.
  
  
  We also defined a new struct object to carry the start/end row and column for each block assigned to a created thread, this helps to define the block's rows and columns boundaries. The function parallel\_search\_block() receives this list of arguments by the time is invoked by pthread\_create() and performs the comparison of matrix index with the atomic int hv variable.
  
  From figure 1 we can advance that, in general, both strategies have similar performances, but the second solution appears to scale better than the first one. Some of the spikes present in the graph for this solution could be explained by the size of sub-matrixes to assign on threads. The amount of threads used in those cases do not divide the matrix in perfectly square sub-matrixes, all threads access the atomic hv var on each iteration, therefor some weird behaviours could happen when trying to access it's address.
  
+Next we plot the total time taken by the program with this parallel solution.
+
+\begin{figure}[h]
+       \centering
+       \includegraphics[width=0.6\textwidth]{imagens/matrixsearchsequencial-05.png}
+       \caption{Total time taken by program with blocks parallelisation.}
+\end{figure}
  
  \subsubsection*{Performance Analysis}
  In order to evaluate the performance of our program when parallelised with both of previous strategies we compute the wall-clock cpu time obtained per threads, available in table 2. 
  
  \subsubsection*{Performance Analysis}
  In order to evaluate the performance of our program when parallelised with both of previous strategies we compute the wall-clock cpu time obtained per threads, available in table 2. 
@@ -165,11 +180,11 @@ Finally, the \textbf{scalability} of a program measures how well a parallel prog
  1000 & 8.126 & 10.560 
  \end{tabular}
  \end{center}
  1000 & 8.126 & 10.560 
  \end{tabular}
  \end{center}
-\caption{Wall-CPU Time (in seconds) taken for each strategy per threads used.}
+\caption{Program user time for each strategy per threads, for matrix with 40k rows and columns.}
  \label{wall-cpu time of parallel strategies}
  \end{table}%
  
  \label{wall-cpu time of parallel strategies}
  \end{table}%
  
-Combining values from tables 1 and 2, and using the formulas presented before we can compute the performance metrics for both strategies:
+Combining values from tables 1 and 2, and using the formulas presented before, we can compute the performance metrics for both strategies:
  
  \begin{table}[htp]
  \begin{center}
  
  \begin{table}[htp]
  \begin{center}
@@ -186,17 +201,41 @@ Blocks & 49 & 1.0526 & 2.1
  \label{performance of parallel strategies}
  \end{table}%
  
  \label{performance of parallel strategies}
  \end{table}%
  
-We can conclude that the parallelisation of our initial program didn't got a significant performance gain. Probably due to the nature of our problem, since the biggest value is always in the last entry of the matrix, threads will have to loop through the entire batch of items to compare values. But also because of the implementation applied, as mentioned before, the function run\_parallel\_search (int num\_threads) could be improved in order to access the atomic variable only once per thread.
+We can conclude that the parallelisation of our initial program didn't got a significant performance gain. Probably due to the nature of our problem, since the biggest value is always in the last entry of the matrix, threads will have to loop through the entire batch of items to compare values. But also because of the implementation applied, as mentioned before, the function run\_parallel\_search (int num\_threads) could be improved in order to access the atomic variable only once per thread. 
  
  
-In order to evaluate scalability we have to compute the previous metrics for both strategies when the matrix size increases from 29k to 40k rows and columns. We choose to use 64 threads for this experience. 
- 
+The metric scalability can be divided in two: \underline{strong scaling}, when the objective is to evaluate how faster a program gets as resources (or threads) are added to a fixed problem size - we already computed this analysis in table 2. There, we can see that the strategy to parallelise the search matrix task by columns has a better performance with the increase of resources. The blocks strategy algorithm loses performance sooner, and if we look at figure 1, we can notice that it needs much more resources to achieve similar total cpu time as the columns split. The \underline{weak scalability} is measured to evaluate if a program can handle larger problems in the same amount of time by adding more resources, ideally, the execution time should remain constant as both work and threads are scaled. From the next plot we can conclude that the columns split is a better strategy for the peculiar situation in study.
  
  
-\section*{Conclusions}
-Modern CPUs can spawn thousands of virtual threads to perform calculations.\\
+\begin{figure}[h]
+       \centering
+       \includegraphics[width=0.8\textwidth]{imagens/matrixsearchsequencial-06.png}
+       \caption{Weak scalability evaluation from 1 to 8 threads.}
+\end{figure}
  
  
-The action of populating the matrix could also update the value of global variable hv instead of the 3rd task. Search the matrix could have a local variable to store the highest value found and compare with the global var hv at the end.\\
+Since there wasn't a big gain when parallelising the search task we could either try to develop a new parallel strategy for the search task or parallelise the matrix initialisation instead. 
+
+\section*{Improving performance}
+We start by updating the function parallel\_search\_cols() to have a local variable to store the highest value found by the thread and perform only one comparison to the atomic variable. This change alone makes the program finishes with a mean value of 8.192 seconds (using 8 threads for a 40k rows and columns matrix), comparing with the values on tables 1 and 2 we can validate the benefit of this change by computing $efficiency = \frac{1.1335}{8} = 14.16 \%$.
+
+The second task, matrix initialisation, simulates the sequential section of a program in this study, but we could try to parallelise this operation has an extra exercise. The function init\_parallel\_matrix() was defined in order to populate segments of the matrix by different threads with a similar algorithm used for columns search. With all configurations out of the way,  the program now takes 4,821 seconds of wall-clock time with 4 threads. The following metrics were obtained:
+
+
+\begin{figure}[h]
+\begin{minipage}[c]{0.45\textwidth}
+       \begin{itemize}
+               \item speedup = 1,9262
+               \item efficiency = 48,15\%
+       \end{itemize}   
+\end{minipage}
+\begin{minipage}[c]{0.5\textwidth}
+\includegraphics[width=0.925\textwidth]{imagens/matrixsearchsequencial-07.png}
+\end{minipage}
+\end{figure}
+
+
+\section*{Conclusions}
+This study revealed a few characteristics of modern cpu behaviours. Apart from being able to spawn thousands of virtual threads to perform calculations, way more than the expected, the framework they use can also predict and make computations easy. On the last experiment of parallelising both tasks, the program takes very similar total cpu time executing them, less than 9 seconds, or less than 2,25 seconds each thread, if we compare with values presented previously for the 2nd and 3rd task before parallelisation, considerably away from each other, we can almost argue that the cpu still "remembers" the matrix while searching it.
  
  
-The clock() function from time.h library is useless to compute wall-clock time and therefore should be ignored on metrics evaluations.
+Parallelising a simple search algorithm is heavily influenced by memory access patterns and synchronization overhead. While modern cpus can spawn multiple threads, the transition from sequential to parallel is not always linear. Our initial attempt using atomic variables showed that high contention can actually degrade performance. By implementing local thread variables, we achieved a speedup of 1.13x, and pre-parallelisation of matrix initialisation improved it to almost 2x using less resources.