X-Git-Url: https://vgcfreebox.myrthtech.pt/gitweb/ue-pp-squarematrixparallelisation.git/blobdiff_plain/cbeee37d9603c7a8564cb1710842159b06785c43..b0e2518d9987a533cb5c7dd24d8de46f5f506186:/MatrixParallelPrograming.tex?ds=inline

diff --git a/MatrixParallelPrograming.tex b/MatrixParallelPrograming.tex
index cc1d5cb..7dd7276 100644
--- a/MatrixParallelPrograming.tex
+++ b/MatrixParallelPrograming.tex
@@ -95,7 +95,7 @@ The program was executed three times using the time command, and the average exe
 m\_rows & no HPC (sec) & with HPC (sec)\\
 \hline
 30k & 5,482 & 5,395 \\
-40k & error & 10,081 \\
+40k & error & 8.886 \\
 \end{tabular}
 \end{center}
 \caption{program's total time execution per amount of matrix rows.}
@@ -137,14 +137,58 @@ We updated the function run\_parallel\_search (int num\_threads), in order to di
 
 We also defined a new struct object to carry the start/end row and column for each block assigned to a created thread, this helps to define the block's rows and columns boundaries. The function parallel\_search\_block() receives this list of arguments by the time is invoked by pthread\_create() and performs the comparison of matrix index with the atomic int hv variable.
 
+From figure 1 we can advance that, in general, both strategies have similar performances, but the second solution appears to scale better than the first one. Some of the spikes present in the graph for this solution could be explained by the size of sub-matrixes to assign on threads. The amount of threads used in those cases do not divide the matrix in perfectly square sub-matrixes, all threads access the atomic hv var on each iteration, therefor some weird behaviours could happen when trying to access it's address.
+
+
 \subsubsection*{Performance Analysis}
+In order to evaluate the performance of our program when parallelised with both of previous strategies we compute the wall-clock cpu time obtained per threads, available in table 2. 
+
+The \textbf{speedup} metric is determined by the quotient between the best time obtained by a sequential program and a parallelised version of it: $Speedup = \frac{t^*}{t_p} $. 
 
-$Speedup = \frac{t^*}{t_p} $, 
+The \textbf{efficiency} of a parallel version of a program can then be calculated by the fraction $efficiency = \frac{speedup}{p}$.
 
-scalability measures how well a parallel program handles increasing levels of parallelism $E = \frac{S_p}{p}$
+Finally, the \textbf{scalability} of a program measures how well a parallel program handles increasing levels of parallelism. %It is calculated by the formula: $E = \frac{S_p}{p}$.
+
+\begin{table}[htp]
+\begin{center}
+\begin{tabular}{ccc}
+\# Threads & Columns & Blocks \\
+\hline
+4 & 16.627 & 16.626\\
+8 & 12.812 & 17.125\\
+16 & 10.572 & 10.372\\
+32 & 9.265 & 9.466 \\
+49 & 8.664 & 8.422 \\
+64 & 8.28 &  8.520 \\
+400 & 7.749 & 10.884 \\
+800 & 7.916 & 10.613 \\
+1000 & 8.126 & 10.560 
+\end{tabular}
+\end{center}
+\caption{Wall-CPU Time (in seconds) taken for each strategy per threads used.}
+\label{wall-cpu time of parallel strategies}
+\end{table}%
+
+Combining values from tables 1 and 2, and using the formulas presented before we can compute the performance metrics for both strategies:
+
+\begin{table}[htp]
+\begin{center}
+\begin{tabular}{cccc}
+Strategy & \#Threads & Speedup & Efficiency (\%)\\
+\hline
+Columns & 400 & 1.1467 & 0.007\\
+Blocks & 400 & 0.8164 & 0.2 \\
+Columns & 49 & 1.026 & 2.1 \\
+Blocks & 49 & 1.0526 & 2.1
+\end{tabular}
+\end{center}
+\caption{Performance of parallel strategies.}
+\label{performance of parallel strategies}
+\end{table}%
 
-$efficiency = \frac{speedup}{p}$
+We can conclude that the parallelisation of our initial program didn't got a significant performance gain. Probably due to the nature of our problem, since the biggest value is always in the last entry of the matrix, threads will have to loop through the entire batch of items to compare values. But also because of the implementation applied, as mentioned before, the function run\_parallel\_search (int num\_threads) could be improved in order to access the atomic variable only once per thread.
 
+In order to evaluate scalability we have to compute the previous metrics for both strategies when the matrix size increases from 29k to 40k rows and columns. We choose to use 64 threads for this experience. 
  
 
 \section*{Conclusions}