X-Git-Url: https://vgcfreebox.myrthtech.pt/gitweb/ue-pp-squarematrixparallelisation.git/blobdiff_plain/03e6d97645ffb44eb652a00b7f87c327a855faaf..3ff75954ac217e89e83b667316e789a7422ac4f4:/MatrixParallelPrograming.tex?ds=inline diff --git a/MatrixParallelPrograming.tex b/MatrixParallelPrograming.tex index 55dd9f5..aca5c5e 100644 --- a/MatrixParallelPrograming.tex +++ b/MatrixParallelPrograming.tex @@ -184,7 +184,7 @@ Finally, the \textbf{scalability} of a program measures how well a parallel prog \label{wall-cpu time of parallel strategies} \end{table}% -Combining values from tables 1 and 2, and using the formulas presented before we can compute the performance metrics for both strategies: +Combining values from tables 1 and 2, and using the formulas presented before, we can compute the performance metrics for both strategies: \begin{table}[htp] \begin{center} @@ -207,26 +207,36 @@ The metric scalability can be divided in two: \underline{strong scaling}, when t \begin{figure}[h] \centering - \includegraphics[width=0.6\textwidth]{imagens/matrixsearchsequencial-06.png} + \includegraphics[width=0.8\textwidth]{imagens/matrixsearchsequencial-06.png} \caption{Weak scalability evaluation from 1 to 8 threads.} \end{figure} Since there wasn't a big gain when parallelising the search task we could either try to develop a new parallel strategy for the search task or parallelise the matrix initialisation instead. \section*{Improving performance} -tbd +We start by updating the function parallel\_search\_cols() to have a local variable to store the highest value found by the thread and perform only one comparison to the atomic variable. This change alone makes the program finishes with a mean value of 8.192 seconds (using 8 threads for a 40k rows and columns matrix), comparing with the values on tables 1 and 2 we can validate the benefit of this change by computing $efficiency = \frac{1.1335}{8} = 14.16 \%$. -\section*{Conclusions} -Modern CPUs can spawn thousands of virtual threads to perform calculations.\\ +The second task, matrix initialisation, simulates the sequential section of a program in this study, but we could try to parallelise this operation has an extra exercise. The function init\_parallel\_matrix() was defined in order to populate segments of the matrix by different threads with a similar algorithm used for columns search. With all configurations out of the way, the program now takes 4,821 seconds of wall-clock time with 4 threads. The following metrics were obtained: + + +\begin{figure}[h] +\begin{minipage}[c]{0.45\textwidth} + \begin{itemize} + \item speedup = 1,9262 + \item efficiency = 48,15\% + \end{itemize} +\end{minipage} +\begin{minipage}[c]{0.5\textwidth} +\includegraphics[width=0.925\textwidth]{imagens/matrixsearchsequencial-07.png} +\end{minipage} +\end{figure} -The action of populating the matrix could also update the value of global variable hv instead of the 3rd task. Search the matrix could have a local variable to store the highest value found and compare with the global var hv at the end.\\ -The clock() function from time.h library is useless to compute wall-clock time and therefore should be ignored on metrics evaluations. +\section*{Conclusions} +This study revealed a few characteristics of modern cpu behaviours. Apart from being able to spawn thousands of virtual threads to perform calculations, way more than the expected, the framework they use can also predict and make computations easy. On the last experiment of parallelising both tasks, the program takes very similar total cpu time executing them, less than 9 seconds, or less than 2,25 seconds each thread, if we compare with values presented previously for the 2nd and 3rd task before parallelisation, considerably away from each other, we can almost argue that the cpu still "remembers" the matrix while searching it. +Parallelising a simple search algorithm is heavily influenced by memory access patterns and synchronization overhead. While modern cpus can spawn multiple threads, the transition from sequential to parallel is not always linear. Our initial attempt using atomic variables showed that high contention can actually degrade performance. By implementing local thread variables, we achieved a speedup of 1.13x, and pre-parallelisation of matrix initialisation improved it to almost 2x using less resources. -TODO: -- strategy to split and continue threads; -- evaluate strong/weak scalability with plot per number of threads \end{document} \ No newline at end of file