From: VGoncalo <vitor.goncalo.costa@gmail.com>
Date: Fri, 27 Mar 2026 17:01:18 +0000 (+0000)
Subject: last comments
X-Git-Url: https://vgcfreebox.myrthtech.pt/gitweb/ue-pp-squarematrixparallelisation.git/commitdiff_plain/refs/heads/main?hp=3ff75954ac217e89e83b667316e789a7422ac4f4

last comments
---

diff --git a/MatrixParallelPrograming.pdf b/MatrixParallelPrograming.pdf
index ba56ecb..efa51cd 100644
Binary files a/MatrixParallelPrograming.pdf and b/MatrixParallelPrograming.pdf differ
diff --git a/MatrixParallelPrograming.tex b/MatrixParallelPrograming.tex
index aca5c5e..ee3713d 100644
--- a/MatrixParallelPrograming.tex
+++ b/MatrixParallelPrograming.tex
@@ -44,7 +44,7 @@
 \fancyhf{} 
 
 \lhead{ProgramaÃ§Ã£o Paralela}
-\rhead{Matrix Search Parallelization}
+\rhead{Matrix Search Parallelisation}
 \fancypagestyle{noheadrule}{
   \fancyhf{}
   \renewcommand{\headrulewidth}{0pt}
@@ -53,7 +53,7 @@
 \rfoot{\thepage}
 
 
-\title{Matrix Search Parallelization}
+\title{Matrix Search Parallelisation}
 \author{Vitor GonÃ§alo Costa, m70323}
 \date{Universidade de Ãvora\\ \today}
 \begin{document}
@@ -114,7 +114,7 @@ With the current implementation of our program, the global variable hv, used to
 	\includegraphics[width=0.6\textwidth]{imagens/matrixsearchsequencial-02.png}  
 \end{figure}
 
-From the previous image we can be sure the 3rd task is the one to parallelise. We should also notice that the changes performed to the program actually increase it's time of execution, the view operation performed on this type of variable is more resource consuming than when performed on a variable of type integer. This can be proved by analysing the time for each task on the first version of our program, populating the matrix toke around 4.5 seconds while the search matrix task toke less than 4 seconds for 40k rows. The wall-clock time of the above figure is 18,931 seconds, this representes the real time a users waits for a program to finish,
+From the previous image we can be sure the 3rd task is the one to parallelise. We should also notice that the changes performed to the program actually increase it's time of execution, the view operation performed on this type of variable is more resource consuming than when performed on a variable of type integer. This can be proved by analysing the time for each task on the first version of our program, initialising the matrix toke around 4.5 seconds while the search matrix task toke less than 4 seconds for 40k rows. The wall-clock time of the above figure is 18,931 seconds, this representes the real time a users waits for a program to finish,
 
 \subsection*{Columns Split}
 Because the matrix in the spotlight is a square one, at first approach, one might think it doesn't really matter if a split is performed by columns or rows, but it does. Since C approaches 2D arrays per row, by the time the cpu access the 32 or 64 bytes of the matrix memory stored in ram to make it available closer, at LP2 or LP3, it brings the closest row values also, making it faster to loop though rows instead of columns. 
@@ -216,7 +216,7 @@ Since there wasn't a big gain when parallelising the search task we could either
 \section*{Improving performance}
 We start by updating the function parallel\_search\_cols() to have a local variable to store the highest value found by the thread and perform only one comparison to the atomic variable. This change alone makes the program finishes with a mean value of 8.192 seconds (using 8 threads for a 40k rows and columns matrix), comparing with the values on tables 1 and 2 we can validate the benefit of this change by computing $efficiency = \frac{1.1335}{8} = 14.16 \%$.
 
-The second task, matrix initialisation, simulates the sequential section of a program in this study, but we could try to parallelise this operation has an extra exercise. The function init\_parallel\_matrix() was defined in order to populate segments of the matrix by different threads with a similar algorithm used for columns search. With all configurations out of the way,  the program now takes 4,821 seconds of wall-clock time with 4 threads. The following metrics were obtained:
+The second task, matrix initialisation, simulates the sequential section of a program in this study, but we could try to parallelise this operation has an extra exercise. The function init\_parallel\_matrix() was defined in order to populate segments of the matrix by different threads with a similar algorithm used for columns search. With this implementations,  the program now takes 4,821 seconds of wall-clock time with 4 threads. The following metrics were obtained:
 
 
 \begin{figure}[h]
@@ -231,12 +231,14 @@ The second task, matrix initialisation, simulates the sequential section of a pr
 \end{minipage}
 \end{figure}
 
+With the previous parallel implementations we finish this study with a parallel version of our initial sequential program that takes almost half of the time to be executed with four threads on the introduced machine. 
 
 \section*{Conclusions}
-This study revealed a few characteristics of modern cpu behaviours. Apart from being able to spawn thousands of virtual threads to perform calculations, way more than the expected, the framework they use can also predict and make computations easy. On the last experiment of parallelising both tasks, the program takes very similar total cpu time executing them, less than 9 seconds, or less than 2,25 seconds each thread, if we compare with values presented previously for the 2nd and 3rd task before parallelisation, considerably away from each other, we can almost argue that the cpu still "remembers" the matrix while searching it.
+This study revealed a few characteristics of modern cpu behaviours. Apart from being able to spawn thousands of virtual threads to perform calculations, way more than the expected, the framework they use can also predict and make computations easy. On the last experiment of parallelising both tasks, the program takes very similar total cpu time executing them, less than 9 seconds, or less than 2,25 seconds each thread, if we compare with values presented previously for the 2nd and 3rd task before parallelisation, considerably away from each other, we can almost argue that the search task benefits from the cache being 'warmed' by the preceding initialisation task.
 
-Parallelising a simple search algorithm is heavily influenced by memory access patterns and synchronization overhead. While modern cpus can spawn multiple threads, the transition from sequential to parallel is not always linear. Our initial attempt using atomic variables showed that high contention can actually degrade performance. By implementing local thread variables, we achieved a speedup of 1.13x, and pre-parallelisation of matrix initialisation improved it to almost 2x using less resources.
+Parallelising a simple search algorithm is heavily influenced by memory access patterns and synchronisation overhead. While modern cpus can spawn multiple threads, the transition from sequential to parallel is not always linear. Our initial attempt using atomic variables showed that high contention can actually degrade performance. By implementing local thread variables, we achieved a speedup of 1.13x, and pre-parallelisation of matrix initialisation improved it to almost 2x using less resources.
 
+As a last conclusion from the study performed, although we focused on a column-based decomposition, performance data suggests that a row-wise partitioning strategy (horizontal striping) would likely yield superior results. Since C stores multi-dimensional arrays in row-major order, a thread processing a horizontal stripe accesses contiguous memory addresses. This maximizes spatial locality, allowing the CPU to effectively utilize its pre-fetchers and L1/L2 caches. By contrast, column-wise access frequently triggers 'cache misses,' as the CPU must jump across large memory strides to reach the next element in a column. Future iterations of this program should prioritize row-based decomposition to minimize memory latency and further improve the Speedup and Efficiency metrics
 
 
 \end{document}
\ No newline at end of file