MatrixParallelPrograming.tex

   1 %----------------------------------------------------------------------------------------------
   2 %                       UE - PP - Matrix Parallel Programming
   3 %----------------------------------------------------------------------------------------------
   4 \documentclass[a4paper,11pt]{article}
   5
   6 %\usepackage[portuguese]{babel}
   7 \usepackage[utf8]{inputenc}
   8 \usepackage[T1]{fontenc}
   9 \usepackage{graphicx}
  10 \usepackage{float}
  11 \usepackage{amsmath}
  12 \usepackage{geometry}
  13 \usepackage{listings}
  14 \usepackage{xcolor}
  15
  16 \definecolor{codegreen}{rgb}{0,0.6,0}
  17 \definecolor{codegray}{rgb}{0.5,0.5,0.5}
  18 \definecolor{codepurple}{rgb}{0.58,0,0.82}
  19 \definecolor{backcolour}{rgb}{0.95,0.95,0.92}
  20
  21 \lstdefinestyle{mystyle}{
  22     backgroundcolor=\color{backcolour},
  23     commentstyle=\color{codegreen},
  24     keywordstyle=\color{magenta},
  25     numberstyle=\tiny\color{codegray},
  26     stringstyle=\color{codepurple},
  27     basicstyle=\ttfamily\footnotesize,
  28     breakatwhitespace=false,
  29     breaklines=true,
  30     captionpos=b,
  31     keepspaces=true,
  32     numbers=left,
  33     numbersep=2pt,
  34     showspaces=false,
  35     showstringspaces=false,
  36     showtabs=false,
  37     tabsize=1
  38 }
  39
  40 \lstset{style=mystyle}
  41 \geometry{top=2.5cm, bottom=2.5cm, left=2.5cm, right=2.5cm}
  42 \usepackage{fancyhdr}
  43 \pagestyle{fancy}
  44 \fancyhf{}
  45
  46 \lhead{Programação Paralela}
  47 \rhead{Matrix Search Parallelization}
  48 \fancypagestyle{noheadrule}{
  49   \fancyhf{}
  50   \renewcommand{\headrulewidth}{0pt}
  51 }
  52 \lfoot{Universidade de Évora}
  53 \rfoot{\thepage}
  54
  55
  56 \title{Matrix Search Parallelization}
  57 \author{Vitor Gonçalo Costa, m70323}
  58 \date{Universidade de Évora\\ \today}
  59 \begin{document}
  60 \maketitle
  61
  62 \abstract{This study aims to investigate the performance gain of a program when applying parallel programming techniques to divide operations between several threads using POSIX framework. }
  63
  64 \section*{Presenting the Problem}
  65 We want to write a program that finds the biggest element in a \textbf{square matrix} with thousands of rows and columns, built during run time. We want to evaluate the metrics \underline{speedup}, \underline{efficiency}, and \underline{scalability} when converting the program from sequencial to parallel computation and we should analyse how does the amount of processors and threads influences the program execution time.\\\\The Matrix is built with the logic: $matrix[r][c] = r*m\_cols+c$, inside a nested for loop for rows and columns, where r and c represent the iteration variable for row and column respectively. This way, the output of $matrix[nrows][ncolumns]$ would always be the highest element in the matrix.
  66
  67 \subsection*{Resources Available}
  68 The machine where we will be compiling and executing the program is a MacBook Pro with an i7 dual-core processor of 2,5 GHz - This cpu has available 2 physical cores with 4 virtual cores each, which means we might be able to invoke up to 8 threads of computation.
  69
  70 \subsection*{The Sequencial Program}
  71 \begin{figure}[h]
  72 \begin{minipage}[c]{0.45\textwidth}
  73         \centering
  74         \includegraphics[width=0.925\textwidth]{imagens/matrixsearchsequencial.png}
  75         %\lstinputlisting[language=C]{SequencialMatrixSearch2.c}
  76 \end{minipage}
  77 \begin{minipage}[c]{0.5\textwidth}
  78 \small
  79
  80 The execution of the program on the right relies on a lot of memory available in order to compute such big matrix. If we pick the value of 40000 for the amount of rows and columns, and knowing that an integer requires 4 bytes of memory, we need around 6,4 GB to compute such matrix. This machine has 16 GB of LPDDR3 installed.\\
  81
  82 The current implementation takes around 5,45 seconds to be executed in this machine, and throws a bus error if m\_rows is updated to 40000, which is not expected at first glance, since the system has 16GB of memory available. \\
  83
  84 This might happen due to OS imposed limits, so we allocate resources with POSIX library.
  85 \end{minipage}
  86 \end{figure}
  87
  88 We update the above program to make use of posix\_memalign() function in order to align the matrix memory with the cpu's 64 bytes cache lines. This technique is called \textit{dynamic memory assignment}, is usually applied to very large matrices and a requirement for \textit{High-Performance Computation} programs.
  89
  90 The program was executed three times using the time command, and the average execution time was computed:
  91
  92 \begin{table}[htp]
  93 \begin{center}
  94 \begin{tabular}{ccc}
  95 m\_rows & no HPC (sec) & with HPC (sec)\\
  96 \hline
  97 30k & 5,482 & 5,395 \\
  98 40k & error & 8.886 \\
  99 \end{tabular}
 100 \end{center}
 101 \caption{program's total time execution per amount of matrix rows.}
 102 \label{sequencial program time results}
 103 \end{table}%
 104
 105 We could now use the \textbf{time.h} library to evaluate the time spent on each operation, 1st allocate memory, 2nd building the matrix and 3rd search for the biggest value, but we save this exercise for later.
 106
 107 \section*{Parallel Strategies}
 108 From the three tasks presented previously, the search for the biggest value is the one that we should parallelise, therefore, a certain number of threads are spawned to split the task among them with the usage of pthread\_create() and pthread\_join() functions from \textbf{pthread.h} library.
 109
 110 With the current implementation of our program, the global variable hv, used to store the highest value found during the search operation, will be shared between threads by the time we execute a parallelised version of it. We need to take into account this \textit{race condition} and decide on a strategy to overcome it. The usage of a pre variable lock() and after unlock() \textbf{muttex} pattern will stop the execution of threads waiting for the access to that variable, due to the nature of our problem, we update this variable to be \textbf{atomic int} instead, but this change alone actually increases the computation time up to 18,35 seconds with $m\_rows = 40k$, which again, suggests the use of \textbf{time.h} library to better understand the contribution of each parallel strategy presented in the following sections of this document. The next picture displays the time our program toke on each task after applying the implementations mentioned before:
 111
 112  \begin{figure}[h]
 113         \centering
 114         \includegraphics[width=0.85\textwidth]{imagens/matrixsearchsequencial-02.png}
 115 \end{figure}
 116
 117 From the previous image we can be sure the 3rd task is the one to parallelise. We should also notice that the changes performed to the program actually increase it's time of execution, the view operation performed on this type of variable is more resource consuming than when performed on a variable of type integer. This can be proved by analysing the time for each task on the first version of our program, populating the matrix toke around 4.5 seconds while the search matrix task toke less than 4 seconds for 40k rows. The wall-clock time of the above figure is 20.460 seconds, this representes the real time a users waits for a program to finish,
 118
 119 \subsection*{Columns Split}
 120 Since the matrix in the spotlight is a square one it doesn't really matter if we perform a split by columns or rows. What matters is the speed one of the about to be spawned threads takes to find the highest value present in the matrix. This way, threads will only perform read on hv global var and not write, which is a more time consuming computer operation.
 121
 122 A new function, parallel\_search\_cols(void *args), was defined in order to share the work between threads, it expects the start and end column indices as arguments in order to split the matrix in as many slots as needed. A second function, run\_parallel\_search(int num\_threads), was also defined to make the program more useful by allowing to define the amount of threads to be created.
 123
 124 The best execution times obtained on this scenario were with the creation of 600 to 800 threads, presenting a total cpu computing time of around 8,5 seconds on the 3rd task. The next figure computes the \underline{sum up time taken by all threads} per its cardinality for both strategies presented in this document.
 125
 126 \begin{figure}[h]
 127         \centering
 128         \includegraphics[width=0.8\textwidth]{imagens/matrixsearchsequencial-03.png}
 129         \caption{Time taken by all threads on the search matrix task.}
 130 \end{figure}
 131
 132 A performance analysis is computed later this document, but we shall advance that the wall-clock time for 800 threads is around 8,21 seconds (around 1 millisecond per thread), which is better than the sequencial program in table 1. Next we present the strategy to split the matrix by blocks.
 133
 134
 135 \subsection*{Blocks Split}
 136 We updated the function run\_parallel\_search (int num\_threads), in order to divide the matrix in blocks of the same size. First it splits the matrix in two and calculate the amount of blocks for each side by the square root of the amount of blocks to be created. If we pick 400 threads for example, means we have 400 blocks to be worked out during the 3rd task. Each of this blocks has 2000 rows and columns, $\sqrt{400} = 20$,which means 20 blocks for each side of the first split, and then we can validate the amount of items in the blocks with: $20*(2000*2000)*2=160k$.
 137
 138 We also defined a new struct object to carry the start/end row and column for each block assigned to a created thread, this helps to define the block's rows and columns boundaries. The function parallel\_search\_block() receives this list of arguments by the time is invoked by pthread\_create() and performs the comparison of matrix index with the atomic int hv variable.
 139
 140 From figure 1 we can advance that, in general, both strategies have similar performances, but the second solution appears to scale better than the first one. Some of the spikes present in the graph for this solution could be explained by the size of sub-matrixes to assign on threads. The amount of threads used in those cases do not divide the matrix in perfectly square sub-matrixes, all threads access the atomic hv var on each iteration, therefor some weird behaviours could happen when trying to access it's address.
 141
 142
 143 \subsubsection*{Performance Analysis}
 144 In order to evaluate the performance of our program when parallelised with both of previous strategies we compute the wall-clock cpu time obtained per threads, available in table 2.
 145
 146 The \textbf{speedup} metric is determined by the quotient between the best time obtained by a sequential program and a parallelised version of it: $Speedup = \frac{t^*}{t_p} $.
 147
 148 The \textbf{efficiency} of a parallel version of a program can then be calculated by the fraction $efficiency = \frac{speedup}{p}$.
 149
 150 Finally, the \textbf{scalability} of a program measures how well a parallel program handles increasing levels of parallelism. %It is calculated by the formula: $E = \frac{S_p}{p}$.
 151
 152 \begin{table}[htp]
 153 \begin{center}
 154 \begin{tabular}{ccc}
 155 \# Threads & Columns & Blocks \\
 156 \hline
 157 4 & 16.627 & 16.626\\
 158 8 & 12.812 & 17.125\\
 159 16 & 10.572 & 10.372\\
 160 32 & 9.265 & 9.466 \\
 161 49 & 8.664 & 8.422 \\
 162 64 & 8.28 &  8.520 \\
 163 400 & 7.749 & 10.884 \\
 164 800 & 7.916 & 10.613 \\
 165 1000 & 8.126 & 10.560
 166 \end{tabular}
 167 \end{center}
 168 \caption{Wall-CPU Time (in seconds) taken for each strategy per threads used.}
 169 \label{wall-cpu time of parallel strategies}
 170 \end{table}%
 171
 172 Combining values from tables 1 and 2, and using the formulas presented before we can compute the performance metrics for both strategies:
 173
 174 \begin{table}[htp]
 175 \begin{center}
 176 \begin{tabular}{cccc}
 177 Strategy & \#Threads & Speedup & Efficiency (\%)\\
 178 \hline
 179 Columns & 400 & 1.1467 & 0.007\\
 180 Blocks & 400 & 0.8164 & 0.2 \\
 181 Columns & 49 & 1.026 & 2.1 \\
 182 Blocks & 49 & 1.0526 & 2.1
 183 \end{tabular}
 184 \end{center}
 185 \caption{Performance of parallel strategies.}
 186 \label{performance of parallel strategies}
 187 \end{table}%
 188
 189 We can conclude that the parallelisation of our initial program didn't got a significant performance gain. Probably due to the nature of our problem, since the biggest value is always in the last entry of the matrix, threads will have to loop through the entire batch of items to compare values. But also because of the implementation applied, as mentioned before, the function run\_parallel\_search (int num\_threads) could be improved in order to access the atomic variable only once per thread.
 190
 191 In order to evaluate scalability we have to compute the previous metrics for both strategies when the matrix size increases from 29k to 40k rows and columns. We choose to use 64 threads for this experience.
 192
 193
 194 \section*{Conclusions}
 195 Modern CPUs can spawn thousands of virtual threads to perform calculations.\\
 196
 197 The action of populating the matrix could also update the value of global variable hv instead of the 3rd task. Search the matrix could have a local variable to store the highest value found and compare with the global var hv at the end.\\
 198
 199 The clock() function from time.h library is useless to compute wall-clock time and therefore should be ignored on metrics evaluations.
 200
 201
 202
 203 \end{document}