![]() |
![]() |
|
02158 Concurrent Programming Fall 2024 |
Mandatory Assignment 1: Task-based Parallel Computing in Java |
Home | Plan | Material |
Read about Java thread pools in [Proc 4.6]. You may also look at this basic introduction to the Java Executor Framework even study its documentation.
The algorithm of this assignment represents a case of iterative parallelism described in [Andrews 1.4]. and the similar concept data parallel algorithm described in [Andrews 3.5].
You are supposed to visualize your results with graphs. For this, you could use Gnuplot. See http://gnuplot.info for installation instructions. You may also use any other visualization tool (Maple, Matlab, R, ...) that you might be familiar with or the LaTeX package pgfplot. See e.g. this description.
java Search <file> <pattern>
where <file> is the name of some text file and <pattern> is the string to be located.
The program performs a sequential search using the Java executor framework. This is accomplished by creating a single task searching the whole file and letting this be executed by a single thread executor. The result of the single task is represented by a future which (when ready) can be used directly as the overall result.
The program measures the elapsed search time (loading of file not included) and prints the number of occurrences.
You may e.g. try to find DNA-patterns in part of the human cromosome 2 or you may search the complete works of William Shakespeare (e.g. for "Something is rotten in the state of Denmark.").
Run the program on a very large file and find a pattern which makes the search take in the order of 1 second (e.g. 0.3-3 s). Also, the search pattern should only occur a small number of times (otherwise, the list handling will dominate).
The program is able to run the search multiple times. This may be set with the -R option:
java Search -R <r> <file> <pattern>
where <r> is the desired number of runs. Try to run your search with, say, 10 runs.
Plot the resulting execution times in a graph. The y-axis should start at 0 in order to judge the variations relatively.
How does the execution time vary?
If you see that the execution times stabilize after some runs, these first runs may be seen as warm-up runs not to be considered for a proper execution time comparison.
Discuss what could be the reasons for the warm-up phenomenon.
You may let the first <w> runs be warm-up runs to be discarded by using the option -W <w>.
Adjust the -W and -R options so that you get a sequence of reasonable identical measurements and state your results.
State the search problem and the values found for -W and -R. Use this problem and settings for all time/speedup measurements in the following problems.
Now modify the program to allow for the search to be performed as a number of independent tasks whose results can be combined to the overall results. The number of tasks to use should be given by a third parameter to the program and is available in the global variable ntasks.
The given Search class already has the machinery to submit muliple tasks to the thread pool. Start by uncommenting the multi-task search part of the main method.
For now, keep the single thread executor.
The given SearchTask class should not be modified.
You should, in particular, be careful to analyze which part of the character array each task should work on and how the position results should be combined. Be sure that your splitting works for any kind of text and for any (reasonably sized) search string.
Test that the multi-task search works correctly. For this, you may use a simpler seach problem. E.g. you may try to find the number of occurrences of xxxx (four x-es) in the file xtest.txt [It should be 2605]. Verify for various tasks numbers.
Describe your deliberations on the splitting and discuss whether the workload has been evenly distributed and whether there is an upper limit to the number of tasks that may be requested.
The program measures the elapsed execution time of the multi-task search as for the single task and calculates an average speedup obtained by using multiple tasks.
Explain the speedup notion.
Execute the program (using the search problem and settings fournd in Problem 1) and observe the speedup for various task numbers. Show the results graphically. Is the speedup as expected?
Once the program is running correctly, you may replace the execution engine with the Executor.newCachedThreadPool() which will run each tasks on its own thread. This is done by passing the -Ec option to the program.
Now you should conduct a series of experiments where, for the same file and pattern, you vary the number of tasks (=threads) and record the speedups obtained.
The results may be recorded by supplying the name of a data file using the option -d <datafile>. You may then add a data line using the method writeData(String).
Your results should be illustrated by graphs showing the speedup as a function of the number of tasks. Both axes should start at 0.
Remember to state your hardware specifications, i.e. the processor type and its number of cores and hyper-threads. Are the cores uniform or are some of more powerful than others (check the machine specifications)?
How does the speedup vary as the number of tasks goes from 1 to say 20 (in steps of 1)? Is that as expected?
Try to let the number of tasks grow even further (using fairly large steps). What do you observe?
[You might want to automate the process of varying the parameters. For our testing, however, the program structure and interface should not be further modified. Instead you may run the program from using a scripting language or from another Java program calling the Search.main method.]
Finally, you should try to decouple tasks and threads by using Executor.newFixedThreadPool(nthreads) where the global variable nthreads may be set by a fourth program parameter. To apply that executor, use the -Ef option.
For a fixed search problem, try to see how the performance is related to the number of tasks and threads by varying these over relevant intervals (determined from your results in Problem 3).
Illustrate the results using graphs. For this you coould use overlaid 2D graphs each showing the speedup as a function of the task number for a fixed number of threads. Alternatively you may use 3D graphs.
How do your explain the results?
The assignment may be done on your own computer (desktop or laptop), but the effect of the parallel execution will be limited by the number of cores you have. The DTU High Performance Computing (HPC) center has machines (called nodes) with many more cores.
Here you are going to run on a so-called application node which is a many-core machine, shared by many users.
To run on an HPC application node you may either install the ThinLinc client application (see http://gbar.dtu.dk/faq/43-thinlinc or just use the web-based client on https://thinlinc.gbar.dtu.dk)
Log in with your usual DTU credentials (it is secure), right-click on the desktop and chose "Open terminal here" to start a terminal (the prompt should start with n-62-...). You are now in a standard Linux environment, from where you may download the code, start various editors/IDEs and run the code. [To leave, rightclick >> Applications >> Log out]
You can transfer files from your laptop using scp (Linux, Mac) or Putty/WinSCP (Windows) or just copy/paste into an editor. Alternatively you may transfer via a git repository.
More information about using the home file system associated with the Gbar/HPC machines are available at: this site.
If you are not on the DTU Campus Network, you may access the HPC machines via the DTU VPN (recommended) or using ssh keys.
Try to redo some of the experiments from Problem 3 or 4 and compare the results. What do you observe? Can you explain this?
You may use the command lscpu to learn how many cores the node has. Is this recognized in your results?
Since an application node is shared among many users, the processors are distributed fairly among the threads of the users. Thus, the more threads, the more processing you get. Therefore, the machine may appear to have an indefinite number of cores.
Caveat: Always beware that measuring computing times on a timeshared multi-user system is like measuring length with an elastic ruler.
The assignment should be made in groups of 2-3 students. If you for some good reason would prefer to do the assignment on your own, you must get prior permission from the teacher (HHL). Groups must be registred in DTU Learn no later than Thursday September 19 in order to hand in.
All groups should make Problems 1 to 4. Groups of 3 students are expected to do Problem 5 as well.
Of course, each group must work on its own. In particular you must not exchange any form of code or texts with other groups. Remember that any use of generative AI must be declared (for what and how).
The asssignment must be handed in on DTU Learn no later than
The hand-in should comprise the following:
The report should include
Your experiments must be properly documented, stating which programs
were run on which hardware with references to the documentation files.
Assistance for the assignment will be available on September 19, during the lab session as indicated on the course plan.
In case of amendments or updates, these will appear here. Major updates will be announced on DTU Learn.
Enjoy!
Hans Henrik Løvengreen, Sep 12, 2024 |