Official Course Description:
In the recent years, the development of parallel and cloud computing has opened the door for Big Data analysis and processing. This module aims at providing an overview and introduction to the vast field of parallel and cloud computing. In traditional parallel computing, we aim to develop notions for different parallelization models (shared-memory, distributed-memory, SIMD, SIMT), get to know appropriate programming methodologies for high performance data analysis (OpenMP / MPI) and aim at understanding performance and scalability in this field (weak vs. strong scaling, Amdahl’s law). This fundamental knowledge will then be carried over to recent developments in cloud computing, where distributed processing frameworks (Spark / Hadoop MapReduce / Dask), based on appropriated deployment infrastructures, are in the process to become DeFacto standards for Big Data processing and analysis. We will approach these technologies from a practical point of view and aim at developing the necessary knowledge to carry out scalable machine learning and data processing on Big Data.
- Zaccone, Python Parallel Programming Cookbook, O’Reilly
- T. Mattson, B. Sanders, B. Massingill. Patterns for Parallel Programming, Addison-Wesley, 2005
HLRS Courses 2009-H course material
- OpenMP Application Programming Interface, Version 5.0
- Message Passing Interface Standard Version 3.0
- Nvidia CUDA Programming Guide Version 11.0
- T. White, Hadoop – The Definitive Guide, O’Reilly
- Z. Radtka, D. Miner, Hadoop with Python. Hadoop with Python, O’Reilly
- J.C. Daniel, Data Science with Python and Dask, Manning Publications
Grading & Final exam
The grades for this lecture will be determined by a single final written test. It will cover all content of the lectures (pre-recorded content and live content). Therefore, you will want to attend the live (online) lecture part.
Recommendations for preparation
If no knowledge in C/C++ is present, interested students are encouraged get a basic understanding of C/C++ (via online material)
in order to better understand some of the discussed concepts.
All official announcements for this course will be posted via the below announcement forum, to for which all students have a mandatory subscription. Additionally, there is a forum for interaction with the instructor and the TA wrt. general questions. All course-related (non-private) communication should be placed to that general discussion forum.
This is a newly introduced lecture
Please be patient with the instructor specifically on potential technical hick-ups.
Culture of interaction
- Please feel free to ask questions at any time!
- Please tell the instructor, if he is too slow / fast / boring / excited / . . . by giving feedback via the lecture forum.
- Please use the “virtual” open-door policy, i.e. send a brief message and we can have a meeting.
Officially, this course has two lecture slots per week, namely Mondays, 11:15-12:30 and Thursdays 14:15-15:30. The initial (online!) meeting of the course will take place on Thursday 14:15-15:30 in this time period. You can access the meeting via the link further down this page.
Afterwards, however, we will chose a more flexible blended learning approach. First important message: For now, the course will be taught online in a partially synchronous (i.e. live) and partially asynchronous (pre-recorded) way. What this exactly means is discussed below.
Pre-recorded – “Delivery” of factual content
Every week, there will be a few pre-recorded lecture videos put online. In sum, they will have about 75 minutes of content. Each video is accompanied with a very short quiz that asks a few small questions. The quiz can be repeated arbitrarily, however, it has to be successfully passed, in order to get access the next video. Students are required to go through the videos before the next interactive session
Live – Exploring and practicing content (synchronous)
Within the Monday lecture slot, more precisely on Mondays, 11:15-12:30 using this link
a live (for now) online meeting (on Zoom) takes place. In that meeting, I will start by giving students an opportunity to ask questions on the theoretical part. Then, we will apply the knowledge that we learned in the theoretical part and do mostly hand-ons with using existing parallel programs or practical programming in the field, or quizzes or paper & pencil style work.
The actual content will depend a bit on the ongoing topic.
What is mandatory content? Can I leave out the live / pre-recorded part?
Note that the content of both parts – live meeting and prerecorded videos – is mandatory knowledge for the final exam. Moreover, the material will be interwoven and depend on each other. To make it even more explicit, you e.g. need in the live lecture content from Moodle, which will only be made available to you if you successfully passed the quizzes from the prerecorded content. The live lectures will not be recorded. So if you are sick, ask one of the other students for information about the live part.
The actual contents are available via the Moodle course page.
However, to get an impression of the codes that we will work with, you can find below a few examples.
Note that the below examples can be easily duplicated for your own first steps into parallel and distributed programming. The nice part is that you won’t need to install any special software as everything is browser-based and pre-configured. In times of remote teaching, the real-time collaboration features will further allow us to still jointly work on codes in class.
Introduction to MPI
Duplicate the project and get your first MPI parallel program running by entering the following command in the terminal:
mpiexec -n 2 python hello_world.py
Parallel dot product in OpenMP
Duplicate the project and run a shared-memory parallel dot product via OpenMP by compiling and executing the code via:
gcc -fopenmp -o dot_product_parallel dot_product_parallel.c
Analysis of stock data via Hadoop MapReduce
Please go to this page to find a published notebook which allows you to carry out a data analysis on stock exchange data via Hadoop MapReduce. Just duplicate the project via the corresponding button, and you can do your own data analysis project in a full (pseudo-distributed) Hadoop environment.