Official Course Description:
In the recent years, the development of parallel and cloud computing has opened the door for Big Data analysis andprocessing. This module aims at providing an overview and introduction to the vast field of parallel and cloud computing. In traditional parallel computing, we aim to develop notions for different parallelization models (shared-memory, distributed-memory, SIMD, SIMT), get to know appropriate programming methodologies for high performance dataanalysis (OpenMP / MPI) and aim at understanding performance and scalability in this field (weak vs. strong scaling, Amdahl’s law). This fundamental knowledge will then be carried over to recent developments in cloud computing, where distributed processing frameworks (Spark / Hadoop MapReduce / Dask), based on appropriated deployment infrastructures, are in the process to become DeFacto standards for Big Data processing and analysis. We will approach these technologies from apractical point of view and aim at developing the necessary knowledge to carry out scalable machine learning and dataprocessing on Big Data.
Intended Learning Outcomes (via Moodle)
Parallel computing part:
- Zaccone, Python Parallel Programming Cookbook, Second Edition, Packt Publishing, 2015.
- T. Mattson, B. Sanders, B. Massingill. Patterns for Parallel Programming, Addison-Wesley, 2005
- HLRS Courses 2009-H course material
- OpenMP Application Programming Interface, Version 5.0
- Message Passing Interface Standard Version 3.0
- Nvidia CUDA Programming Guide Version 11.0
Distributed computing part:
- T. White, Hadoop – The Definitive Guide, Fourth Edition, O’Reilly, 2015
- Z. Radtka, D. Miner, Hadoop with Python. Hadoop with Python, O’Reilly, 2016
- J.C. Daniel, Data Science with Python and Dask, Manning Publications
Grading & Final exam
The grades for this lecture will be determined by a final written exam. It will cover all content of the lectures (pre-recorded content and live content). Therefore, you will want to attend the live (online) lecture part.
(Details on the final exam will be published later.)
Expected programming knowledge
Ideally, students should bring knowledge in Python and C/C++ as background, while it will be generally possible to run the course with only knowledge in Python.
If no knowledge in C/C++ is present, interested students are encouraged get a basic understanding of C/C++ (via online material) in order to better understand some of the discussed concepts.
All official announcements for this course will be posted via the below announcement forum, to for which all students have a mandatory subscription. Additionally, there is a forum for interaction with the instructor wrt. general questions. All course-related (non-private) communication should be placed to that general discussion forum. Please use the Moodle messaging system for private communications, i.e. don’t use emails for that. The use of the Moodle messaging system instead of emails will assure that your messages will not get lost in my constantly exploding e-mail Inbox.
Announcements (via Moodle)
Private communication: Moodle messaging system (via Moodle)
This is a quite technical lecture
Please be patient with the instructor on potential technical hick-ups.
Culture of interaction among students and instructor
- Please feel free to ask questions at any time!
- Please tell the instructor, if he is too slow / fast / boring / excited / . . . by giving feedback via the lecture forum.
- Please use the bookable slots or just ask for a meeting via the Moodle communication system if you need further interaction
Online class netiquette
- Everyone should interact with everyone else in a respectful way. Just think of how you would like to be treated. This gives you a first indication of how to treat others.
- The online classes won’t be recorded, hence everyone should feel safe to show her/his face. Any sort of private recordings are not allowed.
- No harassment-like behavior will be accepted. This is irrespective of the form of the behavior / communication, i.e. text messages, voice, video. Only students with valid Jacobs accounts are allowed to join starting from the second session. Thereby, it will be possible to identify and prosecute any case of harassment following the Code of Academic Integrity or State Laws.
Officially, this course has two lecture slots per week, namely Tuesdays, 14:15-15:30 and Thursdays 14:15-15:30. The initial (online!) meeting of the course will take place on Thursday 14:15-15:30 in this time period.
You can access the meeting via the link further down this page.Afterwards, however, we will chose a more flexible blended learning approach. The course will be taught partially online and partially in-presence in a partially synchronous (i.e. live) and partially asynchronous (pre-recorded) way. What this exactly means is discussed below.
Pre-recorded – “Delivery” of factual content
Every week, there will be a few pre-recorded lecture videos put online. In sum, they will have about 75 minutes of content. Each video is accompanied with a very short quiz that asks a few small questions. The quiz can be repeated arbitrarily, however, it has to be successfully passed, in order to get access the next video. Students are required to go through the videos before the next interactive session
Live – Exploring and practicing content (synchronous, online)
Within the Thursday lecture slot, more precisely on Thursdays, 14:15-15:30 online using this link
a mandatory live online meeting (on Teams) takes place. In that meeting, I will start by giving students an opportunity to ask questions on the theoretical part. Then, we will apply the knowledge that we learned in the theoretical part and do
- mostly hand-ons with using existing parallel programs or practical programming in the field, or
- quizzes or paper & pencil style work.
The actual content will depend a bit on the ongoing topic.
Open forum – Further content support (in-presence)
During the the Tuesday slot, i.e. on Tuesdays 14:15-15:30 in ICC-West Wing Conference Hall, students are invited to join an in-presence meeting for further support with the lecture content, i.e. Q&A. With this additional offer, the instructor wants to overcome the issue of sometimes strongly varying learning habits of students, which is in particular pronounced in larger groups. Please recall for these sessions that the 3G rules apply. Also, social distancing and ventilation will be strictly enforced by the instructor.
What is mandatory content? Can I leave out the live / pre-recorded part?
Note that the content of both parts – live meeting (not the open forum) and prerecorded videos – is mandatory knowledge for the final exam. Moreover, the material will be interwoven and depend on each other. To make it even more explicit, you e.g. need in the live lecture content from Moodle, which will only be made available to you if you successfully passed the quizzes from the prerecorded content. The live lectures will not be recorded. So if you are sick, ask one of the other students for information about the live part.
All code developments in this course will be carried out in a cloud-based collaborative environment (Deepnote). While the platform is generally introduced in the first lecture. You will receive an e-mail – based invite to join a “Team” in Deepnote at the beginning of the course. Everything that you need to do starting from the reception of the email up to a point when you are ready to work Deepnote is described in the below video.
Instructions to get started with Deepnote (via Moodle)
Within the explanation video, we are using the sample project given below:
Sample Deepnote project (via Moodle)
The actual contents are available via the Moodle course page.
However, to get an impression of the codes that we will work with, you can find below a few examples.
Note that the below examples can be easily duplicated for your own first steps into parallel and distributed programming. The nice part is that you won’t need to install any special software as everything is browser-based and pre-configured. In times of remote teaching, the real-time collaboration features will further allow us to still jointly work on codes in class.
Introduction to MPI
Duplicate the project and get your first MPI parallel program running by entering the following command in the terminal:
mpiexec -n 2 python hello_world.py
Parallel dot product in OpenMP
Duplicate the project and run a shared-memory parallel dot product via OpenMP by compiling and executing the code via:
gcc -fopenmp -o dot_product_parallel dot_product_parallel.c
Analysis of stock data via Hadoop MapReduce
Please go to this page to find a published notebook which allows you to carry out a data analysis on stock exchange data via Hadoop MapReduce. Just duplicate the project via the corresponding button, and you can do your own data analysis project in a full (pseudo-distributed) Hadoop environment.