# Data Analysis and Plotting in Python with Pandas

_Andreas Herten, Jülich Supercomputing Centre, Forschungszentrum Jülich, 4 September 2023_

**Version: Tasks**

## Task Outline

* [Task 1](#task1)
* [Task 2](#task2)
* [Task 3](#task3)
* [Task 4](#task4)
* [Task 5](#task5)
* [Task 6](#task6)
* [Task 7](#task7)
* [Task 7B](#task7b)
* [Task 8](#task8)
* [Task 8B](#task8b)

In [2]:
import pandas as pd

## Task 1

TASK

* Create data frame with
 - 6 names of dinosaurs, 
 - their favourite prime number, 
 - and their favorite color.
* Play around with the frame
* Tell me when you're done with status icon in BigBlueButton: 👍

Jupyter Notebook 101:

* Execute cell: `shift+enter`
* New cell in front of current cell: `a`
* New cell after current cell: `b`

In [26]:
happy_dinos = {
 "Dinosaur Name": [],
 "Favourite Prime": [],
 "Favourite Color": []
}
#df_dinos = 

## Task 2

TASK

* Read in `data-nest.csv` to `DataFrame`; call it `df` 
 *(Data was produced with [JUBE](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html))*
* Get to know it and play a bit with it
* Tell me when you're done with status icon in BigBlueButton: 👍

In [34]:
!head data-nest.csv

id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay
5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5
5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5
5,1,2,8,202.15,10,true,0.28,47.98,48.48,0.70,1.20,7.95,142.81,47699384.00,802865,7.03,112500,1265738500,1.5,1.5
5,1,4,8,89.57,10,true,0.15,20.41,23.21,0.23,3.04,3.19,60.31,46813040.00,821491,7.23,112500,1265738500,1.5,1.5
5,2,2,4,164.16,10,true,0.20,40.03,41.09,0.52,1.58,6.08,114.88,46937216.00,802865,7.03,112500,1265738500,1.5,1.5
5,2,4,4,77.68,10,true,0.13,20.93,21.22,0.16,0.46,3.12,52.05,47362064.00,821491,7.23,112500,1265

## Task 3

TASK

* Add a column to the Nest data frame form Task 2 called `Threads` which is the total number of threads across all nodes (i.e. the product of threads per task and tasks per node and nodes)
* Tell me when you're done with status icon in BigBlueButton: 👍

In [65]:
import matplotlib.pyplot as plt
%matplotlib inline

## Task 4

TASK


* Sort the Nest data frame by threads
* Plot `"Presim. Time / s"` and `"Sim. Time / s"` of our data frame `df` as a function of threads
* Use a dashed, red line for `"Presim. Time / s"`, a blue line for `"Sim. Time / s"` (see [API description](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot))
* Don't forget to label your axes and to add a legend _(1st rule of plotting)_
* Tell me when you're done with status icon in BigBlueButton: 👍

## Task 5

TASK

Use the Nest data frame `df` to:

1. Make threads index of the data frame (`.set_index()`)
2. Plot `"Presim. Time / s"` and `"Sim. Time / s`" individually
3. Plot them onto one common canvas!
4. Make them have the same line colors and styles as before
5. Add a legend, add missing axes labels
6. Tell me when you're done with status icon in BigBlueButton: 👍

## Task 6

TASK

* To your `df` Nest data frame, add a column with the unaccounted time (`Unaccounted Time / s`), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time. 
(*I know this is technically not super correct, but it will do for our example.*)
* Plot a stacked bar plot of all these columns (except for program runtime) over the threads
* Tell me when you're done with status icon in BigBlueButton: 👍

## Task 7

TASK

* Create a pivot table based on the Nest `df` data frame
* Let the `x` axis show the number of nodes; display the values of the simulation time `"Sim. Time / s"` for the tasks per node and threads per task configurations
* Please plot a bar plot
* Tell me when you're done with status icon in BigBlueButton: 👍

## Task 7B (like Bonus)

TASK

- Same pivot table as before (that is, `x` with nodes, and columns for Tasks/Node and Threads/Task)
- But now, use `Sim. Time / s` and `Presim. Time / s` as values to show
- Show them as a **stack** of those two values inside the pivot table
- Use Panda's functionality as much as possible!

## Task 8 (Super Bonus)

TASK

* Create bar chart of top 10 actors (on `x`) and average ratings of their top movies (`y`) based on IMDb data (only if they play in at least two movies)

* IMDb provides data sets at [datasets.imdbws.com](https://datasets.imdbws.com)
* Can directly be loaded like
```python
pd.read_table('https://datasets.imdbws.com/dataset.tsv.gz', sep="\t", low_memory=False, na_values=["\\N","nan"])
```
* Needed:
 * `name.basics.tsv.gz` (for names of actors and movies they are known for)
 * `title.ratings.tsv.gz` (for ratings of titles)
* Strategy _suggestions_:
 * Use `df.apply()` with custom function
 * Custom function: Compute average rating and determine if this entry is eligible for plotting (this _can_ be done at once, but does not need to be)
 * Average rating: Look up title IDs as listed in `knownForTitles` in titles dataframe

## Task 8B (Bonuseption)

TASK

All of the following are ideas for unique sub-tasks, which can be done individually
* In addition to Task 8, restrict the top titles to those with more than 10000 votes
* For 30 top-rated actors, plot rating vs. age
* For 30 top-rated actors, plot rating vs. average runtime of the known-for-titles (using `title.basics.tsv.gz`)

Feedback to a.herten@fz-juelich.de

_Next slide: Further reading_