Python for data science: Parsing "aTimeLogger" Android app data to graphs using pandas and matplotlib

enoren5 · June 12, 2024, 7:32pm

Greetings Pythonistas!

I’ve got a data set in CSV format (available for download here) which spans from July 2019 up to December 2023. It is a collection of activities (and their time intervals) that I’ve meticulously recorded and tracked while doing social science research like Philosophy as well as learning how to program Python and build websites with Django, among other research activities.

To give you a specific example about how I use this tool, if I were to, for example, spend 25 minutes on my lunch hour on my tablet watching Udemy course content teaching the basics of Binary Search Trees, at the end of my lunch break, using an app called aTimeLogger on my Android phone I enter the “Activity” type such as “algorithms”. Next I enter metadata such the date, the start time, and the end time. Then I write an annotation (1-2 sentences) which is sort of like a mental note for my future reference. Other discrete research activity categories could be “Python”, “Django”, “sysadmin”, or even “Magick” / “writing”).

For additional context, over the range of the data set, I’ve spent a total of ~378 hours doing something “Python” related and ~579 hours working on Django course content (or a Django based web project). Those are the two largest categories. The rest of the Activities aren’t as data dense.

My two latest code snippets and graphs can be found here.

Here are my questions:

In the first snippet and graph below, in my Jupyter Notebook pandas and matplotlib show two categories - - but only partially so. I noticed that when I change the alpha (translucency) variable, the time spent on the different categories overlap each other. How do I stack the data instead? That’s my first question.

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt
  
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600
 
# Copy so changes made to python_df dos not affect bulk_df and vice versa
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)
 
# Calculate rolling means using the index now
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()
 
# Copy so changes made to django_df dos not affect bulk_df and vice versa
django_df = bulk_df[bulk_df["Activity"] == "Django"].copy()
django_df.set_index('From', inplace=True)
# Calculate rolling means using the index now
django_df['Rolling_Mean_90'] = django_df['Duration_hours'].rolling('90D').mean()
django_df['Rolling_Mean_182'] = django_df['Duration_hours'].rolling('182D').mean()
 
python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()
# py_dj_Month_combined = python_df_Month.add(django_df_Month, fill_value=0)
 
plt.figure(figsize=(14, 8))
plt.bar(python_df_Month.index, python_df_Month, label='Python 90-Day Rolling Mean',width=20, alpha=0.5) # color='red')
plt.bar(django_df_Month.index, django_df_Month, label='Django 90-Day Rolling Mean', width=20, alpha=0.5) #, color='blue')
plt.legend()
plt.title('Stacked Bar Chart for Python and Django Activities')
plt.xlabel('Date')
plt.ylabel('Hours Spent')
plt.show()

That shows as:

In the second snippet and graph below, only one category shows up (”Magick”). How do I get the other “Research” category to show? As far as I can tell, the way I handle the data and cast function calls and methods against the two dataframes should work. I’ve been swapping out variable names, tried refactoring, as well as making large and small other changes without success. Who here can identify what I might be missing to get both categories to show (instead of one)? I feel like what I am missing is trivial and obvious. I am hoping another forum member here can point out the blatent error I am making. (My additional intent here is to ensure the bar graph also stacks the data (rather than overlapping) like I have set out to do with the first graph).

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt
 
# Load the data
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600
 
# Copy and filter data for "Magick" activity and calculate rolling means
magick_df = bulk_df[bulk_df["Activity"] == "Magick"].copy()
magick_df.set_index('From', inplace=True)
magick_df['Rolling_Mean_90'] = magick_df['Duration_hours'].rolling('90D').mean()
magick_df['Rolling_Mean_182'] = magick_df['Duration_hours'].rolling('182D').mean()
 
# Copy and filter data for "Research (general)" activity and calculate rolling means
research_df = bulk_df[bulk_df["Activity"] == "Research (general)"].copy()
research_df.set_index('From', inplace=True)
research_df['Rolling_Mean_90'] = research_df['Duration_hours'].rolling('90D').mean()
research_df['Rolling_Mean_182'] = research_df['Duration_hours'].rolling('182D').mean()
 
# Resample data
magick_df_Month = magick_df['Rolling_Mean_90'].resample('MS').sum()
research_df_Month = research_df['Rolling_Mean_90'].resample('MS').sum()
 
# Plot the combined data with wider bars
plt.figure(figsize=(12, 6))
plt.bar(research_df_Month.index, research_df_Month, label='"Research" 90-Day Rolling Mean', width=20, alpha=0.5, color='blue')
plt.bar(magick_df_Month.index, magick_df_Month, label='"Magick" ("Philosophy") 90-Day Rolling Mean',width=20, alpha=0.5, color='red')
 
plt.legend()
plt.title('Stacked Bar Chart for Magick and Research Activities')
plt.xlabel('Date')
plt.ylabel('Hours Spent')
plt.show()

cameron · June 12, 2024, 11:10pm

If you get no good answer here, try asking over in the matplotlib
Discourse: Community - Matplotlib

In the first snippet and graph below, in my Jupyter Notebook pandas
and matplotlib show two categories - - but only partially so. I noticed
that when I change the alpha (translucency) variable, the time spent on
the different categories overlap each other. How do I stack the data
instead? That’s my first question.

Have a look at this example:
https://matplotlib.org/stable/gallery/lines_bars_and_markers/bar_stacked.html

I found this in the docs for pyplot.bar where it says:
“Stacked bars can be achieved by passing individual bottom values per
bar. See Stacked bar chart.”
and links to matplotlib.pyplot.bar — Matplotlib 3.9.0 documentation

In the second snippet and graph below, only one category shows up
(”Magick”). How do I get the other “Research” category to show? As far
as I can tell, the way I handle the data and cast function calls and
methods against the two dataframes should work. I’ve been swapping out
variable names, tried refactoring, as well as making large and small
other changes without success. Who here can identify what I might be
missing to get both categories to show (instead of one)?

Is it possible that the Research category is (a) overlaid by the Magick
category, hiding it or (b) entirely of size 0 or close to 0, making the
bars invisibly small?

I’d try commenting out the Magick .bar() call and seeing what the
graph looks like. if you get Research bars, pay particular attention to
the scales on the axes.

enoren5 · June 16, 2024, 9:00am

There is data:

It’s not 0. Both the bar graphs have alpha set to 0.50 so I would be able to see both layered on top of each other.

I identified the problem: In my codesnippet above, I was referring to one of Series columns as “Research (general)”. When I changed it to just: “Research”, the data began plotting.

Here is my end product:

So that resolves my second question.

Thank you for sharing these doc links. At first glance, the variables and arguments used in the sample demos are very different from what I am working with. I’ll continue to experiment with refactoring my code to match what is described in these docs until I manage to successfully stack my bar charts.

If I encounter any more issues I will either report back here or reach out for additional support on the official matplotlib Discourse forum you suggested.

Thanks @cameron

cameron · June 18, 2024, 4:47am

[quote=“Cameron Simpson, post:2, topic:55612, username:cameron”]
Have a look at this example:
Stacked bar chart — Matplotlib 3.9.0 documentation
[…]
[/quote]

Thank you for sharing these doc links. At first glance, the variables and arguments used in the sample demos are very different from what I am working with. I’ll continue to experiment with refactoring my code to match what is described in these docs until I manage to successfully stack my bar charts.

The important bit is this code:

 bottom = np.zeros(3)

 for boolean, weight_count in weight_counts.items():
     p = ax.bar(species, weight_count, width, label=boolean, bottom=bottom)
     bottom += weight_count

The idea is to specify where the bottoms of the bars begin. It starts
wit the array bottom all zeroes (the example only uses 3 bars). After
plotting one set of bar it raises all the values in bottom by the size
of the bar it just plotted. You can just add one array/series to
another, eg:

 bottom += research_df_Month

and so forth.

You’d just need to add a bottom= parameter to your own plt.bar()
calls and update it before the next bar plot call.

enoren5 · June 23, 2024, 7:09am

Hi @cameron,

Thank you for your follow up. Based on the matplotlib doc, but especially the part you highlighted, I partially understand. Even though my end product didn’t resemble yours exactly, I still got it working.

For example, you suggested I use something like: bottom += research_df_Month. I didn’t use the addition assignment operator but I did end up adding one data frame object on top of the other for the bottom variable.

Here is my final graph. You’ll notice that I parsed 4 activities categories instead of just the 2 that I was working with previously. Below the image of my final graph, you can see my full working code snippet:

import pandas as pd
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt

# Load the data
bulk_df = pd.read_csv('data/all-comments-removed.csv', parse_dates=["From", "To"])
bulk_df['Duration'] = pd.to_timedelta(bulk_df['Duration'])
bulk_df['Duration_hours'] = bulk_df['Duration'].dt.total_seconds() / 3600

# Copy and filter data for "Magick" activity and calculate rolling means
magick_df = bulk_df[bulk_df["Activity"] == "Magick"].copy()
magick_df.set_index('From', inplace=True)
magick_df['Rolling_Mean_90'] = magick_df['Duration_hours'].rolling('90D').mean()
magick_df['Rolling_Mean_182'] = magick_df['Duration_hours'].rolling('182D').mean()

# Copy and filter data for "Research (general)" activity and calculate rolling means
research_df = bulk_df[bulk_df["Activity"] == "Research"].copy()
research_df.set_index('From', inplace=True)
research_df['Rolling_Mean_90'] = research_df['Duration_hours'].rolling('90D').mean()
research_df['Rolling_Mean_182'] = research_df['Duration_hours'].rolling('182D').mean()

# Resample data
magick_df_Month = magick_df['Rolling_Mean_90'].resample('MS').sum()
research_df_Month = research_df['Rolling_Mean_90'].resample('MS').sum()

# Copy and filter data for "Python" activity and calculate rolling means
python_df = bulk_df[bulk_df["Activity"] == "Python"].copy()
python_df.set_index('From', inplace=True)
python_df['Rolling_Mean_90'] = python_df['Duration_hours'].rolling('90D').mean()
python_df['Rolling_Mean_182'] = python_df['Duration_hours'].rolling('182D').mean()

# Copy and filter data for "Django" activity and calculate rolling means
django_df = bulk_df[bulk_df["Activity"] == "Django"].copy()
django_df.set_index('From', inplace=True)
django_df['Rolling_Mean_90'] = django_df['Duration_hours'].rolling('90D').mean()
django_df['Rolling_Mean_182'] = django_df['Duration_hours'].rolling('182D').mean()

# Resample data
python_df_Month = python_df['Rolling_Mean_90'].resample('MS').sum()
django_df_Month = django_df['Rolling_Mean_90'].resample('MS').sum()

# Create a common index
common_index = magick_df_Month.index.union(research_df_Month.index).union(python_df_Month.index).union(django_df_Month.index)

# Reindex all Series to the common index
magick_df_Month = magick_df_Month.reindex(common_index, fill_value=0)
research_df_Month = research_df_Month.reindex(common_index, fill_value=0)
python_df_Month = python_df_Month.reindex(common_index, fill_value=0)
django_df_Month = django_df_Month.reindex(common_index, fill_value=0)

# Plot the combined data with stacked bars
plt.figure(figsize=(14, 8))

# Plot first activity
plt.bar(magick_df_Month.index, magick_df_Month, label='"Magick" ("Philosophy")', width=20, alpha=1, color='#3D9E60')

# Plot second activity, stacked on the first
plt.bar(research_df_Month.index, research_df_Month, label='"Research"', width=20, alpha=1, color='#2A6D20', bottom=magick_df_Month)

# Plot third activity, stacked on the first and second
plt.bar(django_df_Month.index, django_df_Month, label='"Django"', width=20, alpha=1, color='#FB947E', bottom=magick_df_Month + research_df_Month)

# Plot fourth activity, stacked on the first, second, and third
plt.bar(python_df_Month.index, python_df_Month, label='"Python"', width=20, alpha=1, color='#D95134', bottom=magick_df_Month + research_df_Month + django_df_Month)

plt.legend()
plt.title('Stacked Bar Chart for Python, Django, Magick, Research Activities 90-Day Rolling Mean')
plt.xlabel('Quarter (90 day increments)')
plt.ylabel('Hours Spent')
plt.show()