How to shortern assign many variables in python code?

Vanphu_sdv · June 1, 2023, 3:11pm

Hi all,
I’m new member using python.
Have an example to plot many boxplots to compare each data column of many testers.
My code already work but have a problem.
We have so many collumn need to check.
If we only assign manual row by row. It will be 10 data * 5 tester = 50 rows need to manual assign at this example
How can we assign variables for 100 data columns with 10 testers?

import matplotlib.pyplot as plt
import pandas as pd
 
file = 'E:/Document/python/Book1.xlsx'
df = pd.read_excel(file)
data1_tester1 = df[df['Tester']=='Tester 1']['Data1']
data1_tester2 = df[df['Tester']=='Tester 2']['Data1']
data1_tester3 = df[df['Tester']=='Tester 3']['Data1']
data1_tester4 = df[df['Tester']=='Tester 4']['Data1']
data1_tester5 = df[df['Tester']=='Tester 5']['Data1']
data2_tester1 = df[df['Tester']=='Tester 1']['Data2']
data2_tester2 = df[df['Tester']=='Tester 2']['Data2']
data2_tester3 = df[df['Tester']=='Tester 3']['Data2']
data2_tester4 = df[df['Tester']=='Tester 4']['Data2']
data2_tester5 = df[df['Tester']=='Tester 5']['Data2']
ax1 = plt.subplot(2,1,1)
ax1.boxplot([data1_tester1,data1_tester2,data1_tester3,data1_tester4,data1_tester5])
ax1.set_xticklabels('')
ax2 = plt.subplot(2,1,2)
ax2.boxplot([data2_tester1,data2_tester2,data2_tester3,data2_tester4,data2_tester5],labels=['Tester 1','Tester 2','Tester 3','Tester 4','Tester 5'])
plt.show()

Here is my code result:
Untitled
My file:

MRAB · June 1, 2023, 4:54pm

Have you considered using lists and loops?

data1_testers = []

for i in range(5):
    data1_testers.append(df[df['Tester']==f'Tester {i + 1}']['Data1'])

data2_testers = []

for i in range(5):
    data2_testers.append(df[df['Tester']==f'Tester {i + 1}']['Data2']

Or something similar.

Vanphu_sdv · June 1, 2023, 5:52pm

I’t worked. Thanks so much!
But, how can I use for loop with columns.
This example, I have 10 columns data, I can write 10 times for loop.
If the dataFrame is bigger (about 100 columns), how can?

MRAB · June 1, 2023, 6:32pm

Here’s a more general solution:

import matplotlib.pyplot as plt
import pandas as pd

num_testers = 5
num_data = 2

file = 'E:/Document/python/Book1.xlsx'
df = pd.read_excel(file)

data_tester = []

for data_index in range(num_data):
    data_tester.append([])

    for tester_index in range(num_testers):
        data_tester[-1].append(df[df['Tester'] == f'Tester {tester_index + 1}'][f'Data{data_index + 1}'])

for data_index in range(num_data):
    ax = plt.subplot(2, 1, data_index + 1)

    if data_index == num_data - 1:
        ax.boxplot(data_tester[data_index], labels=[f'Tester {tester_index + 1}' for tester_index in range(num_testers)])
    else:
        ax.boxplot(data_tester[data_index])
        ax.set_xticklabels('')

plt.show()

Vanphu_sdv · June 4, 2023, 1:55am

Thanks so much.
This is my result what I want.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

file = 'E:/Document/python/Book1.xlsx'
df = pd.read_excel(file)
cols = df.columns[2:12]
testers = ['Tester 1', 'Tester 2','Tester 3', 'Tester 4', 'Tester 5']
num_data = np.count_nonzero(cols)
num_testers = np.count_nonzero(testers)
data_tester = []

for data_index in cols:
    data_tester.append([])
    
    for tester_index in testers:
        data_tester[-1].append(df[df['Tester']==tester_index][data_index])

fig = plt.figure(figsize= (10,30))

for data_index in range (num_data):
    ax = plt.subplot(num_data, 1, data_index + 1)
    ax.set_title(cols[data_index],loc='left')
    if data_index == num_data - 1:
        ax.boxplot(data_tester[data_index], labels=[f'Tester {tester_index + 1}' for tester_index in range(num_testers)])
    else:
        ax.boxplot(data_tester[data_index])
        ax.set_xticklabels('')
plt.tight_layout(h_pad=2)
plt.show()

MRAB · June 4, 2023, 2:11am

You could determine the number of testers from the spreadsheet.

CAM-Gerlach · June 5, 2023, 4:34am

Generally speaking, reaching for a for loop with vectorized data structures (Numpy/Pandas) in cases like this is a serious and well-known beginner antipattern, as it will generally be much less efficient (sometimes by millions of times or more), concise and idiomatic than just using the native Numpy/Pandas operators, and throws away most of the benefits of using Numpy/Pandas in the first place.

Using some basic Pandas operations, we can massage our data into the format needed for the base Matplotlib boxpolot, without the need for a tortuous and inefficient for loop. Specifically, we convert the data from wide to long format (i.e., making each “dataN” as a separate row rather than a column) using pd.wide_to_long, group on the “DataN” number using df.groupby, create our subplots up front per the number of groups, and finally iterate over the group subplots, plotting a boxplot of “Data” for each grouped by “Tester”. This is much simpler, more concise and more idiomatic than the previous, and also will likely be much faster on larger dataframes:

import matplotlib.pyplot as plt
import pandas as pd

file = 'Book1_data.xlsx'
df = pd.read_excel(file)

groups = pd.wide_to_long(df, stubnames=["Data"], i=["No", "Tester"], j="N").groupby("N")
figure, axes = plt.subplots(len(groups), 1, figsize=(10, 40))
for (group_n, group_df), ax in zip(groups, axes):
    ax.boxplot(group_df["Data"].groupby("Tester").apply(list))
    ax.set_title(f"Data {group_n}")
plt.show()

Plot image

However, this itself is really another XY problem—as you’ve helpfully made clear in your complete and detailed explanation (thanks!). Since your actual goal is just a boxplot by tester and “DataN”, you can achieve the same or better result with one line of code using Pandas’ build-in DataFrame.plot.box method:

import pandas as pd

file = 'Book1.xlsx'
df = pd.read_excel(file)

df.plot.box(column=[c for c in df.columns if "Data" in c], by="Tester", layout=(-1, 1), figsize=(10, 40))

Plot image

This can be passed and returns a matplotlib object, so it is just like your existing Matplotlib plot, just with far less work.

Vanphu_sdv · June 6, 2023, 12:04am

I’m using Pandas build-in df.boxplot()
How can fill in another color for each tester?
Ex: test 1 fill red, tester 2 fill blue, tester 3 fill green, …

CAM-Gerlach · June 6, 2023, 12:41am

I’m not aware of a simple way to do that natively with either the Pandas boxplot methods or the Matplotlib plotting functions that they wrap; e.g. this approach on SO might work but involves a lot of complexity.

However, this is easy to do with Seaborn, a higher-level interface to Matplotlib, which also produces nicer-looking plots overall by default. Just swap out df.boxplot for seaborn.catplot, converting your columns from wide to long first, and each tester will be given its own color automatically:

import pandas as pd
import seaborn

file = 'Book1_data.xlsx'
df = pd.read_excel(file)
df_plot = pd.wide_to_long(df, stubnames="Data", i="No", j="N").reset_index()
seaborn.catplot(data=df_plot, x="Tester", y="Data", row="N", kind="box", aspect=1.5, sharey=False)

Resulting plot

See the catplot documentation for more details and examples, including how to set which colors are used for which tester.