Custom Discriptive Analysis

Hi, I am Barath

I am new to Python, I have experience with R and SPSS but I am trying to automate the Basic descriptive analysis like Custom Table in SPSS where we get 2 * 2 table for categorical variables(n & %) and we can add the number of Independent variables in Row as given below.

Screenshot 2024-03-11 160746

My Code

import pandas as p
import numpy as np

d = {“Gender”: [1,1,1,2,2,1,1,2,1,2,1,2,2,1,1], “Death”: [2,2,1,1,2,1,2,1,1,2,2,1,1,1,2], “Income” : [1,1,3,3,1,2,3,1,1,2,2,3,1,3,2]}

my_df=pd.DataFrame(d)
print(my_df)

########### Final Two-way Custom table
‘’‘’‘’’
def frequency_table(my_df ,column):

freq=my_df[column].value_counts(dropna=True).sort_index()              # to get the frequncy
# to get the percentage
percent= freq/freq.sum() * 100
# all values converting to integer data type


percent = [int(i) for i in percent]
if 'Death' in my_df:
  for sur in my_df['Death']:

    if (sur ==1):
      frq_sur_0=my_df['Death'].value_counts(dropna=True).sort_index()
      per_sur_0= frq_sur_0/frq_sur_0.sum() * 100
      # all values converting to integer data type
      per_sur_0 = [int(i) for i in per_sur_0]


    if (sur ==2):
      frq_sur_1=my_df['Death'].value_counts(dropna=True).sort_index()
      per_sur_1= frq_sur_1/frq_sur_1.sum() * 100
      # all values converting to integer data type
      per_sur_1 = [int(i) for i in per_sur_1]

freq_table = pd.DataFrame({
'n':freq,
'%':percent,
'Case(n)' : frq_sur_0,
'Case(%)' : per_sur_0,
'Control(n)' : frq_sur_1,
'Control(%)' : per_sur_1
})
return freq_table

#freq_table = freq_table.apply(f, axis=0)
cols=[‘Gender’, ‘Income’] #‘sym’,
for column in cols:
print(f"Frequency Table for {column}:\n", frequency_table(my_df, column),“\n”)
‘’‘’‘’‘’

when I run the code I get the error says : ValueError: All arrays must be of the same length

  1. I did not able to give the correct Frequency and Percentage for the Column variable.
    I am trying to fix it but finding little difficult, It would be helpful If I could get any help.

Hi,

when uploading your code, can you please make sure that the ENTIRE code is pasted within the three back ticks as shown here:

format_code

By the way, in Python, parentheses are not required in conditional statements. For example:

if (sur ==2):

can just be:

if sur == 2:

This way, if volunteers wish to help you by test running your code, they can just click copy. Plus, it helps to verify that the code was entered with correct indentations.

2 Likes

Hi,

I modified the program a bit. Here, I have broken down your function into three distinct functions.
One for calculating the data, one for calculating the percentages, and one for displaying the results. In this approach, there is no need for importing external modules/packages. From your table, there was a correlation between deaths and gender and deaths and income levels. I made the assumption that correlation was a function of position. From this, I obtained two correlation lists:

death_income = list(zip(income_shift, data['Death']))
death_gender = list(zip(data['Gender'], data['Death']))

Because I used a match case statement for simplification, I needed unique row/column pairs. Therefore, I shifted the values of the original Income data points by two.

Here is the code:

# Assumption: Value column positions are correlated
data = {'Gender' : [1,1,1,2,2,1,1,2,1,2,1,2,2,1,1], # Male = 1, Female = 2
        'Death'  : [2,2,1,1,2,1,2,1,1,2,2,1,1,1,2], # Death Yes = 1, Death No = 2
        'Income' : [1,1,3,3,1,2,3,1,1,2,2,3,1,3,2]} # Low = 1, Medium = 2, High = 3

# Shifted up by two so that it conforms to the row/column approach for unique cells
income_shift = [x + 2 for x in data['Income']] # Shift values up by two

# Correlate deaths with both gender and income levels
death_income = list(zip(income_shift, data['Death']))
death_gender = list(zip(data['Gender'], data['Death']))

total_data = death_income + death_gender

def calculate_data(data):

    male_death_yes = 0
    male_death_no = 0
    female_death_yes = 0
    female_death_no = 0

    income_high_death_yes = 0
    income_high_death_no = 0
    income_medium_death_yes = 0
    income_medium_death_no = 0
    income_low_death_yes = 0
    income_low_death_no = 0   

    for r, c in data:

        match r, c:

            case 1, 1:
                male_death_yes += 1  
                
            case 1, 2:
                male_death_no += 1

            case 2, 1:
                female_death_yes += 1
                
            case 2, 2:
                female_death_no += 1 

            case 3, 1:
                income_high_death_yes += 1

            case 3, 2:
                income_high_death_no += 1
                
            case 4, 1:
                income_medium_death_yes += 1

            case 4, 2:
                income_medium_death_no += 1

            case 5, 1:
                income_low_death_yes += 1

            case 5, 2:
                income_low_death_no += 1            

            case _: raise ValueError("Not a valid datapoint pair!")

    return [male_death_yes, male_death_no, female_death_yes, female_death_no,
            income_high_death_yes, income_high_death_no, income_medium_death_yes, income_medium_death_no,
            income_low_death_yes, income_low_death_no]
   
def display_data_table(data, percentages):

    print('\nVariable            Death')
    print('--------            -----')
    print('\nGender       Yes n(%)     No n(%)')
    print('-'*36)
    print('Male             {}({:3.1f})     {}({:3.1f})'.format(data[0], percentages[0], data[1], percentages[1]))
    print('Female           {}({:3.1f})     {}({:3.1f})'.format(data[2], percentages[2], data[3], percentages[3]))
    print('\nIncome')
    print('------')    
    print('High             {}({:3.1f})     {}({:3.1f})'.format(data[4], percentages[4], data[5], percentages[5]))
    print('Medium           {}({:3.1f})     {}({:3.1f})'.format(data[6], percentages[6], data[7], percentages[7]))
    print('Low              {}({:3.1f})     {}({:3.1f})'.format(data[8], percentages[8], data[9], percentages[9]))

def calculate_percentages(data):

    male_yes = 100*data[0]/(data[0] + data[2])
    male_no = 100*data[1]/(data[1] + data[3])
    female_yes = 100*data[2]/(data[2] + data[0])
    female_no = 100*data[3]/(data[3] + data[1])

    high_yes = 100*data[4]/(data[4] + data[6] + data[8])
    high_no = 100*data[5]/(data[5] + data[7] + data[9])
    medium_yes = 100*data[6]/(data[6] + data[4] + data[8])
    medium_no = 100*data[7]/(data[7] + data[5] + data[9])
    low_yes = 100*data[8]/(data[8] + data[4] + data[6])
    low_no = 100*data[9]/(data[9] + data[5] + data[7])    

    return [male_yes, male_no, female_yes, female_no, high_yes, high_no,
            medium_yes, medium_no, low_yes, low_no]
    
# Call functions
data_calc_results = calculate_data(total_data)      
data_percentages = calculate_percentages(data_calc_results)
display_data_table(data_calc_results, data_percentages)

Hope this helps you.

Cheers!

1 Like

Thank you for the valuable inputs @onePythonUser

Do we always have to State columns and row specifically, or is there any way to just can we give number of columns or rows dynamically, because if we want to do descriptive analysis for a data with more numbers of variables, stating each and every variable details will be a time consuming, is there any specific way or topic that you can suggest to help me, thank you again for the input.

Hi,

Do you mean more potential data combination outcomes? For example, right now there are;

  1. five row categories:
    a. male
    b. female
    c. low
    d. medium
    e. high
  2. four column categories:
    a. death yes
    b. death no
    c. a percentage value for each of the death values (two).

You would like to add more categories? More rows and column categories on the fly?

Hi @onePythonUser,

For simplicity, I want the table of Death in a Column and In a row, I have other 5 or 6 Variables to look, at (Gender, Income, Age_Category(<=18,19-35,36-40,>=41), Educated(yes/no)), I am just planning to look like what if in future I have a data set with more than 20-Variables and I want the Custom table for all of it with fixed Column variable(like Death), How can able to give all the variables in row aginst Death variable.

Hello,

if we decide to approach this problem in a more general way, I suppose we can make use of nested for loops along with enumeration (so that we have explicit (row, column) pairs) as shown by the following script:

row_categories = ['Male', 'Female', 'High', 'Medium', 'Low']
column_categories = ['Death_Yes', 'Death_No', 'Percentage_Yes', 'Percentage_No']

for row_num, row_category in enumerate(row_categories):
    print('\n---- Begin combination ----')
    print('        row,     column')
    

    for col_num, col_category in enumerate(column_categories):

        print('({}, {}), {} - {}'.format(row_num, col_num, row_category, col_category))

Note that for every combination (row, column), you will have to keep track of its corresponding quantity, because each quantity will be a unique value on the table. To avoid creating too many variables, you can potentially keep track of the values in a list or in a dictionary. You can potentially also have nested lists or nested dictionaries to keep track of this data. You will have to make use of indexes if using lists, and key-value pairs if using dictionaries.

Note that if you run the code script above, you will see that it matches the original code above in terms of the possible combinations. This of course can grow exponentially depending on how many more categories that you wish to add.

Regarding printing the table, you can print the header information once, and perform iterations for displaying the corresponding row - column combinations.

I hope that the script above and the aforementioned response is enough to get you thinking of how to tackle this problem.

Good luck!

Thank You @onePythonUser,

I think now it is clear how I should approach this problem, Thank you for your valuable time.