How to create a dataframe from 3 columns of a dataset?

Linda · December 25, 2024, 9:19pm

I need to create a data frame from a the value of the columns of dataset.

From

I want to create a data frame:

column ‘Year’ - row ‘Normal weight’ with the value ‘Estimate’

I tried many ways but it is error and I received error message which is ValueError: Index contains duplicate entries, cannot reshape.

I want to visualize the line graph over the years.

Thanks in advance for your comments.

onePythonUser · December 27, 2024, 4:41am

Hello,

have you tried using the pandas library package?

Which ways?

petercordia · December 27, 2024, 9:34am

This looks like you’re using pandas in a jupyter environment, but I can’t be certain unless you show us some of the code.

If you’re using polars that’s great and please keep using it, but pandas is fine.

assuming you’re using pandas, can you show us the output of df.dtypes ?

Linda · December 27, 2024, 11:21am

I used for loop but can’t go through the column Panel.

Linda · December 27, 2024, 11:25am

Yes, I am using Pandas. below is their types.

petercordia · December 27, 2024, 2:59pm

ok, I think I know approximately what the problem is.
but it would be helpful if you posted the actual code you tried. (You need to post the code between ``` to make sure it formats well on the forum.)

Here’s how I would troubleshoot this:

first, filter down your dataframe to the minimal data you need to get this error.
This should only be about 5 rows and 3 columns.

Then you can use the string that is returned by filtered_down_df.to_dict() to instantiate the dataframe instead of loading it from wherever you’re loading it from.

EDIT (I forgot dataframes have tricky repr):
You can use that pd.DataFrame(df.to_dict()) is the same as pd to create minimal self-contained code.

At this point you should be able to fit the (self-contained) code that you need to reproduce your error into about 15 lines.

Now go to perplexity.ai and tell the chatbot what code you’re trying to run, and what error message results. There is a good chance the chatbot will be able to tell you how to fix the problem.

If the chatbot doesn’t know, post those 15 lines of code on this forum, in this thread,

Linda · December 27, 2024, 3:16pm

Thanks Peter.
Actually, I wanted to post 3 screenshots in my question but the forum didn’t allow me as a new joiner.

Based on the dataset, I want to create the data frame as shown below. While I was waiting for any responses, I manually created a dictionary. But now I will try your way.

petercordia · December 27, 2024, 3:45pm

I understand the restrictions of the forum can be frustrating.

When you post code, could you post it like

this

?

This_is_a_dict_one_can_easilty_copy = {'a': 1, 'b':2}

select my text and then ‘quote’ and you should be able to see how to format code to make it readable.
You write newline, then 3 ` tokens, then newline, then your code, then agian 3 ` tokens.

Please don’t use printscreens for code.

I expect you already have code something like this, and it’s throwing an error. But this is something perplexity.ai generated for me:

import pandas as pd

# Create a DataFrame in long format
data = {
    'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'player': [1, 2, 3, 4, 1, 2, 3, 4],
    'points': [11, 8, 10, 6, 12, 5, 9, 4]
}

df_long = pd.DataFrame(data)

# View the long DataFrame
print("Long DataFrame:")
print(df_long)

# Reshape from long to wide format
df_wide = df_long.pivot(index='team', columns='player', values='points')

# View the updated wide DataFrame
print("\nWide DataFrame:")
print(df_wide)

Note you can easily copy and paste this into your jupyter notebook to test it, which would not be possible with a printscreen.

(I did test the code)

onePythonUser · December 27, 2024, 3:56pm

Hello,

here is a sample test script that you can use for reference. I have populated a partial listing but you can expand upon it to get the table you’re looking for.

import os, pandas as pd
os.chdir(r'C:\Desktop\Temp_Directory')

# Create folder on Desktop for testing purposes
newpath = r'C:\Desktop\Temp_Directory'
if not os.path.exists(newpath):
    os.makedirs(newpath)

# Define column titles
col_titles = ['YEAR ', 'Normal_weight ', 'Overweight_or_obese ']

# Define data to be displayed
data = [[' 1988-1994',  41.5, 56.0], 
        [' 1999-2002',  33.0, 65.4],
        [' 2001-2004',  32.3, 66.0],
        [' 2003-2006',  31.6, 66.7],
        [' 2005-2008',  30.8, 67.5],
        [' 2007-2010',  29.8, 68.5]]   

# Create panda dataframe
df = pd.DataFrame(data, columns = col_titles)

# Print the dataframe
print('\n', df)

# Write info to a file:
csv_path = 'C:/Desktop/Temp_Directory/obesity_study_results.csv'

with open(csv_path, 'w', encoding = 'UTF-8') as csv_file:
        df.to_csv(csv_file)

Linda · December 27, 2024, 4:10pm

Hi Peter,

Thanks again.
I copied the code below and I tried to copy each data from the csv file. Is there any way to create a data frame as I want from the existing dataset ?

Additionally, I tried the method data.to_dict() but it didn’t work.

data_3 = {
    'YEAR':[ '1988-1994', '1999-2002', '2001-2004', '2003-2006', '2005-2008','2007-2010', '2009-2012', '2011-2014', '2013-2016', '2015-2018'],
    'Normal_weight': [41.6, 33, 32.3, 31.6, 30.8, 29.8, 29.6, 28.9, 27.7, 26],
    'Overweight_or_obese': [56, 65.1, 66, 66.7, 67.5, 68.5, 68.7, 69.5, 70.9, 72.5], 
    'Obesity': [22.9, 30.4, 31.4, 33.4, 34, 34.7, 35.3, 36.4, 38.8, 41.1], 
    'Grade_1_obesity': [14.8, 17.9, 19.3, 19.8, 19.5, 19.9, 20.4, 20.6, 21.2, 22],  
    'Grade_2_obesity': [5.2, 7.6, 7.2, 8.2, 8.7, 8.9, 8.6, 8.8, 9.9, 10.5], 
    'Grade_3_obesity': [2.9, 4.9, 5, 5.4, 5.8, 6, 6.3, 6.9, 7.7, 8.6]
}

Linda · December 27, 2024, 4:14pm

Hi Paul,

Thanks for your suggestion. Is there any way from the existing dataset to create the data frame as I want to make a plot later on ? I was stuck at this step, so I just copied each value from the csv file to create the dictionary as you saw.

thanks.

onePythonUser · December 27, 2024, 4:24pm

Note that you can plot anything that you want so long as there is a 1-to-1 correlation. In the two graphs below, note that x1 and y1 are grouped together for plotting - they also have the same number of elements in their arrays. The same can be stated for the x2 and y2 pairing. Think of it as a zipper relationship. So, if you can arrange your data such that there is a 1-to-1 correlation as shown here, you can plot it.

Here is an example:

from matplotlib import pyplot as pl

# Make arrays of x and y data values - Red line graph
x1 = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]

# Make arrays of x and y data value - Green dot graph
x2 = [1, 2, 4, 6, 8]
y2 = [2, 4, 8, 12, 16]

# Define plotting attributes
pl.plot(x1, y1, 'r')
pl.plot(x2, y2, 'go')

# Plot titles
pl.xlabel('X-Value')
pl.ylabel('Y-Value')
pl.title('Plot of X vs. Y')

# Plot boundaries / limits
pl.xlim(0.0, 9.0)
pl.ylim(0.0, 30.)

pl.show()

Here is an example how you can obtain the data for plotting from the existing dataset that you currently have (note that you have to pay special attention to the indices):

from matplotlib import pyplot as pl

# Create two empty lists - they will become your `x` and `y` values for plotting
x_value = []
y_value = []

# This is your original data table
data = [[' 1988-1994',  41.5, 56.0],
        [' 1999-2002',  33.0, 65.4],
        [' 2001-2004',  32.3, 66.0],
        [' 2003-2006',  31.6, 66.7],
        [' 2005-2008',  30.8, 67.5],
        [' 2007-2010',  29.8, 68.5]]

# Loop through the data to obtain plotting values
# In this example, we will plot years vs. normal weight values
for index in range(len(data)):
    x_value.append(data[index][0])
    y_value.append(data[index][1])

# Print values to verify values in arrays
print(x_value)
print(y_value)

# Define plotting attributes
pl.plot(x_value, y_value, 'go')

pl.xlabel('Year')
pl.ylabel('Normal Weight')
pl.title('Plot of Year vs. Normal Weight')

pl.show()

petercordia · December 27, 2024, 4:56pm

if data is a dataframe, data.to_dict() should work, as demonstrated below:

import pandas as pd
data_3 = {
    'YEAR':[ '1988-1994', '1999-2002', '2001-2004', '2003-2006', '2005-2008','2007-2010', '2009-2012', '2011-2014', '2013-2016', '2015-2018'],
    'Normal_weight': [41.6, 33, 32.3, 31.6, 30.8, 29.8, 29.6, 28.9, 27.7, 26],
    'Overweight_or_obese': [56, 65.1, 66, 66.7, 67.5, 68.5, 68.7, 69.5, 70.9, 72.5], 
    'Obesity': [22.9, 30.4, 31.4, 33.4, 34, 34.7, 35.3, 36.4, 38.8, 41.1], 
    'Grade_1_obesity': [14.8, 17.9, 19.3, 19.8, 19.5, 19.9, 20.4, 20.6, 21.2, 22],  
    'Grade_2_obesity': [5.2, 7.6, 7.2, 8.2, 8.7, 8.9, 8.6, 8.8, 9.9, 10.5], 
    'Grade_3_obesity': [2.9, 4.9, 5, 5.4, 5.8, 6, 6.3, 6.9, 7.7, 8.6]
}
data = pd.DataFrame(data_3)

print(data.to_dict())

It is possible to create the dataframe you want from the existing data, but it’s hard to demonstrate without having access to an example dataframe.

Linda · December 27, 2024, 5:16pm

Hi Peter,

I don’t know how to share the a part of the dataset but I copied from the csv file as below to show you how the data put in the csv file. From this way of allocation the data in the csv, I have no idea to create the data frame. Please help.

Thanks.

PANEL	YEAR	ESTIMATE
Normal weight (BMI from 18.5 to 24.9)	1988-1994	41.6
Normal weight (BMI from 18.5 to 24.9)	1999-2002	33
Normal weight (BMI from 18.5 to 24.9)	2001-2004	32.3
Normal weight (BMI from 18.5 to 24.9)	2003-2006	31.6
Normal weight (BMI from 18.5 to 24.9)	2005-2008	30.8
Normal weight (BMI from 18.5 to 24.9)	2007-2010	29.8
Normal weight (BMI from 18.5 to 24.9)	2009-2012	29.6
Normal weight (BMI from 18.5 to 24.9)	2011-2014	28.9
Normal weight (BMI from 18.5 to 24.9)	2013-2016	27.7

petercordia · December 27, 2024, 5:32pm

How did you create the printscreen in the OP if you didn’t load the csv into a dataframe?

You can load a csv file into a pandas dataframe with df = pd.read_csv(csv_file_path)

onePythonUser · December 27, 2024, 5:41pm

As an aside … do you have your data upside down regarding the BMI estimated values? I am almost certain that obesity rates (BMI) have been going up, not down.

Linda · December 27, 2024, 5:50pm

Hi Paul,
What does ‘go’ mean? I copied your code and it shows the green dot.
Thanks.

onePythonUser · December 27, 2024, 5:52pm

go = green “o” (or green circle dots)

replace it with ro, and see what you get

Linda · December 27, 2024, 5:52pm

Hi Peter,

I know how to load the csv file into a pandas but I don’t know how to give a access the csv file. That’s why I copy a part of it and paste here.

petercordia · December 27, 2024, 6:27pm

yes, what I meant is doing something like

df = pd.read_csv(...)
filtered_df = remove_some_columns_and_rows(df)
print(filtered_df.to_dict())

and then posting the result of the print here.

Though you said to_dict() was throwing an error?