Why pandas concat gives me untrue dataframe while its type and shape is correct

obyilmaz · September 28, 2021, 2:34pm

I want to make some language processing. my code work properly when I used dataframe that “yor1” or “yor2”. but unfortunately when I merge these dataframe in 1 dataframe , which is “yorumlar”, my code gives this error


  File "D:\BELGELERİM\programing\4_Datascience\havlu dil işleme\nlp.py", line 44, in <module>
    yorum = re.sub("[^a-zA-Z]"," ", yorumlar["Body"] [i])

  File "C:\Users\oby_pc\anaconda3\lib\re.py", line 210, in sub
    return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or bytes-like object

and moreover to this, shape and type of “yorumlar” is true. there is something wrong with pd.concat operation but i could’n solve it. all files ,that i try to concat, is same format. my code is. in the and you can see my variables type and size.

 import numpy as np
import pandas as pd
import re
import nltk
from nltk import FreqDist



yor1=pd.read_csv("Amazon Brand.csv")
yor2=pd.read_csv("American Soft Linen.csv")
yor3=pd.read_csv("GLAMBURG.csv")
yor4=pd.read_csv("Hammam.csv")
yor5=pd.read_csv("Hotel.csv")
yor6=pd.read_csv("Luxury Hotel.csv")
yor7=pd.read_csv("Luxury White.csv")
yor8=pd.read_csv("Qute.csv")



yorumlar = pd.concat([yor1,yor2,yor3,yor4,yor5,yor6,yor7,yor8], axis=0)
print(yorumlar)
from nltk.stem.porter import PorterStemmer

ps=PorterStemmer()

from nltk.corpus import stopwords

#Preprocessing
derlem = []
allwords=[]

for i in range(yorumlar.shape[0]):
    yorum = re.sub("[^a-zA-Z]"," ", yorumlar["Body"] [i])
  
    yorum=yorum.lower()

    yorum= yorum.split()
    yorum=[ps.stem(kelime) for kelime in yorum if not kelime in set(stopwords.words("english"))]
    
    for kelime in yorum:
        allwords.append(kelime)
    
    yorum= " ".join(yorum)
    derlem.append(yorum)

my variables

Processing: New Bitmap Image.bmp…

probably i do some easy mistake, could you help me ? thank you

steven.daprano · September 29, 2021, 7:52am

Can you please copy and paste the FULL traceback, starting with the
line “Traceback…”, so that we can see exactly what error you are
getting, what line of code is giving the error, and the context of why
the error is happening?

It looks to me like pandas.read_csv can return either a DataFrame or a
TextParser object, but the documentation doesn’t make it clear why it
would return one or the other.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

So the first thing I would do is run this:

for obj in [yor1,yor2,yor3,yor4,yor5,yor6,yor7,yor8]:
    print(type(obj))

and see which ones are different from the rest. If some are data frames
and some are not, that might explain the error.

obyilmaz · September 29, 2021, 1:00pm

i tried it and i put the picture of my variables size and type. full traceback is added.

obyilmaz · September 29, 2021, 1:56pm

SOLVED: i solved the problem. thanks for helping. i made 2 mistakes

1- i need to use ( ignore_index=“false” ) in my concad operation 2- i need to take care of “nan” values. the error is becuse of errors.