Im trying to training a Random Forest Classifier to predict movie success based on various features.
Im using tmdb_5000_movies.csv data set. code as below
df_movies = pd.read_csv(‘tmdb_5000_movies.csv’)
df_credits= pd.read_csv(‘tmdb_5000_credits.csv’)
df_movies.rename(columns={‘id’: ‘movie_id’}, inplace=True) ## Rename the ‘id’ column to ‘movie_id’
Merge the DataFrames on ‘movie_id’ with specified suffixes
df_merged = pd.merge(df_movies, df_credits, on=‘movie_id’, suffixes=(‘_movie’, ‘_credit’))
def extract_genres(json_str):
try:
genres = json.loads(json_str.replace(“'”, “"”)) # Replace single quotes with double quotes for valid JSON
genre_names = [genre[‘name’] for genre in genres]
return genre_names
except (json.JSONDecodeError, TypeError):
return
Apply the function to the ‘genres’ column
df_merged[‘genres’] = df_merged[‘genres’].apply(extract_genres)
Handle keywords
df_merged.iloc[0][‘keywords’]
def extract_keywords(text):
L =
for i in ast.literal_eval(text):
L.append(i[‘name’])
return L
Apply the function to the ‘keywords’ column
df_merged[‘keywords’] = df_merged[‘keywords’].apply(extract_keywords)
Handle cast
df_merged.iloc[0][‘cast’]
Function to convert string to list of keyword names and keeping top 4 cast
def convert_cast(text):
L = []
counter = 0
for i in ast.literal_eval(text):
if counter < 4:
L.append(i['name'])
counter+=1
return L
df_merged[‘cast’] = df_merged[‘cast’].apply(convert_cast)
Handle crew
df_merged.iloc[0][‘crew’]
Function to Extract The Director Name
def get_director(text):
L =
for i in ast.literal_eval(text):
if i[‘job’] == ‘Director’:
L.append(i[‘name’])
break
return L
df_merged[‘crew’] = df_merged[‘crew’].apply(get_director)
Converting overview to list
df_merged.iloc[0][‘overview’]
Remove spaces from strings
def remove_spaces(text):
if isinstance(text, list):
return [t.replace(" “, “”) for t in text]
elif isinstance(text, str):
return text.replace(” ", “”)
else:
return text
Apply the function to remove spaces
df_merged[‘overview’] = df_merged[‘overview’].apply(remove_spaces)
df_merged[‘genres’] = df_merged[‘genres’].apply(remove_spaces)
df_merged[‘keywords’] = df_merged[‘keywords’].apply(remove_spaces)
df_merged[‘cast’] = df_merged[‘cast’].apply(remove_spaces)
df_merged[‘crew’] = df_merged[‘crew’].apply(remove_spaces)
Drop the ‘homepage’ column
df_merged.drop(columns=[‘homepage’], inplace=True)
Preprocess the data
categorical_features = [‘genres’, ‘original_language’, ‘production_countries’, ‘spoken_languages’, ‘status’]
label_encoder = LabelEncoder()
for feature in categorical_features:
df_merged[feature] = label_encoder.fit_transform(df_merged[feature].astype(str))
Prepare the features and target variable
X = df_merged.drop([‘title_movie’, ‘vote_average’, ‘vote_count’], axis=1)
y = df_merged[‘vote_average’].apply(lambda x: 1 if x >= 6 else 0) # Assuming a vote_average >= 6 is considered a success
Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Train a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
I’m getting the below error. ValueError: setting an array element with a sequence.

