EMPATH PACKAGE - inclusion of seed terms

Hi there!

I am new to Python and data Science in General.
I want to use the EMPATH package provided by Fast, Chen & Bernstein.
From what I gathered it is using a couple of seed terms to create a dictionary based on unsupervised learning. Now I am trying to find out which categories are build based on the seed terms I am giving - I don’t know which function to use to do so.

Generally I am getting a result from an analysis with the following - but I am not sure if it is the right approach and can not find many information on the matter:

import pandas as pd
import re
import matplotlib.pyplot as plt

Load the dataset

file_path = ‘C:/Users/katja/OneDrive/Dokumente/R_Earnings_Calls/filtered_earningscalls_semiconductor.xlsx’
df = pd.read_excel(file_path)

Ensure the date column is in datetime format

df[‘date’] = pd.to_datetime(df[‘date’], errors=‘coerce’)

Define the custom dictionary with seed words

custom_dict = {
“government”: [“regulation”, “federal”, “law”, “policy”, “government”, “Government”],
“incentive”: [“subsidy”, “tax break”, “reward”, “grant”, “incentive”, “incentives”],
“negotiate”: [“bargain”, “deal”, “discuss”, “arrange”, “negotiate”, “negotiating”, “negotiated”, “negotiation”, “negotiations”, “compromise”],
“pressure”: [“force”, “stress”, “coerce”, “compel”, “pressure”, “enormous_pressure”, “intense_pressure”, “great_pressure”, “political_pressure”]
}

Compile regular expressions for each topic

compiled_keywords = {topic: re.compile(r’\b(’ + ‘|’.join(words) + r’)\b’, re.IGNORECASE) for topic, words in custom_dict.items()}

Function to check for keywords

def detect_keywords(text, regex):
return bool(regex.search(text)) if pd.notna(text) else False

Detect keywords in the dataset

for topic, regex in compiled_keywords.items():
df[topic] = df[‘component_text’].apply(lambda x: detect_keywords(x, regex))

Extract year from the date column

df[‘year’] = df[‘date’].dt.year

Filter dataset for the years 2017 to 2023

df_filtered = df[(df[‘year’] >= 2017) & (df[‘year’] <= 2023)]

Group by year and transcript_id to calculate the proportion of calls mentioning each topic

summary_df = df_filtered.groupby(‘year’)[‘transcript_id’].nunique().reset_index(name=‘total_calls’)

Calculate the share of calls mentioning each topic

for topic in custom_dict.keys():
topic_df = df_filtered[df_filtered[topic]].groupby(‘year’)[‘transcript_id’].nunique().reset_index(name=f’{topic}_calls’)
summary_df = summary_df.merge(topic_df, on=‘year’, how=‘left’)

Fill NaN values with 0

summary_df.fillna(0, inplace=True)

Calculate the proportion in percentage

for topic in custom_dict.keys():
summary_df[f’{topic}_percentage’] = (summary_df[f’{topic}_calls’] / summary_df[‘total_calls’]) * 100

Plotting the results

plt.figure(figsize=(12, 8))
for topic in custom_dict.keys():
plt.plot(summary_df[‘year’], summary_df[f’{topic}_percentage’], marker=‘o’, label=topic)

plt.title(‘Share of Calls Mentioning Specific Keywords Over Time (2017-2023)’)
plt.xlabel(‘Year’)
plt.ylabel(‘Share of Calls (%)’)
plt.legend()
plt.grid(True)
plt.tight_layout()

Welcom to the community. I recommend you follow these instructions for posting code. About the Python Help category

It’s a lot harder to read what is going on without proper formatting.

1 Like

Looking over the empath paper, it uses a neural network embedding and then filters with crowd sourced labels. Your regular expression is counting words. While this may give some approximation it is not the same algorithm. An improvement to this method could simply be using the spacy fuzzy matcher, or llama embeddings but each embedding will have different ways to map words and cause different counts.

The empath package seems to be a thin wrapper to a web service that seems to be no longer functioning, so it is hard to run code to see how close to the approximation you would be getting.

This forum is mostly focused on Python software development and you may find more indepth analysis over at the data science stack exchange.

1 Like