First I clustered my text data and then I combined all the documents that have the same label into a single document. The code to combine all documents is:
docs_df = pd.DataFrame(data, columns=["Doc"])
docs_df['Topic'] = cluster.labels_
docs_df['Doc_ID'] = range(len(docs_df))
docs_per_topic = docs_df.dropna(subset=['Doc']).groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})
Now I want tf-idf and for that, I am using CountVectorizer using the below code:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
def c_tf_idf(documents, m, ngram_range=(1, 1)):
count = CountVectorizer(ngram_range=ngram_range, stop_words="english").fit(documents)
t = count.transform(documents).toarray()
w = t.sum(axis=1)
tf = np.divide(t.T, w)
sum_t = t.sum(axis=0)
idf = np.log(np.divide(m, sum_t)).reshape(-1, 1)
tf_idf = np.multiply(tf, idf)
return tf_idf, count
tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m=len(data))
But I am getting the below error:
ValueError Traceback (most recent call last)
in
13 return tf_idf, count
14
—> 15 tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m=len(data))in c_tf_idf(documents, m, ngram_range)
3
4 def c_tf_idf(documents, m, ngram_range=(1, 1)):
----> 5 count = CountVectorizer(ngram_range=ngram_range, stop_words=“english”).fit(documents)
6 t = count.transform(documents).toarray()
7 w = t.sum(axis=1)~/opt/anaconda3/envs/tensorflow/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit(self, raw_documents, y)
1184 “”"
1185 self._warn_for_unused_params()
→ 1186 self.fit_transform(raw_documents)
1187 return self
1188~/opt/anaconda3/envs/tensorflow/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1218
1219 vocabulary, X = self.count_vocab(raw_documents,
→ 1220 self.fixed_vocabulary)
1221
1222 if self.binary:~/opt/anaconda3/envs/tensorflow/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1148 vocabulary = dict(vocabulary)
1149 if not vocabulary:
→ 1150 raise ValueError(“empty vocabulary; perhaps the documents only”
1151 " contain stop words")
1152ValueError: empty vocabulary; perhaps the documents only contain stop words
I tried using split as some forum suggested but that is giving another error.
I would appreciate suggestions or solutions to my problem.
Thank you.