Unable to correctly train utf-8 text using sklearn

When I train ASCII data, then I get correct output in test cases. But in case of ‘utf-8’ data, I get wrong output.

Can anyone please give me a solution?

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB

-----

# Create a pipeline that vectorizes the data and then applies the classifier
pipeline = make_pipeline(CountVectorizer(binary=True, encoding='utf-8'), BernoulliNB())
X_train1 = ["राया", "मज्या", "अजूनि", "आयलो", "ना", "वाट", "रे"]

# Train the pipeline
Y_train1 = [1, 1, 2, 2, 2, 0, 0]
pipeline.fit(X_train2, Y_train1)

# Predict the class of a new data
test_data = ["राया", "मज्या", "अजूनि", "आयलो", "ना", "वाट", "रे"]
predicted_class = pipeline.predict(test_data)
print("Predicted class:", predicted_class)

Output

Predicted class: [2 1 2 2 2 2 2]

Do you also get these results if the data is raw binary, which happens to be the UTF-8 encoding of these test strings?

… For those of us who don’t do this AI stuff, what actually is the expected result, and why?

2 Likes

@kknechtel I will check the condition given by you.

I found a solution for my question. But I don’t think that is the ideal way to handle unicode data using sklearn.

---
X_train1a = []
for i in X_train1:
    X_train1a.append(i.encode('ascii', 'backslashreplace').decode('ascii'))
---
pipeline.fit(X_train1a, Y_train1)
---
test_data_a = []
for i in test_data:
    test_data_a.append(i.encode('ascii', 'backslashreplace').decode('ascii'))
predicted_class = pipeline.predict(test_data_a)

Output

Predicted class: [1 1 2 2 2 0 0]

@kknechtel, I tried your suggestion by converting the strings to binary and got the expected output.
But I also expect to know any function of sklearn for handling Unicode strings, if available.

Here is my attempt at implementing your suggestion.

X_train1b = []
for word in X_train1:
    word_e = word.encode('utf-8')
    bin1 = ''.join(format(byte, '08b') for byte in word_e)
    X_train1b.append(bin1)

test_data_b = []
for word in test_data:
    word_e = word.encode('utf-8')
    bin1 = ''.join(format(byte, '08b') for byte in word_e)
    test_data_b.append(bin1)

Output

Predicted class: [1 1 2 2 2 0 0]