Question about reading books with Python

Hi, I have a question about reading books with Python.
I have an assignment that wants me to count the number of unique words in the English translation of Hamlet. The file I am using is taken from the Gutenberg library: Index of /files/1787

I am seriously lost because this code returns 4797 for me, but the solution tells me 3348.
I am not sure where my mistake could be.
I would be immensely grateful for any help.

import os 
import pandas as pd 
import numpy as np 
from collections import Counter

def count_words_fast(text): 
    text = text.lower() 
    skips = [".", ",", ";", ":", "'", '"', "\n", "!", "?", "(", ")"] 
    for ch in skips: 
        text = text.replace(ch, "") 
    word_counts = Counter(text.split(" ")) 
    return word_counts

def word_stats(word_counts): 
    num_unique = len(word_counts) 
    counts = word_counts.values() 
    return (num_unique, counts)
def word_count_distribution(text):
    word_counts = count_words_fast(text)
    count_distribution = Counter(word_counts.values())
    return count_distribution
def more_frequent(distribution):
    counts = list(distribution.keys())
    frequency_of_counts = list(distribution.values())
    cumulative_frequencies = np.cumsum(frequency_of_counts)
    more_frequent = 1 - cumulative_frequencies / cumulative_frequencies[-1]
    return dict(zip(counts, more_frequent))
def read_book(title_path):
    """Read a book and return it as a string."""
    with open(title_path, "r", encoding="utf8") as current_file:
        text=text.replace("\n", " ").replace("\r", " ")
    return text
hamlets = pd.DataFrame(columns = ["language","text"])
book_dir = "Books1"
title_num = 1
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir + "/" + language):
        for title in os.listdir(book_dir + "/" + language + "/" + author):
            if title == "Hamlet":
                inputfile = "Books1/"+language+"/"+author+"/"+title+".txt"
                text = read_book(inputfile)
                hamlets.loc[title_num] = language, text
                title_num += 1
hamlets.loc[title_num] = language, text
counted_text = count_words_fast(text)

data = pd.DataFrame({
    "word": list(counted_text.keys()),
    "count": list(counted_text.values())

language, text = hamlets.iloc[0]

counted_text = count_words_fast(text)

data = pd.DataFrame({
    "word": list(counted_text.keys()),
    "count": list(counted_text.values())

data["length"] = data["word"].apply(len)

#data.loc[data["count"] > 10,  "frequency"] = "frequent"
#data.loc[data["count"] <= 10, "frequency"] = "infrequent"
data.loc[data["count"] == 1,  "frequency"] = "unique"


It’s probably down to your definition of “word”. When you strip out apostrophes, for example, you’re making “god’s” and “gods” the same. It’s better to identify what is part of a word rather than stripping out what’s not.

You could start by splitting on whitespace, listing the resulting “words”, and then strip any punctuation off any words that have any and merge those results with the proper words.

Also, should you be including the leading, trailing, and other extraneous text that’s not part of Shakespeare’s work?

Incidentally, the text won’t contain any ‘\r’ because the line endings, whether ‘\r\n’, ‘\r’ or ‘\n’, are automatically converted to ‘\n’ when it’s read.

1 Like

Yes, whether a given sequence of characters is to be or not to be a word, really is the question. (Sorry, couldn’t resist.)

MRAB’s is all good advice. I would add: don’t forget actually to look at what words it is choosing, especially the least frequent words, to see if you agree or something does not belong.

After thinking about his a bit more, looking at the text because it is fun to read, it seems the exercise is mainly about data cleaning. That is difficult to do well.

  Clown. A pestilence on him for a mad rogue! 'A pour'd a flagon
    Rhenish on my head once. This same skull, sir, was Yorick's
    skull, the King's jester.
  Ham. This?
  Clown. E'en that.
  Ham. Let me see. [Takes the skull.] Alas, poor Yorick! I knew
    Horatio. A fellow of infinite jest, of most excellent fancy.

Just here we can see character name abbreviations (“Ham.”, “Hor.”), case (“King” vs “king”), stage directions, and dialect transcription or contractions.

As well as the header/footer, there are also inserts every so often that look like this:


I think the approach to reading in count_words_fast is trying too much at once by lower-casing and string-replacing en masse and counting as well.

You should probably read line-by-line, spotting when you have entered or left header material, and when you have a real line of the play, splitting the surviving lines as MRAB suggests into words on space, maybe treating the first line of a speech specially (character name), dealing with punctuation, dealing with case, special gotchas, and only then counting. Regexes (re module) can be good at this stage. It takes multiple functions to express your “word policy”.

Special gotchas includes apostrophe and plural processing: you can’t just delete them, "king’s " and “kings” are not the same, but is a plural a different word? Nor “I’ll”, “ill” and “ills”? And how do you deal with “drown’d” or “You shall do marvell’s wisely”? (It’s a contraction of “marvellous”).

It’s going to be a matter of choice finally what is a distinct word, but an answer a long way from the reference answer is a bad sign.

1 Like

Just to defining a “word”: does your assignment question itself specify
what it considers a word to be? If they expect a particular numeric
result (or provide one for an example text as a sanity check) they will
usually be pretty specific about what you need to count.

Cameron Simpson