How to create a dictionary from scanning text

Hi,
I am trying to write a program that reads the text in a file and lists each unique word, along with it’s frequency. I can open the file, read it, remove the punctuation, make it all uppercase (so that words like help & Help are still classified as the same), and then strip out each word to remove the white spaces - see code below.

I don’t however know how to perform the next step of creating a dictionary, adding each unique word to it as a key and then incrementing the value each time it appears in the text.

Help appreciated,
Thanks
Dan

#import string library
import string

#open file as read only, assign to text variable
with open('mytext.txt', 'r') as file:
    text = file.read()
    
    #remove punctuation from text
    for i in text:
        if i in string.punctuation:
            text = text.replace(i,'')
    #print(text)   

    #make all text uppercase 
    text_after_strip = text.upper()
    print(text_after_strip) #test print to see that it works

    #split each word from text - creates a list of words
    words = text_after_strip.split()
    print(words) # test print to see output
    
    #this is where I add unique words to dictionary and 
    #increase value as they appear more than once
    for word in words:

I am trying to write a program that reads the text in a file and lists
each unique word, along with it’s frequency. I can open the file, read
it, remove the punctuation, make it all uppercase (so that words like
help & Help are still classified as the same), and then strip out each
word to remove the white spaces - see code below.
[… snip, dict suggests at the bottom …]

A few remarks:

#remove punctuation from text
for i in text:

I’d use the name “c” or “ch” here - “i” conventionally is often an
integer and often an index.

   if i in string.punctuation:
       text = text.replace(i,'')

You might want to replace with a space, as otherwise “this,that” will
become a single word.

You might also want to replace characters not matching “word”
characters (letters maybe) rather than having a specific list of
rejected characters, even the presuppplied string.punctuation.

While I’m usually slow to recommend regular expressions, this is a great
situation for them. You could match \w+ or some character class and skip
all this tedious per-character string mangling. If you step into
regexps, be aware that they can be shoehorned into doing many things,
often inappropiately. Try to keep some reluctance to reach for them as
the first choice of tool.

[…]

#this is where I add unique words to dictionary and
#increase value as they appear more than once
for word in words:

Well, making a dictionary is easy:

word_counts = {}

Then you just need to bump the counters for each word as you find them.
So:

word_counts[word] += 1

The fiddly bit is that that will break the first time a word is seen -
there’s no entry for it so nothing to bump. You could do this:

word_counts[word] = word_count.get(word, 0) + 1

(See the dict.get method.) Or… you could grab a defaultdict from the
collections module:

word_counts = defaultdict(int)

and just use the += version. Defautdict takes a callable and returns a
dict which calls that callable to set up the entry for anything missing.
So:

word_counts[word]

for a new word would call int() and store the result in the dict
immediately, ready to be incremented - no need to do that explicitly, it
will happen as soon as you try to increment that entry.

Often defaultdict is given a class name (for classes which can be
instantiated with no arguments - int() returns 0), but any factory
function accepting no arguments will do.

There’s also a Counter class in the standard library (look it up in the
index of the python docs) - I don’t seem to use it much, but I think it
can also be used for this purpose.

But this kind of exercise is almost always to get you to do the
mechanics yourself so that you know what’s going on - calling a magic
prebuilt tool that solves the problem for you isn’t the objective.

Cheers,
Cameron Simpson cs@cskk.id.au

Hi Dan,

It may be better to use casefold rather than uppercase to normalise the
text, especially if there is any chance of non-English words appearing
in your text.

If you want to count the unique words, you can use a Counter:

from collections import Counter
fequencies = Counter(words)

If you import Counter into the interactive interpreter, you can use
help(Counter) to view some useful examples.

To understand how Counter works under the hood, very roughly it works
something like this:

frequencies = {}
for word in words:
    try:
        count = frequencies[word]
    except KeyError:
        count = 0
    count += 1
    frequencies[word] = count

That can be re-written like this:

frequencies = {}
for word in words:
    count = frequencies.get(word, 0)
    frequencies[word] = count + 1
1 Like

Hi Cameron,

I have taken your advice and change the i variable for c to try and make my programming more in line standard practice … My code now work for inserting the text into a dictionary …
I have tried the last couple days to jump over my last hurdle, and that is to print/show these words in either of 2 ways - Print the dictionary with the key and values in alphabetical order OR order by the highest value in the dictionary.

Dictionary’s dont have the sort option but can used sorted function but this does not show the value, only the key … How can I achieve this?

Appreciate any help,

Thanks very much !
Regards
Dan

Hi Steven,

GReat advice and much appreciated.
I have created the dictionary and then added a couple lines of code to show how many items in the dictionary. I just need now to print/show what these words are in either of 2 ways. Print the keys of the dictionary in alphabetical order OR in order from highest value.

I know dictionary’s dont have a sort function but you can used sorted to put them all in alphabetical order, HOWEVER, they will not show their values…
So how do i sort and print their values - 1 per line?

Cheers
Dan

Sort the keys. Get each value from the key.

Cheers,
Cameron Simpson cs@cskk.id.au

Thanks Cameron, great advice … it worked…

Cheers
Dave