How to create a dictionary from scanning text

I am trying to write a program that reads the text in a file and lists
each unique word, along with it’s frequency. I can open the file, read
it, remove the punctuation, make it all uppercase (so that words like
help & Help are still classified as the same), and then strip out each
word to remove the white spaces - see code below.
[… snip, dict suggests at the bottom …]

A few remarks:

#remove punctuation from text
for i in text:

I’d use the name “c” or “ch” here - “i” conventionally is often an
integer and often an index.

   if i in string.punctuation:
       text = text.replace(i,'')

You might want to replace with a space, as otherwise “this,that” will
become a single word.

You might also want to replace characters not matching “word”
characters (letters maybe) rather than having a specific list of
rejected characters, even the presuppplied string.punctuation.

While I’m usually slow to recommend regular expressions, this is a great
situation for them. You could match \w+ or some character class and skip
all this tedious per-character string mangling. If you step into
regexps, be aware that they can be shoehorned into doing many things,
often inappropiately. Try to keep some reluctance to reach for them as
the first choice of tool.


#this is where I add unique words to dictionary and
#increase value as they appear more than once
for word in words:

Well, making a dictionary is easy:

word_counts = {}

Then you just need to bump the counters for each word as you find them.

word_counts[word] += 1

The fiddly bit is that that will break the first time a word is seen -
there’s no entry for it so nothing to bump. You could do this:

word_counts[word] = word_count.get(word, 0) + 1

(See the dict.get method.) Or… you could grab a defaultdict from the
collections module:

word_counts = defaultdict(int)

and just use the += version. Defautdict takes a callable and returns a
dict which calls that callable to set up the entry for anything missing.


for a new word would call int() and store the result in the dict
immediately, ready to be incremented - no need to do that explicitly, it
will happen as soon as you try to increment that entry.

Often defaultdict is given a class name (for classes which can be
instantiated with no arguments - int() returns 0), but any factory
function accepting no arguments will do.

There’s also a Counter class in the standard library (look it up in the
index of the python docs) - I don’t seem to use it much, but I think it
can also be used for this purpose.

But this kind of exercise is almost always to get you to do the
mechanics yourself so that you know what’s going on - calling a magic
prebuilt tool that solves the problem for you isn’t the objective.

Cameron Simpson