How to count the number of times specific words appear in a text file

I’ve been working on this one question for hours. I thought this would be simple but it has turned out to be more complicated than it should be. I’ve been asked “How many events does each “node” have in the applog.txt file? Display the node and number of events for the node in the output.” I’ve been told beforehand that it must be solved using at least one function and a dictionary is recommended. I’m unable to attach the txt file I was assigned but it pretty much is ten thousand lines full of this:
[node1] - 190.223.252.106 - User Successful Login
[node2] - 239.84.157.20 - User Successful Profile Picture Upload
[node2] - 87.130.185.37 - User Successful Login
[node6] - 210.155.211.219 - User Successful Payment
[node5] - 64.103.198.103 - User Successful Login
[node2] - 136.4.60.67 - User Successful Login
[node7] - 191.161.59.79 - User Failed Login
[node1] - 128.215.207.129 - User Failed Payment
[node6] - 203.141.97.180 - User Successful Profile Picture Upload
[node6] - 172.218.65.224 - User Successful Profile Picture Upload
[node3] - 14.45.232.197 - User Successful Login
[node2] - 87.130.185.37 - User Failed Login
I really want to figure this problem out so I can move on.

It’s not altogether clear: are you asking for a complete solution or do you have some code that you’ve written and need help with?

to add…

Do you need the output to include the ip address and event, or would this output suffice?

Output based on what you’ve posted:

Node	Events
 1		  2
 2		  4
 6		  3
 5		  1
 7		  1
 3		  1

This code is the best I got so far

Open the text file in read mode

text = open(“Applog.txt”, “r”)

Creating an empty dictionary

d = dict()

Loop through each line of the file

for line in text:
# Remove the leading spaces and newline character
line = line.strip()

# Convert the characters in line to lowercase to avoid case mismatch
line = line.lower()

# Split the line into words
line = line.split(" ")

#The list of nodes to be analyzed
words = ["node1", "node2", "node3", "node4", "node5", "node6", "node7"]
                
# Iterate over each word in line
for word in words:
    # Check if the word is already in the dictionary
    if word in d:
        # Increase the word count by 1
        d[word] = d[word] + 1
    else:
        # Add the word to dictionary, counting 1
        d[word] = 1

Print the contents of the dictionary

for key in list(d.keys()):
print(key, “:”, d[key])

The output should be something like
node1: 12354
node2: 34325
node3: 3863

Maybe you could explain what this means?

So far as I can see, it has no relationship with your sample log file.

e.g: Node 1 has 2 events: log entry #1 and log entry #8


Based on your first post, this is what I’ve come up with.

entry = 'start'
events = {}

def process_log(entry):
    """returns a tuple: (node number, ip address, log message)"""
    if entry:
        node_number = ''
        entry = entry.split('-')
        node = entry[0].strip()
        for char in node:
            if char.isdigit():
                node_number += char
        ip_address = entry[1].strip()
        msg = entry[2].strip()
        return node_number, ip_address, msg
    else:
        return False
    
with open('logfile','r') as log:
    while entry:
        entry = process_log(log.readline())
        if entry:
            print(entry) # this simply displays the tuple; it can be removed.
            node = entry[0]
            if node in events:
                event = events.get(node)
                event += 1
                events[node] = event
            else:
                events.update({node:1})
print()
print("Node\tEvents")
for node in events:
    print(f" {node}\t\t  {events.get(node)}")

The custom function extracts all of the data, so that you have the option to use said data in the events dictionary and include it in a report.

Suggested development:

  • Have a entry number for each event
  • Include the ip address and the log message
  • List the events for each node, grouped.

This code is the best I got so far

Thank you. We like to see your code, it gives us something to comment
on. We don’t write full solutions because you won’t learn much - better
to see your attempt and discuss what might work better.

BTW, paste code between triple backticks, it preserves the indenting and
punctuation. Example:

 ```
 your code
 goes here
 ```

I’ll make some remarks about the code inline below.

 # Open the text file in read mode
 text = open("Applog.txt", "r")

Usually we open files just while they’re needed (so tightly around your
for-loop below), and use this idiom:

 with open("Applog.txt", "r") as text:
     .... for-loop etc goes here ....

This automatically closes the file at the unindent after the “with”.

Anyway, what you have is not a bug.

 # Creating an empty dictionary
 d = dict()

You can write an empty dict directly:

 d = {}

Also not a bug.

 # Loop through each line of the file
 for line in text:
     # Remove the leading spaces and newline character
     line = line.strip()
     # Convert the characters in line to lowercase to avoid case mismatch
     line = line.lower()
     # Split the line into words
     line = line.split(" ")

At this point I’d be printing line to see what was in it, to be sure.
I’d also use a different variable name than line here, maybe fields
or something. Up to this point line is a string, the current line of
text from the file. Suddenly it is a list. Hence the name change. Again,
not a bug.

 #The list of nodes to be analyzed
 words = ["node1", "node2", "node3", "node4", "node5", "node6", "node7"]

I’d define this just once, at the top of the programme. Not a bug.

 # Iterate over each word in line
 for word in words:
     # Check if the word is already in the dictionary
     if word in d:
         # Increase the word count by 1
         d[word] = d[word] + 1
     else:
         # Add the word to dictionary, counting 1
         d[word] = 1

You’d bumping a counter in the dict for each word. But you’re doing it
uniformly for all the words in words, once for every line. This has
nothing to do with the contents of the line. I would expect all the
counts in your dict to be the same, and to be the same as the number
of lines in the file.

Instead, I think you want to extract eg “node3” from the first field in
your file, which looked like [node3] IIRC. You could do something
like:

 node = line[0].lstrip('[').rstrip(']')

Then print("node =", node) to see how that went.

The look up node in your dict and only bump the counter for node.
No for-loop.

 # Print the contents of the dictionary
 for key in list(d.keys()):
     print(key, ":", d[key])

You don’t need list(d.keys() here, d.keys() would do. Not a bug. You
may see code list list(d.keys()) in practice for code which expects to
modify d, because that might imply that the keys change during the
loop and that brings unexpected behaviour. So the list(.....)
effectively takes a copy of the keys before the loop runs. You don’t
need that here.

Also, iterating a dict iterates its keys anyway, so you can in fact
write:

 for key in d:

and since wanting the key and the value at the same time is very common
you can use the dict.items() method instead:

 for key, value in d.items():
     print(key, ":", value)

for added convenience.

The output should be something like
node1: 12354
node2: 34325
node3: 3863

Indeed. But you don’t say what you’re getting instead.

Cheers,
Cameron Simpson cs@cskk.id.au

@beetlebat

I’m not sure if what I’ve done is of any help, but if it is and you want to move this forward, there’s a very easy way to modify the script that I’ve posted, so that the output is sorted by node number, like this:

Node	Events
 1		  2
 2		  4
 3		  1
 5		  1
 6		  3
 7		  1

… which is the first step for a grouped (by node number) output.

If you’re interested in moving this forward and implementing my suggestions, then post a reply and I’ll walk you through it.

Or, reach out to @cameron (by way of a reply) and I’m sure that he’ll also guide you in that direction, but tbh, if we both try to guide you in different directions, it’s going to get confusing for you, so it’s your choice.

For me it seems like simple counting. Get node part out from every line and count every occurrence. So there are two tasks: (1) get node part out from every row (2) count these nodes

from collections import Counter

with open('nodes_log.txt', 'r') as f:
    events = Counter(row.split(' - ')[0].strip('[]') for row in f)

print(*(f'{k}:{v}' for k, v in events.items()), sep='\n')

# node1:2
# node2:4
# node6:3
# node5:1
# node7:1
# node3:1

I’m sure the OP knows they’re trying to count things. They probably do
not know about the Counter class. But if their task is to implement
a counter as a learning exercise then the above doesn’t help them.

Rob and I are trying to help them understand where their code is not
working and how to fix it.

Cheers,
Cameron Simpson cs@cskk.id.au

Learning means different things for different people.

After finding out about Counter (or clearly defining that counting is one of the objectives) one could be curious enough to find out how counting is implemented and use it in it’s own implementation. But then, maybe not. Fish is usually more tempting than fishing.

This completely solved my problem, thank you for your help.

You’re very welcome and I hope you learned something from that. I know how frustrating it can be and well remember that feeling, when I first started with Python.

The script is not too difficult to build upon (as per my suggested development) and is that way by design.

The output from that script, with very little code added, and only two (I think) minor mods is now:

Node	Events
 1	      2
		       Entry    1: User Successful Login                   from 190.223.252.106
		       Entry    8: User Failed Payment                     from 128.215.207.129
 2	      4
		       Entry    2: User Successful Profile Picture Upload  from 239.84.157.20
		       Entry    3: User Successful Login                   from 87.130.185.37
		       Entry    6: User Successful Login                   from 136.4.60.67
		       Entry   12: User Failed Login                       from 87.130.185.37
 3	      1
		       Entry   11: User Successful Login                   from 14.45.232.197
 5	      1
		       Entry    5: User Successful Login                   from 64.103.198.103
 6	      3
		       Entry    4: User Successful Payment                 from 210.155.211.219
		       Entry    9: User Successful Profile Picture Upload  from 203.141.97.180
		       Entry   10: User Successful Profile Picture Upload  from 172.218.65.224
 7	      1
		       Entry    7: User Failed Login                       from 191.161.59.79

So, like I said in my last post, if you’re interested in moving it forward, just post back.