Sorting a column of numbers

Hello,

I have the following assignment:

Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ’ line by finding the time and then splitting the string a second time using a colon.
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.
The data can be found in the following link: https://www.py4e.com/code3/mbox-short.txt?PHPSESSID=3a64fe134f5f073f3911c47546619bcc

04 3
06 1
07 1
09 2
10 3
11 6
14 1
15 2
16 4
17 2
18 1
19 1

What I have tried:

1 name = input("Enter file:")
2 if len(name) < 1:
3     name = "mbox-short.txt"
4 handle = open(name)
5 counts = dict();

6 for line in handle:
7     if line.startswith('From:'):
8         pass
9     elif line.startswith('From'):
10       x = line.split();
11       time = x[5];
12       t = time.split(':');
13       hour = t[0];
14      for line in hour:
15          hoursorted = sorted(hour);
16         counts[line] = counts.get(line,0) + 1;
17
18        print(hoursorted, counts[line]);

The output I’m getting is:

[β€˜0’, β€˜9’] 1
[β€˜0’, β€˜9’] 1
[β€˜1’, β€˜8’] 1
[β€˜1’, β€˜8’] 1
[β€˜1’, β€˜6’] 2
[β€˜1’, β€˜6’] 1
[β€˜1’, β€˜5’] 3
[β€˜1’, β€˜5’] 1
[β€˜1’, β€˜5’] 4
[β€˜1’, β€˜5’] 2
[β€˜1’, β€˜4’] 5
[β€˜1’, β€˜4’] 1
[β€˜1’, β€˜1’] 6
[β€˜1’, β€˜1’] 7
[β€˜1’, β€˜1’] 8
[β€˜1’, β€˜1’] 9
[β€˜1’, β€˜1’] 10
[β€˜1’, β€˜1’] 11
[β€˜1’, β€˜1’] 12
[β€˜1’, β€˜1’] 13
[β€˜1’, β€˜1’] 14
[β€˜1’, β€˜1’] 15
[β€˜1’, β€˜1’] 16
[β€˜1’, β€˜1’] 17
[β€˜0’, β€˜1’] 18
[β€˜0’, β€˜1’] 2
[β€˜0’, β€˜1’] 19
[β€˜0’, β€˜1’] 3
[β€˜0’, β€˜1’] 20
[β€˜0’, β€˜1’] 4
[β€˜0’, β€˜9’] 5
[β€˜0’, β€˜9’] 2
[β€˜0’, β€˜7’] 6
[β€˜0’, β€˜7’] 1
[β€˜0’, β€˜6’] 7
[β€˜0’, β€˜6’] 2
[β€˜0’, β€˜4’] 8
[β€˜0’, β€˜4’] 2
[β€˜0’, β€˜4’] 9
[β€˜0’, β€˜4’] 3
[β€˜0’, β€˜4’] 10
[β€˜0’, β€˜4’] 4
[β€˜1’, β€˜9’] 21
[β€˜1’, β€˜9’] 3
[β€˜1’, β€˜7’] 22
[β€˜1’, β€˜7’] 2
[β€˜1’, β€˜7’] 23
[β€˜1’, β€˜7’] 3
[β€˜1’, β€˜6’] 24
[β€˜1’, β€˜6’] 3
[β€˜1’, β€˜6’] 25
[β€˜1’, β€˜6’] 4
[β€˜1’, β€˜6’] 26
[β€˜1’, β€˜6’] 5

As you can see, each of the two integers in each line are being separated. If I only print(hour), I get a column of unsorted numbers, however they don’t get separated by commas, neither do they get surrounded by brackets. I’m trying to sort them as column and put the total number of times each number appears with β€œcounts” on the right, as in the answer above.

I think my problem is with lines 14 and 15, it’s clear that this is not the right way to sort a column. I searched the web and found that it is possible to do it with sort_value(), using pandas; but the compiler I’m using doesn’t allow me to download pandas.

Could someone please clarify how I could sort this list without separating two of each integers and without brackets?

Thank you.

Your file handler is not right and I’ve no idea why you have the if: elif: test when (it seems to me) that a single if line.startswith('From'): would do.

The rest of your script, for the most part, I can figure out, but as the formatting is off, I’m having to guess the at last part.

From what I can see, you’re trying to sort a list object (holding the hour of the email), but that object holds strings, not numbers, which should be integers, for sorting.

I’d suggest that you have a list object (maybe named hour) and append the hour value to that, as a int, then simply sort that, when all the data have been collected.

hour = []

...

# grab the first element pair from index 5 and append as a int value
hour.append(int(x[5][0:2])

...

# sort the list
hour.sort()

This may get you back on track, but I don’t have enough data to know for sure. I’m also unsure what you’re doing with the dictionary, but so far as I can tell, you may not need that object, given that what you’re trying to is to count how many times a particular time has been logged, so far as I can tell.

We can also find a better name than x, possibly hr or time_stamp.

Edit to add:

From what you’ve presented here and from what I can figure, I think your output should look something like this:

Output:
01 6
04 6
06 2
07 2
09 4
11 12
14 2
15 4
16 8
17 4
18 2
19 2
1 Like

As @rob42 mentioned, you code came though with formatting lost – that is killer for Python with its required indentation. Try wrapping your code in triple back-ticks (or click the </> button):

if something:
     do_somthing

Also, it’s helpful to show us a sample of your impot file, too.

Back to your issues – Rob’s suggestions are good, but there’s quite a bit going on here:

  • not sure why you are making the distinction between From with and without a colon – but I’ll trust there’s a reason :slight_smile:
  • I’d be tempted to pull the time stamp by getting the second to last item after splitting, just in case there’s some extra spaces in there: time = x[-2] (and as rob said, x isn’t the best name – you can actually re-use line.
  • not that one liners are a goal, but: hour = time.split(':')[0]
  • get used to not using semi-colons …
  • what is this? for line in hour: – hour is a two-character string, looping through it will get you the two characters – that’s why they are getting split up.
  • if I understand the problem correctly, you want to sort after counting.
  • you’ve got the right idea with the counts dict – almost. you want to be using the hour as the key. and it won’t exist the first time, so take a look at dict.setdefault() – or the collections.Counter class.
  • I can’t see how your code is indented, but you want to make sure you first count everything, then sort, then print the output.
  • Rob pointed out that your β€œhour” is a string, so may not sort correctly, In this case, with the leading zeros, they should, but it’s good to know about the key parameter to the sort functions – key=int will convert to integers before sorting.
  • once you have a dict with the hours and counts, there are a number of ways to print them sorted. I’m not going to write that code for you, but:
    • loop through the sorted keys, and then print each key and value
    • convert to a list of (key, value) ((hour, count) in this case) tuples, then sort them, then print them. Or loop through a sorted version and print each one.

Good luck – you are close!

PS: think about your development process – you want to try out each piece of code before putting it all together – for instance, try running just these lines:

for line in hour:
    hoursorted = sorted(hour);

You’ll immediately see what’s up.

Also, no harm in scattering in some print()s while you are debugging.

I find iPython to be really helpful for this – though most IDEs provide a way to run a little bit of code, too.

1 Like

Hello,

The reason I’m using β€œelif” is beacuse there are lines that start with β€œFrom:” (notice the colon) and they shouldn’t be included.

I tried to use…hour.append(int(x[5][0:2])…but I get the error of β€œbad input”.

what’s the value of x at that point?

apparently the first two characters of the fifth element of x aren’t an integer. You really need to do some debugging – put some print’s in – print x before that line, and it will probably be clear what’s going on.

1 Like

OK, thanks.

In your post, you have the code x = line.split(). Given that line is a string object…
'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n'
… at the first run which makes the list object x:
['From', 'stephen.marquard@uct.ac.za', 'Sat', 'Jan', '5', '09:14:16', '2008']

So, hour.append(int(x[5][0:2])) means append to the hour list, as a integer value, the first two characters ([0:2]) of the 6th element. There are 7 elements, indexed 0 – 6, of which index 5 is '09:14:16' and the string part 09 will be type converted to a 9, before being appended to the hour list.

To add: as a general comment; when you edit a post, based on something further down the thread, it would be a courtesy to mention, in the edit, the reason for such, especially if in doing so, you invalidate some part of the thread, as you have indeed done: your edit invalidates a part of my first post.

[Corrected a part of the explainer]

1 Like

I have a working solution, but I don’t know if you want to work this out for yourself, or if this is now (as can happen) more of a frustration than a challenge.

To be fair, you’re not too far off, but you seem to be missing some point of knowledge regarding some of the Python objects, as well as how a file handler should be coded.

To add:

In case you’re still having some issues with coding a file handler and as a reply to the above, this should get you back on track:

hour = []

with open("data") as fh:  # 'fh' is the file handler
    for line in fh:
        if line.startswith("From "):  # include the space, then "From:" will be dropped
            print(f"Reading: {line}")  # this is a debugging output and can be omitted
            line_list = line.split()
...

If you want my full solution, then simply ask for it: I don’t want post that if you’d rather do this for yourself, but I fully understand that if you’re stuck, then it could be of more help to you to see the full code so that you learn from that.

As a hint: you don’t need a dictionary and you don’t need to sort the hour list in order to have the output look like this:

Output:
04 3
06 1
07 1
09 2
10 3
11 6
14 1
15 2
16 4
17 2
18 1
19 1
1 Like

no, but a dict is a good way to do it – there’s always more than one way to skin a cat – not sure what Rob has in mind, but yes, if you know ahead of time that there are 24 hours in the day, as you do this time, this is easier – but in a more general sense, a dict would be useful when you didn’t know ahead of time how many β€œbins” you have.

Indeed there is. In fact I have two ways coded right now (one based on what I have already posted and one based on a generator) and I’m sure there are a few more, such as the one you have in mind.

It could very well be that the OP has abandoned this thread, which would be a bit of a shame. I’ll give it a few more days then post up my code so that the thread is not a dead loss as it may be of use to someone in the future; even more so if there are three or four different solutions.


Added content:

As it’s looking as if this thread has indeed been abandoned, I’d like to tie up some loose ends by posting my solutions to the topic.

I’ll not write any explainers right now (there are comment lines), as I think that the code is not too hard to follow, but if you have any questions, then go right ahead and ask.

My V1

hour = []

with open("data", mode="r", encoding="UTF-8") as fh:  # 'fh' is the file handler
    for line in fh:
        if line.startswith("From "):  # include the space, then "From:" will be dropped
            print(f"Reading: {line}")  # this is a debugging output and can be omitted
            line_list = line.split()
            # grab the first element pair from index 5 and append as a int value
            hour.append(int(line_list[5][0:2]))
# the file handler will close the file when the 'for:' loop exits

for hr in range(24):  # create a loop for the 24 hour values (zero to 23)
    if hr in hour:  # execute if any of the hr values are in the hour list
        found = hour.count(hr)  # count how many are in the hour list
        print(f"{hr:02} {found}")  # present the findings, formatted
# the formatting code :02 is to present the hour as two digits

Almost the same, but using a generator as a file handler

hour = []

lines = (line for line in open("data", mode="r", encoding="UTF-8"))

for line in lines:
    line_list = line.split()
    if line_list and line_list[0] == "From":
        hour.append(int(line_list[5][0:2]))


for hr in range(24):
    if hr in hour:
        found = hour.count(hr)
        print(f"{hr:02} {found}")

And as above, but using a dictionary and an added feature.

hours = []
headers = {}
lines = (line for line in open("data", mode="r", encoding="UTF-8"))

for line in lines:
    line_list = line.split()
    if line_list and line_list[0] == "From":
        FROM = line_list[1]
        TD_STAMP = list(line_list[2:7])
        if FROM in headers:
            headers[FROM].append(TD_STAMP)
        else:
            headers[FROM] = [TD_STAMP]

# output the details
for item in headers.items():
    print(f"emails from {item[0]}")
    td_stamps = item[1]
    for td_stamp in td_stamps:
        hours.append(int(td_stamp[3][0:2]))
        print(td_stamp)
    print()

# required output
for hr in range(24):
    if hr in hours:
        found = hours.count(hr)
        print(f"{hr:02} {found}")

In posting this, I trust that I’ve not introduced any of my bad habits, habits that I’m doing my best to drop.

1 Like