# Sorting a column of numbers

Hello,

I have the following assignment:

Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From β line by finding the time and then splitting the string a second time using a colon.
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.
The data can be found in the following link: https://www.py4e.com/code3/mbox-short.txt?PHPSESSID=3a64fe134f5f073f3911c47546619bcc

04 3
06 1
07 1
09 2
10 3
11 6
14 1
15 2
16 4
17 2
18 1
19 1

What I have tried:

``````1 name = input("Enter file:")
2 if len(name) < 1:
3     name = "mbox-short.txt"
4 handle = open(name)
5 counts = dict();

6 for line in handle:
7     if line.startswith('From:'):
8         pass
9     elif line.startswith('From'):
10       x = line.split();
11       time = x[5];
12       t = time.split(':');
13       hour = t[0];
14      for line in hour:
15          hoursorted = sorted(hour);
16         counts[line] = counts.get(line,0) + 1;
17
18        print(hoursorted, counts[line]);
``````

The output Iβm getting is:

[β0β, β9β] 1
[β0β, β9β] 1
[β1β, β8β] 1
[β1β, β8β] 1
[β1β, β6β] 2
[β1β, β6β] 1
[β1β, β5β] 3
[β1β, β5β] 1
[β1β, β5β] 4
[β1β, β5β] 2
[β1β, β4β] 5
[β1β, β4β] 1
[β1β, β1β] 6
[β1β, β1β] 7
[β1β, β1β] 8
[β1β, β1β] 9
[β1β, β1β] 10
[β1β, β1β] 11
[β1β, β1β] 12
[β1β, β1β] 13
[β1β, β1β] 14
[β1β, β1β] 15
[β1β, β1β] 16
[β1β, β1β] 17
[β0β, β1β] 18
[β0β, β1β] 2
[β0β, β1β] 19
[β0β, β1β] 3
[β0β, β1β] 20
[β0β, β1β] 4
[β0β, β9β] 5
[β0β, β9β] 2
[β0β, β7β] 6
[β0β, β7β] 1
[β0β, β6β] 7
[β0β, β6β] 2
[β0β, β4β] 8
[β0β, β4β] 2
[β0β, β4β] 9
[β0β, β4β] 3
[β0β, β4β] 10
[β0β, β4β] 4
[β1β, β9β] 21
[β1β, β9β] 3
[β1β, β7β] 22
[β1β, β7β] 2
[β1β, β7β] 23
[β1β, β7β] 3
[β1β, β6β] 24
[β1β, β6β] 3
[β1β, β6β] 25
[β1β, β6β] 4
[β1β, β6β] 26
[β1β, β6β] 5

As you can see, each of the two integers in each line are being separated. If I only print(hour), I get a column of unsorted numbers, however they donβt get separated by commas, neither do they get surrounded by brackets. Iβm trying to sort them as column and put the total number of times each number appears with βcountsβ on the right, as in the answer above.

I think my problem is with lines 14 and 15, itβs clear that this is not the right way to sort a column. I searched the web and found that it is possible to do it with sort_value(), using pandas; but the compiler Iβm using doesnβt allow me to download pandas.

Could someone please clarify how I could sort this list without separating two of each integers and without brackets?

Thank you.

Your file handler is not right and Iβve no idea why you have the `if:` `elif:` test when (it seems to me) that a single `if line.startswith('From'):` would do.

The rest of your script, for the most part, I can figure out, but as the formatting is off, Iβm having to guess the at last part.

From what I can see, youβre trying to sort a list object (holding the hour of the email), but that object holds strings, not numbers, which should be integers, for sorting.

Iβd suggest that you have a list object (maybe named `hour`) and append the hour value to that, as a `int`, then simply sort that, when all the data have been collected.

``````hour = []

...

# grab the first element pair from index 5 and append as a int value
hour.append(int(x[5][0:2])

...

# sort the list
hour.sort()
``````

This may get you back on track, but I donβt have enough data to know for sure. Iβm also unsure what youβre doing with the dictionary, but so far as I can tell, you may not need that object, given that what youβre trying to is to count how many times a particular time has been logged, so far as I can tell.

We can also find a better name than `x`, possibly `hr` or `time_stamp`.

From what youβve presented here and from what I can figure, I think your output should look something like this:

``````Output:
01 6
04 6
06 2
07 2
09 4
11 12
14 2
15 4
16 8
17 4
18 2
19 2
``````
1 Like

As @rob42 mentioned, you code came though with formatting lost β that is killer for Python with its required indentation. Try wrapping your code in triple back-ticks (or click the </> button):

``````if something:
do_somthing
``````

Also, itβs helpful to show us a sample of your impot file, too.

Back to your issues β Robβs suggestions are good, but thereβs quite a bit going on here:

• not sure why you are making the distinction between From with and without a colon β but Iβll trust thereβs a reason
• Iβd be tempted to pull the time stamp by getting the second to last item after splitting, just in case thereβs some extra spaces in there: `time = x[-2]` (and as rob said, x isnβt the best name β you can actually re-use `line`.
• not that one liners are a goal, but: `hour = time.split(':')[0]`
• get used to not using semi-colons β¦
• what is this? ` for line in hour:` β hour is a two-character string, looping through it will get you the two characters β thatβs why they are getting split up.
• if I understand the problem correctly, you want to sort after counting.
• youβve got the right idea with the counts dict β almost. you want to be using the hour as the key. and it wonβt exist the first time, so take a look at `dict.setdefault()` β or the `collections.Counter` class.
• I canβt see how your code is indented, but you want to make sure you first count everything, then sort, then print the output.
• Rob pointed out that your βhourβ is a string, so may not sort correctly, In this case, with the leading zeros, they should, but itβs good to know about the `key` parameter to the sort functions β `key=int` will convert to integers before sorting.
• once you have a dict with the hours and counts, there are a number of ways to print them sorted. Iβm not going to write that code for you, but:
• loop through the sorted keys, and then print each key and value
• convert to a list of `(key, value)` (`(hour, count)` in this case) tuples, then sort them, then print them. Or loop through a sorted version and print each one.

Good luck β you are close!

PS: think about your development process β you want to try out each piece of code before putting it all together β for instance, try running just these lines:

``````for line in hour:
hoursorted = sorted(hour);
``````

Youβll immediately see whatβs up.

Also, no harm in scattering in some `print()`s while you are debugging.

I find iPython to be really helpful for this β though most IDEs provide a way to run a little bit of code, too.

1 Like

Hello,

The reason Iβm using βelifβ is beacuse there are lines that start with βFrom:β (notice the colon) and they shouldnβt be included.

I tried to useβ¦hour.append(int(x[5][0:2])β¦but I get the error of βbad inputβ.

whatβs the value of x at that point?

apparently the first two characters of the fifth element of x arenβt an integer. You really need to do some debugging β put some printβs in β print x before that line, and it will probably be clear whatβs going on.

1 Like

OK, thanks.

In your post, you have the code ` x = line.split()`. Given that `line` is a string objectβ¦
`'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n'`
β¦ at the first run which makes the `list` object `x`:
`['From', 'stephen.marquard@uct.ac.za', 'Sat', 'Jan', '5', '09:14:16', '2008']`

So, `hour.append(int(x[5][0:2]))` means append to the `hour` list, as a integer value, the first two characters (`[0:2]`) of the 6th element. There are 7 elements, indexed 0 β 6, of which index 5 is `'09:14:16'` and the string part `09` will be type converted to a `9`, before being appended to the `hour` list.

To add: as a general comment; when you edit a post, based on something further down the thread, it would be a courtesy to mention, in the edit, the reason for such, especially if in doing so, you invalidate some part of the thread, as you have indeed done: your edit invalidates a part of my first post.

[Corrected a part of the explainer]

1 Like

I have a working solution, but I donβt know if you want to work this out for yourself, or if this is now (as can happen) more of a frustration than a challenge.

To be fair, youβre not too far off, but you seem to be missing some point of knowledge regarding some of the Python objects, as well as how a file handler should be coded.

In case youβre still having some issues with coding a file handler and as a reply to the above, this should get you back on track:

``````hour = []

with open("data") as fh:  # 'fh' is the file handler
for line in fh:
if line.startswith("From "):  # include the space, then "From:" will be dropped
print(f"Reading: {line}")  # this is a debugging output and can be omitted
line_list = line.split()
...
``````

If you want my full solution, then simply ask for it: I donβt want post that if youβd rather do this for yourself, but I fully understand that if youβre stuck, then it could be of more help to you to see the full code so that you learn from that.

As a hint: you donβt need a dictionary and you donβt need to sort the hour list in order to have the output look like this:

``````Output:
04 3
06 1
07 1
09 2
10 3
11 6
14 1
15 2
16 4
17 2
18 1
19 1
``````
1 Like

no, but a dict is a good way to do it β thereβs always more than one way to skin a cat β not sure what Rob has in mind, but yes, if you know ahead of time that there are 24 hours in the day, as you do this time, this is easier β but in a more general sense, a dict would be useful when you didnβt know ahead of time how many βbinsβ you have.

Indeed there is. In fact I have two ways coded right now (one based on what I have already posted and one based on a generator) and Iβm sure there are a few more, such as the one you have in mind.

It could very well be that the OP has abandoned this thread, which would be a bit of a shame. Iβll give it a few more days then post up my code so that the thread is not a dead loss as it may be of use to someone in the future; even more so if there are three or four different solutions.

As itβs looking as if this thread has indeed been abandoned, Iβd like to tie up some loose ends by posting my solutions to the topic.

Iβll not write any explainers right now (there are comment lines), as I think that the code is not too hard to follow, but if you have any questions, then go right ahead and ask.

My V1

``````hour = []

with open("data", mode="r", encoding="UTF-8") as fh:  # 'fh' is the file handler
for line in fh:
if line.startswith("From "):  # include the space, then "From:" will be dropped
print(f"Reading: {line}")  # this is a debugging output and can be omitted
line_list = line.split()
# grab the first element pair from index 5 and append as a int value
hour.append(int(line_list[5][0:2]))
# the file handler will close the file when the 'for:' loop exits

for hr in range(24):  # create a loop for the 24 hour values (zero to 23)
if hr in hour:  # execute if any of the hr values are in the hour list
found = hour.count(hr)  # count how many are in the hour list
print(f"{hr:02} {found}")  # present the findings, formatted
# the formatting code :02 is to present the hour as two digits
``````

Almost the same, but using a generator as a file handler

``````hour = []

lines = (line for line in open("data", mode="r", encoding="UTF-8"))

for line in lines:
line_list = line.split()
if line_list and line_list[0] == "From":
hour.append(int(line_list[5][0:2]))

for hr in range(24):
if hr in hour:
found = hour.count(hr)
print(f"{hr:02} {found}")
``````

And as above, but using a dictionary and an added feature.

``````hours = []
lines = (line for line in open("data", mode="r", encoding="UTF-8"))

for line in lines:
line_list = line.split()
if line_list and line_list[0] == "From":
FROM = line_list[1]
TD_STAMP = list(line_list[2:7])
else:

# output the details
print(f"emails from {item[0]}")
td_stamps = item[1]
for td_stamp in td_stamps:
hours.append(int(td_stamp[3][0:2]))
print(td_stamp)
print()

# required output
for hr in range(24):
if hr in hours:
found = hours.count(hr)
print(f"{hr:02} {found}")
``````

In posting this, I trust that Iβve not introduced any of my bad habits, habits that Iβm doing my best to drop.

1 Like