Help needed with split and sort

shomikc · July 2, 2023, 10:03am

The question given is Write a code that accepts a sequence of words as input and prints the words in a sequence after sorting them alphabetically.

The answer given is

my_str = "Hello this Is an Example With cased letters"

# To take input from the user
#my_str = input("Enter a string: ")

# breakdown the string into a list of words
words = [word.lower() for word in my_str.split()]

# sort the list
words.sort()

# display the sorted words

print("The sorted words are:")
for word in words:
   print(word)
type or paste code here

My question is why do I need lower() or upper(). Why cant I have the string sorted just the way it is written? Thankyou.

rob42 · July 2, 2023, 10:46am

The answer is that you can have the string sorted just the way that it is written, but when it comes to sorting, you’ll discover that uppercase comes before lowercase. This is because of the ASCII encoding: A — Z being 65 — 90 and a — z being 97 — 122, so the output would be:

Example
Hello
Is
With
an
cased
letters
this

sr-murthy · July 2, 2023, 10:50am

I guess it is possible that the question was worded incorrectly, and that casing was left out of the description.

But the casing makes all the difference to the sorting of the words, because in Python strings are sorted by comparing the sequence of ordinal values (code points with respect to the encoding scheme) of the characters in the strings, e.g. if we compare the strings "abc" and "abC" we see that:

In [28]: "abc" > "abC"
Out[28]: True

because the ordinal values of the characters in "abc" are correspondingly greater than or equal to those in "abC":

In [30]: list(map(ord, "abc"))
Out[30]: [97, 98, 99]

In [31]: list(map(ord, "abC"))
Out[31]: [97, 98, 67]

sr-murthy · July 2, 2023, 10:56am

A bit on your test string: we can first consider a version where we don’t change the case, just split the string and print out the sorted words, as well as the first character, and the ord value:

In [26]: for w in sorted("Hello this Is an Example With cased letters".split()):
    ...:     print(f'{w}, {w[0]}, {ord(w[0])}')
    ...: 
    ...: 
    ...: 
    ...: 
    ...: 
    ...: 
Example, E, 69
Hello, H, 72
Is, I, 73
With, W, 87
an, a, 97
cased, c, 99
letters, l, 108
this, t, 116

If you now compare this with the same where we lowercase the string, before (or after) splitting, then we get:

In [27]: for w in sorted("Hello this Is an Example With cased letters".lower().split()):
    ...:     print(f'{w}, {w[0]}, {ord(w[0])}')
    ...: 
    ...: 
    ...: 
    ...: 
    ...: 
    ...: 
    ...: 
an, a, 97
cased, c, 99
example, e, 101
hello, h, 104
is, i, 105
letters, l, 108
this, t, 116
with, w, 119

kknechtel · July 2, 2023, 5:46pm

Did you try doing that? What happens if you try it? Do you understand that result?

cameron · July 2, 2023, 10:33pm

It has been mentioned that the .lower() or .upper() is to normlise
the words so that eg “AGE” and “age” sort together because the default
sort compares the strings directly, and strings compare by the rodinal
values of their letters, and the entire UPPERCASE range occurs before
the lowercase range.

But you can wort using the original words:

 words = my_str.split()
 sorted_words = sorted(words, key=str.lower)
 print(sorted_words)

The sorted() and list.sort() functions accept an optional key=
parameter which is a function to compute the key for the sort
comparison. See its documentation:

Without the key= parameter, the comparison key is the value itself.

With the parameter, the key function is called on each value to get what
to compare. In the example above we’re using str’s .lower function,
so that the sort compares the lowercase forms of the words. The words
themselves are unchanged.

As another example, when two words differ only in case we often (in
English) for the uppercase one before the lowercase one eg in document
indices etc. You could invoke the sort like this:

 sorted_words = sorted(
     words,
     key=lambda word: (word.lower(), word),
 )

Here we’re doing something more sophisticated for the sort comparison
key: we’re computing a 2-tuple of the lowercase form of the word, and
its original, eg:

 ('age', 'Age')

When you compare tuples (or any sequence, the comparison compares the
first member, then if they’re the same then the second member and so
forth. In this way, if the lowercase forms are different, that controls
the sort. But if they’re the same eg 'Age' and 'age', their lower
case forms will be the same, and we fall back to their original forms,
which would sort 'Age' before 'age'.

The lambda in the example is Python syntax for an anonymous function.
In this case is accepts one argument (word) and returns the 2-tuple.

We could write the example like this:

 def word_sort_key(word):
     return (word.lower(), word)

 sorted_words = sorted(words, key=word_sort_key)

which works exactly the same, but for small things like this it is
common to write a lambda directly in the call.

Cheers,
Cameron Simpson cs@cskk.id.au

Rosuav · July 2, 2023, 11:46pm

Small side point: If the purpose of your transformation is to normalize case for sorting, there’s a better choice: casefold(). It’s specifically designed so that everything that would upper or lower case to the same thing will come out as the same thing, which isn’t guaranteed with either upper() or lower():

>>> def equal(a, b):
...     if a == b: print("%r == %r already" % (a, b))
...     elif a.upper() == b.upper(): print("%r.upper() == %r.upper()" % (a, b))
...     elif a.lower() == b.lower(): print("%r.lower() == %r.lower()" % (a, b))
...     elif a.casefold() == b.casefold(): print("%r.casefold() == %r.casefold()" % (a, b))
...     else: print("%r and %r are not equal." % (a, b))
... 
>>> equal("ß", "ss")
'ß'.upper() == 'ss'.upper()
>>> equal("ẞ", "ß")
'ẞ'.lower() == 'ß'.lower()
>>> equal("ẞ", "ss")
'ẞ'.casefold() == 'ss'.casefold()

shomikc · July 3, 2023, 10:36am

Yes, sir. I did.
I replaced words = [word.lower() for word in my_str.split()] with
words = [ my_str.split()]

The result was the same string as the input. Thanks for asking this question.

shomikc · July 3, 2023, 10:26pm

No, Sir. I did not understand the answer.

cameron · July 3, 2023, 11:36pm

That seems slight odd, though we would need to see the new code.

This:

 [ word.lower() for word in my_str.split() ]

is a “list comprehension”, which makes a new list whose values are
word.lower() for each value from my_str.split(). So, a list of
strings.

However, this:

 [ my_str.split() ]

is not a list comprehension. It is a list containing one value, and
that value is the result of my_str.split(), which is a list of
strings. So the result of this expression is something like this:

 [ [ "Words", "from", "the", "my_str", "variable" ] ]

i.e. a list-of-strings inside a list.

It is perhaps unfortunate that the two expressions are visually so
similar, though the syntax for a list comprehension was undoubtedly
chosen to be similar to that for a plain list.

Cheers,
Cameron Simpson cs@cskk.id.au