How does Python 3.12 handle extended ASCII letters?

c-rob · February 16, 2024, 1:07pm

I’m pretty new to Python. How does Python 3.12 on Windows 11 handle extended ASCII characters like accented letters? I will be reading and writing Excel and CSV files. I will also be running Python programs in a web app on the MS Azure ecosystem via a web page.

In Perl it was pretty painful and difficult to do. We sometimes have European names and words with accented characters in spreadsheets or other files. We also have some spreadsheets with Japanese in them as well.

My job is to read the original spreadsheet and output the data in different groupings or sortings.

elis.byberi · February 16, 2024, 1:14pm

Accented characters are found in latin-1 encoding: ISO/IEC 8859-1 - Wikipedia

MegaIng · February 16, 2024, 1:18pm

In python it is generally pretty easy, just make sure you know the encoding of the input files. Otherwise you probably don’t need to think about it.

kknechtel · February 16, 2024, 1:47pm

The same way as any Python 3 on any operating system.

“Extended ASCII” hasn’t been a particularly useful or meaningful term for quite some time. Unicode is the standard.

There are two separate concepts needed to understand Unicode: the encoding (rules that say how to re-interpret the bytes as a sequence of “code point” numbers) and the Unicode mapping (rules that assign an actual character to each code point - i.e., explain which character is which, what kind of script it comes from, how to position it graphically, and other such character properties). But with Unicode, the latter is always the same. There is a common Unicode database, which is versioned and maintained by the Unicode Consortium. So the interesting part is the encoding.

Dealing with all the things that can be done with text, really properly, in a way that works for every language, etc. etc. is incredibly complex. But fortunately for you, the operating system and other built-in libraries are responsible for most of that, and almost all of the rest is stuff that you won’t have to worry about if you just want to read and write text that someone else prepared for you. The hard part comes when you need to worry about language-specific rules for upper/lowercasing, sorting etc. Python has a little of this built in (see e.g. str.casefold).

When you open the CSV file, simply specify the encoding of the file, according to how the file is encoded. This is something you have to figure out, either by knowing (because you wrote the file), being told (someone else left you some metadata somewhere), or guessing (trying some options until it works, or looking at the raw bytes first and doing some analysis, or using a third-party library to do that analysis). After that, Python takes care of the rest.

Excel files are much more complex than CSV - they store a lot more information than just the text of each “cell”, and they have a more complex internal structure. You should use a third-party library to process them.

But as long as you know the encoding of the input file, “European” (there are separate sets of characters for several major alphabets, even when they superficially look the same) or “Japanese” (there is a large block of “unified CJK” characters that include kanji, and then separate blocks for katakana and hiragana) characters won’t cause a problem.

bschubert · February 16, 2024, 4:15pm

For more on some of the topics Karl mentioned, there’s this popular article: