UTF-8 BOM not being consumed when opening file

smontanaro · December 21, 2024, 1:10pm

My broker (Fidelity) has seen fit to change the data they deliver to me as “CSV”. In addition to crap at the start and end of the file, they’ve added a byte order mark U+EFBBBF, which this Wikipedia page tells me is the UTF-8 BOM.

When I open the file while declaring the encoding to be UTF-8, not all of the BOM is stripped:

>>> x = open("Investment_income_balance_detail (1).csv", "rb")
>>> x.read(12)
b'\xef\xbb\xbf"Monthly"'
>>> rdr = csv.DictReader(open("Investment_income_balance_detail (1).csv", encoding="utf-8"))
>>> rdr.fieldnames
['\ufeff"Monthly"', 'Beginning balance', 'Market change', 'Dividends', 'Interest', 'Deposits', 'Withdrawals', 'Net advisory fees', 'Ending balance']

Shouldn’t the entire BOM be eaten in the process of opening the file? What have I missed/forgotten?

JamesParrott · December 21, 2024, 1:29pm

Yeah, that looks like a bug. open.read only strips the first byte, it’s nothing to do with csv.DictReader.

kpfleming · December 21, 2024, 1:29pm

From codecs — Codec registry and base classes — Python 3.13.1 documentation

Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

LATIN SMALL LETTER I WITH DIAERESIS

RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

INVERTED QUESTION MARK

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

Given that, if you use the utf-8-sig encoding instead of utf-8, you should be fine.

smontanaro · December 21, 2024, 1:33pm

Thanks. So if Fidelity “fixes” this in the future, I revert back, I guess…

I managed to almost completely avoid Microsoft stuff during my career (occasionally required to use Outlook), but it still pops up from time-to-time and annoys me.

kpfleming · December 21, 2024, 1:34pm

No need, utf-8-sig will only consume the first three bytes if they are those specific values, otherwise it’s identical to utf-8.

barry-scott · December 21, 2024, 2:19pm

Atleast it was utf-8 and not utf-16 that I have had to handle…

Rosuav · December 21, 2024, 2:30pm

To clarify: That’s a sequence of byte values, not a Unicode codepoint. The “U+” designation refers to a character, and in this case that would be U+FEFF; what you are seeing at the start of the file is the three-byte sequence EF BB BF which is how U+FEFF is represented in UTF-8.