My broker (Fidelity) has seen fit to change the data they deliver to me as “CSV”. In addition to crap at the start and end of the file, they’ve added a byte order mark U+EFBBBF, which this Wikipedia page tells me is the UTF-8 BOM.
When I open the file while declaring the encoding to be UTF-8, not all of the BOM is stripped:
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
Given that, if you use the utf-8-sig encoding instead of utf-8, you should be fine.
Thanks. So if Fidelity “fixes” this in the future, I revert back, I guess…
I managed to almost completely avoid Microsoft stuff during my career (occasionally required to use Outlook), but it still pops up from time-to-time and annoys me.
To clarify: That’s a sequence of byte values, not a Unicode codepoint. The “U+” designation refers to a character, and in this case that would be U+FEFF; what you are seeing at the start of the file is the three-byte sequence EF BB BF which is how U+FEFF is represented in UTF-8.