During writing file, many characters becames "?"

Hi all,

Writing an output file, many characters becames “?” as ‘à’,‘°’ etc…
can someone help me to solve this unconvenient?
thank you very much

How are you writing the file? Can you show the simplest possible

piece of code that demonstrates the problem?

How are you viewing the file that you think it contains question marks?

Example code:

with open('test.txt', 'w') as f:

    f.write('\u00C1')  # Á = capital A with acute



with open('test.txt', 'r') as f:

    s = f.read()

print(s)

That should correctly print A with acute accent.

Note: the only reason I used f.write('\u00C1') in my code above is

that I don’t trust the Discuss software in this forum to keep the A with

accent correctly. There seems to be bugs in the forum software that

sometimes mangles accents. But in your code, you should be able to just

write f.write('Á') and it should work fine.

This is unlikely to be a problem with Python, instead it will probably

be an encoding problem. Probably the program you are using to read the

file thinks you are using a different encoding than you actually are.

If you don’t know about encodings, please read this:

Also, can you tell us the version of Python you are using, and your

operating system? (Linux, Mac OSX, Windows, something else?)

If you can, it would also be good if you could tell us the default

encoding your OS is using. If you don’t know how to find that out, on

Linux you can try running these in an xterm or other console:

echo $LC_CTYPE



echo $LANG

That might also work on Mac.

If you are using an xterm, it should have a menu command that lets

you set the encoding. Please tell us what that is set to.

Unfortunately, when dealing with non-ASCII characters, there are a

million things that can go wrong. Fortunately Python is probably not

one of them, so we should be able to work out what is happening if you

give us some more information.

To ensure UTF-8 is used, it has to be set explicitly via open(..., encoding='utf-8'). The default I/O encoding may only support a legacy character set, such as Windows ANSI code pages 1250-1258.

I don’t have a Windows system to test, but I don’t think it really
matters what encoding is used to write to the file, so long as both the
writer and the reader agree on the encoding.

(And of course if the characters being written exist in the encoding,
which they do, otherwise it would raise a UnicodeEncode or UnicodeDecode
exception.)

So if D writes ‘à’ (U+00E0) out to a file, on Linux it will be written
in UTF-8 and the file will contain b’\xc3\xa0’, but on Windows it might
be written in some other encoding like (let’s say) Latin-1. That would
give you a file containing a single byte b’\xe0’. If you read it back in
using the default Latin-1 encoding, you would still get ‘à’, not a
question mark.

The problem comes if you try to view the file in another application,
which is expecting UTF-8. But on Windows, is that very likely?

I don’t know.

It would really help if D writes back with more information, otherwise I
don’t think we can solve this problem.

My reply was specifically with regard to the unqualified statements that the given code “should correctly print A with acute accent” and “should work fine”. If the default file encoding doesn’t support U+00C1, the code fail with a UnicodeEncodeError. It only “should” work, in an unqualified sense, if a Unicode encoding is used such as UTF-8 or UTF-16.

Depends on the error mode. The OP would get the behaviour they describe if they write a file with an incomplete non-Western encoding (or us-ascii) and errors="replace". This isn’t the default for Python’s open(), but it may well be the default behaviour of some modules on Windows.

Not likely, but more plausible than it once was. The default behaviour of Visual Studio Code on Windows is to open files as UTF-8 and silently replace non-UTF-8 bytes with ‘�’ (not ‘?’)…