Bytes data being converted to string before reaching to encoding's decode method

In python3(3.9 and 3.11), I am using codecs.EncodedFile inside DictWriter to write some data to a csv file which may contain non-ascii characters too.
Using below code-

import codecs, csv, tempfile, os

s=b'c1318'
print(s, type(s))
temp_file_dir = "/Users/chaturvedi/Documents"
file_fd, tmp_out_file_url = tempfile.mkstemp(dir=temp_file_dir, text=True)
print("file_fd=", file_fd, "tmp_out_file_url=", tmp_out_file_url)
out_file_descriptor = os.fdopen(file_fd, "wb")
print("out_file_descriptor=", out_file_descriptor)
csv_writer = csv.DictWriter(codecs.EncodedFile(out_file_descriptor, 'utf-8', 'utf-16'), [b'head'], extrasaction='ignore', dialect='excel-tab')
print("csv_writer=", csv_writer.__dict__)
csv_writer.writerow({b'head':s})

But when i run this i get this error-

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/csv.py", line 162, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 836, in write
  File "/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/encodings/utf_8.py", line 24, in decode
    return codecs.utf_8_decode(input, errors, True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'str'

Upon some investigation i found that the bytes data is being converted to string before reaching the decode method of encodings/utf_8.py. This method in turn calls codecs.utf_8_decode which expects a byte data.
Below are some logs-

(python3_env) chaturvedi@Abhisheks-MacBook-Pro scripts % python encoding_test.py

b'c1318' <class 'bytes'>

file_fd= 3 tmp_out_file_url= /Users/chaturvedi/Documents/tmpvh76vaji

out_file_descriptor= <_io.BufferedWriter name=3>

csv_writer= {'fieldnames': [b'head'], 'restval': '', 'extrasaction': 'ignore', 'writer': <_csv.writer object at 0x1003f9480>}

reached writerow

rowdict= {b'head': b'c1318'}

self.writer= <_csv.writer object at 0x1003f9480> <built-in method __dir__ of _csv.writer object at 0x1003f9480>

k= b'head' <class 'bytes'>

v= b'c1318' <class 'bytes'>

self.writer= <_csv.writer object at 0x1003f9480>

reached decode

input= b'c1318'

type= <class 'str'>

I tried several tweaks like changing binary mode to text, passing str data etc but nothing worked

Some conversion is taking place in between which I am not able to see since its happening in frozen code it seems(either _csv or _codecs). Upon explicitly converting it to bytes in decode method, it works

Also, In python2 same code works fine without any issues

Please help me find a solution to this or confirm if this is some bug w.r.t to python3

csv.DictWriter expects the file to be a text file, not a binary file. When you use codecs.EncodedFile, you’re giving csv.DictWriter a binary file.

Hi @MRAB can you please elaborate on the solution. I did not completely get it. The same snippet works fine with python2. So what change do i need to make it to work with python3 also? I tried modifying mode of opening files to wt, w etc, changing data to binary, string etc but did not work

A couple of things:

  • Please have a look at the documentation for Dictwriter: csv — CSV File Reading and Writing — Python 3.11.5 documentation The example there provides good guidance.
  • You should pass strings to the DictWriter, not bytes (using the b"" literal).
  • The writer will only need a text file open for writing. open() will do this for you and you can pass in the encoding to use in the file with the encoding parameter.
  • codecs.EncodedFile() is meant for writing binary data to a file, not for text data. It uses Unicode strings as intermediate format.

It seems like you are creating the file from scratch, and the problem is really just “I want to make sure that I can write a file that has non-ASCII characters in it”. In this case, you are making the problem much too complicated. Just open a file in the normal way, in text mode, and specify an encoding that can handle the characters you want to write (for example, like encoding='utf-8'). Then give that file to the DictWriter, and use strings (not bytes objects) for both the keys and values of the dict that you are writing.

You should also not use the tempfile interface for files that are actually supposed to remain on the disk after the program has finished. The “temp” part means temporary. Files like that should go in a special directory for temp files (such as C:\Windows\Temp), not in the user’s Documents folder.

This is how such a task normally looks:

import csv

s = '日本語' # some text with non-ascii characters - it's *text*, not a bytes object
with open('my_data.csv', 'w', encoding='utf-8') as f:
    writer = csv.DictWriter(f)
    writer.writerow({'language': s})

CSV is fundamentally a text format. If you have binary data, you should convert it to text as part of building up the data that you want to store in the file:

import csv

# the same string, pre-encoded as UTF-8 and stored in a bytes literal
data = b'\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
with open('my_data.csv', 'w', encoding='utf-8') as f:
    writer = csv.DictWriter(f)
    writer.writerow({'language': data.decode('utf-8')})

Do not try to write raw bytes to the file even if they are already in the right encoding. This just complicates matters.

If you need to append to an existing file, then open it specifying the encoding that it already uses, as text, in append mode ('a' rather than 'w'). You are completely free to interpret binary data coming into the program with a different encoding, if that makes sense for the data.

Hi, I will rephrase my question since it seems to be too complicated due to tempfiles and bytes data

  1. Opening a csv file for writing
  2. Creating EncodedFile out of that file object
  3. Creating DictWriter with that EncodedFile(StreamRecoder object)
  4. Writing string data into the csv file.

Below is the code

import codecs, csv, tempfile, os, sys

s='abhishek'
print(s, type(s))
test_file = "/Users/chaturvedi/Documents/test.csv"
fd = open( test_file, "wb" )
encoded_file_pointer = codecs.EncodedFile(fd, 'utf-8', 'utf-16')
csv_writer = csv.DictWriter(encoded_file_pointer, ['head'], extrasaction='ignore', dialect='excel-tab')
csv_writer.writerow({'head':s})

OR
using file descriptor-

import codecs, csv, tempfile, os, sys

s='abhishek'
print(s, type(s))
test_file = "/Users/chaturvedi/Documents/test.csv"
fd = os.open( test_file, os.O_RDWR|os.O_CREAT )
out_file_descriptor = os.fdopen(fd, "wb")
encoded_file_pointer = codecs.EncodedFile(out_file_descriptor, 'utf-8', 'utf-16')
csv_writer = csv.DictWriter(encoded_file_pointer, ['head'], extrasaction='ignore', dialect='excel-tab')
csv_writer.writerow({'head':s})

The above two codes work fine when run in python2 and write data into the csv file.
But when I run the same code in python3, it gives below error-

Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/csv.py", line 162, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 836, in write
  File "/usr/local/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/encodings/utf_8.py", line 24, in decode
    return codecs.utf_8_decode(input, errors, True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'str'

Observations-

  1. When we are using EncodedFile, the flow goes through decode method of the particular encoding file in encodings folder/package. Whatever data we provide(int, bytes, str etc) is converted to str format just before reaching this decode method. This method in turn calls codecs.utf_8_decode method which is expecting bytes data.
    In python2
    bytes → str
    str → unicode

thats why it does not fail in python2 since it is already getting bytes data(str)

But in python3
bytes → bytes
str - str

Hence it fails in python3

  1. Even changing the file mode, or data type does not work in python3

This piece of code is running for long in production(this is a small simulated example of that) in python2 and now we want to make it compatible in both python2 and python3, thats why seeking for a solution for the same.

Why are you using os.open? Why are you creating an EncodedFile?

Karl has already said how it’s normally done.

If you’re still using Python 2(!), use io.open, which has the same kind of functionality of Python 3’s open. (In Python 3, io.open and open are the same thing.)

In your own words, what do you think this part of the code means? Specifically, what do you think the b means, and what problem do you hope to solve by doing it this way?