Add `DataclassReader` and `DataclassWriter` to the `csv` module

Hi,

I am interested in expanding the csv module to include DataclassReader and DataclassWriter classes.

These classes would complement the provided DictReader and DictWriter classes by providing structured, type-safe reading and writing. (The expected schema of the input/output CSV would be defined as a dataclass.)

Some existing dataframe packages support type-safe schemas (e.g. polars), but require loading the entire CSV into memory. There are contexts (e.g. bioinformatics) where having a Reader class to stream the contents of a CSV is preferable to batch processing.

Is this a feature that would be generally useful and welcomed into the standard library? And if so, is it substantial enough to warrant the submission of a PEP, or could it be introduced via the issue tracker?

I’ve implemented a minimal proof-of-concept to demonstrate how such classes might be used in practice: GitHub - msto/dataclass_io: Read and write dataclasses.

Thank you!

2 Likes

How do you intend to address the discrepancy between type annotations and runtime types? I ask this specifically because you mention “type-safe” reading of data.

The following usage is valid at runtime (but not at type-checking time):

@dataclasses.dataclass
class MyClass:
    attr1: int

MyClass(attr1="foo")

In the general-case, checking a runtime type against a type annotation is difficult.
If type-safety is important to you, this is likely to be a significant hurdle to clear.

Is this a feature that would be generally useful and welcomed into the standard library?

Most probably, the general reception you will receive is “sounds like a great idea for a pypi package!”
And I think it does sound like a good package idea. :smile: Many users already benefit from being able to read CSV or other simply structured data into higher level constructs like dataframes.

I would be slightly surprised, but not shocked, if none of the mainstream data-science packages like pandas implement something like this. You should research the options and consider publishing your implementation. If it gains traction, then you will have a strong case for “X many users find this feature beneficial, and I believe it will have broader positive impact in the stdlib”.[1]

And if so, is it substantial enough to warrant the submission of a PEP, or could it be introduced via the issue tracker?

I don’t believe this would need a PEP, but don’t try to push to add it without a core developer expressing interest in the idea. The stdlib is slow to add new features, and until a core dev shows support, it’s unlikely to be accepted.


  1. This is similar to the situation with itertools and more-itertools. Sometimes implementations from more-itertools, like batched, are promoted into the stdlib. ↩︎

Cattrs/pydantic both support things close to this although not specifically for csv but similar idea of dataclass like object serialization/parsing. I have implemented a similar library. There’s a lot of runtime introspection needed and what exactly are types you will support? Only small set of primitives like int/str/bool? How about list[int]? You can’t just use isinstance/type to identify something is list[int]. How about more complex types? Would you allow dataclass with a member annotated as another dataclass? If you allow cases like that you quickly enter to how do subclasses/polymorphism fit? Even simple case of just int already has that appear a bit. How should True be serialized and should it depend on if annotation was int vs bool?

Then there’s fun things like forward references, from future import annotations, type aliases, etc to worry about. If you allow list[int]/container types do you want to allow recursive type aliases too?

All of these questions have multiple possible answers and enough rabbit hole that while I find this valuable, I also find it not good fit for standard library. If you don’t want to reinvent this logic (it is fun to explore) I’d recommend building on top of library like cattrs/pydantic and let them take care of various runtime type issues.

4 Likes

For what it’s worth, many such dataframe packages support some form of lazy / on-disk dataframes, which allow you to work with a dataframe without loading the entire thing into memory.

In polars, there’s polars.scan_csv, which lets you load a CSV file into a LazyFrame. Depending on how you process the dataframe, only a small part will be in memory at any given time.

Like Mehdi said, this is a complex problem (and probably well out of the scope for the stdlib), but fortunately it’s well trodden ground. There are a bunch of existing libraries that you can draw from. You may be particularly interested in pydantic, which lets you validate data against an existing stdlib dataclass via a TypeAdapter[1]:

import csv
from dataclasses import dataclass
import pydantic

@dataclass
class Person:
    name: str
    age: int
    height: float

adapter = pydantic.TypeAdapter(Person)


csv_file = """\
name,age,height
Alice,11,1.11
Bob,22,2.22
"""
reader = csv.DictReader(csv_file.splitlines())
for row in reader:
    row = adapter.validate_python(row)
    print(row)

output:

Person(name='Alice', age=11, height=1.11)
Person(name='Bob', age=22, height=2.22)

  1. though you’ll generally want to prefer using proper pydantic models or pydantic.dataclasses for better feature integration ↩︎

1 Like