Resources to validate CSV files using complex rules

Patman · December 3, 2025, 3:46am

I’ve re-written this to be clearer and avoid confusion.

I work with systems that process data files, mostly CSV/TSV or something similar, multiple times a day. Each system needs the files and values they contain formatted in very specific ways.

While the rules may seem arbitrary and inconsistent, they are oftentimes well documented. The goal I’m trying to achieve is to create a script, likely packaged as an exe, that can be pointed to a JSON schema file or something similar that contains a set of rules and to a CSV or set of CSV files. The script will make determinations about the validity of a CSV file using the rules defined in the JSON file and give helpful feedback to the user if CSV file(s) doesn’t pass the check. Since the rules change so frequently, I’m trying to make it so that the rules are modular and can be updated without having to modify any python code.

These are the types of rules/checks/validations (off the top of my head) that I’d like to be able to define in the JSON schema file, using header values to reference columns if they are present:

Overall file/formatting rules (characer encoding, delimiter, quote usage, etc)
The presence of required field names in the column headers
Min/max value
Min/max length
Data types (text, date, integer, etc)
Data formats (date format is YYYYMMDD, number format is #.###, etc)
Confirm that value is in/not in a list of values
Match against more complex regex patterns
Confirm relationships between fields such as
- Field 1 must be greater than Field 2
- Field 4 must start with the first three characters of Filed 3
- Support for combining rules with IF/THEN/ELSE as well as AND/OR
Eventually, I’d like to also be able to treat multiple files as if they were related tables and ensure that, for example, the values in a column from one file have a corresponding key column in another file for a valid many-to-one relationship.

To make this easier, I’m looking for any resources that might help me with this project. I don’t want to reinvent the wheel or waste time if there are already similar open source projects or existing modules I can use or incorporate in order to achieve my goal. I have used pandas on what is likely a fairly shallow level to manipulate dataframes for this purpose and I can likely figure out ways to validate and give feedback to a user using what I know of that module. The biggest challenge for me is being able to read a rules file instead of hard-coding the rules into the script. If there are better terms that I can use to describe what I’m look for that might also be helpful.

I’ve done some searching and I’ve found:

csv-schema/csv-schema - Seems capable of doing basics
kenfar/DataGristle - Part of this package includes a validator but I’m not quite understanding the language.
datahappy1/csv_file_validator - I have only ever considered and read about parsers, I’ve never tried making/using one
frictionlessdata/frictionless-py - There is a lot here and I’m not quite understanding the language but if I read more into it, I may find it helpful.

None of what I’ve found obviously support the ability for the schema to define the comparison of one field to another. I’ll have to look into the bigger projects later to see if they do and it just isn’t obvious. If anyone is familiar with them and know their capabilities, please let me know.

JamesParrott · December 3, 2025, 1:01pm

This sounds like the type of logic that naturally lives in a dataclass’s __post_init__. Pydantic is very good for simple checks like min/max values. It does a lot more besides, but I’ve not used that yet, and I’ll be very surprised if it doesn’t support custom validators, and some equivalent to post init.

onePythonUser · December 4, 2025, 9:36pm

As @JamesParrott mentioned, the pydantic library package might be suited for this type of application. I am by no means an authority with this package. However, coincidently, I was watching the following youtube video yesterday as I was interested as it compared to the dataclass library package.

If you watch the video, please note that the @pydantic.validator('attribute_name_here') decorator has been deprecated therefore the correct version to use would be @pydantic.field_validator('attribute_name_here').

In the body of the assigned attribute methods, you may customize the code to check and verify if the information is valid prior to writing to the .csv file(s).

tjreedy · December 5, 2025, 10:11pm

You might also look at pandas · PyPI.

Patman · December 9, 2025, 8:33pm

Thank you for the recommendation.

Looking further into pydantic, it appears to be capable of the validation I’m looking for but it’s missing what I consider the biggest piece - importing the validators from an external file. Like most of the validation tools I’ve found, it must be defined within the code as part of the class definition. Models can be defined in a schema, but I don’t see any way to do this for the validators.

If I’m wrong, someone please let me know.

Patman · December 9, 2025, 8:38pm

I mentioned pandas and that I have at least a shallow familiarity with it. If I try to make something from scratch, my familiarity with it might lead me to use it for the data storage and comparison side of things.

My struggle is creating a system to import complex validator definitions from a file. All of the projects I’ve found seem to either have simple validation that can be defined in a file, or complex validation that must be part of the class definition. I’m looking to have my cake and eat it to.

At this point I’ve half given up on finding something mostly complete that will meet my needs.

onePythonUser · December 9, 2025, 9:46pm

I peformed a quick google search. The following was returned:

Yes, Pydantic fully supports importing and reusing validators from external files or modules. This approach is encouraged for code organization, modularity, and reusability.

You can achieve this in several ways, primarily through standard Python imports:

Using @field_validator or @model_validator

You can define validator functions in a separate module and import them into your Pydantic model file.

a. validators.py (external file):

from pydantic import ValidationError

def validate_name_not_empty(v: str) -> str:
    if not v.strip():
        raise ValidationError('Name cannot be empty', model=None) # model argument just for example
    return v

def validate_age_range(v: int) -> int:
    if not (18 <= v <= 99):
        raise ValidationError('Age must be between 18 and 99', model=None)
    return v

b. models.py (your Pydantic model file):

from pydantic import BaseModel, field_validator
from validators import validate_name_not_empty, validate_age_range

class User(BaseModel):
    name: str
    age: int

    @field_validator('name')
    def check_name(cls, v: str) -> str:
        return validate_name_not_empty(v)

    @field_validator('age')
    def check_age(cls, v: int) -> int:
        return validate_age_range(v)

Using Annotated with Custom Types

For an even cleaner and more reusable approach (especially in Pydantic v2+), you can create custom, reusable types using typing.Annotated in your external file and then import those types directly.

a. custom_types.py (external file)

from typing import Annotated
from pydantic import Field

# Reusable custom type for a non-empty string
NonEmptyString = Annotated[str, Field(min_length=1)]

# Reusable custom type for an age between 18 and 99
AgeRange = Annotated[int, Field(ge=18, le=99)]

b. models.py (your Pydantic model file):

from pydantic import BaseModel
from custom_types import NonEmptyString, AgeRange

class User(BaseModel):
    name: NonEmptyString
    age: AgeRange

Both methods leverage standard Python module importing mechanisms, making it simple to organize your validation logic across multiple files and maintain a clean, readable codebase. The Pydantic documentation provides more details on Annotated Validators and Custom Types.

It appears that pydantic may be worth looking into.

BrenBarn · December 11, 2025, 4:08am

When you say “import from a file” do you mean they need to be imported from some kind of static data file (like a JSON file), or just a separate file from the class def? Are you saying that you want the rules to be specified declaratively somehow, or are you okay with the rules themselves having to be implemented as code (in Python, presumably, but maybe some other language)?

Most of the solutions others have given you are ways to validate things, but I think the devil would be in the details if you’re trying to specify complex rules in a non-Turing-complete data format. In particular items like your “Field 4 must start with the first three characters of Field 3” and “IF/THEN/ELSE” are detailed enough that a schema-specification language complex enough to encode them would likely be more unwieldy than Python while also being less powerful.

If you are willing to use Python code, then you can just write your validations as scripts that accept the CSV as input and use pandas (and/or other tools) to process it.

c-rob · December 12, 2025, 10:31am

I work with TSV files every week and sometimes CSV files. I find TSV files easier to work with. I often get an Excel file from a customer but even that Excel file can have embedded tab characters in the actual cell which throws things off. These embedded tabs or CRLF need to be removed before I export the file to TSV. The same goes for CSV files.

I don’t write rules in a JSON file because that adds another layer of complexity. If you want to reuse these rules in many programs perhaps you can put the rule as Python code in a custom library file. Just pass in the whole file path to the function, inside the function read the TSV file and check each row for each rule.

I have a custom library file I use in all my programs because there are things I do that Python does not have built-in.

Patman · December 20, 2025, 4:46pm

Sorry, life happened and this was put on the back burner so I haven’t thought to check back.

Thank you for the detailed reply! I never considered using an actual python file as the “file” source. I’ve created modules but always considered them to be part of the whole, not a separate file.

As I mentioned in my original attempt to ask for help but, naturally, may have left out when I completely rewrote it, part of the end result would be an executable (via pyinstaller or something similar). Looking into it, I think it is still possible to import an external python file, as you recommend, even if the application has been packaged into an exe format. Though I know that python is not pre-compiled, I hadn’t considered that a possibility and assumed it would have to be 100% defined internally within the exe. I’ll look deeper into it again. Thanks again!

trlandet · December 20, 2025, 7:43pm

It may be that you can get one of the “Great Expectations”, “Deequ”, or “Soda” libraries to do what you want. I have used Great Expectations some years ago, but it may be that Soda is closer to what you want. I think they define the data rules in YAML files.

When I looked at these now after not using any of them for some years I see that all of them are pushing services and cloud offerings. I believe it is still possible to use them as normal Python libraries for validating data tables, but you may have to hunt for the docs on their web pages…

Also, if your data is made up of independent rows you can turn them into dictionaries with the column name as key and use JSON schema to validate each row. It may be slow for large datasets and inter-row relationships cannot be checked that way

petersuter · December 21, 2025, 12:51am

There are at least two different CSV schema specs.

csv-schema/csv-schema (JSON)
digital-preservation/csv-schema (DSL by nationalarchives.gov.uk)
kenfar/DataGristle (JSON Schema in YAML)
datahappy1/csv_file_validator (JSON)
frictionlessdata/frictionless-py (JSON)

petersuter · December 21, 2025, 12:55am

Have you considered importing the CSV files in e.g. SQLite virtual tables or DuckDB?

gwerbin · December 21, 2025, 1:39am

I don’t know that “complex” is ever going to be achievable by reading off of some kind of static specification.

However you might be interested in the Pandera library, which strives to be a highly declarative framework for validating tabular data. It originally worked with Pandas but it also now supports other “data frame” types. It has some annoyances, but it produces nice error messages and provides a lot more structure than manually checking everything. It also has an optional class-based syntax that is similar to frameworks like Attrs, dataclasses, Marshmallow, and Pydantic. It even has some basic Mypy integration.

petersuter · December 21, 2025, 2:41am

This one https://csvw.org/ (recommended by the UK Government Digital Service) has support for relations of multiple linked tables.

petersuter · December 21, 2025, 2:42am

Under “Tools” it has at least one Python library to validate it: GitHub - cldf/csvw: CSV on the web