Resources to validate CSV files using complex rules

I’ve re-written this to be clearer and avoid confusion.

I work with systems that process data files, mostly CSV/TSV or something similar, multiple times a day. Each system needs the files and values they contain formatted in very specific ways.

While the rules may seem arbitrary and inconsistent, they are oftentimes well documented. The goal I’m trying to achieve is to create a script, likely packaged as an exe, that can be pointed to a JSON schema file or something similar that contains a set of rules and to a CSV or set of CSV files. The script will make determinations about the validity of a CSV file using the rules defined in the JSON file and give helpful feedback to the user if CSV file(s) doesn’t pass the check. Since the rules change so frequently, I’m trying to make it so that the rules are modular and can be updated without having to modify any python code.

These are the types of rules/checks/validations (off the top of my head) that I’d like to be able to define in the JSON schema file, using header values to reference columns if they are present:

  • Overall file/formatting rules (characer encoding, delimiter, quote usage, etc)
  • The presence of required field names in the column headers
  • Min/max value
  • Min/max length
  • Data types (text, date, integer, etc)
  • Data formats (date format is YYYYMMDD, number format is #.###, etc)
  • Confirm that value is in/not in a list of values
  • Match against more complex regex patterns
  • Confirm relationships between fields such as
    • Field 1 must be greater than Field 2
    • Field 4 must start with the first three characters of Filed 3
    • Support for combining rules with IF/THEN/ELSE as well as AND/OR
  • Eventually, I’d like to also be able to treat multiple files as if they were related tables and ensure that, for example, the values in a column from one file have a corresponding key column in another file for a valid many-to-one relationship.

To make this easier, I’m looking for any resources that might help me with this project. I don’t want to reinvent the wheel or waste time if there are already similar open source projects or existing modules I can use or incorporate in order to achieve my goal. I have used pandas on what is likely a fairly shallow level to manipulate dataframes for this purpose and I can likely figure out ways to validate and give feedback to a user using what I know of that module. The biggest challenge for me is being able to read a rules file instead of hard-coding the rules into the script. If there are better terms that I can use to describe what I’m look for that might also be helpful.

I’ve done some searching and I’ve found:

  • csv-schema/csv-schema - Seems capable of doing basics
  • kenfar/DataGristle - Part of this package includes a validator but I’m not quite understanding the language.
  • datahappy1/csv_file_validator - I have only ever considered and read about parsers, I’ve never tried making/using one
  • frictionlessdata/frictionless-py - There is a lot here and I’m not quite understanding the language but if I read more into it, I may find it helpful.

None of what I’ve found obviously support the ability for the schema to define the comparison of one field to another. I’ll have to look into the bigger projects later to see if they do and it just isn’t obvious. If anyone is familiar with them and know their capabilities, please let me know.

This sounds like the type of logic that naturally lives in a dataclass’s __post_init__. Pydantic is very good for simple checks like min/max values. It does a lot more besides, but I’ve not used that yet, and I’ll be very surprised if it doesn’t support custom validators, and some equivalent to post init.

As @JamesParrott mentioned, the pydantic library package might be suited for this type of application. I am by no means an authority with this package. However, coincidently, I was watching the following youtube video yesterday as I was interested as it compared to the dataclass library package.

If you watch the video, please note that the @pydantic.validator('attribute_name_here') decorator has been deprecated therefore the correct version to use would be @pydantic.field_validator('attribute_name_here').

In the body of the assigned attribute methods, you may customize the code to check and verify if the information is valid prior to writing to the .csv file(s).

You might also look at pandas · PyPI.

Thank you for the recommendation.

Looking further into pydantic, it appears to be capable of the validation I’m looking for but it’s missing what I consider the biggest piece - importing the validators from an external file. Like most of the validation tools I’ve found, it must be defined within the code as part of the class definition. Models can be defined in a schema, but I don’t see any way to do this for the validators.

If I’m wrong, someone please let me know.

I mentioned pandas and that I have at least a shallow familiarity with it. If I try to make something from scratch, my familiarity with it might lead me to use it for the data storage and comparison side of things.

My struggle is creating a system to import complex validator definitions from a file. All of the projects I’ve found seem to either have simple validation that can be defined in a file, or complex validation that must be part of the class definition. I’m looking to have my cake and eat it to.

At this point I’ve half given up on finding something mostly complete that will meet my needs.

I peformed a quick google search. The following was returned:

Yes, Pydantic fully supports importing and reusing validators from external files or modules. This approach is encouraged for code organization, modularity, and reusability.

You can achieve this in several ways, primarily through standard Python imports:

  1. Using @field_validator or @model_validator

You can define validator functions in a separate module and import them into your Pydantic model file.

a. validators.py (external file):

from pydantic import ValidationError

def validate_name_not_empty(v: str) -> str:
    if not v.strip():
        raise ValidationError('Name cannot be empty', model=None) # model argument just for example
    return v

def validate_age_range(v: int) -> int:
    if not (18 <= v <= 99):
        raise ValidationError('Age must be between 18 and 99', model=None)
    return v

b. models.py (your Pydantic model file):

from pydantic import BaseModel, field_validator
from validators import validate_name_not_empty, validate_age_range

class User(BaseModel):
    name: str
    age: int

    @field_validator('name')
    def check_name(cls, v: str) -> str:
        return validate_name_not_empty(v)

    @field_validator('age')
    def check_age(cls, v: int) -> int:
        return validate_age_range(v)
  1. Using Annotated with Custom Types

For an even cleaner and more reusable approach (especially in Pydantic v2+), you can create custom, reusable types using typing.Annotated in your external file and then import those types directly.

a. custom_types.py (external file)

from typing import Annotated
from pydantic import Field

# Reusable custom type for a non-empty string
NonEmptyString = Annotated[str, Field(min_length=1)]

# Reusable custom type for an age between 18 and 99
AgeRange = Annotated[int, Field(ge=18, le=99)]

b. models.py (your Pydantic model file):

from pydantic import BaseModel
from custom_types import NonEmptyString, AgeRange

class User(BaseModel):
    name: NonEmptyString
    age: AgeRange

Both methods leverage standard Python module importing mechanisms, making it simple to organize your validation logic across multiple files and maintain a clean, readable codebase. The Pydantic documentation provides more details on Annotated Validators and Custom Types.

It appears that pydantic may be worth looking into.

When you say “import from a file” do you mean they need to be imported from some kind of static data file (like a JSON file), or just a separate file from the class def? Are you saying that you want the rules to be specified declaratively somehow, or are you okay with the rules themselves having to be implemented as code (in Python, presumably, but maybe some other language)?

Most of the solutions others have given you are ways to validate things, but I think the devil would be in the details if you’re trying to specify complex rules in a non-Turing-complete data format. In particular items like your “Field 4 must start with the first three characters of Field 3” and “IF/THEN/ELSE” are detailed enough that a schema-specification language complex enough to encode them would likely be more unwieldy than Python while also being less powerful.

If you are willing to use Python code, then you can just write your validations as scripts that accept the CSV as input and use pandas (and/or other tools) to process it.

I work with TSV files every week and sometimes CSV files. I find TSV files easier to work with. I often get an Excel file from a customer but even that Excel file can have embedded tab characters in the actual cell which throws things off. These embedded tabs or CRLF need to be removed before I export the file to TSV. The same goes for CSV files.

I don’t write rules in a JSON file because that adds another layer of complexity. If you want to reuse these rules in many programs perhaps you can put the rule as Python code in a custom library file. Just pass in the whole file path to the function, inside the function read the TSV file and check each row for each rule.

I have a custom library file I use in all my programs because there are things I do that Python does not have built-in.