I’ve re-written this to be clearer and avoid confusion.
I work with systems that process data files, mostly CSV/TSV or something similar, multiple times a day. Each system needs the files and values they contain formatted in very specific ways.
While the rules may seem arbitrary and inconsistent, they are oftentimes well documented. The goal I’m trying to achieve is to create a script, likely packaged as an exe, that can be pointed to a JSON schema file or something similar that contains a set of rules and to a CSV or set of CSV files. The script will make determinations about the validity of a CSV file using the rules defined in the JSON file and give helpful feedback to the user if CSV file(s) doesn’t pass the check. Since the rules change so frequently, I’m trying to make it so that the rules are modular and can be updated without having to modify any python code.
These are the types of rules/checks/validations (off the top of my head) that I’d like to be able to define in the JSON schema file, using header values to reference columns if they are present:
- Overall file/formatting rules (characer encoding, delimiter, quote usage, etc)
- The presence of required field names in the column headers
- Min/max value
- Min/max length
- Data types (text, date, integer, etc)
- Data formats (date format is YYYYMMDD, number format is #.###, etc)
- Confirm that value is in/not in a list of values
- Match against more complex regex patterns
- Confirm relationships between fields such as
- Field 1 must be greater than Field 2
- Field 4 must start with the first three characters of Filed 3
- Support for combining rules with IF/THEN/ELSE as well as AND/OR
- Eventually, I’d like to also be able to treat multiple files as if they were related tables and ensure that, for example, the values in a column from one file have a corresponding key column in another file for a valid many-to-one relationship.
To make this easier, I’m looking for any resources that might help me with this project. I don’t want to reinvent the wheel or waste time if there are already similar open source projects or existing modules I can use or incorporate in order to achieve my goal. I have used pandas on what is likely a fairly shallow level to manipulate dataframes for this purpose and I can likely figure out ways to validate and give feedback to a user using what I know of that module. The biggest challenge for me is being able to read a rules file instead of hard-coding the rules into the script. If there are better terms that I can use to describe what I’m look for that might also be helpful.
I’ve done some searching and I’ve found:
- csv-schema/csv-schema - Seems capable of doing basics
- kenfar/DataGristle - Part of this package includes a validator but I’m not quite understanding the language.
- datahappy1/csv_file_validator - I have only ever considered and read about parsers, I’ve never tried making/using one
- frictionlessdata/frictionless-py - There is a lot here and I’m not quite understanding the language but if I read more into it, I may find it helpful.
None of what I’ve found obviously support the ability for the schema to define the comparison of one field to another. I’ll have to look into the bigger projects later to see if they do and it just isn’t obvious. If anyone is familiar with them and know their capabilities, please let me know.