Data modelling in Python - what's the best way?

p1x44r · July 30, 2022, 2:14am

Hello,

I relatively new to Python (3 months working with it) and coming from TS programming I am used to strong models for everything.

With python I have tried JSON schemas and pydantic to accomplish this with mixed results.

I have to add that I work within AWS SAM realm and sometime wonder if it’s even a good idea to implement strong models for things like configuration, events and responses.

Does anyone have any opinion on how to approach this? Models or not and if yes, how far to go with it.

Thank you.

steven.daprano · July 30, 2022, 3:15am

So many TLAs… cries

What is TS programming? What do you mean by “strong models”? Remember that this is a Python forum, and not everyone here will have the same background you have.

Can you give a concrete example of exactly what you are tried to do, and how your question relates to it?

It is not clear what your question is precisely, or how it relates to Python. It is only because I recognise the third-party library pydantic that I can hazard a wild guess that maybe you are thinking of including runtime type checking in your code.

Personally I don’t see the point of pydantic. Python already does runtime type-checking of code, it just occurs at the point that values are used, not where they are defined.

For anyone wondering, “AWS SAM” is “Amazon Web Services Serverless Application Model”, an over-complicated (in my opinion) way of running Python “in the cloud” so that instead of writing a Python script and then just running it, you get to write the same script plus hundreds of lines of config to deploy and run it on AWS SAM.

But the benefit is that you are now dependent on a horrible abusive company for your critical business infrastructure, so there is that.

p1x44r · July 30, 2022, 3:24am

You’re right, I should’ve been clearer. Was nervous making first post here.

TS is TypeScript. Strong models means you validate your structs very strictly. Down to field/value level.

I am wondering if data models/validation are used in Python and if those are worth the overhead of creating them and checking against them which of course affects performance.

Simple scenario/question for you. Say you have dozens of independent functions in your project. Do you ensure that all of them use identical response structures? say a list of dicts with mandatory fields in those dicts. If responses aren’t compliant you fail the execution.

steven.daprano · July 30, 2022, 8:06am

If the functions are independent, they won’t necessarily return the same type. E.g. len() and chr() are independent.

If your collection of functions are constrained to return the same type, then they are no longer independent.

It sounds like you are considering having your functions follow a protocol. Whether that is a good idea depends on the details. What do the functions do? What does the return result represent, what do you do with it, what constraints are you working under?

Should you use runtime checks to enforce the protocol? The question is too general to give a good answer. Here are some scenarios:

You are passing the results of your functions to a third-party library which will silently misbehave if given invalid data (say, missing fields) – in that case, your functions should perform a runtime check on their output.
You are passing the results of your functions to a library that does its own input validation, or otherwise can handle missing fields (say, it raises an exception, or can use a meaningful default). In this case, it seems like a lot of duplicated effort to validate the output if the library is doing the same validation.
You are designing your own protocol. In that case, Postel’s Law should apply: your protocol should be resilient when given data with missing fields, provided the receiver can sensibly ignore them or infer some useful value. Your functions may choose to leave unused fields missing rather than fill them in with some sentinel value.
But note that resilience to bad data can include raising an exception. Type-checking and data validation can occur at the point of use. Advantages: if you never use a value, who cares if it is wrong? Many errors are automatically resolved by the interpreter, at no cost to you. E.g. None.method() will fail safe with an exception, it won’t cause a segfault.

Without knowing more about your project, it is impossible to say whether runtime checks – or rather, additional runtime checks beyond what the interpreter already does – are of any value to you.

p1x44r · July 30, 2022, 11:38pm

Thanks for your replies both.

My situation is such - this is a serverless AWS application and the functions I’m referring to are Lambdas. They do not share anything in common outside of some libraries such as logging.

All Lambdas have a decorator that implements a bunch of stuff enforced across all of them. Standard logging, event and response models are validated (*you will get a 500 trying to return a non-compliant result-set).

There’s more there - This loads configuration from SSM, secrets Manager and AWS AppConfig.
All that config is JSON (or YAML) and all of it is validated upload retrieval. Sometimes via pydantic, sometimes via jmespath.

There’s no changing this now - I am not sure how I feel about this but this being my first Python AWS project I think I went too far with all the validation and restrictions. I got very excited playing with AWS Lambda power tool package and lost my mind a little bit.

The question is what would you recommend to try and unify all these models. As I mentioned, right now the application uses both pydantic’s BaseModel and straight-up JSON schemas.

I know either is a performance penalty obviously, but the application is running very well, so that’s not a concern at the moment.

Would you rather maintain a huge set of JSON schema strings or a comparable amount of pydantic model classes? Is there a preferred way of dealing with this?

That was all I wanted to clarify last night, but wasn’t having a good day.

Thanks in advance.