Abstracting data access

kiuhnm · July 3, 2024, 1:47am

I read something about “Data Oriented Programming” (DOP), which boils down to using JSON-like objects to represent data and keep data and code separated.

I think this was popularized by Clojure. Indeed, it’s very lispy:

Usually, one wants syntactically rich code. The compiler will then convert it to an AST (Abstract Syntax Tree).
Lisp’s idea: Why don’t we program directly with the AST?
In OOP, we want to hide implementation details regarding both code and data. When needed, we can serialize the objects by converting their data into a canonical format.
Clojure’s idea: Why don’t we just use serialized data from the start?

I’m half-joking, of course.

Consider this:

{
    "books": [
        {
            "title": "...",
            "author": "...",
            "publisher": "..."
        },
        {
            "title": "...",
            "author": "...",
            "publisher": "..."
        },
        ...
    ]
}

The main advantage is that “plain data” is fully manipulable using standard functions.

The main disadvantage is the loss of encapsulation, validation, and static types.

What I know is that every time I used “plain data” in my programs, I ended up refactoring it by introducing (data)classes and my code improved. I’ve never felt the need to go back to using “plain data”. I like auto-completion and ahead-of-time type checking too much.

Even though I’ll probably never adopt DOP in my programs, this got me thinking about how we access and organize data.

Yesterday I had to sort some data wrt custom keys:

taus.sort(key=key_func)

The introduction of a key arg was a good idea, but I also wanted to check the keys beforehand, so I had to do something like this:

keys = [get_key(tau) for tau in taus]

# inspect and alter the keys (don't ask)
...

tau_to_key = {tau: key for tau, key in zip(taus, keys)}
taus.sort(key=tau_to_key.__getitem__)

The introduction of tau_to_key and the use of __getitem__ is just noise. Wouldn’t the following be better?

taus.sort(keys=keys)

(Even better (especially if we had several args):

taus.sort(*, keys)

That "*, " would indicate that only kwargs follow and that each x without an equal sign stands for x=x, where x is any valid identifier.)

One simple solution is to implement our own sort with an arg keys: Callable[[T], K] | list[K] | dict[T, K], but the general problem remains.

All we want is a key for each element. The exact way the pairing is expressed should be abstracted away:

the caller shouldn’t have to put the data in a specific format;
the writer of the function shouldn’t have to add explicit support for all the formats.

As far as I know, no one has ever tried to generalize the way data is accessed and manipulated. I feel like we’re still in the “asm era” regarding this aspect of programming.

Thoughts?

JamesParrott · July 3, 2024, 9:13am

Memoising key_func would allow doing away with tau_to_key.

kiuhnm · July 3, 2024, 10:24pm

The keys are inspected and altered, which rules out memoization.

But that was just an example. Pretend we have a list of keys right from the start.

In this case, one possible solution would be to define a type Paired[T, U] that would convert between formats:

from typing import Callable, overload, Generator


class Paired[T, U]:
    @overload
    def __init__(self, ts_and_us: list[tuple[T, U]], /) -> None: ...
    @overload
    def __init__(self, ts: list[T], us: list[U], /) -> None: ...
    @overload
    def __init__(self, t_to_u: dict[T, U], /) -> None: ...
    @overload
    def __init__(self, u_of_t: Callable[[T], U], /) -> None: ...
    def __init__(self, *args) -> None:
        ...
    
    def __iter__(self) -> Generator[tuple[T, U], None, None]:
        ...
        
    def __getitem__(self, t: T) -> U:
        ...
        
    def __call__(self, t: T) -> U:
        ...

# stupid example, I know
xs = [1, 2, 3, 4, 5, 6]
keys = [5, 1, 8, 2, 4, 3]

xs.sort(key=Paired(xs, keys))

Ideally, Paired would be of type list[tuple[T, U]] & dict[T, U] & Callable[[T], U], but this isn’t possible because of conflicts.

Conversions in Paired would need to be lazy, i.e. done only when needed.

If these types of “utilities” were standard, we’d use them just like zip, which is confusing to beginners, but extremely handy in the long run.

What I’m wondering is why we’re still doing lots of explicit data conversions and bookkeeping. It’s like this has never been perceived as a problem worth solving. I’m talking in general, not just about Python.

BTW, I know this is “Python Help”, but I couldn’t find a more general section.