Additional features for the struct module

The struct module covers most basic needs for parsing binary data, but sadly it lacks support for some common features. In case one of these features occur in the binary data one wants to parse, one is often forced to either use slow Python workarounds or write a C-extension.

Therefore I would like to propose adding some of these features to the struct module.

Character Groups/Sub Structures
While it’s currently possible to use numbers to parse a specific amount of values of a specific type, it’s not possible to do this with a group of types or, in a sense, a sub-structure.
e.g. instead of being able to use 4(ch) to parse 4 tuples of (char, short) one is forced to use chchchch in combination with a more complex return value grouping.
I think the main difficulty of this feature would be cpython internal storing of a group.

Additional types

Namely:

  • varint
  • null-terminated strings

These two types have in common that they don’t have a predictable size and would therefore require a different way to ensure that the decoder doesn’t segfault. They might also require adding an additional field to the Struct class, which tells if Struct.size is always correct or if it’s just the minimum required size.
I guess that this might have been the prime reason why these three types haven’t been added yet.

Variable Instance Count
Often binary encoded data can contain a variable amount of instances of a type, e.g., a string described by an integer x followed by x chars. This could be described via i*s. This feature runs into both issues of the previous two feature proposals.


Finally, I would like to note that I’m currently already working on a python package that basically handles all of the named features, and mainly want to discuss this topic here as I would love to add these features to CPython itself.

It sounds to me like a “struct-plus” module on PyPI would be a good way to develop and refine those extra features, and then once they are stable and demonstrated to be useful, they can be added to the struct module itself. If you’re saying that’s what you have right now, then post a link to it, so people can take a look! :slightly_smiling_face:

Personally, I’ve very occasionally had a need for something like this. Not often, but enough that I’m broadly supportive of the idea.

1 Like

Struct format strings are just strings, so the normal string operations apply. Use 'ch'*4.

In most modern Python interpreters, the keyhole optimizer will build that at compile time so it is just as efficient as 'chchchch'.

I don’t know what a varint is.

See discussion in this thread and this feature request.

Variable-length integer. Unfortunately there are several meanings for this. The MIDI specification has one, RFC 2795 has another, etc.

I was more or less thinking along those lines.

My current implementation is from scratch and is still a bit away from being mature enough to make it public. I hope that I will be able to get it to a shareable degree within the next one to two weeks. Once this is accomplished, I will share the link here; then, we can discuss how it could be improved.

For the implementation itself, I decided to stick to a state-machine approach.

Struct format strings are just strings, so the normal string operations apply. Use 'ch'*4 .

I know that this can be done, and the case I mentioned was just an example.
But things like this can be part of a more complex format, in which case the string formatting would make the format string less readable, in my opinion.

See discussion in this thread and this feature request .

Thanks a lot for sharing these. I couldn’t find them myself.
Giving them a read, it looks like variable size seems to be the major problem, so I guess that Paul Moore’s suggestion seems to be the best option for now.

Variable-length integer. Unfortunately there are several meanings for this. The MIDI specification has one, RFC 2795 has another, etc.

Thanks for pointing out this issue in general. The one I have in mind is C#'s 7bit encoded int which seems to be the MIDI specification.

Yeah, that does look like the same specification. But I would avoid calling it “varint” and instead be very VERY specific as to which one you’re implementing (and be prepared for debates about whether this should be the only one implemented, or if it should be included at all).

1 Like

There’s also the Construct project. This is a much more powerful library. The question is whether there’s any room for a library that’s a bit more powerful than struct, but not as powerful as something like construct.

Personally, I’m more likely to just go for construct for anything that’s not fairly simple.

2 Likes

Ooh I should look into Construct. Normally, if I need something the stdlib can’t do, I just end up rolling my own from primitives, which often ends up unideal.

1 Like

The main problem I have with Construct is its overhead and that it’s a bit slow for my taste due to being completely coded in python. Especially C-String/null-terminated strings usually take quite a bit to parse. While this usually doesn’t matter much for Python, I like to keep the speed up to trash-talk C#. Using struct or even just a C-implemented binary reader that emits python objects is faster than C#'s binary reader.

I’m planning to expand my project later on as well, trying to implement a decorator that automatically generates a Struct based on type hinted fields or via a specified field, similar to the attrs project.
But that’s in the future once the things mentioned previously work fine, and is not something I would propose to add to struct.

One other alternative is to use ctypes’ Structure, c_uint8, and friends.

Ooh I should look into Construct.

Me too. That use of __ldiv__ is very cute.

Normally, if I need something the stdlib can’t do, I just end up
rolling my own from primitives, which often ends up unideal.

Yah. My own foray is the cs.binary module, which at
least has a struct-based form for structures which can be parsed from
the struct moudle (and of course flexible forms for… everything else).

Cheers,
Cameron Simpson cs@cskk.id.au