TL;DR:
- What would be the best implementation for UUID v7? should it solely focus on providing monotonicity within a milli-second and unguessability for the remaining components or should it provide an improved timestamp precision?
- What would be the best implementation for UUID v8? should it solely focus on being the most generic one where the random blocks can be customized and no monotonicity is guaranteed, should it feature time-based properties or name-based properties?
RFC 9562, superseeding RFC 4122, is now in the standard tracks, meaning that it is part of the standard. There are many implementations of RFC 9562, as acknowledged by the following study: GitHub - uuid6/prototypes: Draft Prototypes and Tests for UUIDv6 and beyond.
While the implementation of version 6 is explicitly stated in the RFC, versions 7 and 8 may have different purposes and different implementations. There has been a previous discussion on UUID v7 (Add uuid7 in uuid
module in standard library) but the latter did not discuss the implementation details of the UUIDv7 nor UUIDv8. A first PR was drafted for v6/v7 in 2021 by oittaa #29824 but was based on previous specifications. The same author updated their implementation oittaa/uuid6-python but did not make another PR on that topic.
I want to discuss the implementation of UUIDv7/v8 for the standard library. Since Python is not the only language in which UUIDs are implemented, I would appreciate the feedback of any other language implementing them at the level of their standard library.
Version 7
For version 7, the UUID consists of a 48-bit timestamp (milliseconds precision) from UNIX epoch together with 74 bits of randomness split into two (non-adjacent) blocks of 12 and 62 bits. The content of those blocks depends on the implementation. The RFC allows the following constructions:
- Select 74 bits uniformly at random. This is the simplest way to implement version 7 while maintaining monotonicity within a millisecond. This is implemented in #120650.
- Use 12 bits for sub-millisecond precision and the remaining 62 bits for a seeded counter. There are various methods for implementing this approach, one given by #120830 and another one given by stevesimmons/uuid7. There is also a rust implementation of v7 uuid-rs/uuid but this one is again a bit different (at least in the way it samples the components).
There are pros and cons for those two approaches. The first approach gives you a relatively good entropy but not much of a timestamp precision. However, the purpose of v7 is to improve the entropy characteristics over UUIDv1 or UUIDv6. In particular, you may want to balance the precision and the randomness constraints in such a way that you can batch-generate UUIDv7 objects without having too many collisions, while being better than UUIDv1 or UUIDv6 (the RFC states that “Implementations SHOULD utilize UUIDv7 instead of UUIDv1 and UUIDv6 if possible”, so it can be thought of as a replacement).
More generally, the UUIDv7 structure is as follows, with the first row being the number of bits:
# --- 48 --- -- 4 -- - 12 - -- 2 -- - 62 -
# unix_ts_ms | version | rand_a | variant | rand_b
In my alternative implementation #120830, rand_a
is used for an additional 12-bit sub-millisecond precision constructed with Method 3, §6.2. This choice follows the recommendation of §5.7. Note that we are NOT allowed to have more than 12-bit of sub-ms precision according to the RFC since otherwise we would lose entropy guarantees.
On the other hand, rand_b
is a 62-bit seeded counter generated according to Method 2, §6.2. The initial value of rand_b
is a random 62-bit integer and is incremented by a random 32-bit integer if they are generated within the same timestamp tick. For instance:
# ms = 0
u0 = uuid7() # u0.rand_b = random(62)
# ms = 1
u1 = uuid7() # u1.rand_b = random(62)
# recall uuid7() within the same ms tick
u2 = uuid7() # u2.rand_b = u1.rand_b + random(32)
# ms = 2
u3 = uuid7() # u3.rand_b = random(62)
If rand_b
overflows within the same ms tick, it is replaced by a uniformly distributed 62-bit integer and rand_a
is advanced by 1 (this would emulate the fact that the UUID was in fact generated in the next sub-millisecond tick). In particular, the monotonicity is kept.
If rand_a
also overflows, the entire routine is executed again (which would emulate the fact that some time has passed). Note that this could be possibly be used in side-channels since the execution of uuid7()
would be twice slower. The technique of advancing rand_a
in case of an overflow in rand_b
tries to follow the RFC guidelines about counter rollover handling.
Actually, for this last point, I would like to discuss which direction is the best. Currently, it’s an “update-ahead” approach which allows to handle overflows in both the rand_a
and rand_b
parts. However, this might become disruptive in the sense that the exact sub-ms precision is actually lost. For instance, if I call uuid7()
at some point and the timestamp that is being used would cause an overflow in rand_b
and rand_a
, the resulting UUID would be “as if” I called the function a bit later. Similarly, if I only have an overflow in rand_b
, I advance rand_a
by 1, namely I change the actual sub-millisecond component. I don’t have much experience in such corner cases in practice, so I would be happy if anyone could tell me whether this could cause serious issues.
Version 8
UUIDv8 is a versatile UUID because it has almost no constraints. It is quite close to UUIDv4, but does not need to be cryptographically secure. More precisely, UUIDv8 considers 122 free bits, split into three non-adjacent chunks of 48, 12 and 62 bits, as depicted on the following figure:
# -- 48 -- -- 4 -- -- 12 -- -- 2 -- -- 62 --
# custom_a | version | custom_b | variant | custom_c
The user might provide any of those three blocks separately. In #120650, those blocks are generated by default with random.getrandbits
, which is quite efficient (in general, MT19937-based generators are faster than those targetting the uniform distribution). Since UUIDv8 is an implementation-detail, the RFC provides two alternatives:
- Time-based UUIDv8: this one features a 10-ns precision with a 60-bit timestamp and 62 bits of random data. This could be used as an alternative to UUIDv7 lack of precision.
- Name-based UUIDv8: this one allows the use of secure hashing algorithms such as SHA256/SHA-3/SHAKE-256 (see here for an example of SHA256-based UUIDv8).
I did not find explicit version 8 implementations that are not just simple implementations. However, I think that the standard library, if supporting v8, should either take the simplest implementation where monotonicity is no more guaranteed or use the implementations for time-based or name-based UUIDv8 suggested by the RFC. While I understand that there are packages already exposed, I think the role of the standard library is first and foremost to be standard and thus should follow the RFC as much as possible.