Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UUIDv6, UUIDv7, and UUIDv8 from RFC 9562 #89083

Open
stevesimmons mannequin opened this issue Aug 15, 2021 · 42 comments
Open

Support UUIDv6, UUIDv7, and UUIDv8 from RFC 9562 #89083

stevesimmons mannequin opened this issue Aug 15, 2021 · 42 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@stevesimmons
Copy link
Mannequin

stevesimmons mannequin commented Aug 15, 2021

BPO 44920
Nosy @serhiy-storchaka, @loganasherjones, @stevesimmons

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-08-15.17:26:05.956>
labels = ['type-feature', 'library', '3.11']
title = 'Support UUIDv6, UUIDv7, and UUIDv8 from the new version of RFC4122'
updated_at = <Date 2021-08-17.00:36:53.563>
user = 'https://github.com/stevesimmons'

bugs.python.org fields:

activity = <Date 2021-08-17.00:36:53.563>
actor = 'loganasherjones'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2021-08-15.17:26:05.956>
creator = 'stevesimmons'
dependencies = []
files = []
hgrepos = []
issue_num = 44920
keywords = []
message_count = 2.0
messages = ['399624', '399647']
nosy_count = 3.0
nosy_names = ['serhiy.storchaka', 'loganasherjones', 'stevesimmons']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue44920'
versions = ['Python 3.11']

Linked PRs

Related

@stevesimmons
Copy link
Mannequin Author

stevesimmons mannequin commented Aug 15, 2021

Three new types of UUIDs have been proposed in the latest draft of the next version of RFC4122. Full text of that draft is in [1] (published 21 April 2021; draft period ends 21 Oct 2021).

Support for these should be included in uuid.py for Python 3.11, with backport for 3.9 and 3.10. The timetable for Python 3.11 should fit with the end of the IETF draft period.

Implementation should be similar to the existing UUID classes in uuid.py, the prototypes in [2], or even parts of my own uuid6 version [3].

[1] https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format
[2] https://github.com/uuid6/prototypes/tree/main/python
[3] https://github.com/stevesimmons/pyuuid6/blob/main/uuid6.py

@stevesimmons stevesimmons mannequin added 3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Aug 15, 2021
@serhiy-storchaka
Copy link
Member

It is a new feature, and we usually do not backport new features to old Python versions, so it can only be included in Python 3.11 (backports can be provided by third-party libraries). Do you want to create a PR?

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@freundTech
Copy link
Contributor

Is there anyone currently working on this?

If not I'd like to have a look at implementing this.

@AlexWaygood AlexWaygood added 3.12 bugs and security fixes and removed 3.11 only security fixes labels Sep 8, 2022
@ambv
Copy link
Contributor

ambv commented Oct 15, 2023

Note: the spec for UUIDv5 - UUIDv8 is still a draft, it's still being revised:
https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis

Therefore, it is too early to add this to the Python standard library.

@erlend-aasland erlend-aasland added 3.13 bugs and security fixes and removed 3.12 bugs and security fixes labels Jan 5, 2024
@shane-ns1
Copy link

UUIDv6, UUIDv7, and UUIDv8 are now in a standards-track RFC:

https://www.rfc-editor.org/rfc/rfc9562

@picnixz
Copy link
Contributor

picnixz commented Jun 17, 2024

I'll make a PR for this (I'm interested in those versions).

@hugovk hugovk added 3.14 new features, bugs and security fixes and removed 3.13 bugs and security fixes labels Jun 17, 2024
@hugovk hugovk changed the title Support UUIDv6, UUIDv7, and UUIDv8 from the new version of RFC4122 Support UUIDv6, UUIDv7, and UUIDv8 from RFC 9562 Jun 17, 2024
@picnixz

This comment was marked as resolved.

@gpshead
Copy link
Member

gpshead commented Jun 22, 2024

FYI - there are PyPI packages from people in the community attempting to come up with ways to use UUID v6-8 today:

What we'd be seeking to do within the stdlib is settle upon how these should fit as features into the standard library's existing uuid APIs. (here and with your other fields related issue)

@picnixz
Copy link
Contributor

picnixz commented Jun 22, 2024

Actually, I first tried an implementation based on those packages but after reading the RFC again, I was wondering: "which is the best course of action for the standard library?" and thus I decided to pick the (only) possible variant of v6 where the implementation is RFC-compliant (and then I hit the issue with the fields...) and for v7 and v8, I decided to first take the generic one (and made an alternative for v7 using monotonicity as specified in the RFC alternatives). I did not decide anything on v8 since discussion should first be done.

Note that oittaa's v7 is more or less like #120650 (non-monotonous sub-sec v7) since it follows the basic RFC but Simmons' v7 seems to follow the alternative (Method 3) combined with Method 1, §6.2 (Fixed Bit-Length Dedicated Counter) whereas #120830 is Method 3 combined with Method 2, §6.2 (Monotonic Random). I say "seems to" because it's not really clear whether the RFC allows mixing Method 1 & Method 3 (Method 1 forces the counter to immediately follow the 48-bit timestamp part but Method 3 says that the sub-seconds precision should be at that place so...). Method 2 explicitly tells me that I need to use the last 62 bits to make whatever I need so it's closer to RFC compliance.

Actually, there are more prototypes that I found last week: https://github.com/uuid6/prototypes, and they like to differ in the implementation of v7 and v8... For v6, the implementation is RFC-decided so we don't need to bother with a discussion, just the other issue on the fields. For v7/v8, do you think we need a Discourse (different from https://discuss.python.org/t/add-uuid7-in-uuid-module-in-standard-library/44390/7) & a PEP perhaps?

There's also https://github.com/uuid-rs/uuid which uses the same techniques that I presented in the first PR (namely, UUIDv7 has 80-bit security and UUIDv8 has custom chunks).

@picnixz
Copy link
Contributor

picnixz commented Jun 26, 2024

I've opened https://discuss.python.org/t/rfc-4122-9562-uuid-version-7-and-8-implementation/56725 to discuss the implementations more in detail.

@picnixz picnixz removed the 3.14 new features, bugs and security fixes label Aug 22, 2024
@kyzer-davis
Copy link

kyzer-davis commented Aug 22, 2024

Thanks, I found your PRs after I submitted my writeup! (I will comment on #120878 rather than here)

When you edit your post, please make it a new comment and ping me instead so that I get the notification.

Will do

But generally I agree. Each item I mention should be a PR. Some come before others but they all have some benefit.

I will address a few of the comments to expand on them:


Support all v7 counter methods:

I like this approach but this would make the size of the module blow up.

I don't think it will add that much overhead.
We should have the default as none which generates timestamp and random since this is technically the base UUIDv7 spec. IMO implement this first then we can layer on methods 1-3 in future PRs. The others are "nice to have" for batch UUID generation rather than "I need a UUID give me one now" every so often. (Similar to timestamp_offset... get the base v7 then we can add a flag for that additional item later.

All Methods 1-3:

  • Generate timestamp and random as if none was supplied
  • Check if old v7 timestamp is same as current v7 timestamp
  • When timestamp is greater than old one, ensure we clear the old counter
  • Pseudocode below assumes we have already verified the timestamp is the same and need to do counter logic:

Fixed Bit-Length Dedicated Counter (Method 1):

  • Instantiate N-1 counter randomly if no counter state using 12 as default or user supplied input
  • If counter: increment +1
  • Add counter to MSB of random. Can be done a few ways (front-trim+append, bitwise operations, etc.)

Monotonic Random (Method 2):

  • The key for this one is that we track the state of both the timestamp and the random and don't roll for each UUID gen call, just re-roll random when the timestamp is different. Thus the random just increments +1 from previous when timestamp is the same or timestamp changes and random also changes.

Replace Leftmost Random Bits with Increased Clock Precision (Method 3):

  • I will be honest, I did not come up with this method or the text so I could be wrong with what comes next but I will put some logic down regardless.
  • From a NS timestamp, truncate to MS but keep track of the extra bits we would discard to calculate a counter as per this paragraph: https://www.rfc-editor.org/rfc/rfc9562.html#section-6.2-5.6.3
  • Check if we have an old counter state and increment by 1
  • If no old counter, perform the calc above and save it for future comparisons.
  • This counter is a fixed length of 12 bits so it should be easy to get into the MSB of the random (akin to what would be done with method 1)

v8
I will check the PR, didn't see that one yet.

By the way, anyone should be able to generate v8, not just an "admin".

Yeah, that is what I meant. Admins==Users in my world.


I think the standard library is responsible for providing something standard.

That true. This specific example, while illustrative, was an ask of many folks who want to use v5 but for one reason or another can't use sha1. The request came so late into the spec writing that we didn't get to give it a proper version number. This was a simple way of providing some extensibility to v5 via the v8 space.

With the level chatter around it that I saw while authoring the document: I think folks will find it useful if there was a way to generate a sha256 name-based UUIDv8 using that specific example. Maybe we could label it experimental?

Note: I wrote that example by actually modifying uuidv5 in uuid.py so I know the code required is pretty minimal.

But I tend to agree with the fact that v8 name-based (using whatever algo is probably overkill). If we create the v8 generic function; whomever creates that library can piggy back off our uuid.uuidv8 later.


monotonic_check=true

One question: what would this monotonic_check do? should it warn? should it fail? should it sleep a bit to make sure that monotonicity is correct? assuming no race conditions, the current implementation technically (I hope) guarantees monotonicity so no need for an additional helper.

We have that here for general checks:
https://www.rfc-editor.org/rfc/rfc9562.html#section-6.2-12.2

For counter rollover (shouldn't happen with the methods built-in guards but maybe it does) the guidance is here:
https://www.rfc-editor.org/rfc/rfc9562.html#section-6.2-9.4


What would UUID.NIL represent? and UUID.MAX?

NIL = UUID('00000000-0000-0000-0000-000000000000')

https://www.rfc-editor.org/rfc/rfc9562.html#name-nil-uuid

https://www.rfc-editor.org/rfc/rfc9562.html#name-max-uuid

MAX = UUID('FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF')


AFAIK, 9562 is an extension of 4122 so v1, v3, v4, and v5 should be ok if they conform to 4122.

We did make one small change to v1 (and v6) which is that we now encourage random > MAC nodes now.

The default for this lib is if node is not supplied it gets one from the host and if that is not possible; generates random.

IMO we should add a new v1/v6 flag that is like node_pref=4122,9562 and default to 9562. (Alt default to 4122 for v1 and 9562 for v6)
Then in check for this to determine if the node should be generated a per mac > random (4122) or random > mac (9562).

Note: I do not think we need to add the MAC address randomization techniques [IEEE802.11bh] to this logic.

More on nodes: there is lots of logic added for other types of IEEE 802 node derivation (can be handled in some other PR or not at all. Just guidance for what to do if we obtain these values and how to convert them into a node ID)
https://www.rfc-editor.org/rfc/rfc9562.html#section-5.1-10

Lastly, I looked at my notes and I did check v3/v5 when performing various checks during the authoring.
The outputs provided by this library matched the other libraries so I am inclined to say they are valid under the hood. (Some libraries I saw ran into issues with mixing the variant into the hash. See my note here for an example of what I mean.

@kyzer-davis
Copy link

kyzer-davis commented Aug 22, 2024

Updated my original comment with a link to the errata that change a binary bit from a test vector.
e.g https://www.rfc-editor.org/errata/eid7929
Edit, this is only for the sha256 v8 UUID so if that is not used this errata is moot.

@gbdlin
Copy link

gbdlin commented Aug 22, 2024

We do not currently have a lock mechanism (not even for v1) so we could perhaps make it more robust.

Synchronization issues might happen but we're probably very unlucky if we ever hit them (or if someone noticed it).

For synchronization and lock mechanism, there is actually one for v1, but a bit obscured. When an implementation in C exists (using unix-tools underneath), it can call a separate process, if it is running on the system, called uuidd. This is a single-thread process capable of generating v1 UUIDs monothonically and without collisions. As v1 UUIDs are mac address based, this is needed for UUIDs to be collision-free, as on multi-threaded systems there is a chance 2 UUIDs wil be generated on the same clock cycle. If uuidd is used for it, the generated uuid will have uuid.SafeUUID.safe flag set.

This method is for now not present for UUID v6, as the uuidd process is not capable of generating them for now.

UUID v7 would not need such treatment, as each process should generate a different random value, which is used in v7 instead of a mac address.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Aug 22, 2024

Replace Leftmost Random Bits with Increased Clock Precision (Method 3):

  • I will be honest, I did not come up with this method or the text so I could be wrong with what comes next but I will put some logic down regardless

Although fault tolerance requires that each microservice writes to its own database tables, in practice this requirement is often violated.

The implementation of UUIDv7 for PostgreSQL had to switch from Method 1 to Method 3 (Increased Clock Precision with 12 bits sub-millisecond timestamp fraction) to synchronize the UUIDv7s generated by different microservices for the same database table. This turned out to be simpler than the autoincrement-like analogue. See the C implementation v27-0001-Implement-UUID-v7.patch of Method 3 at the page as a reference. The entire timestamp acts as a counter in rare case when more than about 4 identifiers per microsecond are generated.

This implementation also added the ability to offset the timestamp by a specified interval to hide the record creation time for information security. If offset would cause the timestamp to be outside the allowed range, it should not be applied.

It would be nice to add such a special UUIDv7 function for microservices.

@kyzer-davis
Copy link

Edited my original comment to feature a note about uuid.RFC_4122 in the code and some suggestions for handling that.

@piranna
Copy link

piranna commented Sep 6, 2024

From my understand of the v7 spec, it's possible to have both sub-millisecond precission AND a counter, and also the sub-millisecond DON'T need to be at least 12 bits, just only recommended since it's the size of the rand_a field.

I was planning on doing my own implementation (I can share it if you are interested to include it), and my idea was to have a sub_ms_bits flag to specify how much bits do you want to use for the sub-millisecond field (0-16, that's the most we can get with time.time_ns() function, or True to enable it with the suggested 12 bits), and a counter_bits int flag to indicate how much bits to use for the counter (I think the spec says up to 48, to ensure there's at least 10-14 pure-random bits at the end, but maybe I misunderstood). Later we can have a third and final flag to define what method to use for the counter bits. This way we could fully implement the UUID v7 spec in a flexible way, by allowing to opt-in to the extended precision and counter bits and define their length.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Sep 7, 2024

@piranna

When using submillisecond precision, I advise you to use the whole timestamp as a counter if the timestamps for consecutive UUIDs have not increased. This will eliminate the need for a separate counter, and therefore it will be possible to preserve a sufficiently long random segment (at least 32 bits) to make attacks difficult by sequential brute force of UUID values.

As for the length of the submillisecond part, it is necessary to take into account not only the available precision of time sources in operating systems, but also the maximum performance of recording in the DBMS, which does not require nanosecond accuracy.

@picnixz
Copy link
Contributor

picnixz commented Sep 7, 2024

You're right that the RFC does not give a lower bound but it gives an upper bound (emphasis mine):

An OPTIONAL sub-millisecond timestamp fraction (12 bits at maximum) as per Section 6.2 (Method 3).

That being said, I chose to have a non-modularizable implemention of v7 for the standard library as a first draft. It's always possible to make it extensible in the future but I think the standard library should propose one way (if you want more, then I think 3rd-party libs should be used).

The issue with specifying a flexible counter bit length is that we need to keep track of multiple global variables (for instance, all UUIDv7 objects generated with a counter with say 7 bits will have their own global timestamp and counter for synchronization). This is quite easy since you would create a dictionary entry with its state, each entry being a possible configuration. But I don't really want to go there.

What could be done, however, is to create a factory of UUID factories. In other words, you specify a configuration for your UUIDv7 algorithm and you'll get an object that would create UUIDv7 objects according to that specific configuration. The factory object would have a single method, namely new() which would just create a new UUIDv7.

This would probably the easiest way to have a flexible UUIDv7 implementation included in the standard library. The standard library would however expose by default the UUIDv7 implementation using the RFC recommended methods.

@piranna
Copy link

piranna commented Sep 7, 2024

My comments were about made the v7 implementation according to the spec, but i'm ok with having base support with just 48 timestamp bits and no counters or already enabled opt-in featured, as far as API is designes to allow add and enabled the featured in upcoming versions. It's said, i don't want implentation to be opinión based but be spec based, like 12 bits submilliseconds being all or nothing when spec allows It to be variable, or force to choose between submilliseconds precission or counter when spec allows both. Call me purist if you want.

@picnixz
Copy link
Contributor

picnixz commented Sep 8, 2024

I'll work on designing a separate factory for that. What I want to more or less ensure is to be consistent with other languages if possible (e.g., PHP/Rust that can be both used for microservices and/or backends) rather than having Python follows its own rules.

Call me purist if you want.

I won't call you that because that's what I would personally do for my personal projects. However, for the stdlib we sometimes need to make design and implementation choices. But part of me do like a flexible implementation (especially if it is desired by the community).

Now, another of example of this is actually the random module which features standard random techniques (namely MT-19937 based generators) but allows you to specify your own PRG algorithm if you want to. I think having a similar interface could be beneficial (I'll work on it a bit later, just to suggest a PoC).

@sergeyprokhorenko
Copy link

@picnixz

What I want to more or less ensure is to be consistent with other languages if possible (e.g., PHP/Rust that can be both used for microservices and/or backends) rather than having Python follows its own rules.

See uuid-rs/uuid#767 (comment)

@piranna
Copy link

piranna commented Sep 23, 2024

I have create a full implementation of the UUIDv7 latest frozen spec (RFC 9562) at https://github.com/piranna/UUIDv7, with 100% tests coverage. The most complex part was to understand that in practice, having a monotonic random makes the counter as a sort of guard, since spec text is confusing explaining the relationship between methods 1 & 2. After that, implementation was easy-ish and I think got to get a very simple API, although complete and and at the same time unopinionated.

I have done it with the intention of being considered to be included as a built-in library in the Python batteries. Besides adding (more and better) documentation, what else would I need to get it included?

@picnixz
Copy link
Contributor

picnixz commented Sep 23, 2024

Thank you for that but I already (and completely) implemented the v7 but I'll have a look at yours. We can update my PR and avoid having two different PRs but not today (I'm currently travelling)

The issue here is not really the API but rather which standard implementation to choose by default. I followed what other languages decided to do (and will update it accordingly) but we will probably have another interface for more versatility.

In general, the Python library does not like having a single function with lots of parameters which make the implementation different and rather like having different functions with different names (but I'm not sure if this principle applies here; it did for the (yet to be accepted) fnmatch.filterfalse function). This is also the reason why I first wanted a parameterless uuidv7 function as a first implementation.

@piranna
Copy link

piranna commented Sep 23, 2024

My own one by default works without arguments, all of them are optional, and that just provides a 48 bits timestamp + 74 random bits. Later you can provide the arguments to enable the different methods, or tune Up, or define the explicit values for each one of the fields.

@picnixz
Copy link
Contributor

picnixz commented Sep 25, 2024

My own one by default works without arguments, all of them are optional

Actually this is what we try to avoid. One reason is that it makes maintenance harder (if any) and optimization harder as well (you create if-branches due to that). Finally, the fact that there are assertions / checks for checking whether the parameters combinations are good or not is something I would like to avoid (I think it's easier to make multiple functions rather than a single one; but we can make a single class with multiple class methods acting as a factory, which is what I originally had in mind).

I had a quick look at the API and I think we'll need to rethink the UUID class itself and the class itself should only be a view and not be responsible for generating the value.


What I can suggest is: we first decide on a default implementation that is really parameterless, namely def uuid7(): ... as in my PR and then we can add the features one by one. While I may seem critizing I am grateful that you took the time to look at the RFC and provide a versatile implementation.

@picnixz picnixz self-assigned this Sep 25, 2024
@piranna
Copy link

piranna commented Sep 25, 2024

I also think UUID API needs to be rethink, It seems like It was a uuid1 that later was refitted to allow support for the other versions. A UUID base class and several UUIDx chikd classes with their own properties would be better.

A parameterless version of UUID would just only create current timestamp + random, that can work pretty much as replacement of uuid4. I think we can start with that, just only in my use case i needed to set the timestamp explicitly too, just only meanwhile i was there, i wanted to go the extra (ten :-P) miles :-D And i'm glad you liked i did It :-)

@picnixz
Copy link
Contributor

picnixz commented Sep 25, 2024

I have a separate issue for tracking the UUID interface itself (burried in this huge conversation): #120878 (I added it to the issue; should have done that earlier...). It's only about the time fields but this is roughly one thing that is annoying (namely some attributes are not supported or have different meanings depending on the version).

coderbydesign added a commit to coderbydesign/insights-rbac that referenced this issue Oct 25, 2024
… column

This would start a new convention with the v2 APIs and models in order to have
consistent, clear naming, particularly when it comes to FK references.

We currently have the `uuid` field as the self-referencial FK column on the `Workspace`
model. More details around the impetus for changing the naming around IDs can be
found in RedHatInsights#1257.

These changes offer an alternate approach, since we have no data in stage/production,
where we no longer use the `uuid` as the `lookup_field` in Django, but rather
use a `uuid` as the `id` format. The rationale for not doing this, and having an
explicit `uuid` was primarily for having sequential integers as the PK/FK relations.
However, UUID7 is a time-ordered UUID, eliminating index issues and solving the
need for having distributed ID values across our services.

We're using `uuid-utils` [1] which is a compliant implementation using Rust's
UUID library. There's also an open proposal [2,3] to add it to Python's standard
library.

This updates the model, view and serializer. In order to move the `id` from int
to uuid, we need two migrations:
- one to move the current `id` column, and the `parent` column (because of the FK ref)
  as well as making the current `uuid` column the PK
- a second to then rename the `uuid` column to `id` and add the `parent` FK ref/column back

[1] https://github.com/aminalaee/uuid-utils
[2] python/cpython#89083
[3] https://discuss.python.org/t/add-uuid7-in-uuid-module-in-standard-library/44390
vstinner pushed a commit that referenced this issue Nov 12, 2024
@vstinner
Copy link
Member

Change 03924b5 added uuid.uuid8().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests