Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL serialiser does not produce idempotent strings if an opaque path has a trailing space and "exclude fragment" is true #784

Open
karwa opened this issue Sep 5, 2023 · 15 comments · May be fixed by #785 or #844

Comments

@karwa
Copy link
Contributor

karwa commented Sep 5, 2023

  • A URL's opaque path can contain trailing spaces if there is a query or fragment after it

  • This presents a problem, because if the URL somehow loses its query or fragment, the URL's serialisation will have trailing spaces. Since the parser trims spaces, re-parsing such a URL would not return an equivalent value to one that was serialised.

  • In Setters can cause failure to roundtrip #651 we modified the query and fragment setters to safeguard against this - if you use the API to remove the query/fragment, trailing spaces in the opaque path are trimmed.

  • However, the URL serialiser also has an "exclude fragment" parameter (default false). If this parameter were to be set true, it would produce the same issue as Setters can cause failure to roundtrip #651

Fixing this will be tricky. I think there are only a couple of options:

  1. We just document the status quo; the API/parser/serialiser are idempotent except when "exclude fragment" is set to true, in which case they might corrupt the path. That doesn't sound good. It's quite common to exclude the fragment.

  2. We trim opaque paths when serialising. That's also not good, because it means the API tells me one thing, but somebody (such as a browser) could serialise the URL without fragment, reparse the result, and the same API property would have a different value.

  3. We always trim trailing spaces from opaque paths. Technically a breaking change, but overall it's better at ensuring everything stays consistent.

@karwa
Copy link
Contributor Author

karwa commented Sep 5, 2023

To illustrate using JSDOM:

const urlRecord = parseURL("data:space #hello");
const serialized = serializeURL(urlRecord, /* excludeFragment: */ true);
  
const reparsed = parseURL(serialized);
const serializedAgain = serializeURL(reparsed, /* excludeFragment: */ true);
assert.strictEqual(serialized, serializedAgain);  // Fails
assert.strictEqual(received, expected)

Expected value to strictly be equal to:
  "data:space"
Received:
  "data:space "

@annevk
Copy link
Member

annevk commented Sep 5, 2023

Nice catch!

We could also try to encode a trailing space, maybe. Or maybe we already considered and discarded that last time for a good reason?

Within a browser serializing without a fragment is mainly done for networking-related code (and thus https:), but data: URLs could be impacted I suppose.

@karwa
Copy link
Contributor Author

karwa commented Sep 7, 2023

I don't see any discussion last time about escaping trailing spaces. I suggested escaping all spaces, but that was considered likely incompatible with the web. I think it should be fine.

That said, I would suggest that we escape all trailing spaces (not just the final trailing space). I can't imagine that escaping just the final space would be any better for compatibility.

@karwa karwa linked a pull request Sep 7, 2023 that will close this issue
4 tasks
@rmisev
Copy link
Member

rmisev commented Sep 7, 2023

I'd prefer to just escape only the last trailing space. It's simpler and the algorithm would be faster.
So I don't see any advantage in escaping all the trailing spaces, or am I wrong?

@annevk
Copy link
Member

annevk commented Sep 8, 2023

@achristensen07 @valenting @hayatoito thoughts on how to best tackle this? Given the discussion so far I see these options:

  1. Always replace a trailing space in an opaque path with %20.
  2. Replace a trailing space with %20 when it becomes problematic, such as when serializing or using one of the setters. (This would also change setter behavior away from removing trailing spaces.)
  3. Only replace a trailing space with %20 when serializing, but keep removing spaces when using the setters.

And then variants of 1/2/3 where we instead replace all trailing spaces with %20, let's call those 1b/2b/3b.

I like the simplicity of 1 personally.

@karwa
Copy link
Contributor Author

karwa commented Sep 8, 2023

I'd prefer to just escape only the last trailing space. It's simpler and the algorithm would be faster.
So I don't see any advantage in escaping all the trailing spaces, or am I wrong?

So, I think there is probably broad agreement that unescaped spaces are not ideal. Unfortunately, they are required for web compatibility, so we have to allow them as much as possible.

As part of this change, one or more trailing spaces that would previously have been unescaped would now be escaped. That's a change which has the potential to break some applications, but it is equally breaking whether we escape one trailing space or all trailing spaces. There is no compatibility advantage to only escaping a single space - the same applications would break either way, and the remedy would be the same.

I can't think of another time where we escape only a single occurrence of a particular character, and leave all other instances of that character in the same URL component without escaping; generally it's always "code point X is escaped in component Y". So both possibilities add not-insignificant complexity - but, if I were a developer working these kinds of URLs, I think it would be overall simpler and more predictable if the parser escaped all of those trailing spaces for me at once. Then I could do my processing (perhaps removing some of those trailing spaces), and I wouldn't keep seeing the parser add escaping for me all the time.

As for performance? The difference is negligible. We're talking about an edge case of an edge case of an edge case (multiple occurrences of unescaped trailing spaces in a URL with opaque path); the important thing is that the more common scenarios can be fast-pathed, and they can.

@annevk
Copy link
Member

annevk commented Sep 8, 2023

@karwa however, the current proposal gives data:text/html,blah blah%20 for data:text/html,blah blah so it's not quite correct that all spaces end up replaced. So we might as well do what is simpler.

@karwa
Copy link
Contributor Author

karwa commented Sep 8, 2023

Right. The question is only about multiple trailing unescaped spaces. If the source only contains a single trailing space to begin with, the result would be the same.

Result
Source data:blah blah ?q
Escape single trailing data:blah blah %20?q
Escape all trailing data:blah blah%20%20?q

I think escaping all trailing spaces leads to a simpler and more predictable outcome (and fewer unescaped spaces, even if we can't escape all of them) -- but again, it's an edge case of an edge case of an edge case, so it's not extremely important to me.

@annevk
Copy link
Member

annevk commented Sep 20, 2023

https://software.hixie.ch/utilities/js/live-dom-viewer/saved/12022 demonstrates this issue using XMLHttpRequest. Gecko strips trailing spaces and Chromium/WebKit do not (none escape). Perhaps this lack of interoperability suggests we have some freedom here to do something better?

@achristensen07 @valenting @hayatoito thoughts?

@valenting
Copy link
Collaborator

While I would prefer a consistent behaviour such as escaping all spaces in a path, I suspect it might be easier to just escape or strip the trailing spaces. I don't have any preference between the two.

@hayatoito
Copy link
Member

hayatoito commented Sep 21, 2023

I haven't had time to understand the issue yet.

Can't we consider that encoding a trailing space is a user's responsibility, instead of URL Standard's responsibility?

For example, the following is a bad practice. Spaces can be trimmed anytime after some accidental operations.

new URL("data:blah #hello")

Thus, we encourage users to escape trailing spaces by themselves before passing it to URL to avoid an accidental removal.

new URL("data:blah %20#hello");

Please correct me if I don't understand the issue.

@annevk
Copy link
Member

annevk commented Sep 21, 2023

You understand it correctly, the problem here is that we have an idempotence goal: https://url.spec.whatwg.org/#goals. And while the goals are not immutable, idempotence is a property I think we want to keep. There have been quite a few security issues with URL implementations that do not have that property.

@hayatoito
Copy link
Member

Thanks! I didn't notice that idempotence is clearly stated as a goal in the URL Standard. I understand now.

  1. Always replace a trailing space in an opaque path with %20.

Proposal 1 means:

const url = new URL("data:blah  #a");
assertEquals(url.pathname, "blah %20");
url.hash = "";
assertEquals(url.pathname, "blah %20");

, right? This sounds best to me (out of 1/2/3/1b/2b/3b).

However, as far as I understand, the following URLs (as a result of serialization) are not equivalent to each other:

  • "data:blah "
  • "data:blah%20%20"
  • "data:blah %20"

So every option proposed here seems a technically breaking change.

  1. We always trim trailing spaces from opaque paths. Technically a breaking change, but overall it's better at ensuring everything stays consistent.

This doesn't seem a popular option here, however, this looks the simplest and easy-to-understand rule to me.

I assume we introduce a breaking change anyway.

@karwa
Copy link
Contributor Author

karwa commented Dec 6, 2023

We always trim trailing spaces from opaque paths. Technically a breaking change, but overall it's better at ensuring everything stays consistent.

This doesn't seem a popular option here, however, this looks the simplest and easy-to-understand rule to me.

I’d be fine with this, too.

annevk added a commit that referenced this issue Dec 2, 2024
In fdaa0e5 we tackled a problem whereby removing the fragment or query from a URL with an opaque path through the API would not make the URL roundtrip due to the opaque path being able to end in non-percent-encoded spaces.

However, this failed to address other ways of serializing the URL. As such this is a new approach whereby opaque paths simply cannot end with non-percent-encoded spaces. Enforcing this in the URL parser allows us to completely revert the aforementioned commit, greatly simplifying the API implementation.

Fixes #784.
@annevk annevk linked a pull request Dec 2, 2024 that will close this issue
5 tasks
@annevk
Copy link
Member

annevk commented Dec 2, 2024

I put up #844 which removes trailing U+0020 from opaque paths during parsing. Now that all browsers have a good handle on non-special URLs it seems like a good time to be fixing this (modulo having to wait a bit for Interop 2024 results to be frozen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants