Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden validation of user-submitted LCP background image URL #1713

Merged
merged 29 commits into from
Dec 17, 2024

Conversation

westonruter
Copy link
Member

@westonruter westonruter commented Dec 2, 2024

Fixes #1584

Accepting the LCP background image URL from the untrusted client is dangerous as it could lead to abuse. The JSON Schema to enforce the validity of the URL Schema with such a background image URL is not robust enough to ensure that the URL is safe and that it actually references a valid image file. Therefore, a new od_url_metric_data_pre_storage filter is introduced in the REST API endpoint. The URL Metric data is passed into this filter, allowing plugins to further sanitize the data prior to storage. Alternatively, if the data cannot be sanitized due to it being invalid, an exception can be thrown (ideally an OD_Data_Validation_Exception) which blocks the URL Metric from being stored.

@westonruter westonruter added [Plugin] Optimization Detective Issues for the Optimization Detective plugin [Plugin] Image Prioritizer Issues for the Image Prioritizer plugin (dependent on Optimization Detective) labels Dec 2, 2024
@westonruter westonruter added this to the image-prioritizer n.e.x.t milestone Dec 2, 2024
Base automatically changed from add/external-bg-preload to trunk December 11, 2024 19:36
@westonruter westonruter added the [Type] Enhancement A suggestion for improvement of an existing feature label Dec 13, 2024
@westonruter
Copy link
Member Author

westonruter commented Dec 13, 2024

@felixarntz This is ready for review although I still have yet to:

  • Guard against attempting to preload gigantic images (e.g. 2MB).
  • Add tests for functions added in plugins/image-prioritizer/helper.php.

I intend to do these Friday morning my time, so hopefully this can be reviewed and merged by mid-day Friday PST in preparation for the Monday release.

@westonruter
Copy link
Member Author

I ran out of time to work on this today, unfortunately. So I'm going to pick it up tomorrow.

@westonruter westonruter marked this pull request as ready for review December 15, 2024 22:30
Copy link

github-actions bot commented Dec 15, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: westonruter <[email protected]>
Co-authored-by: adamsilverstein <[email protected]>
Co-authored-by: felixarntz <[email protected]>
Co-authored-by: swissspidy <[email protected]>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonruter The actual validation logic in Image Prioritizer looks good, but a few points of feedback on the OD API implementation.

plugins/optimization-detective/class-od-element.php Outdated Show resolved Hide resolved
plugins/optimization-detective/class-od-url-metric.php Outdated Show resolved Hide resolved
* @noinspection PhpUnhandledExceptionInspection
*/
wp_trigger_error( __FUNCTION__, $image_validity->get_error_message() . ' Background image URL: ' . $data['url'] );
$url_metric->unset( 'lcpElementExternalBackgroundImage' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is the only unset() usage in production code? Since this is a very specific use-case, maybe we can find another solution to make it possible, without allowing to unset metric and element data generally.

Could we for instance generate a copy of the given $url_metric, without the field that is unset? Obviously that's more or less the same thing, but without the need for an extra method just because of this use-case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work. Such a sanitization filter would be something like this:

$data = $url_metric->jsonSerialize();
unset( $data['lcpElementExternalBackgroundImage'] );
$url_metric = new OD_Strict_URL_Metric( $data );

The downside here is it's going to end up re-validating the data against the schema twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably fine. I don't think there's a considerable performance cost with that, and it only happens when sending data (vs e.g. on every page load).

* @param OD_Strict_URL_Metric $url_metric URL Metric, already validated against the JSON Schema.
* @param array<string, mixed> $url_metric_data Original URL Metric data before any mutations.
*/
$validity = apply_filters( 'od_url_metric_storage_validity', true, $url_metric, $url_metric->jsonSerialize() );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really storage specific? Why not call it just od_url_metric_validity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for this being storage-specific explained in the filter description. In particular, this extra validation is only applied at the time of storage. It doesn't apply when loading URL Metrics from the od_url_metrics post type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's quite normal though, to validate and sanitize prior to receiving data. When loading from a post type, it's fair to assume that this data is valid - otherwise it should have never been saved to a post in the first place.

I don't think WordPress commonly validates or sanitizes outside of receiving data, so I think mentioning "storage" seems unnecessary.

Comment on lines 289 to 295
/**
* No WP_Exception is thrown by wp_trigger_error() since E_USER_ERROR is not passed as the error level.
*
* @noinspection PhpUnhandledExceptionInspection
*/
wp_trigger_error( __FUNCTION__, $image_validity->get_error_message() . ' Background image URL: ' . $data['url'] );
$url_metric->unset( 'lcpElementExternalBackgroundImage' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the unset() point, this function strictly speaking sanitizes the URL metric, it doesn't just validate it. Given the validity filter name, I would find it more appropriate to throw an exception and call it a day. Validation shouldn't remove invalid properties. Also, since this wp_trigger_error() would typically happen in the REST API, it would be suppressed even with WP_DEBUG_DISPLAY, making this harder to spot - so an exception / WP_Error would be better from that perspective too. Would that not work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking behind unset here is that what if someone using an external background image from another domain not in the allowed list of origins. If any exception it thrown then every single URL Metric will be rejected. This is not good because then none of the other data in the URL Metric would ever be stored. What we want in this case is to just exclude that non-required property from the URL Metric as if it wasn't set in the first place. The use of wp_trigger_error() is to provide a mechanism for a site owner to find out when WP_DEBUG is enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but then shouldn't we have a filter for sanitization instead for this purpose, rather than validation?

Given this is for now the only usage of the validation filter, but it's not throwing an exception, I think it would make sense to change this to a sanitization filter that acts on the URL_Metric instance. This would be similar to e.g. how Core has a sanitization filter for options, but not validation. Obviously we could also add a validation extension point, but that could then happen later and doesn't seem as relevant to this as a sanitization filter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an alternative to the current validity filter in this PR. We could have a filter that runs on the $data before it is passed into the OD_Strict_URL_Metric class for instantiation. So instead of:

try {
// The "strict" URL Metric class is being used here to ensure additionalProperties of all objects are disallowed.
$url_metric = new OD_Strict_URL_Metric(
array_merge(
$data,
array(
// Now supply the readonly args which were omitted from the REST API params due to being `readonly`.
'timestamp' => microtime( true ),
'uuid' => wp_generate_uuid4(),
'etag' => $request->get_param( 'current_etag' ),
)
)
);

We could have:

try {
	$data = array_merge(
		$data,
		array(
			// Now supply the readonly args which were omitted from the REST API params due to being `readonly`.
			'timestamp' => microtime( true ),
			'uuid'      => wp_generate_uuid4(),
			'etag'      => $request->get_param( 'current_etag' ),
		)
	);

	/**
	 * Filters the URL Metric data prior to validation for storage.
	 *
	 * This can be used to apply additional sanitization to the URL Metric data which
	 * cannot be done with JSON Schema alone. Additionally, this filter can be used to
	 * apply validation constraints by throwing an OD_Data_Validation_Exception which
	 * will block the URL Metric from being stored.
	 *
	 * @since n.e.x.t
	 * @param array<string, mixed> $data URL Metric data. This is the Data type from OD_URL_Metric.
	 */
	$data = apply_filters( 'od_url_metric_data_pre_storage', $data );

	// The "strict" URL Metric class is being used here to ensure additionalProperties of all objects are disallowed.
	$url_metric = new OD_Strict_URL_Metric( $data );

In this way, the same filter can be used for both sanitization and validation. Even without explicitly throwing an exception, an exception would end up getting thrown anyway if a filter did something like unset( $data['viewport] ) since that is a required property.

Copy link
Member Author

@westonruter westonruter Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonruter That looks good to me. Though I think that filter name would be better as od_sanitize_url_metric_data, as it's a more intuitive name in terms of naming structure, and its primary purpose would be to sanitize data.

Makes sense that this filter could also be used for validation, so the try catch wrap around it makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using "sanitize" in the filter name makes sense since other core filters aren't named in that way. It's up to the filter callback to decide whether the purpose is to sanitize. Going the Options API as a model, there is the pre_update_option filter which is used to (maybe) serialize, how about od_pre_store_url_metric? Or to follow wp_insert_post() which has the wp_insert_post_data filter, what about od_store_url_metric_data? I like that more.

Copy link
Member Author

@westonruter westonruter Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there there is sanitize_option_{$option_name}.

Copy link
Member

@felixarntz felixarntz Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using "sanitize" in the filter name makes sense since other core filters aren't named in that way.

I don't think that's true, see e.g. sanitize_option_{$option}.

It's up to the filter callback to decide whether the purpose is to sanitize. Going the Options API as a model, there is the pre_update_option filter which is used to (maybe) serialize, how about od_pre_store_url_metric?

What other use-cases would such a filter have than sanitize? You can throw an exception (effectively validate), but technically you can throw an exception from anywhere, so that to me is not necessarily a reason to change the name. The example of serialization wouldn't be relevant here.

Or to follow wp_insert_post() which has the wp_insert_post_data filter, what about od_store_url_metric_data? I like that more.

This would be an okay name to me if we go with filtering the raw data. But that's not as clean of an API surface as filtering the metric instance (per my other comment), so I would prefer not to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other use-cases would such a filter have than sanitize? You can throw an exception (effectively validate), but technically you can throw an exception from anywhere, so that to me is not necessarily a reason to change the name. The example of serialization wouldn't be relevant here.

A use case could be an extension which amends a stored URL Metric with additional data, like the IP address of the client which submitted the URL Metric.

This would be an okay name to me if we go with filtering the raw data. But that's not as clean of an API surface as filtering the metric instance (per my other comment), so I would prefer not to do that.

I think the API surface is actually cleaner by passing around an array rather then the OD_URL_Metric instance. See #1713 (comment).

@westonruter westonruter force-pushed the add/external-bg-preload-validation branch from 94c5e6b to e6cfcfd Compare December 16, 2024 18:34
@westonruter westonruter force-pushed the add/external-bg-preload-validation branch from e6cfcfd to 250f094 Compare December 16, 2024 18:39
Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonruter This looks close, two follow up comments.

*
* @param array<string, mixed> $data URL Metric data. This is the Data type from OD_URL_Metric.
*/
$data = apply_filters( 'od_url_metric_data_pre_storage', $data );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1713 (comment) and #1713 (comment), I think there are better names for such a filter. Two things:

  • It's missing a verb. I think sanitize would be suitable.
  • The mentioning of "pre storage" is unnecessary for this kind of sanitization (or validation), and I think it's unnecessarily verbose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the filter in f74f4f4

Comment on lines 219 to 222
$data = apply_filters( 'od_url_metric_data_pre_storage', $data );

// The "strict" URL Metric class is being used here to ensure additionalProperties of all objects are disallowed.
$url_metric = new OD_Strict_URL_Metric( $data );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not keep filtering the OD_URL_Metric instance for this, as that would be a better filter interface than dealing with raw data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not because then we're passing around an object to the filters, and objects are passed by reference. And any time that you construct an OD_URL_Metric it applies all of the JSON Schema for validation. It's also not ergonomic to make a change to a URL Metric via a filter:

function ( $url_metric ) {
    $data = $url_metric->jsonSerialize();
    unset( $data['bad_key'] );
    $new_url_metric = new OD_Strict_URL_Metric( $data );
    return $new_url_metric;
}

Compared with:

function ( $data ) {
    unset( $data['bad_key'] );
    return $data;
}

So I think it is better to pass around the URL Metric data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, by passing around an object then the user needs to be aware of the distinction between OD_Strict_URL_Metric and OD_URL_Metric, where the former is intended to be an internal class used just in the REST API endpoint.

Copy link
Member

@felixarntz felixarntz Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only for the specific example of unsetting data though (because there's no unset() method). Regular sanitization would be as simple as:

function ( $url_metric ) {
    $url_metric->set( 'prop', 'sanitized-value' );
    return $new_url_metric;
}

If you feel strongly about the ergonomics of this for the scenario of unsetting a property, I'd rather bring back the unset() method than filtering raw data just because of the lack of an unset() method.

Honestly, at that point, it would be even better to use an action hook, since the object can be modified directly either way, and an exception can still be thrown too, so a filter is not needed. This would also be safer, as a filter would allow to exchange the entire object which, despite me having suggested that earlier in favor of unset(), also has drawbacks as normally people shouldn't be doing that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no set() method either. So you'd have to do the same:

function ( $url_metric ) {
    $data = $url_metric->jsonSerialize();
    $data['prop'] = 'sanitized-value';
    $new_url_metric = new OD_Strict_URL_Metric( $data );
    return $new_url_metric;
}

If we were to support set() then this would bring back concerns about cache invalidation for the group and the collection, and it would be confusing if a user tries to call set() on a URL Metric from a collection in a tag visitor, as obviously that would not be right since it's only relevant here.

So I think it's preferable for the OD_URL_Metric object to be immutable, siding with you for suggesting that we don't allow unset().

Copy link
Member

@felixarntz felixarntz Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filters are the middleware.

Not in the sense that I meant it, they don't allow controlling what you can modify. With an array you can do whatever you want. If we later renamed a property and handle it backwards compatibly, or wanted to trigger a warning if someone changes a specific property, things like that, we can't do that if we use an array - but we can if we use a class instance.

I really think this is too much. I think it is somewhat over-engineered.

Can you clarify why? What's "WordPressy" is a bit subjective, and FWIW some parts of WordPress are not necessarily examples for good engineering.

If we want something more structured in the future I suggest deprecating this filter to instead go with the replacement.

That's a short-term option, though it would not be a good long-term solution IMO per my previous comment. Since this is in the milestones due tomorrow, I'd be okay going with it for now, but it's really not great to ship new API surfaces without sufficient consideration just because we want to ship something fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we later renamed a property and handle it backwards compatibly, or wanted to trigger a warning if someone changes a specific property, things like that, we can't do that if we use an array - but we can if we use a class instance.

That's a good point.

Can you clarify why?

It's subjective 😀

What's "WordPressy" is a bit subjective

I'm not sure its subjective. WordPressy is what aligns with previous established patterns in WordPress.

FWIW some parts of WordPress are not necessarily examples for good engineering.

This is true.


I think there is another way to go about this with an existing filter in the REST API without introducing a new sanitization/validation mechanism: the rest_request_brefore_callbacks filter. I've never used this filter before, but it is billed as being specifically for performing additional validation. I'm investigating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, check this out: d4c9f40

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, if we can implement this without a new API surface in OD entirely, that would simplify things, at least for now :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that looks great and elegantly avoids expanding the new API. So obviously this is a workaround now, but it gives us time to think whether and how we would want to make this more intuitively possible.

This is using get_json_params() instead of get_params() because any URL query parameter would then be included among the params, so by just getting the params from the JSON body we avoid potential failures due to additional query parameters being injected into the request URL.
Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iterations @westonruter!

Comment on lines 291 to 292
// The strtolower() is due to \WP_REST_Server::match_request_to_handler() using case-insensitive pattern match.
OD_REST_API_NAMESPACE . OD_URL_METRICS_ROUTE !== trim( strtolower( trim( $request->get_route(), '/' ) ) )
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By chance I had the thought of checking whether the route was case sensitive. It wasn't. So this would have been a vulnerability. I want to double check how $request->get_route() gets populated to see if there could potentially be any other ways to make a request to the endpoint and yet bypass this condition here.

Copy link
Member Author

@westonruter westonruter Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I found that if you tried making a request without pretty permalinks and added %0A to the end of the rest_route query parameter:

http://localhost:8888/index.php?rest_route=/optimization-detective/v1/url-metrics:store%0A

Then the route is parsed as "/optimization-detective/v1/url-metrics:store\n" and this matches here in \WP_REST_Server::match_request_to_handler() on this line:

$match = preg_match( '@^' . $route . '$@i', $path, $matches );

This is because $ matches a newline. In reality this should be using \z instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test case to explicitly check for this: 5ab7fd1

@westonruter westonruter force-pushed the add/external-bg-preload-validation branch from b31d894 to 5ab7fd1 Compare December 17, 2024 19:18
@westonruter westonruter merged commit f5f50f9 into trunk Dec 17, 2024
15 checks passed
@westonruter westonruter deleted the add/external-bg-preload-validation branch December 17, 2024 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Plugin] Image Prioritizer Issues for the Image Prioritizer plugin (dependent on Optimization Detective) [Plugin] Optimization Detective Issues for the Optimization Detective plugin [Type] Enhancement A suggestion for improvement of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image Prioritizer should be able to store the LCP (background) image URL for prioritization
4 participants