-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity escaping is missing #42
Comments
One would wonder why somebody would put those into a href... Should be corrected regardless. |
The ampersand is used for GET parameters, it is definitely something that can appear on an HREF |
Here are the exact rules: https://www.sitemaps.org/protocol.html#escaping All the stuff have to be entity escaped and URLs URL-escaped and encoded. Rules are very clear. |
I'm musing with the idea of parsing properly-encoded hrefs, letting cURL handle the weirdness and encode it all right before inserting it. |
If you encounter encoded and/or escaped URLs, you should decode and unescape them before adding them to crawl list. All the text should be encode when it's written to sitemap but not before. Otherwise you will lose the uniqueness of URLs if you start encoding them while crawling. The encoding is needed only in sitemap and for sitemap. |
Currently only Ampersand (&) is entity escaped (&). Sitemap specification requires also single quote, double quote, GT and LT to be entity escaped:
This should be done to all the strings that are written into sitemap.xml.
The text was updated successfully, but these errors were encountered: