Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity escaping is missing #42

Open
ghost opened this issue Sep 9, 2017 · 5 comments
Open

Entity escaping is missing #42

ghost opened this issue Sep 9, 2017 · 5 comments

Comments

@ghost
Copy link

ghost commented Sep 9, 2017

Currently only Ampersand (&) is entity escaped (&). Sitemap specification requires also single quote, double quote, GT and LT to be entity escaped:

Ampersand	&	&
Single Quote	'	'
Double Quote	"	"
Greater Than	>	>
Less Than	<	&lt;

This should be done to all the strings that are written into sitemap.xml.

@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

One would wonder why somebody would put those into a href...

Should be corrected regardless.

@studiosi
Copy link

The ampersand is used for GET parameters, it is definitely something that can appear on an HREF

@ghost
Copy link
Author

ghost commented Sep 13, 2017

Here are the exact rules: https://www.sitemaps.org/protocol.html#escaping

All the stuff have to be entity escaped and URLs URL-escaped and encoded. Rules are very clear.

@vezaynk
Copy link
Owner

vezaynk commented Sep 13, 2017

I'm musing with the idea of parsing properly-encoded hrefs, letting cURL handle the weirdness and encode it all right before inserting it.

@ghost
Copy link
Author

ghost commented Sep 13, 2017

If you encounter encoded and/or escaped URLs, you should decode and unescape them before adding them to crawl list.

All the text should be encode when it's written to sitemap but not before. Otherwise you will lose the uniqueness of URLs if you start encoding them while crawling. The encoding is needed only in sitemap and for sitemap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants