-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indefinite loop #34
Comments
Interesting case. ../ and ./ are not covered cases yet. They should be simplified after the relative to absolute conversion. |
I believe this is the problem you are talking about: Remove Dot Segments. There is a PHP gist that might be helpful here: https://gist.github.com/rdlowrey/5f56cc540099de9d5006 |
Great find! Woke up this morning thinking I would have to read RFCs again. Shouldn't be hard to implement this. |
I think I got it. Hopefully I didn't break anything in the process. |
I am not sure about that. In that forum there is only handful of of topics and posts, but the crawler is finding hundreds of URLs. Try and see: https://www.forum.2globalnomads.info/ |
That's because it's finding a lot of urls you should have blacklisted such as posting.php and search.php |
I will let it run to the end to see that it's not in an indefinite loop. |
It run out of memory and crashed before finishing. I am pretty sure there is still problems with PHPBB3 forums. |
In the temp sitemap file there was 7565 lines when it crashed. That's pretty impossible without duplicates or looping. |
I will take a closer look. There are too many safeguards against duplicate links. The issue is something else. |
My closer look yielded results. The |
sid (or similar functionality) is pretty common in systems that have session handling. If I am not completely wrong, it should be ignored by default. How about if you do it so that it can be enabled with an option like |
Sitemaps don't need sids, I assure you. The sane default option is going to be to have all arguments ignored by default and a number of arguments will be whitelisted by default. |
Sounds good to me. |
For sake of interest, I analysed the data.
The same page was indexed 62 times but with different sids and views. |
Command:
php sitemap.php file=sitemap.xml site=https://www.forum.2globalnomads.info
Output:
The text was updated successfully, but these errors were encountered: