Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indefinite loop #34

Closed
ghost opened this issue Sep 8, 2017 · 15 comments
Closed

Indefinite loop #34

ghost opened this issue Sep 8, 2017 · 15 comments

Comments

@vezaynk
Copy link
Owner

vezaynk commented Sep 8, 2017

Interesting case.

../ and ./ are not covered cases yet. They should be simplified after the relative to absolute conversion.

@Thyra
Copy link

Thyra commented Sep 9, 2017

I believe this is the problem you are talking about: Remove Dot Segments. There is a PHP gist that might be helpful here: https://gist.github.com/rdlowrey/5f56cc540099de9d5006

@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

Great find! Woke up this morning thinking I would have to read RFCs again.

Shouldn't be hard to implement this.

@vezaynk vezaynk closed this as completed in 9d5691f Sep 9, 2017
@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

I think I got it. Hopefully I didn't break anything in the process.

@ghost
Copy link
Author

ghost commented Sep 9, 2017

I am not sure about that. In that forum there is only handful of of topics and posts, but the crawler is finding hundreds of URLs. Try and see: https://www.forum.2globalnomads.info/

@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

That's because it's finding a lot of urls you should have blacklisted such as posting.php and search.php

@ghost
Copy link
Author

ghost commented Sep 9, 2017

I will let it run to the end to see that it's not in an indefinite loop.

@ghost
Copy link
Author

ghost commented Sep 9, 2017

It run out of memory and crashed before finishing. I am pretty sure there is still problems with PHPBB3 forums.

@ghost
Copy link
Author

ghost commented Sep 9, 2017

In the temp sitemap file there was 7565 lines when it crashed. That's pretty impossible without duplicates or looping.

@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

I will take a closer look. There are too many safeguards against duplicate links. The issue is something else.

@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

My closer look yielded results. The sid argument is at fault here, mostly. #31 would fix this.

@ghost
Copy link
Author

ghost commented Sep 9, 2017

sid (or similar functionality) is pretty common in systems that have session handling. If I am not completely wrong, it should be ignored by default. How about if you do it so that it can be enabled with an option like
php sitemap.php --enable-sid

@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

Sitemaps don't need sids, I assure you.

The sane default option is going to be to have all arguments ignored by default and a number of arguments will be whitelisted by default.

@ghost
Copy link
Author

ghost commented Sep 9, 2017

Sounds good to me.

@vezaynk
Copy link
Owner

vezaynk commented Sep 9, 2017

For sake of interest, I analysed the data.

=> cat sitemap.xml.partial | grep "php?f=3" | tee >(wc -l) | cat

<loc>https://www.forum.2globalnomads.info/viewtopic.php?f=3&amp;t=31&amp;sid=9a0e59593d7f1a504e0556c92f59dc5e&amp;view=print</loc>
........
<loc>https://www.forum.2globalnomads.info/viewforum.php?f=3&amp;sid=793f5d0f359deb76f70c4be8e155bd41</loc>
62

The same page was indexed 62 times but with different sids and views.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants