Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"noindex" URL are listed in sitemap #82

Open
stephanros opened this issue Mar 14, 2019 · 12 comments
Open

"noindex" URL are listed in sitemap #82

stephanros opened this issue Mar 14, 2019 · 12 comments

Comments

@stephanros
Copy link

My sitemap contains a lot of URL that have a Meta "noindex".
So webmastertools send me an alert.

@stephanros stephanros changed the title "noindex" URL are listed "noindex" URL are listed in sitemap Mar 14, 2019
@stephanros
Copy link
Author

Hi,

This is a sample to explain my issue :
You can see on my sitemap : https://buen-polvo.es/sitemap.xml some URLs like "https://buen-polvo.es/miembro_motera1_2408242.html".
But, if you open source code of this URL, you can see a Meta for bots with "noindex" :

This is inconsistent for Google because of we ask to Google Bot to index this URL (on sitemap), but when Google Bot try to analyse this URL, it see a noindex, so it can't index this URL.

I hope it's more clear now.
Regards

@vezaynk
Copy link
Owner

vezaynk commented Mar 17, 2019

Thanks, this is very helpful. I'm really short on time lately and I'm unlikely to be able to address this until late April.

Hopefully you'll manage until then.

@stephanros
Copy link
Author

Thanks. I'll try to manage it waiting your update.

Regards.

@stephanros
Copy link
Author

Hi,

In my side, I created a script which is cleaning bad url, but it's very slowly, so I never update my sitemap.
Did you have time to look at this problem of noindex ?

Best regards.

@vezaynk
Copy link
Owner

vezaynk commented Apr 17, 2019

I'll get on it in a few days.

@vezaynk
Copy link
Owner

vezaynk commented Apr 26, 2019

Aaaand I'm done with final exams.
Expect the patch this week 😎

@stephanros
Copy link
Author

Greaaaat 👍
I look forward to testing it :)

@stephanros
Copy link
Author

I'm sorry, but I can't find your patch.

@vezaynk
Copy link
Owner

vezaynk commented May 22, 2019

It's a work in progress. I thought this would be easier. A major problem is that with links I only need to match a single attribute (href), with meta tags, I need to match both the name and content. It's tricky to get right.

@vezaynk
Copy link
Owner

vezaynk commented May 22, 2019

A cheap that you can apply yourself is to simply check if the meta tag string is present in the html but hard-coding the check here: https://github.com/knyzorg/Sitemap-Generator-Crawler/blob/0b89cd5f53b02472d33131a2ebb62396003bf8df/sitemap.functions.php#L367

But my regular expression skills are somewhat rusty and regular expressions were never meant to parse html.

The entire project was written back for when PHP installations had finicky support for parsing HTML natively, and should have become unnecessary with the release of PHP7... yet here we are. I will eventually re-write as a binary with a proper HTML parser and deprecate the project.

@vezaynk
Copy link
Owner

vezaynk commented Jun 27, 2019

@wcmohler is working on it in #83.

@mylselgan
Copy link

@knyzorg
pull request #83 works as expected. but it should follow the links from noindex pages and add them to sitemap.xml file

example:
page A have "noindex" meta
Page A links to page B and Page C
Page B and Page C don't have meta "noindex"

Result: Page A should be omitted but Page B and C should be added to the sitemap.xml file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants