Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in text directive #91

Open
Anniepoo opened this issue Feb 16, 2022 · 1 comment
Open

bug in text directive #91

Anniepoo opened this issue Feb 16, 2022 · 1 comment

Comments

@Anniepoo
Copy link
Member

Anniepoo commented Feb 16, 2022

This is a bit of wikipedia scraper code. It grabs the country name as the innerhtml of the a tag.

:- module(datascoop, [test/1]).


:- use_module(library(http/http_open)).
:- use_module(library(sgml)).
:- use_module(library(xpath)).
:- use_module(library(sgml_write)).

test_url('https://en.wikipedia.org/wiki/Visa_requirements_for_Yemeni_citizens').

test(row(
         WIKIREF,
         Country

     )) :-
    test_url(URL),
    load_html(URL, DOM, []),
    !,
    xpath(DOM, //div(h2(index(I), span(@id(string)="Visa_requirements_map"))), Div),
    succ(I, TI),
    xpath(Div, table(@class='sortable wikitable', index(TI)), Table),
    findall(Head, xpath(Table, /self(tbody/tr/th(text(atom)=Head)), _),
           [ 'Country',
             'Visa requirement',
             'Allowed stay',
             'Notes (excluding departure fees)'
           ]),
    !,
    xpath(Table, tbody/tr, Row),
    xpath(Row, td(index(1))/a(@href), WIKIREF),
% if you change this to text from normalize_space it binds Country to "\240\Afghanistan"
% changing to normalize_space shows "Afghanistan".
    xpath(Row, td(index(1), normalize_space), Country).

@JanWielemaker
Copy link
Member

Why is this a bug? \240\ is a non-breaking space that you should get if you ask for text and not if you normalize spaces (remove from both ends, use a single space for each internal sequence).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants