You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are still some irrelevant triples. Only triples with a subject
below http://ceur-ws.org/ should end up in our dataset. If we need to
define CEUR-specific vocabulary terms, then we should define them in the http://ceur-ws.org/vocab/ namespace.
Some persons, e.g. http://ceur-ws.org/resource/person/M.%20Daud%20Ahmed, are owl:sameAs a
lot of other persons with diffferent names. Something must be wrong
here. To see the details, try the following query:
SELECT ?s (COUNT (?o) as ?count) {?s
<http://www.w3.org/2002/07/owl#sameAs> ?o} GROUP BY ?s HAVING (?count > 10)
For papers in PDF format the URI generation works fine, e.g.
paper1.pdf translates to http://ceur-ws.org/Vol-XXX/#paper1. Please
adapt it in the same way for papers in other formats, e.g. paper1.ps.
Apparently the extraction tool does not apply normalisation to
strings. Look at the dc:title of http://ceur-ws.org/Vol-100/#Aditya_Kalyanpur-et-al: It contains a lot
of whitespace, which comes straight from the HTML. This whitespace has
no meaning, it's just an artifact of how the source code was written. I
think that in most string literals it's safe to replace any sequence of
spaces by just one space.
As of RDF 1.1 the default datatype for a literal is xsd:string, so
the rdf:datatype attribute could be omitted in this case, to save some
space.
Some non-Latin characters, e.g., in names, are output using combining
diacritical marks. Please check e.g. Wikipedia if you don't know what
this is. Output should be in NFC. Which means, e.g., that ñ should be
output as "the single UTF-8 character 'ñ'" instead of "a combining '~'
character applied to an 'n'".
Papers seem to have a bibo:numPages property but not start and end
page. (It should!)
The text was updated successfully, but these errors were encountered:
Hi @liyakun! Thank you for your comments! You're right, there are a lot of issues which should be fixed to improve the crawler. If you have time and desire, your help is welcome!
This code of the repo were used to participate in the challenge last year, but if you want to improve the code and participate in the next challenge, feel free to do that. We only would like to ask you to cite our last year paper if you're going to reuse our work. Thanks!
below http://ceur-ws.org/ should end up in our dataset. If we need to
define CEUR-specific vocabulary terms, then we should define them in the
http://ceur-ws.org/vocab/ namespace.
http://ceur-ws.org/resource/person/M.%20Daud%20Ahmed, are owl:sameAs a
lot of other persons with diffferent names. Something must be wrong
here. To see the details, try the following query:
http://choosealicense.com/licenses/no-license/. However, those volumes
that are licensed under CC0 should also be annotated with the right
license.
paper1.pdf translates to http://ceur-ws.org/Vol-XXX/#paper1. Please
adapt it in the same way for papers in other formats, e.g. paper1.ps.
strings. Look at the dc:title of
http://ceur-ws.org/Vol-100/#Aditya_Kalyanpur-et-al: It contains a lot
of whitespace, which comes straight from the HTML. This whitespace has
no meaning, it's just an artifact of how the source code was written. I
think that in most string literals it's safe to replace any sequence of
spaces by just one space.
the rdf:datatype attribute could be omitted in this case, to save some
space.
any papers, which is wrong.
diacritical marks. Please check e.g. Wikipedia if you don't know what
this is. Output should be in NFC. Which means, e.g., that ñ should be
output as "the single UTF-8 character 'ñ'" instead of "a combining '~'
character applied to an 'n'".
page. (It should!)
The text was updated successfully, but these errors were encountered: