issues to get all and clean ceur dataset #11

liyakun · 2016-02-23T08:57:21Z

There are still some irrelevant triples. Only triples with a subject
below http://ceur-ws.org/ should end up in our dataset. If we need to
define CEUR-specific vocabulary terms, then we should define them in the
http://ceur-ws.org/vocab/ namespace.
Some persons, e.g.
http://ceur-ws.org/resource/person/M.%20Daud%20Ahmed, are owl:sameAs a
lot of other persons with diffferent names. Something must be wrong
here. To see the details, try the following query:

 SELECT ?s (COUNT (?o) as ?count) {?s
<http://www.w3.org/2002/07/owl#sameAs> ?o} GROUP BY ?s HAVING (?count > 10)

The only license that we have is
http://choosealicense.com/licenses/no-license/. However, those volumes
that are licensed under CC0 should also be annotated with the right
license.
For papers in PDF format the URI generation works fine, e.g.
paper1.pdf translates to http://ceur-ws.org/Vol-XXX/#paper1. Please
adapt it in the same way for papers in other formats, e.g. paper1.ps.
Apparently the extraction tool does not apply normalisation to
strings. Look at the dc:title of
http://ceur-ws.org/Vol-100/#Aditya_Kalyanpur-et-al: It contains a lot
of whitespace, which comes straight from the HTML. This whitespace has
no meaning, it's just an artifact of how the source code was written. I
think that in most string literals it's safe to replace any sequence of
spaces by just one space.
As of RDF 1.1 the default datatype for a literal is xsd:string, so
the rdf:datatype attribute could be omitted in this case, to save some
space.
Some workshops (at least http://ceur-ws.org/Vol-1513/) do not have
any papers, which is wrong.
Some non-Latin characters, e.g., in names, are output using combining
diacritical marks. Please check e.g. Wikipedia if you don't know what
this is. Output should be in NFC. Which means, e.g., that ñ should be
output as "the single UTF-8 character 'ñ'" instead of "a combining '~'
character applied to an 'n'".
Papers seem to have a bibo:numPages property but not start and end
page. (It should!)

The text was updated successfully, but these errors were encountered:

KMax · 2016-02-24T08:51:15Z

Hi @liyakun! Thank you for your comments! You're right, there are a lot of issues which should be fixed to improve the crawler. If you have time and desire, your help is welcome!

This code of the repo were used to participate in the challenge last year, but if you want to improve the code and participate in the next challenge, feel free to do that. We only would like to ask you to cite our last year paper if you're going to reuse our work. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues to get all and clean ceur dataset #11

issues to get all and clean ceur dataset #11

liyakun commented Feb 23, 2016

KMax commented Feb 24, 2016

issues to get all and clean ceur dataset #11

issues to get all and clean ceur dataset #11

Comments

liyakun commented Feb 23, 2016

KMax commented Feb 24, 2016