Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues to get all and clean ceur dataset #11

Open
liyakun opened this issue Feb 23, 2016 · 1 comment
Open

issues to get all and clean ceur dataset #11

liyakun opened this issue Feb 23, 2016 · 1 comment

Comments

@liyakun
Copy link

liyakun commented Feb 23, 2016

 SELECT ?s (COUNT (?o) as ?count) {?s
<http://www.w3.org/2002/07/owl#sameAs> ?o} GROUP BY ?s HAVING (?count > 10)
  • The only license that we have is
    http://choosealicense.com/licenses/no-license/. However, those volumes
    that are licensed under CC0 should also be annotated with the right
    license.
  • For papers in PDF format the URI generation works fine, e.g.
    paper1.pdf translates to http://ceur-ws.org/Vol-XXX/#paper1. Please
    adapt it in the same way for papers in other formats, e.g. paper1.ps.
  • Apparently the extraction tool does not apply normalisation to
    strings. Look at the dc:title of
    http://ceur-ws.org/Vol-100/#Aditya_Kalyanpur-et-al: It contains a lot
    of whitespace, which comes straight from the HTML. This whitespace has
    no meaning, it's just an artifact of how the source code was written. I
    think that in most string literals it's safe to replace any sequence of
    spaces by just one space.
  • As of RDF 1.1 the default datatype for a literal is xsd:string, so
    the rdf:datatype attribute could be omitted in this case, to save some
    space.
  • Some workshops (at least http://ceur-ws.org/Vol-1513/) do not have
    any papers, which is wrong.
  • Some non-Latin characters, e.g., in names, are output using combining
    diacritical marks. Please check e.g. Wikipedia if you don't know what
    this is. Output should be in NFC. Which means, e.g., that ñ should be
    output as "the single UTF-8 character 'ñ'" instead of "a combining '~'
    character applied to an 'n'".
  • Papers seem to have a bibo:numPages property but not start and end
    page. (It should!)
@KMax
Copy link
Member

KMax commented Feb 24, 2016

Hi @liyakun! Thank you for your comments! You're right, there are a lot of issues which should be fixed to improve the crawler. If you have time and desire, your help is welcome!

This code of the repo were used to participate in the challenge last year, but if you want to improve the code and participate in the next challenge, feel free to do that. We only would like to ask you to cite our last year paper if you're going to reuse our work. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants