Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments in SPARQL, or how I learned to stop worrying and name the graph. #6

Open
pudo opened this issue Aug 25, 2014 · 7 comments

Comments

@pudo
Copy link

pudo commented Aug 25, 2014

So I’ve had the worst possible weekend, implementing a version of the grano API that is based on RDF/SPARQL. The RDF tooling for anything other than Java is rotten. If you want to use RDF, I would seriously look at something that runs on the JVM for server-side processing (Clojure, Scala…?).

All of that would be a nice challenge, but the result is incredibly slow: running a simple count query on my network entities on Jena Fuseki now takes 300-400ms, and that’s not even a large dataset (5k entities, something like 3k relationships). This remains pretty much the same if I use an in-memory server. It’s 3 seconds on dydra (the fuck?). I must be doing something seriously wrong, but I can’t figure out what - perhaps it’s related to named graphs.

In any case, I thought you might be interested in playing with the raw data - It’s a quarter million quads, modelled along the lines of what we discussed on in #2 and #3. Provenance graphs are UUIDs, everything else is in http://example/update-base/default.

@pudo
Copy link
Author

pudo commented Aug 25, 2014

Here's a sample SPARQL query, it's generated which gives it these weirdly-names labels:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX gd: <http://data.grano.cc/v1/>
PREFIX gf: <http://ns.grano.cc/v1/fields/>

SELECT ?root ?status_f66de9cdcc ?schemata_048d504a3d ?hidden_427d6a6016 ?name_fd4d44795e 
?label_4a212b3078 ?_any_d01be58ea7_name ?_any_d01be58ea7_value ?_any_d01be58ea7_graph 
?_any_d01be58ea7_source_url ?id_b79420d01c

WHERE { 
?root gf:inProject <http://data.grano.cc/v1/projects/opennews2> . ?root a gd:entities .
OPTIONAL { ?root gf:status ?status_f66de9cdcc }
GRAPH ?_any_d01be58ea7_graph { ?root ?_any_d01be58ea7_attr ?_any_d01be58ea7_value } ?_any_d01be58ea7_graph gf:isActive true . OPTIONAL { ?_any_d01be58ea7_graph dc:source ?_any_d01be58ea7_source_url } ?_any_d01be58ea7_attr a gd:attributes . ?_any_d01be58ea7_attr dc:identifier ?_any_d01be58ea7_name . ?root a ?schemata_048d504a3d . ?schemata_048d504a3d a gd:schemata . OPTIONAL { ?schemata_048d504a3d gf:isHidden ?hidden_427d6a6016 } OPTIONAL { ?schemata_048d504a3d dc:identifier ?name_fd4d44795e } OPTIONAL { ?schemata_048d504a3d <http://www.w3.org/2000/01/rdf-schema#label> ?label_4a212b3078 } OPTIONAL { ?root gf:id ?id_b79420d01c } 

{ SELECT DISTINCT ?root
WHERE { ?root gf:inProject <http://data.grano.cc/v1/projects/opennews2> . ?root a gd:entities . OPTIONAL { ?root gf:status ?status_f66de9cdcc } GRAPH ?_any_d01be58ea7_graph { ?root ?_any_d01be58ea7_attr ?_any_d01be58ea7_value } ?_any_d01be58ea7_graph gf:isActive true . OPTIONAL { ?_any_d01be58ea7_graph dc:source ?_any_d01be58ea7_source_url } ?_any_d01be58ea7_attr a gd:attributes . ?_any_d01be58ea7_attr dc:identifier ?_any_d01be58ea7_name . ?root a ?schemata_048d504a3d . ?schemata_048d504a3d a gd:schemata . OPTIONAL { ?schemata_048d504a3d gf:isHidden ?hidden_427d6a6016 } OPTIONAL { ?schemata_048d504a3d dc:identifier ?name_fd4d44795e } OPTIONAL { ?schemata_048d504a3d <http://www.w3.org/2000/01/rdf-schema#label> ?label_4a212b3078 } OPTIONAL { ?root gf:id ?id_b79420d01c } }

LIMIT 25 } }

@jmatsushita
Copy link
Member

Thanks for sharing your adventures!

I think Jena is not meant for speed. Also we’re definitely reaching the limits of my practical experience! Maybe an index thing? It might be related to named graphs, not all stores are optimised for that. Funny enough when looking into this on StackOverflow I found out that Virtuoso’s Quad store is based on SQL ?! http://stackoverflow.com/questions/17719341/difference-between-virtuoso-native-rdf-quad-store-and-virtuoso-sql-based-rdf-tri/17720682#17720682. Also some interesting stuff there :

Benchmark related stuff:

From when I looked, the only very pretty good tooling with RDF was Ruby (Spira in particular I really liked : https://github.com/ruby-rdf/spira). I wouldn’t be surprised if stuff starts coming up in the Javascript arena too. I have an irrational dislike of Java… :)

Maybe @elf-pavlik or @lisp could help with the performance question?

@lisp
Copy link

lisp commented Aug 25, 2014

i have looked closer at your query. there are two issues.
first, i suspect the query expects the default dataset to include the named graphs.
this is not the case with dydra. in order to apply a query to such a dataset, it should include

from <hrn:dydra:all>

to specify that intent.

second, we are working on changes to our control structures, with the unfortunate consequence that, at the moment, caches are disabled and the query set-up time is much higher than it should be. in this case a query (with the inclusive dataset specification) which has an actual execution time under 200ms has a set-up time ten times that.

@pudo
Copy link
Author

pudo commented Aug 25, 2014

@lisp many thanks for that analysis! For your reference, here's the actual COUNT query I was referring to:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX gd: <http://data.grano.cc/v1/>
PREFIX gf: <http://ns.grano.cc/v1/fields/>
SELECT COUNT(DISTINCT(?root))
WHERE { ?root gf:inProject <http://data.grano.cc/v1/projects/opennews2> . ?root a gd:entities . GRAPH ?_any_5b726eb44c_graph { ?root ?_any_5b726eb44c_attr ?_any_5b726eb44c_value } ?_any_5b726eb44c_graph gf:isActive true . OPTIONAL { ?_any_5b726eb44c_graph dc:source ?_any_5b726eb44c_source_url } ?_any_5b726eb44c_attr a gd:attributes . ?_any_5b726eb44c_attr dc:identifier ?_any_5b726eb44c_name }

@lisp
Copy link

lisp commented Aug 25, 2014

On 2014-08-25, at 20:34, Friedrich Lindenberg [email protected] wrote:

@lisp many thanks for that analysis! For your reference, here's the actual COUNT query I was referring to:

PREFIX dc: http://purl.org/dc/terms/
PREFIX gd: http://data.grano.cc/v1/
PREFIX gf: http://ns.grano.cc/v1/fields/
SELECT COUNT(DISTINCT(?root))
WHERE { ?root gf:inProject http://data.grano.cc/v1/projects/opennews2 . ?root a gd:entities . GRAPH ?_any_5b726eb44c_graph { ?root ?_any_5b726eb44c_attr
?_any_5b726eb44c_va!
lue
} ?_any_5b726eb44c_graph gf:isActive true . OPTIONAL { ?_any_5b726eb44c_graph dc:source ?_any_5b726eb44c_source_url } ?_any_5b726eb44c_attr a gd:attributes .

?_any_5b726eb44c_attr dc:identifier ?_any_5b726eb44c_name }

i expect, this would need to declare the dataset as follows, as it intends to both incorporate the named graphs into the default graph and match each one separately

PREFIX dc: http://purl.org/dc/terms/
PREFIX gd: http://data.grano.cc/v1/
PREFIX gf: http://ns.grano.cc/v1/fields/
SELECT count(*) # COUNT(DISTINCT(?root))
from urn:dydra:all
from named urn:dydra:named
WHERE {
?root gf:inProject http://data.grano.cc/v1/projects/opennews2 . # 159
?root a gd:entities . # 31 / 5951
GRAPH ?_any_5b726eb44c_graph { ?root ?_any_5b726eb44c_attr ?_any_5b726eb44c_value }
?_any_5b726eb44c_graph gf:isActive true .
OPTIONAL { ?_any_5b726eb44c_graph dc:source ?_any_5b726eb44c_source_url }
?_any_5b726eb44c_attr a gd:attributes .
?_any_5b726eb44c_attr dc:identifier ?_any_5b726eb44c_name
}

still, i am not clear, what you intend. it looks like you want to restrict the graphs, but somehow that restriction eliminates everything,

PREFIX dc: http://purl.org/dc/terms/
PREFIX gd: http://data.grano.cc/v1/
PREFIX gf: http://ns.grano.cc/v1/fields/
SELECT count(*)
from urn:dydra:all
from named urn:dydra:named
WHERE {
GRAPH ?_any_d01be58ea7_graph { ?root ?_any_d01be58ea7_attr ?_any_d01be58ea7_value } # 262025
?_any_d01be58ea7_graph gf:isActive true . # 47848
}

in that, for your current dataset, the count here is zero, despite the respective statement pattern cardinality.

@pudo
Copy link
Author

pudo commented Sep 1, 2014

I've bloggered about this whole thing here: http://pudo.org/blog/2014/09/01/grano-linked-data.html

@akuckartz
Copy link

@pudo Still interested in resolving this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants