Importing RDF data into Neo4j

The previous blog post might have been a bit too dense to start with, so I’ll try something a bit lighter this time like importing RDF data into Neo4j. It asumes, however, a certain degree of familiarity with both RDF and graph databases.

There are a number of RDF datasets out there that you may be aware of and you may have asked yourself at some point: “if RDF is a graph, then it should be easy to load it into a graph database like Neo4j, right?”. Well, the RDF model and the property graph model (implemented by Neo4j) are both graph models but with some important differences that I wont go over in this post. What I’ll do though, is describe one possible way of migrating data from an RDF graph into Neo4j’s property graph database.

I’ve also implemented this approach as a Neo4j stored procedure, so if you’re less interested in the concept and just want to see how to use the procedure you can go straight to the last section. Give it a try and share your experience, please.

The mapping

The first thing to do is plan a way to map both models. Here is my proposal.

An RDF graph is a set of tiples or statements (subject,predicate,object) where both the subject and the predicate are resources and the object can be either another resource or a literal. The only particularity about literals is that they cannot be the subject of other statements. In a tree structure we would call them leaf nodes. Also keep in mind that resources are uniquely identified by URIs.
Rule1: Subjects of triples are mapped to nodes  in Neo4j. A node in Neo4j representing an RDF resource will be labeled :Resource and have a property uri with the resource’s URI.
(S,P,O) => (:Resource {uri:S})...
Rule2a: Predicates of triples are mapped to node properties in Neo4j if the object of the triple is a literal
(S,P,O) && isLiteral(O) => (:Resource {uri:S, P:O})
Rule 2b: Predicates of triples are mapped to relationships in Neo4j if the object of the triple is a resource
(S,P,O) && !isLiteral(O) => (:Resource {uri:S})-[:P]->(:Resource {uri:O})
Let’s look at an example: Here is a short RDF fragment from the RDF Primer by the W3C that describes a web page and links it to its author. The triples are the following:
ex:index.html   dc:creator              exstaff:85740 .
ex:index.html   exterms:creation-date   "August 16, 1999" .
ex:index.html   dc:language             "en" .
The URIs of the resources are shortened by using the xml namespace mechanism. In this example, ex stands for http://www.example.org/, exterms stands for http://www.example.org/terms/, exstaff stands for http://www.example.org/staffid/  and dc stands for http://purl.org/dc/elements/1.1/
The full URIs are shown in the graphical representation of the triples (the figure is taken from the W3C page).
threetriples
If we iterate over this set of triples applying the  three rules defined before, we would get the following elements in a Neo4j property graph. I’ll use Cypher to describe them.
The application of rules 1 and 2b to the first triple would produce:
(:Resource { uri:"ex:index.html"})-[:`dc:creator`]->(:Resource { uri:"exstaff:85740"})
The second triple is transformed using rules 1 and 2a:
(:Resource { uri:"ex:index.html", `exterms:creation-date`: "August 16, 1999"})
And finally the third triple is transformed also with rules 1 and 2a producing:
(:Resource { uri:"ex:index.html", `dc:language`: "dc"})

Categories

The proposed set of basic mapping rules can be improved by adding one obvious exception for categories. RDF can represent both data and metadata as triples in the same graph and one of the most common uses of this is to categorise resources by linking them to classes through an instance-of style relationships (called rdf:type). So let’s add a new rule to deal with this case.
rule3: The rdf:type statements are mapped to categories in Neo4j.
(Something ,rdf:type, Category) => (:Category {uri:Something})
The rule basically maps the way individual resources (data) are linked to classes (metadata) in RDF through the rdf:type predicate to the way you categorise nodes in Neo4j i.e. by using labels.
This has also the advantage of removing dense nodes that aren’t particularly nice to deal with for any database. Rather than having a few million nodes representing people in your graph all of them connected to a single Person class node, we will have them all labeled as :Person which makes a lot more sense and there is no semantic loss.

The naming of things

Resources in RDF are identified by URIs which makes them unique, and thats great, but they are meant to be machine readable rather than nice to the human eye. So even though you’d like to read ‘Person’, RDF will use http://vocabularies.com/socialNetowrk/voc#Person (for example). While these kind of names can be used in Neo4j with no problem, they will make your labels and property names horribly long and hard to read and your Cypher queries will be polluted with http://… making the logic harder to grasp.
So what can we do? We have two options: 1) leave things named just as they are in the RDF model, with full URIS, and just deal with it in your queries. This would be the right thing to do if your data uses multiple schemas not necessarily under your control and/or more schemas can be added dynamically. Option 2) would be to make the pragmatic decision of shortening names to make both the model and the queries more readable. This will require some governance to ensure there are no name clashes. Probably a reasonable thing to do if you are migrating into Neo4j data from an RDF graph where you are the owner of the vocabularies being used or at least you have control over what schemas are used.
The initial version of the importRDF stored procedure supports both approaches as we will see in the final sections.

Datatypes in RDF literals

Literals can have data types associated in RDF by by pairing a string with a URI that identifies a particular XSD datatype.

exstaff:85740  exterms:age  "27"^^xsd:integer .

As part of the import process you may want to map the XSD datatype used in a triple to one of Neo4j’s datatypes. If datatypes are not explicitly declared in your RDF data you can always just load all literals as Strings and then cast them if needed at query time or through some batch post-import processing.

Blank nodes

The building block of the RDF model is the triple and this implies an atomic decomposition of your data in individual statements. However -and I quote here the W3C’s RDF Primer again- most real-world  data involves structures that are more complicated than that and the way to model structured information is by linking the different components to an aggregator resource. These aggregator resources may never need to be referred to directly, and hence may not require universal identifiers (URIs). Blank nodes are the artefacts in RDF that fulfil this requirement of representing anonymous resources. Triple stores will give them some sort of graph store local unique ID for the purposes of keeping unicity and avoiding clashes.

Our RDF importer will label blank nodes as BNode, and resources identified with URIs as URI, however, it’s important to keep in mind that if you bring data into Neo4j from multiple RDF graphs, identifiers of blank nodes are not guaranteed to be unique and unexpected clashes may occur so extra controls may be required.

The importRDF stored procedure

As I mentioned at the beginning of the post, I’ve implemented these ideas in the form of a Neo4j stored procedure. The usage is pretty simple. It takes four arguments as input.

  • The url of the RDF data to import.
  • The type of serialization used. The most frequent serializations for RDF are JSON-LD, Turtle, RDF/XML, N-Triples and  TriG. There are a couple more but these are the ones accepted by the stored proc for now.
  • A boolean indicating whether we want the names of labels, properties and relationships shortened as described in the “naming of things” section.
  • The periodicity of the commits. Number of triples ingested after which a commit is run.
CALL semantics.importRDF("file:///Users/jbarrasa/Downloads/opentox-example.turtle","Turtle", false, 500)

Will produce the following output:

Screen Shot 2016-06-08 at 23.39.43

UPDATE [16-Nov-2016] The stored procedure has been evolving over the last few months and the signature has changed. It takes now an extra boolean argument indicating whether category optimisation (Rule3) is applied or not. I expect the code to keep evolving so take this post as an introduction to the approach and look for the latest on the implementation in the github repo README.

The URL can point at a local RDF file, like in the previous example or to one accessible via HTTP. The next example loads a public dataset with 3.5 million triples on food products, their ingredients, allergens, nutrition facts and much more from Open Food Facts.

CALL semantics.importRDF("http://fr.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf","RDF/XML", true, 25000)

On my laptop the whole import took just over 4 minutes to produce this output.

Screen Shot 2016-06-09 at 00.45.38

 

When shortening of names is selected, the list of prefix being used is included in the import summary. If you want to give it a try don’t forget to create the following indexes beforehand, otherwise the stored procedure will abort the import and will remind you:

CREATE INDEX ON :Resource(uri) 
CREATE INDEX ON :URI(uri)
CREATE INDEX ON :BNode(uri) 
CREATE INDEX ON :Class(uri)

Once imported, I can find straight away what’s the set of shared ingredients between your Kellogg’s Coco Pops cereals and a bag of pork pies that you can buy at your local Spar.

Screen Shot 2016-06-08 at 23.57.51

 

Below is the cypher query that produces these results. Notice how the urls have been shortened but unicity of names is preserved by prefixing them with a namespace prefix.

MATCH (prod1:Resource { uri: 'http://world-fr.openfoodfacts.org/produit/9310055537194/coco-pops-kellogg-s'})
MATCH (prod2:ns3_FoodProduct { ns3_name : '2 Snack Pork Pies'})
MATCH (prod1)-[:ns3_containsIngredient]->(x1)-[:ns3_food]->(sharedIngredient)<-[:ns3_food]-(x2)<-[:ns3_containsIngredient]-(prod2)
RETURN prod1, prod2, x1, x2, sharedIngredient

I’ve intentionally written the two MATCH blocks for the two products in different ways, one identifying the product by its unique identifier (URI) and the other combining the category and the name.

A couple of open points

There are a couple of thing that I have not explored in this post and that the current implementation of the RDF importer does not deal with.

Mutltivalued properties

The current implementation does not deal with multivalued properties, although an obvious implementation could be to use arrays of values for this.

And the metadata?

This works great for instance data, but there is a little detail to take into account: An RDF graph can contain metadata statements. This means that you can find in the same graph (JB, rdf:type, Person) and (Person, rdf:type, owl:Class) and even (rdf:type, rdf:type, refs:Property). The post on Building a semantic graph in Neo4j gives some ideas on how to deal with RDF metadata but this is a very interesting topic and I’ll be coming back to it in future posts.

Conclusions

Migrating data from an RDF graph into a property graph like the one implemented by Neo4j can be done in a generic and relatively straightforward way as we’ve seen. This is interesting because it gives an automated way of importing your existing RDF graphs (regardless of your serialization: JSON-LD, RDF/XML, Turtle, etc.) into Neo4j without loss of its graph nature and without having to go through any intermediate flattening step.

The import process being totally generic results in a graph in Neo4j that of course inherits the modelling limitations of RDF like the lack of support for attributes on relationships so you will probably want to enrich / fix your raw graph once it’s been loaded in Neo4j. Both potential improvements to the import process and post-import graph processing will be discussed in future posts. Watch this space.

Advertisements