Importing RDF data into Neo4j

The previous blog post might have been a bit too dense to start with, so I’ll try something a bit lighter this time like importing RDF data into Neo4j. It asumes, however, a certain degree of familiarity with both RDF and graph databases.

There are a number of RDF datasets out there that you may be aware of and you may have asked yourself at some point: “if RDF is a graph, then it should be easy to load it into a graph database like Neo4j, right?”. Well, the RDF model and the property graph model (implemented by Neo4j) are both graph models but with some important differences that I wont go over in this post. What I’ll do though, is describe one possible way of migrating data from an RDF graph into Neo4j’s property graph database.

I’ve also implemented this approach as a Neo4j stored procedure, so if you’re less interested in the concept and just want to see how to use the procedure you can go straight to the last section. Give it a try and share your experience, please.

The mapping

The first thing to do is plan a way to map both models. Here is my proposal.

An RDF graph is a set of tiples or statements (subject,predicate,object) where both the subject and the predicate are resources and the object can be either another resource or a literal. The only particularity about literals is that they cannot be the subject of other statements. In a tree structure we would call them leaf nodes. Also keep in mind that resources are uniquely identified by URIs.
Rule1: Subjects of triples are mapped to nodes  in Neo4j. A node in Neo4j representing an RDF resource will be labeled :Resource and have a property uri with the resource’s URI.
(S,P,O) => (:Resource {uri:S})...
Rule2a: Predicates of triples are mapped to node properties in Neo4j if the object of the triple is a literal
(S,P,O) && isLiteral(O) => (:Resource {uri:S, P:O})
Rule 2b: Predicates of triples are mapped to relationships in Neo4j if the object of the triple is a resource
(S,P,O) && !isLiteral(O) => (:Resource {uri:S})-[:P]->(:Resource {uri:O})
Let’s look at an example: Here is a short RDF fragment from the RDF Primer by the W3C that describes a web page and links it to its author. The triples are the following:
ex:index.html   dc:creator              exstaff:85740 .
ex:index.html   exterms:creation-date   "August 16, 1999" .
ex:index.html   dc:language             "en" .
The URIs of the resources are shortened by using the xml namespace mechanism. In this example, ex stands for http://www.example.org/, exterms stands for http://www.example.org/terms/, exstaff stands for http://www.example.org/staffid/  and dc stands for http://purl.org/dc/elements/1.1/
The full URIs are shown in the graphical representation of the triples (the figure is taken from the W3C page).
threetriples
If we iterate over this set of triples applying the  three rules defined before, we would get the following elements in a Neo4j property graph. I’ll use Cypher to describe them.
The application of rules 1 and 2b to the first triple would produce:
(:Resource { uri:"ex:index.html"})-[:`dc:creator`]->(:Resource { uri:"exstaff:85740"})
The second triple is transformed using rules 1 and 2a:
(:Resource { uri:"ex:index.html", `exterms:creation-date`: "August 16, 1999"})
And finally the third triple is transformed also with rules 1 and 2a producing:
(:Resource { uri:"ex:index.html", `dc:language`: "dc"})

Categories

The proposed set of basic mapping rules can be improved by adding one obvious exception for categories. RDF can represent both data and metadata as triples in the same graph and one of the most common uses of this is to categorise resources by linking them to classes through an instance-of style relationships (called rdf:type). So let’s add a new rule to deal with this case.
rule3: The rdf:type statements are mapped to categories in Neo4j.
(Something ,rdf:type, Category) => (:Category {uri:Something})
The rule basically maps the way individual resources (data) are linked to classes (metadata) in RDF through the rdf:type predicate to the way you categorise nodes in Neo4j i.e. by using labels.
This has also the advantage of removing dense nodes that aren’t particularly nice to deal with for any database. Rather than having a few million nodes representing people in your graph all of them connected to a single Person class node, we will have them all labeled as :Person which makes a lot more sense and there is no semantic loss.

The naming of things

Resources in RDF are identified by URIs which makes them unique, and thats great, but they are meant to be machine readable rather than nice to the human eye. So even though you’d like to read ‘Person’, RDF will use http://vocabularies.com/socialNetowrk/voc#Person (for example). While these kind of names can be used in Neo4j with no problem, they will make your labels and property names horribly long and hard to read and your Cypher queries will be polluted with http://… making the logic harder to grasp.
So what can we do? We have two options: 1) leave things named just as they are in the RDF model, with full URIS, and just deal with it in your queries. This would be the right thing to do if your data uses multiple schemas not necessarily under your control and/or more schemas can be added dynamically. Option 2) would be to make the pragmatic decision of shortening names to make both the model and the queries more readable. This will require some governance to ensure there are no name clashes. Probably a reasonable thing to do if you are migrating into Neo4j data from an RDF graph where you are the owner of the vocabularies being used or at least you have control over what schemas are used.
The initial version of the importRDF stored procedure supports both approaches as we will see in the final sections.

Datatypes in RDF literals

Literals can have data types associated in RDF by by pairing a string with a URI that identifies a particular XSD datatype.

exstaff:85740  exterms:age  "27"^^xsd:integer .

As part of the import process you may want to map the XSD datatype used in a triple to one of Neo4j’s datatypes. If datatypes are not explicitly declared in your RDF data you can always just load all literals as Strings and then cast them if needed at query time or through some batch post-import processing.

Blank nodes

The building block of the RDF model is the triple and this implies an atomic decomposition of your data in individual statements. However -and I quote here the W3C’s RDF Primer again- most real-world  data involves structures that are more complicated than that and the way to model structured information is by linking the different components to an aggregator resource. These aggregator resources may never need to be referred to directly, and hence may not require universal identifiers (URIs). Blank nodes are the artefacts in RDF that fulfil this requirement of representing anonymous resources. Triple stores will give them some sort of graph store local unique ID for the purposes of keeping unicity and avoiding clashes.

Our RDF importer will label blank nodes as BNode, and resources identified with URIs as URI, however, it’s important to keep in mind that if you bring data into Neo4j from multiple RDF graphs, identifiers of blank nodes are not guaranteed to be unique and unexpected clashes may occur so extra controls may be required.

The importRDF stored procedure

As I mentioned at the beginning of the post, I’ve implemented these ideas in the form of a Neo4j stored procedure. The usage is pretty simple. It takes four arguments as input.

  • The url of the RDF data to import.
  • The type of serialization used. The most frequent serializations for RDF are JSON-LD, Turtle, RDF/XML, N-Triples and  TriG. There are a couple more but these are the ones accepted by the stored proc for now.
  • A boolean indicating whether we want the names of labels, properties and relationships shortened as described in the “naming of things” section.
  • The periodicity of the commits. Number of triples ingested after which a commit is run.
CALL semantics.importRDF("file:///Users/jbarrasa/Downloads/opentox-example.turtle","Turtle", false, 500)

Will produce the following output:

Screen Shot 2016-06-08 at 23.39.43

UPDATE [16-Nov-2016] The stored procedure has been evolving over the last few months and the signature has changed. It takes now an extra boolean argument indicating whether category optimisation (Rule3) is applied or not. I expect the code to keep evolving so take this post as an introduction to the approach and look for the latest on the implementation in the github repo README.

The URL can point at a local RDF file, like in the previous example or to one accessible via HTTP. The next example loads a public dataset with 3.5 million triples on food products, their ingredients, allergens, nutrition facts and much more from Open Food Facts.

CALL semantics.importRDF("http://fr.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf","RDF/XML", true, 25000)

On my laptop the whole import took just over 4 minutes to produce this output.

Screen Shot 2016-06-09 at 00.45.38

 

When shortening of names is selected, the list of prefix being used is included in the import summary. If you want to give it a try don’t forget to create the following indexes beforehand, otherwise the stored procedure will abort the import and will remind you:

CREATE INDEX ON :Resource(uri) 
CREATE INDEX ON :URI(uri)
CREATE INDEX ON :BNode(uri) 
CREATE INDEX ON :Class(uri)

Once imported, I can find straight away what’s the set of shared ingredients between your Kellogg’s Coco Pops cereals and a bag of pork pies that you can buy at your local Spar.

Screen Shot 2016-06-08 at 23.57.51

 

Below is the cypher query that produces these results. Notice how the urls have been shortened but unicity of names is preserved by prefixing them with a namespace prefix.

MATCH (prod1:Resource { uri: 'http://world-fr.openfoodfacts.org/produit/9310055537194/coco-pops-kellogg-s'})
MATCH (prod2:ns3_FoodProduct { ns3_name : '2 Snack Pork Pies'})
MATCH (prod1)-[:ns3_containsIngredient]->(x1)-[:ns3_food]->(sharedIngredient)<-[:ns3_food]-(x2)<-[:ns3_containsIngredient]-(prod2)
RETURN prod1, prod2, x1, x2, sharedIngredient

I’ve intentionally written the two MATCH blocks for the two products in different ways, one identifying the product by its unique identifier (URI) and the other combining the category and the name.

A couple of open points

There are a couple of thing that I have not explored in this post and that the current implementation of the RDF importer does not deal with.

Mutltivalued properties

The current implementation does not deal with multivalued properties, although an obvious implementation could be to use arrays of values for this.

And the metadata?

This works great for instance data, but there is a little detail to take into account: An RDF graph can contain metadata statements. This means that you can find in the same graph (JB, rdf:type, Person) and (Person, rdf:type, owl:Class) and even (rdf:type, rdf:type, refs:Property). The post on Building a semantic graph in Neo4j gives some ideas on how to deal with RDF metadata but this is a very interesting topic and I’ll be coming back to it in future posts.

Conclusions

Migrating data from an RDF graph into a property graph like the one implemented by Neo4j can be done in a generic and relatively straightforward way as we’ve seen. This is interesting because it gives an automated way of importing your existing RDF graphs (regardless of your serialization: JSON-LD, RDF/XML, Turtle, etc.) into Neo4j without loss of its graph nature and without having to go through any intermediate flattening step.

The import process being totally generic results in a graph in Neo4j that of course inherits the modelling limitations of RDF like the lack of support for attributes on relationships so you will probably want to enrich / fix your raw graph once it’s been loaded in Neo4j. Both potential improvements to the import process and post-import graph processing will be discussed in future posts. Watch this space.

Advertisements

23 thoughts on “Importing RDF data into Neo4j

  1. Hello Jesús,

    i’m new to the Neo4J world and interested to import rdf N-Triples data to Neo4J. So I got to your blog post. At the moment i am a little bit stucked, because i don’t know how to use your procedure. This are the steps that i’ve done:

    1. Downloaded your source code
    2. Generate a jar file with ‘mvn package’
    3. Put the generated jar in the plugin folder of Neo4J
    4. Called the procedure over the web interface

    But calling the procedure generates the error:
    ‘There is no procedure with the name `semantics.importRDF` registered for this database instance. Please ensure you’ve spelled the procedure name correctly and that the procedure is properly deployed.’.

    Can you give me some suggestions how to solve it? Maybe APOC is a necessary dependency?

    Best regards
    Timo

    Like

    • Hi Timo, great to hear that you’re giving the loader a try! You seem to have followed the right steps but let me ask you a couple of questions.

      1. Did you restart the Neo4j server after copying the jar to the plugins folder?

      2. Did you copy across also the dependent jars? APOC is not a required jar in the current version but there are a number of required third party jars that you’ll find in the pom.xml that need to be copied to the plugins directory as well.

      3. Finally, if you’ve checked the previous two and it still does not work I’d recommend opening an issue in github (look at this simliar one https://github.com/jbarrasa/neosemantics/issues/1 ). At this point maybe the version of Neo4j you’re working with and the server logs would be useful.

      Hope this helps. Let me know how it goes.

      JB.

      Like

  2. Hello,

    I have some trouble with your procedure…
    In Neo4j, callling the procedure returns always a KO termination status with the same extra info:

    “At least one of the required indexes was not found [ :Resource(uri), :URI(uri), :BNode(uri), :Class(uri) ]”

    Even your example
    CALL semantics.importRDF(“http://fr.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf”,”RDF/XML”,false,5000)

    returns the same issue.

    Any idea?

    Thank you for your help,
    Thomas

    Like

    • Hi Thomas,
      I did add this stopper to avoid kicking of the RDF load if the indexes were not present because without indexes it can be really slow on large RDF imports.

      All you have to do is create the indexes as described in the post. You can do this either from the browser or from the shell by running the following instructions:

      CREATE INDEX ON :Resource(uri)
      CREATE INDEX ON :URI(uri)
      CREATE INDEX ON :BNode(uri)
      CREATE INDEX ON :Class(uri)

      You can check that the indexes have been created by running :SCHEMA on your Neo4j browser.
      Once the indexes are present the importRDF procedure should work nicely.
      Let me know if this solves the problem.

      Cheers,

      JB.

      Like

  3. Hello there,

    Unfortunately I still run into trouble getting it to work.
    Let me recap the steps I took in case I did anything wrong (neo4j version 3.0.6):

    1. downloaded the entire neosemantics-folder into maven/bin directory, running mvn package shade:shade
    2. copied the target output unto neo4j/plugins save for original-neosemantics-1.0-SNAPSHOT.jar, which causes a crash at neo4j startup
    3. CALL semantics.importRDF(“http://fr.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf”,”RDF/XML”, true, 25000) causes following error: At least one of the required indexes was not found [ :Resource(uri), :URI(uri), :BNode(uri), :Class(uri) ]
    4. copied the JARs from “alternate location” directly into neo4j/plugins folder, restart server, neo4j crashes altogether.

    I uploaded the error/warning portion from my log bc it would explode this comment section (http://incinerator.lima-city.de/errors.txt) I don’t understand the error because all the mentioned JARs are in the plugins folder…

    Maybe you know what could be the matter?
    Thanks,
    Christian

    Like

    • For goodness sake, why can’t I read instructions properly? It was the creating Index issue, which was even pointed out in the original blog article AND was mentioned in this very comment section. Sometimes you can’t see the forest for the trees I guess.
      Seems to work now, sorry for posting the question
      Christian

      Like

  4. Hi Jesús,

    Thanks for the detailed blogs and code that you have made availble to us “online community”. It had helped me a lot to get a running start on combining the two worlds RDF and Neo4j {graphs} with each other.
    I have a conundrum with a real-life use-case and would like to know if you can get in touch with me. I’m intressted in your view on few points…..

    Kind regards, Black

    Like

  5. Hello Jesús,

    I really searched something like this. Thank you for providing it! I tried today your loader and it is working fine. The only problem is scalability. I’m trying to import an 100 million triple file and it takes very long. In fact I have the impression that number of triples imported per second is going down. I’m trying on a 60Gb RAM machine. Did you try on larger datasets? Have you some ideas what is happening?

    Thank you
    Dennis

    Like

    • Hi Dennis,

      The degradation in write performance is expected. In Neo4j all connections (relationships) between nodes are materialised at write time as opposed to triple stores where they are computed via joins at query time. It’s a trade-off between write and read performance. What you’re getting with Neo4j is more expensive transactional data load but lightning speed traversals at read time.

      Also Keep in mind that this approach is transactional which may or may not be the best approach for super large datasets. If your dataset is really massive you may want to try the non transactional import tool (https://neo4j.com/docs/operations-manual/current/import/)

      The largest data set I’ve imported with my stored procedure is 107Million triples and it took 2h50min on my 16Gb laptop.

      So I’d say it depends, if it’s a one off import it can be acceptable but if you want to import 100Mill triples every hour then probably you’ll need to find alternatives.

      BTW, I’m currently working on the next post describing the experience of loading the larger dataset I mentioned before so watch this space.

      Cheers,

      JB.

      Like

  6. Hi Jesus,

    thank you for your work that allows the Semantic Web community to use more easily neo4j technology. I wrote once in the post about importing RDF datasets. My problem was that I had problems importing a bigger RDF dataset in neo4j. Unfortunately I still have the problem. You pointed me to the neo4j-import utility. So I did the following. I parsed an RDF file and I created a dictionary for all the nodes using a quite big HashMap. Then I created two files:

    reduced_dbpedia_nodes.csv:

    id:ID,uri,:LABEL
    7,”http://dbpedia.org/resource/Albedo”,Resource

    reduced_dbpedia_relations.csv:

    :START_ID,:END_ID,:TYPE
    1,2,”http://www.w3.org/1999/02/22-rdf-syntax-ns#type”

    I used neo4j-import for importing. I can import a part of DBpedia (and it is fine) but when I want to load also the yago classes (just a lot of classes) I get a strange error I reported here:

    http://stackoverflow.com/questions/41909362/neo4j-import-fails

    You said that you wanted to do a post for importing big datasets. Is this still the case? Can you help?

    Merci
    Dennis Diefenbach

    Like

  7. Thanks for your code.But when I running the code CALL semantics.importRDF(“http://fr.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf”,”RDF/XML”, { languageFilter: ‘fr’, commitSize: 5000 , nodeCacheSize: 250000}) the procedure returns always a KO termination status with the same extra info:”unqualified property element not allowed [line 2, column 14]”

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s