Neo4j is your RDF store (part 2)

As in previous posts, for those of you less familiar with the differences and similarities between RDF and the Property Graph, I recommend you watch this talk I gave at Graph Connect San Francisco in October 2016.

In the previous post on this series, I showed the most basic way in which a portion of your graph can be exposed as RDF. That was identifying a node by ID or URI if your data was imported from an RDF dataset. In this one, I’ll explore a more interesting way by running Cypher queries and serialising the resulting subgraph as RDF.

The dataset

For this example I’ll use the Nortwind database that you can easily load in your Neo4j instance by running the following in your Neo4j browswer.

:play northwind graph

If you follow the step by step instructions you should get the graph built in no time. You’re ready then to run queries like “Get the detail of the orders by Rita Müller containing at least a dairy product”. Here is the cypher for it:

MATCH (cust:Customer {contactName : "Rita Müller"})-[p:PURCHASED]->(o:Order)-[or:ORDERS]->(pr:Product)
WHERE (o)-[:ORDERS]->()-[:PART_OF]->(:Category {categoryName:"Dairy Products"})
RETURN *

And this the resulting graph:

Screen Shot 2016-12-16 at 12.46.40.png

Serialising the output of a cypher query as RDF

The result of the previous query is a portion of the Nortwhind graph, a set of nodes and relationships that can be serialised as RDF using the neosemantics neo4j extension.

Once installed on your Neo4j instance, you’ll notice that the neosemantics extension includes a cypher endpoint /rdf/cypher (described here) that takes a cypher queryas input and returns the results serialised as RDF with the usual choice of serialisation format in the HTTP request.

The endpoint can be tested directly from the browser and will produce JSON-LD by default.

Screen Shot 2016-12-16 at 12.58.39.png

The uris of the resources in RDF are generated from the node ids in neo4j and in this first version of the LPG-to-RDF endpoint, all elements in the graph -RDF properties and types- share the same generic vocabulary namespace (It will be different if your graph has been imported from an RDF dataset as we’ll see in the final section).

Validating the RDF output on the W3C RDF Validation Service

A simple way of validating the output of the serialisation could be to load it into the W3C RDF validation service. It takes two simple steps:

Step one: Run your Cypher query on the rdf/cypyher endpoint selecting application/rdf+xml as serialization format on the Accept header of the http request. This is what the curl expresion would look like:

curl http://localhost:7474/rdf/cypher -H Accept:application/rdf+xml 
     -d "MATCH (cust:Customer {contactName : 'Rita Müller'})-[p:PURCHASED]->(o:OrdeERS]->(pr:Product) WHERE (o)-[:ORDERS]->()-[:PART_OF]->(:Category {categoryName:'Dairy Products'}) RETURN *"

This should produce something like this (showing only the first few rows):

<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF xmlns:neovoc="neo4j://vocabulary#"
         xmlns:neoind="neo4j://indiv#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdf:Description rdf:about="neo4j://indiv#77511">
    <rdf:type rdf:resource="neo4j://vocabulary#Customer"/>
    <neovoc:country rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Germany</neovoc:country>
    <neovoc:address rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Adenauerallee 900</neovoc:address>
    <neovoc:contactTitle rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Sales Representative</neovoc:contactTitle>
    <neovoc:city rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Stuttgart</neovoc:city>
    <neovoc:phone rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0711-020361</neovoc:phone>
    <neovoc:contactName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Rita Müller</neovoc:contactName>
    <neovoc:companyName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Die Wandernde Kuh</neovoc:companyName>
    <neovoc:postalCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string">70563</neovoc:postalCode>
    <neovoc:customerID rdf:datatype="http://www.w3.org/2001/XMLSchema#string">WANDK</neovoc:customerID>
    <neovoc:fax rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0711-035428</neovoc:fax>
    <neovoc:region rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NULL</neovoc:region>
</rdf:Description>

<rdf:Description rdf:about="neo4j://indiv#77937">
    <neovoc:ORDERS rdf:resource="neo4j://indiv#76432"/>
</rdf:Description>
...

I know the XML based format is pretty horrible but we need it because it’s the only one that the RDF validator accetps 😦

Step two:  Go to the W3C RDF validation service page (https://www.w3.org/RDF/Validator/) and copy the xml from the previous step in the text box and select triples and graph in the display options. Hit Parse RDF and… you should get the list of 266 parsed triples plus a graphical representation of the RDF graph like this one:

266triples.png

Yes, I know, huge if we compare it to the original property graph but this is normal. RDF makes an atomic decomposition of every single statement in your data. In an RDF graph not only entities but also every single property produce a new vertex, leading to this explosion in the size of the graph.

Screen Shot 2016-12-16 at 15.58.33.png

That’s a slide from this talk at Graph Connect SF in Oct 2016 where I discussed that it’s normal that the number of triples in an RDF dataset is an order of magnitude bigger than the number of nodes in a LPG.

The portion of the Northwind graph returned by our example query is not an exception 19 nodes => 266 triples.

If the graph was imported from RDF…

So if your graph in Neo4j had been imported using the semantics.importRDF procedure (described in previous blog posts and with some examples) then you want to use the rdf/cypheronrdf endpoint (described here) instead. It works exactly in the same way, but uses the uris as unique identifiers for nodes instead of the ids.

If you’re interested on what this would look like, watch this space for part three of this series.

Takeaways

As in the previous post, the main takeaway is that it is pretty straightforward to offer an RDF “open standards compliant” API for publishing your graph while still getting the benefits of native graph storage and Cypher querying in Neo4j.

 

 

 

Advertisements

Neo4j is your RDF store (part 1)

If you want to understand the differences and similarities between RDF and the Labeled Property Graph implemented by Neo4j, I’d recommend you watch this talk I gave at Graph Connect San Francisco in October 2016.

Intro

Let me start with some basics: RDF is a standard for data exchange, but it does not impose any particular way of storing data.

What do I mean by that? I mean that data can be persisted in many ways: tables, documents, key-value pairs, property graphs, triple graphs… and still be published/exchanged as RDF.

It is true though that the bigger the paradigm impedance mismatch -the difference between RDF’s modelling paradigm (a graph) and the underlying store’s one-, the more complicated and inefficient the translation for both ingestion and publishing will be.

I’ve been blogging over the last few months about how Neo4j can easily import RDF data and in this post I’ll focus on the opposite: How can a Neo4j graph be published/exposed as RDF.

Because in case you didn’t know, you can work with Neo4j getting the benefits of native graph storage and processing -best performance, data integrity and scalability- while being totally ‘open standards‘ to the eyes of any RDF aware application.

Oh! hang on… and your store will also be fully open source!

A “Turing style” test of RDFness

In this first section I’ll show the simplest way in which data from a graph in Neo4j can be published as RDF but I’ll also demonstrate that it is possible to import an RDF dataset into Neo without loss of information in a way that the RDF produced when querying Neo4j is identical to that produced by the original triple store.

Screen Shot 2016-11-17 at 01.18.36.png

You’ll probably be familiar with the Turing test where a human evaluator tests a machine’s ability to exhibit intelligent behaviour, to the point where it’s indistinguishable from that of a human. Well, my test aims to prove Neo4j’s ability to exhibit “RDF behaviour” to an RDF consuming application, making it indistinguishable from that of a triple store. To do this I’ll use the neosemantics neo4j extension.

The simplest test one can think of, could be something like this:

Starting from an RDF dataset living in a triple store, we migrate it (all or partially) into Neo4j. Now if we run a Given a SPARQL DESCRIBE <uri> query on the triple store and its equivalent rdf/describe/uri<uri> in Neo4j, do they return the same set of triples? If that is the case -and if we also want to be pompous- we could say that the results are semantically equivalent, and therefore indistinguishable to a consumer application.

We are going to run this test step by step on data from the British National Bibliography dataset:

Get an RDF node description from the triple store

To do that, we’ll run the following SPARQL DESCRIBE query in the British National Bibliography public SPARQL endpoint, or alternatively in the more user friendly SPARQL editor.

DESCRIBE <http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940>

The request returns an RDF fragment containing all information about Mikhail Bulgakov in the BNB. A pretty cool author, by the way, which I strongly recommend. The fragment actually contains 86 triples, the first of which are these:

<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/givenName> "Mikhail" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.w3.org/2000/01/rdf-schema#label> "Bulgakov, Mikhail, 1891-1940" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/familyName> "Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/name> "Mikhail Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/010535795> .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/008720599> .
...

You can get the whole set running the query in the SPARQL editor I mentioned before or sending an  HTTP request with the query to the SPARQL endpoint:

curl -i http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E -H Accept:text/plain

Ok, so that’s our base line,  exactly the output we want to get from Neo4j to be able to affirm that they are indistinguishable to an RDF consuming application.

Move the data from the triple store to Neo4j

We need to load the RDF data into Neo4j. We could load the whole British National Bibliography since it’s available for download as RDF, but for this example we are going to load just the portion of data that we need.

I will not go into the details of how this happens as it’s been described in previous blog posts and with some examples. The semantics.importRDF procedure runs a straightforward and lossless import of RDF data into Neo4j. The procedure is part of the neosemantics extension. If you want to run the test with me on your Neo4j instance, now is the moment when you need to install it (instructions in the README).

Once the extension ins installed, the migration could not be simpler, just run the following stored procedure:

CALL semantics.importRDF("http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E",
"RDF/XML",true,true,500)

We are passing as parameter the url of the BNB SPARQL endpoint returning the RDF data needed for our test, along with some import configuration options. The output of the execution shows that the 86 triples have been correctly imported into Neo4j:

Screen Shot 2016-11-16 at 03.01.52.png

Now that the data is in Neo4j and you can query it with Cypher and visualise it in the browser. Here is a query example returning Bulgakov and all the nodes he’s connected to:

MATCH (a)-[b]-(c:Resource { uri: "http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940"})
RETURN *

Screen Shot 2016-11-16 at 02.54.34.png

There is actually not much information in the graph yet, just the node representing good old Mikhail with a few properties (name, uri, etc…) and connections to the works he created or contributed to, the events of his birth and death and a couple more. But let’s not worry about size for now, well deal with that later. The question was: can we now query our Neo4j graph and produce the original set of RDF triples? Let’s see.

Get an RDF description of the same node, now from Neo4j

The neosemantics repo also includes an extensions (http endpoints) that provide precisely this capability. The equivalent in Neo4j of the SPARQL DESCRIBE on Mikhail Bulgakov would be the following:

:GET /rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940

If you run it in the browser, you will get the default serialisation which is JSON-LD, something like this:

Screen Shot 2016-11-16 at 16.40.23.png

But if you set in the request header the serialisation format of your choice -for example using curl again- you can get the RDF fragment in any of the available formats.

curl -i http://localhost:7474/rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940 -H accept:text/plain

Well, you should not be surprised to know that it return 86 triples, exactly the same set that the original query on the triple store returned.

So mission accomplished. At least for the basic case.

RDF out Neo4j’s movie database

I thought it could be interesting to prove that an RDF dataset can be imported into Neo4j and then published without loss of information but OK, most of you may not care much about existing RDF datasets, that’s fair enough. You have a graph in Neo4j and you just want to publish it as RDF. This means that in your graph, the nodes don’t necessarily have a property for the uri (why would they?) or are labelled as Resources. Not a problem.

Ok, so if your graph is not the result of some RDF import, the service you want to use instead of the uri based one, is the nodeid based equivalent.

:GET /rdf/describe/id?nodeid=<nodeid>

We’ll use for this example Neo4j’s movie database. You can get it loaded in your Neo4j instance by running

:play movies

You can get the ID of a node either directly by clicking on it on the browser or by running a simple query like this one:

MATCH (x:Movie {title: "Unforgiven"}) 
RETURN ID(x)

In my Neo4j instance, the returned ID is 97 so the GET request would pass this ID and return in the browser the JSON-LD serialisation of the node representing the movie “Unforgiven” with its attributes and the set of nodes connected to it (both inbound and outbound connections):

screen-shot-2016-11-16-at-17-07-26

But as in the previous case, the endpoint can also produce your favourite serialisation just by setting it in the accept parameter in the request header.

curl -i http://localhost:7474/rdf/describe/id?nodeid=97 -H accept:text/plain

When setting the serialisation to N-Triples forma the previous request gets you these triples:

<neo4j://indiv#97> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <neo4j://vocabulary#Movie> .
<neo4j://indiv#97> <neo4j://vocabulary#tagline> "It's a hell of a thing, killing a man" .
<neo4j://indiv#97> <neo4j://vocabulary#title> "Unforgiven" .
<neo4j://indiv#97> <neo4j://vocabulary#released> "1992"^^<http://www.w3.org/2001/XMLSchema#long> .
<neo4j://indiv#167> <neo4j://vocabulary#REVIEWED> <neo4j://indiv#97> .
<neo4j://indiv#89> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#DIRECTED> <neo4j://indiv#97> .
<neo4j://indiv#98> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .

The sharpest of you may notice when you run it that there is  a bit missing. There are relationship properties in the movie database that are lost in the RDF fragment. Yes, that is because there is no way of expressing that in RDF. At least not without recurring to horribly complicated patterns like reification or singleton property that are effectively unusable in any practical real world use case. But we’ll get to that too in future posts.

Takeaways

 

I guess the main one is that if you want to get the benefits of native graph storage and be able to query your graph with Cypher in Neo4j but also want to:

  •  be able to easily import RDF data into your graph and/or
  •  offer an RDF “open standards compliant” API for publishing your graph

Well, that’s absolutely fine, because we’ve just seen how Neo4j does a great job at producing and consuming RDF.

Remember: RDF is about data exchange, not about storage.

There is more to come on producing RDF from Neo4j than what I’ve shown in this post. For instance, publishing the results of a Cypher query as RDF. Does it sound interesting?Watch this space.

Also I’d love to hear your feedback!

 

 

 

QuickGraph#3 A step-by-step example of RDF to Property Graph transformation

The dataset

For this example I am going to use a sample movie dataset from the Cayley project. It’s a set of half a million triples about actors, directors and movies that can be downloaded here. Here is what the dataset looks like:

</en/meet_the_parents> <name> "Meet the Parents" .
</en/meet_the_parents> <type> </film/film> .
</en/meet_the_parents> </film/film/directed_by> </en/jay_roach> .
</en/meet_the_parents> </film/film/starring> _:28754 . 
_:28754 </film/performance/actor> </en/ben_stiller> .
_:28754 </film/performance/character> "Gaylord Focker" .
</en/meet_the_parents> </film/film/starring> _:28755 .
...

One could argue whether this dataset is actual RDF or just a triple based graph since it does not use valid URIs or even the RDF vocabulary (note for example that instead of  http://www.w3.org/1999/02/22-rdf-syntax-ns#type we find just type). But this would be a rather pointless discussion in my opinion. For what it’s worth, the graph is parseable with standard RDF parsers which is enough and as we’ll see the problems derived from this can be fixed, which is the point of this post.

 

Loading the data into Neo4j

I’ll use the RDF Importer described here for the data load. Now, there is something to take into account, even though the data set is called ‘30kmoviedata.nq’ it does not contain quads but triples, so I tried the parser setting the serialization format to ‘N-Triples’. The parser threw an error complaining about the structure of the URIs:

Not a valid (absolute) IRI: /film/performance/actor [line 1]

However, funnily enough the file parses as Turtle format. So if you want to give it a try, remember to set the second parameter of the importRDF stored procedure to ‘Turtle’ and run the import in the usual way. It took only 39 seconds to load the 471K triples on my laptop.

screen-shot-2016-09-09-at-16-56-13

Fixing the model

Fixing dense nodes representing categories

First thing we notice is that because the data set does not use the RDF vocabulary, the a <type> b statements are not transformed into labeled nodes as would have happened if rdf:type was used instead. So there are a couple of unusually dense nodes representing the categories (person and movie) because most of the nodes in the dataset are either actors or movies and are therefore linked to either one or the other category node. The two dense nodes are immediately visible in a small sample of 1000 nodes:

screen-shot-2016-09-09-at-17-11-09

We can get counts on the number of nodes connected to each of them by running this query:

MATCH (x)-[:ns1_type]->(t) RETURN t.uri, count (x)

screen-shot-2016-09-09-at-16-27-39

The natural way of representing categories in the Label Property Graph model is by using labels so let’s fix this!  Here is the Cypher fragment that does the job:

MATCH (x)-[:ns1_type]->({uri : 'file:/film/film'}) 
SET x:Film

And once we have the nodes labeled with their categories we can get rid of the dense nodes and the links that connect the rest of the nodes to them.

MATCH (f {uri : 'file:/film/film'}) DETACH DELETE f

Exactly the same applies to the other category: ‘file:/film/person’

MATCH (x)-[:ns1_type]->({uri : 'file:/people/person'}) 
SET x:Person 

MATCH (p {uri : 'file:/people/person'}) DETACH DELETE p

Fixing unneeded intermediate nodes holding relationship properties

In the tiny fragment that I copied at the beginning of the post, we can already see that the data set suffers from one of the known limitations of triple based graph models which is the impossibility of adding attributes to relationships. To do that, intermediate nodes need to be created. Let’s have a look at the example in the previous data fragment graphically.

Ben Stiller plays the role of Gaylord Focker in the movie Meet the Parents and when modelling this (think how would you draw that in a whiteboard) our intuition says something like this:

 

Screen Shot 2016-09-09 at 21.11.20.png

But in a triple based model you will need to introduce an intermediate node to hold the role played by an actor in a movie. Something like this.

screen-shot-2016-09-09-at-14-43-32

This obviously creates a gap between what you conceive when modelling a domain and what is stored in disk and ultimately queried. You will have to map what’s in your head, what you drew in the whiteboard when sketching the model to what the triple based formalism forces you to actually create. Does this ring a bell? Join tables in the relational model maybe? In your head it’s a many-to-many relationship but in the relational model it has to be modelled in a separate join table, an artificial construct imposed by the modelling paradigm that inevitably builds a gap between the conceptual model and the physical one. This ultimately makes your model harder to understand and maintain and your SQL queries looooooonger and less performant. But not to worry, we’ll fix this by using the property graph model, the one that is closer to the way we as humans understand and model domains.

But before we do that, let’s look at another problem derived from this. This complex model introduces the possibility of data quality problems in the form of broken links. What if we have the first leg connecting our intermediate node with the movie but no connection with the actor?  It would be a totally meaningless piece of information. The pattern I’m describing would be expressed like this:

()-[r:ns2_starring]->(x) WHERE NOT (x)-[:ns0_actor]->()

And a query producing a ‘Data Quality’ report on this particular issue could look something like this:

MATCH ()-[r:ns2_starring]->(x) WHERE NOT (x)-[:ns0_actor]->() 
WITH COUNT(r) as brokenLinks
MATCH ()-[r:ns2_starring]->(x)-[:ns0_actor]->() 
WITH COUNT(r) as linked, brokenLinks
RETURN linked + brokenLinks as total, linked, brokenLinks,  
     toFloat(brokenLinks)* 100/(linked + brokenLinks) as percentageBroken

Screen Shot 2016-09-09 at 17.43.59.png

So 0.03% does not seem to be significant, probably the dataset was truncated in a bad way, which would explain the missing bits. Anyway, we can get rid of these broken links that don’t add any value to our graph. Here’s how:

MATCH ()-[r:ns2_starring]->(x) WHERE NOT (x)-[:ns0_actor]->() 
DETACH DELETE x

Ok, so now we are in a position to get rid of the ugly and unintuitive intermediate nodes that I described before and replace them with relationships containing attributes on them.

MATCH (film)-[r:ns2_starring]->(x)-[:ns0_actor]->(actor)
CREATE (actor)-[:ACTS_IN { character: x.ns0_character}]->(film)
DETACH DELETE x
...
Deleted 136694 nodes, set 15043 properties, created 136694 relationships, statement executed in 7029 ms.

And voilà! Here is the final model zooming on the ‘Gaylord Focker’ area:

MATCH (actor)-[:ACTS_IN { character : 'Gaylord Focker' }]->(movie) 
RETURN * LIMIT 25

 

Screen Shot 2016-09-09 at 18.37.38.png

And to finish, one of our favourites at Neo4j, a recommendation engine for Hollywood actors. Who should Ben Stiller work with? We’ll base this in the concept of friend-of-a-friend. If Ben has worked several times with actor X and actor X has worked several times with actor Y then there is a good chance that Ben might be interested in working with actor Y.

Here is the Cypher query that returns our best recommendations for Ben Stiller:

MATCH (ben:Person {ns1_name: 'Ben Stiller'})-[:ACTS_IN]->(movie)<-[:ACTS_IN]-(friend) 
WITH ben, friend, count(movie) AS timesWorkedWithBen ORDER BY timesWorkedWithBen DESC LIMIT 3 //limit to top 3 
MATCH (friend)-[:ACTS_IN]->(movie)<-[:ACTS_IN]-(friendOfFriend)
WHERE NOT (ben)-[:ACTS_IN]->(movie)<-[:ACTS_IN]-(friendOfFriend) AND friendOfFriend <> ben
RETURN friend.ns1_name AS friendOfBen, timesWorkedWithBen, friendOfFriend.ns1_name AS recommendationForBen, count(movie) AS timesWorkedWithFriend ORDER BY timesWorkedWithFriend DESC limit 50

Easy, right? And here are the recommendations:

Screen Shot 2016-09-09 at 20.46.32.png

The following two visualisations give an idea of the portion of the graph explored with our recommendation query. This first one shows Ben’s friends and the movies where they worked together (~400 nodes in total):

Screen Shot 2016-09-09 at 19.26.35.png

And the next shows Ben’s friends’ friends, again with the movies that connect them (~1800 nodes):

Screen Shot 2016-09-09 at 19.34.10.png

You can try to write something similar on the original triple based graph using SPARQL, Gremlin or any other language but I bet you it will be less compact, less intuitive and certainly less performant than the Cypher I wrote. Prove me wrong if you can 😉

What’s interesting about this QuickGraph?

The example highlights some of the modelling limitations of triple based graph models like RDF and how it is possible to transform a model originally created as RDF into a more intuitive and easier to query and explore using the Labeled Property Graph in Neo4j.

 

 

 

 

 

QuickGraph #1 European Politics from DBpedia. Loading data from an RDF triple store into Neo4j via SPARQL

The first of a series of quick graphs in Neo4j built from public data. Watch this space! The example is also available as a GraphGist here

The dataset

For this example I’ve used DBpedia data about European cities, the political parties in their local governments and their ideologies. So big disclaimer, this data is not meant to be complete or more accurate than what can be expected from DBpedia/Wikipedia data. The DBpedia stores data using the RDF model so in order to query it we’ll have to use the SPARQL query language.

Here is the SPARQL query I’ve used to extract the cities and countries and the political parties currently in the local government. The query can be tested on the DBPedia SPARQL endpoint.

select distinct ?party ?city ?cityName ?ctr ?ctrName ?pop
where {

     ?city a dbo:Location ;
          rdfs:label ?cityName ;
           dbo:country ?ctr ;
           dbo:leaderParty ?party .
     FILTER(langMatches(lang(?cityName), "EN"))

     ?ctr dct:subject dbc:Countries_in_Europe ;
          rdfs:label ?ctrName .
     FILTER(langMatches(lang(?ctrName), "EN"))

     optional { ?city dbo:populationTotal ?pop }
}

And a second SPARQL query that returns for each party, the ideologies they are associated with. Again, according to DBpedia/Wikipedia of the different political parties.

select distinct ?party ?partyName ?ideology ?ideologyName where {

     ?city a dbo:Location ;
           dbo:country ?ctr ;
           dbo:leaderParty ?party .

     ?ctr dct:subject dbc:Countries_in_Europe .

     ?party dbo:ideology ?ideology ;
            rdfs:label ?partyName .
     FILTER(langMatches(lang(?partyName), "EN"))

     ?ideology rdfs:label ?ideologyName
     FILTER(langMatches(lang(?ideologyName), "EN"))
}

SPARQL queries of type SELECT return a set of variable bindings, or in other words, a tabular structure that the endpoint can serialise as CSV. This is quite convenient because it can be used directly by the LOAD CSV instruction in Cypher.

Loading the data into Neo4j

Here are the two Cypher statements that create the model in Neo4j. You may also want to create an index on city nodes first to get better performance:

CREATE INDEX ON :City(city_uri)
WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fparty+%3Fcity+%3FcityName+%3Fctr+%3FctrName+%3Fpop+where+%7B%0D%0A%3Fcity+a+dbo%3ALocation+%3B%0D%0A++++++dbo%3Acountry+%3Fctr+%3B%0D%0A++++++dbo%3AleaderParty+%3Fparty+.%0D%0A%0D%0A%3Fctr+dct%3Asubject+dbc%3ACountries_in_Europe+.%0D%0A%0D%0A%3Fparty+dbo%3Aideology+%3Fideology+.%0D%0A%0D%0Aoptional+%7B+%3Fcity+dbo%3ApopulationTotal+%3Fpop+%7D%0D%0A%0D%0A%3Fcity+rdfs%3Alabel+%3FcityName%0D%0AFILTER%28langMatches%28lang%28%3FcityName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%3Fctr+rdfs%3Alabel+%3FctrName%0D%0AFILTER%28langMatches%28lang%28%3FctrName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%7D&format=csv" AS queryOnSPARQLEndpoint
LOAD CSV WITH HEADERS FROM queryOnSPARQLEndpoint AS row
WITH row
MERGE (cty:City {city_uri: row.city}) 
      SET cty.city_pop= coalesce(row.pop,0), cty.city_name = row.cityName
MERGE (ctr:Country {ctry_uri: row.ctr}) 
      SET ctr.ctry_name = row.ctrName
MERGE (pty:Party {party_uri: row.party})
MERGE (cty)-[:CTY_IN_COUNTRY]->(ctr)
MERGE (cty)-[:GOVERNING_PARTY]->(pty)

Note that the query runs directly off the DBpedia SPARQL endpoint (same for the next one) so if the endpoint is down for maintenance or any other reason, this script won’t do much 🙂

WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fparty+%3FpartyName+%3Fideology+%3FideologyName+where+%7B%0D%0A%0D%0A%3Fcity+a+dbo%3ALocation+%3B%0D%0A++++++dbo%3Acountry+%3Fctr+%3B%0D%0A++++++dbo%3AleaderParty+%3Fparty+.%0D%0A%0D%0A%3Fctr+dct%3Asubject+dbc%3ACountries_in_Europe+.%0D%0A%0D%0A%3Fparty+dbo%3Aideology+%3Fideology+%3B%0D%0A+++++++rdfs%3Alabel+%3FpartyName+.%0D%0AFILTER%28langMatches%28lang%28%3FpartyName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%3Fideology+rdfs%3Alabel+%3FideologyName%0D%0AFILTER%28langMatches%28lang%28%3FideologyName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%7D&format=csv" AS queryOnSPARQLEndpoint
LOAD CSV WITH HEADERS FROM queryOnSPARQLEndpoint AS row
WITH row
MERGE (pty:Party {party_uri: row.party}) 
      SET pty.party_name = row.partyName
MERGE (ide:Ideology {ideology_uri: row.ideology}) 
      SET ide.ideology_name = row.ideologyName
MERGE (pty)-[:HAS_IDEOLOGY]->(ide)

And that’s pretty much it. The data in your graph should look something like this for a given city, for instance Bern in Switzerland:

 

Screen Shot 2016-07-27 at 10.22.33

Or like this for a few cities in Spain:

Screen Shot 2016-07-27 at 15.40.54

Querying the graph

Now a couple of interesting queries:

Which are the most transnational ideologies in Europe?

MATCH (ctr:Country)<-[:CTY_IN_COUNTRY]-(city)-[:GOVERNING_PARTY]->(party)-[:HAS_IDEOLOGY]->(i:Ideology)
RETURN i.ideology_name, COUNT(DISTINCT(ctr)) AS presentInCountries, COLLECT(DISTINCT(ctr.ctry_name)) AS CountryList
ORDER BY presentInCountries DESC LIMIT 5

Resulting in the following records:

╒═══════════════════╤══════════════════╤══════════════════════════════╕
│i.ideology_name    │presentInCountries│CountryList                   │
╞═══════════════════╪══════════════════╪══════════════════════════════╡
│Social democracy   │25                │[Bosnia and Herzegovina, Bulga│
│                   │                  │ria, Albania, Austria, Denmark│
│                   │                  │, United Kingdom, Estonia, Cze│
│                   │                  │ch Republic, Turkey, Switzerla│
│                   │                  │nd, Croatia, Cyprus, Spain, Sw│
│                   │                  │eden, Greece, Germany, France,│
│                   │                  │ Lithuania, Italy, Slovenia, S│
│                   │                  │erbia, Romania, Portugal, Norw│
│                   │                  │ay, Netherlands]              │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Christian democracy│17                │[Croatia, Bulgaria, Bosnia and│
│                   │                  │ Herzegovina, Austria, Estonia│
│                   │                  │, Switzerland, Germany, France│
│                   │                  │, Italy, Lithuania, Spain, Slo│
│                   │                  │venia, Slovakia, Serbia, Roman│
│                   │                  │ia, Poland, Netherlands]      │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Euroscepticism     │15                │[Ukraine, Turkey, Croatia, Swi│
│                   │                  │tzerland, Romania, Hungary, Gr│
│                   │                  │eece, Germany, France, United │
│                   │                  │Kingdom, Lithuania, Italy, Spa│
│                   │                  │in, Serbia, Netherlands]      │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Conservatism       │14                │[Bulgaria, Kosovo, Bosnia and │
│                   │                  │Herzegovina, Austria, Turkey, │
│                   │                  │Estonia, Denmark, Cyprus, Croa│
│                   │                  │tia, France, Spain, Italy, Slo│
│                   │                  │venia, Romania]               │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Pro-Europeanism    │13                │[Bosnia and Herzegovina, Bulga│
│                   │                  │ria, Albania, Kosovo, Turkey, │
│                   │                  │Denmark, Switzerland, Croatia,│
│                   │                  │ Greece, Spain, Romania, Polan│
│                   │                  │d, Norway]                    │
└───────────────────┴──────────────────┴──────────────────────────────┘

Which are the most ideologically similar parties from different countries?

MATCH (p1:Party)-[:HAS_IDEOLOGY]->(i)<-[:HAS_IDEOLOGY]-(p2:Party)
WHERE (ID(p1) < ID(p2))
WITH p1, p2, COUNT(DISTINCT(i)) AS sharedIdeologyCount, 
     COLLECT(DISTINCT(i.ideology_name)) AS sharedIdeologies
WHERE sharedIdeologyCount > 2
MATCH (p1)<-[:GOVERNING_PARTY]-()-[:CTY_IN_COUNTRY]->(ctr1), 
      (p2)<-[:GOVERNING_PARTY]-()-[:CTY_IN_COUNTRY]->(ctr2)
WHERE ctr1 <> ctr2
RETURN DISTINCT p1.party_name AS Party1, ctr1.ctry_name AS Country1, 
      p2.party_name AS Party2, ctr2.ctry_name AS Country2, 
      sharedIdeologyCount, sharedIdeologies 
ORDER BY sharedIdeologyCount DESC LIMIT 5

Producing these results:

╒═════════════════╤═════════════════╤═════════════════╤═════════════════╕
│Party1           │Party2           │sharedIdeologyCou│sharedIdeologies │
│                 │                 │nt               │                 │
╞═════════════════╪═════════════════╪═════════════════╪═════════════════╡
│Croatian Party of│Independent Greek│3                │[Euroscepticism, │
│ Rights from Croa│s from Greece    │                 │Social conservati│
│tia              │                 │                 │sm, National cons│
│                 │                 │                 │ervatism]        │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Movement for Righ│Convergence and U│3                │[Liberalism, Popu│
│ts and Freedoms f│nion from Spain  │                 │lism, Centrism]  │
│rom Bulgaria     │                 │                 │                 │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Greater Romania P│Swiss People's Pa│3                │[Euroscepticism, │
│arty from Romania│rty from Switzerl│                 │Right-wing populi│
│                 │and              │                 │sm, National cons│
│                 │                 │                 │ervatism]        │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Movement for Righ│ANO 2011 from Cze│3                │[Liberalism, Cent│
│ts and Freedoms f│ch Republic      │                 │rism, Populism]  │
│rom Bulgaria     │                 │                 │                 │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Independent Greek│Swiss People's Pa│3                │[Euroscepticism, │
│s from Greece    │rty from Switzerl│                 │Right-wing populi│
│                 │and              │                 │sm, National cons│
│                 │                 │                 │ervatism]        │
└─────────────────┴─────────────────┴─────────────────┴─────────────────┘

 

What’s interesting about this QuickGraph?

The graph is built directly from querying an RDF triple store via a SPARQL endpoint and consuming the output of the SPARQL query directly with ‘LOAD CSV’ in Cypher.

Importing RDF data into Neo4j

The previous blog post might have been a bit too dense to start with, so I’ll try something a bit lighter this time like importing RDF data into Neo4j. It asumes, however, a certain degree of familiarity with both RDF and graph databases.

There are a number of RDF datasets out there that you may be aware of and you may have asked yourself at some point: “if RDF is a graph, then it should be easy to load it into a graph database like Neo4j, right?”. Well, the RDF model and the property graph model (implemented by Neo4j) are both graph models but with some important differences that I wont go over in this post. What I’ll do though, is describe one possible way of migrating data from an RDF graph into Neo4j’s property graph database.

I’ve also implemented this approach as a Neo4j stored procedure, so if you’re less interested in the concept and just want to see how to use the procedure you can go straight to the last section. Give it a try and share your experience, please.

The mapping

The first thing to do is plan a way to map both models. Here is my proposal.

An RDF graph is a set of tiples or statements (subject,predicate,object) where both the subject and the predicate are resources and the object can be either another resource or a literal. The only particularity about literals is that they cannot be the subject of other statements. In a tree structure we would call them leaf nodes. Also keep in mind that resources are uniquely identified by URIs.
Rule1: Subjects of triples are mapped to nodes  in Neo4j. A node in Neo4j representing an RDF resource will be labeled :Resource and have a property uri with the resource’s URI.
(S,P,O) => (:Resource {uri:S})...
Rule2a: Predicates of triples are mapped to node properties in Neo4j if the object of the triple is a literal
(S,P,O) && isLiteral(O) => (:Resource {uri:S, P:O})
Rule 2b: Predicates of triples are mapped to relationships in Neo4j if the object of the triple is a resource
(S,P,O) && !isLiteral(O) => (:Resource {uri:S})-[:P]->(:Resource {uri:O})
Let’s look at an example: Here is a short RDF fragment from the RDF Primer by the W3C that describes a web page and links it to its author. The triples are the following:
ex:index.html   dc:creator              exstaff:85740 .
ex:index.html   exterms:creation-date   "August 16, 1999" .
ex:index.html   dc:language             "en" .
The URIs of the resources are shortened by using the xml namespace mechanism. In this example, ex stands for http://www.example.org/, exterms stands for http://www.example.org/terms/, exstaff stands for http://www.example.org/staffid/  and dc stands for http://purl.org/dc/elements/1.1/
The full URIs are shown in the graphical representation of the triples (the figure is taken from the W3C page).
threetriples
If we iterate over this set of triples applying the  three rules defined before, we would get the following elements in a Neo4j property graph. I’ll use Cypher to describe them.
The application of rules 1 and 2b to the first triple would produce:
(:Resource { uri:"ex:index.html"})-[:`dc:creator`]->(:Resource { uri:"exstaff:85740"})
The second triple is transformed using rules 1 and 2a:
(:Resource { uri:"ex:index.html", `exterms:creation-date`: "August 16, 1999"})
And finally the third triple is transformed also with rules 1 and 2a producing:
(:Resource { uri:"ex:index.html", `dc:language`: "dc"})

Categories

The proposed set of basic mapping rules can be improved by adding one obvious exception for categories. RDF can represent both data and metadata as triples in the same graph and one of the most common uses of this is to categorise resources by linking them to classes through an instance-of style relationships (called rdf:type). So let’s add a new rule to deal with this case.
rule3: The rdf:type statements are mapped to categories in Neo4j.
(Something ,rdf:type, Category) => (:Category {uri:Something})
The rule basically maps the way individual resources (data) are linked to classes (metadata) in RDF through the rdf:type predicate to the way you categorise nodes in Neo4j i.e. by using labels.
This has also the advantage of removing dense nodes that aren’t particularly nice to deal with for any database. Rather than having a few million nodes representing people in your graph all of them connected to a single Person class node, we will have them all labeled as :Person which makes a lot more sense and there is no semantic loss.

The naming of things

Resources in RDF are identified by URIs which makes them unique, and thats great, but they are meant to be machine readable rather than nice to the human eye. So even though you’d like to read ‘Person’, RDF will use http://vocabularies.com/socialNetowrk/voc#Person (for example). While these kind of names can be used in Neo4j with no problem, they will make your labels and property names horribly long and hard to read and your Cypher queries will be polluted with http://… making the logic harder to grasp.
So what can we do? We have two options: 1) leave things named just as they are in the RDF model, with full URIS, and just deal with it in your queries. This would be the right thing to do if your data uses multiple schemas not necessarily under your control and/or more schemas can be added dynamically. Option 2) would be to make the pragmatic decision of shortening names to make both the model and the queries more readable. This will require some governance to ensure there are no name clashes. Probably a reasonable thing to do if you are migrating into Neo4j data from an RDF graph where you are the owner of the vocabularies being used or at least you have control over what schemas are used.
The initial version of the importRDF stored procedure supports both approaches as we will see in the final sections.

Datatypes in RDF literals

Literals can have data types associated in RDF by by pairing a string with a URI that identifies a particular XSD datatype.

exstaff:85740  exterms:age  "27"^^xsd:integer .

As part of the import process you may want to map the XSD datatype used in a triple to one of Neo4j’s datatypes. If datatypes are not explicitly declared in your RDF data you can always just load all literals as Strings and then cast them if needed at query time or through some batch post-import processing.

Blank nodes

The building block of the RDF model is the triple and this implies an atomic decomposition of your data in individual statements. However -and I quote here the W3C’s RDF Primer again- most real-world  data involves structures that are more complicated than that and the way to model structured information is by linking the different components to an aggregator resource. These aggregator resources may never need to be referred to directly, and hence may not require universal identifiers (URIs). Blank nodes are the artefacts in RDF that fulfil this requirement of representing anonymous resources. Triple stores will give them some sort of graph store local unique ID for the purposes of keeping unicity and avoiding clashes.

Our RDF importer will label blank nodes as BNode, and resources identified with URIs as URI, however, it’s important to keep in mind that if you bring data into Neo4j from multiple RDF graphs, identifiers of blank nodes are not guaranteed to be unique and unexpected clashes may occur so extra controls may be required.

The importRDF stored procedure

As I mentioned at the beginning of the post, I’ve implemented these ideas in the form of a Neo4j stored procedure. The usage is pretty simple. It takes four arguments as input.

  • The url of the RDF data to import.
  • The type of serialization used. The most frequent serializations for RDF are JSON-LD, Turtle, RDF/XML, N-Triples and  TriG. There are a couple more but these are the ones accepted by the stored proc for now.
  • A boolean indicating whether we want the names of labels, properties and relationships shortened as described in the “naming of things” section.
  • The periodicity of the commits. Number of triples ingested after which a commit is run.
CALL semantics.importRDF("file:///Users/jbarrasa/Downloads/opentox-example.turtle","Turtle", false, 500)

Will produce the following output:

Screen Shot 2016-06-08 at 23.39.43

UPDATE [16-Nov-2016] The stored procedure has been evolving over the last few months and the signature has changed. It takes now an extra boolean argument indicating whether category optimisation (Rule3) is applied or not. I expect the code to keep evolving so take this post as an introduction to the approach and look for the latest on the implementation in the github repo README.

The URL can point at a local RDF file, like in the previous example or to one accessible via HTTP. The next example loads a public dataset with 3.5 million triples on food products, their ingredients, allergens, nutrition facts and much more from Open Food Facts.

CALL semantics.importRDF("http://fr.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf","RDF/XML", true, 25000)

On my laptop the whole import took just over 4 minutes to produce this output.

Screen Shot 2016-06-09 at 00.45.38

 

When shortening of names is selected, the list of prefix being used is included in the import summary. If you want to give it a try don’t forget to create the following indexes beforehand, otherwise the stored procedure will abort the import and will remind you:

CREATE INDEX ON :Resource(uri) 
CREATE INDEX ON :URI(uri)
CREATE INDEX ON :BNode(uri) 
CREATE INDEX ON :Class(uri)

Once imported, I can find straight away what’s the set of shared ingredients between your Kellogg’s Coco Pops cereals and a bag of pork pies that you can buy at your local Spar.

Screen Shot 2016-06-08 at 23.57.51

 

Below is the cypher query that produces these results. Notice how the urls have been shortened but unicity of names is preserved by prefixing them with a namespace prefix.

MATCH (prod1:Resource { uri: 'http://world-fr.openfoodfacts.org/produit/9310055537194/coco-pops-kellogg-s'})
MATCH (prod2:ns3_FoodProduct { ns3_name : '2 Snack Pork Pies'})
MATCH (prod1)-[:ns3_containsIngredient]->(x1)-[:ns3_food]->(sharedIngredient)<-[:ns3_food]-(x2)<-[:ns3_containsIngredient]-(prod2)
RETURN prod1, prod2, x1, x2, sharedIngredient

I’ve intentionally written the two MATCH blocks for the two products in different ways, one identifying the product by its unique identifier (URI) and the other combining the category and the name.

A couple of open points

There are a couple of thing that I have not explored in this post and that the current implementation of the RDF importer does not deal with.

Mutltivalued properties

The current implementation does not deal with multivalued properties, although an obvious implementation could be to use arrays of values for this.

And the metadata?

This works great for instance data, but there is a little detail to take into account: An RDF graph can contain metadata statements. This means that you can find in the same graph (JB, rdf:type, Person) and (Person, rdf:type, owl:Class) and even (rdf:type, rdf:type, refs:Property). The post on Building a semantic graph in Neo4j gives some ideas on how to deal with RDF metadata but this is a very interesting topic and I’ll be coming back to it in future posts.

Conclusions

Migrating data from an RDF graph into a property graph like the one implemented by Neo4j can be done in a generic and relatively straightforward way as we’ve seen. This is interesting because it gives an automated way of importing your existing RDF graphs (regardless of your serialization: JSON-LD, RDF/XML, Turtle, etc.) into Neo4j without loss of its graph nature and without having to go through any intermediate flattening step.

The import process being totally generic results in a graph in Neo4j that of course inherits the modelling limitations of RDF like the lack of support for attributes on relationships so you will probably want to enrich / fix your raw graph once it’s been loaded in Neo4j. Both potential improvements to the import process and post-import graph processing will be discussed in future posts. Watch this space.

Building a semantic graph in Neo4j

There are two key characteristics of RDF stores (aka triple stores): the first and by far the most relevant is that they represent, store and query data as a graph. The second is that they are semantic, which is a rather pompous way of saying that they can store not only data but also explicit descriptions of the meaning of that data. The RDF and linked data community often refer to these explicit descriptions as ontologies. In case you’re not familiar with the concept, an ontology is a machine readable description of a domain that typically includes a vocabulary of terms and some specification of how these terms inter-relate, imposing a structure on the data for such domain. This is also known as a schema. In this post both terms schema and ontology will be used interchangeably to refer to these explicitly described semantics.

Making the semantics of your data explicit in an ontology will enable data and/or knowledge exchange and interoperability which will be useful in some situations. In other scenarios you may want use your ontology to run generic inferencing on your data to derive new facts from existing ones. Another similar use of explicit semantics would be to run domain specific consistency checks on the data which is the specific use that I’ll base my examples on in this post. The list of possible uses goes on but these are some of the most common ones.

Right, so let’s clarify what I mean by domain specific consistency because it is quite different from integrity constraints in the relational style. I’m actually talking about the definition of rules such as “if a data point is connected to a Movie through the ACTED_IN relationship then it’s an Actor” or “if a User is connected to a BlogEntry through a POSTED relationship then it’s an ActiveUser”. We’ll look first at how to express these kind of consistency checks and then I’ll show how they can be applied in three different ways:

  • Controlling data insertion into the database at a transaction level by checking before committing that the state in which a given change leaves the database would still be consistent with the semantics in the schema, and rolling back the transaction if not.
  • Using the ontology to drive a generic front end guaranteeing that only data that keeps the graph consistent is added through such front end.
  • Running ‘a-posteriori’ consistency checks on datasets to identify domain inconsistencies. While the previous two could be considered preventive, this approach would be corrective. An example of this could be when you build a data set (a graph in our case) by bringing data from different sources and you want to check if they link together in a consistent way according to some predefined schema semantics.

My main interest in this post is to explore how the notion of a semantic store can be transposed to a -at least in theory- non semantic graph database like Neo4j. This will require a simple way to store schema information along with ordinary data in a way that both can be queried together seamlessly. This is not a problem with Neo4j+Cypher as you’ll see if you keep reading. We will also need a formalism, a vocabulary to express semantics in Neo4j, and in order to do that, I’ll pick some of the primitives in OWL and RDFS to come up with a basic ontology definition language as you will see in the next section. I will show how to use it by modelling a simple domain and prototyping ways of checking the consistency of the data in your graph according to the defined schema.

Defining an RDFS/OWL style ontology in Neo4j

The W3C recommends the use of RDFS and OWL for the definition of ontologies. For my example, I will use a language inspired by these two, in which I’ll borrow some elements like owl:Class, owl:ObjectProperty and owl:DatatypeProperty, rdfs:domain and rdfs:range, etc. If you’ve worked with any of these before, what follows will definitely look familiar.

Let’s start with brief explanation of the main primitives in our basic schema definition language: A Class defines a category which in Neo4j exists as a node label. A DatatypeProperty describes an attribute in Neo4j (each of the individual key-value pairs in both nodes or relationships) and an ObjectProperty describes a relationship in Neo4j. The domain and range properties in an ontology are used to further describe ObjectProperty and DatatypeProperty definitions. In the case of an ObjectProperty, the domain and range specify the source and the target classes of the nodes connected by instances of the ObjectProperty. Similarly, in the case of a DatatypeProperty, the domain will specify the class of nodes holding values for such property and the range, when present can be used to specify the XMLSchema datatype of the property values values. Note that because the property graph model accepts properties on relationships, the domain of a DatatypeProperty can be an ObjectProperty, which is not valid in the RDF world (and actually not even representable without using quite complicated workarounds, but this is another discussion). If you have not been exposed to the RDF model before this can all sound a bit too ‘meta’ so if you have time (a lot) you can find a more detailed description of the different elements in the OWL language reference or probably better just keep reading because it’s easier than it may initially seem and most probably the examples will help clarifying things.

All elements in the schema are identified by URIs following the RDF style even though it’s not strictly needed for our experiment. What is definitely needed is a reference to how each element in the ontology is actually used in the Neo4j data and this name is stored in the label property. Finally there is a comment with a human friendly description of the element in question.

If you’ve downloaded Neo4j you may be familiar by now with the movie database. You can load it in your neo4j instance by running :play movies on your browser. I’ll use this database for my experiment; here is a fragment of the ontology I’ve built for it. You can see the Person class, the name DatatypeProperty and the ACTED_IN ObjectProperty defined as described in the previous paragraphs.

// A Class definition (a node label in Neo4j)
(person_class:Class {	uri:'http://neo4j.com/voc/movies#Person',
			label:'Person',
			comment:'Individual involved in the film industry'})

// A DatatypeProperty definition (a property in Neo4j) 
(name_dtp:DatatypeProperty {	uri:'http://neo4j.com/voc/movies#name',
				label:'name',
				comment :'A person's name'}),
(name_dtp)-[:DOMAIN]->(person_class)

// An ObjectProperty definition (a relationship in Neo4j) 
(actedin_op:ObjectProperty { 	uri:'http://neo4j.com/voc/movies#ACTED_IN',
				label:'ACTED_IN',
				comment:'Actor had a role in film'}),
(person_class)<-[:DOMAIN]-(actedin_op)-[:RANGE]->(movie_class)

The whole code of the ontology can be found in the github repository jbarrasa/explicitsemanticsneo4j. Grab it and run it on your Neo4j instance and you should be able to visualise it in the Neo4j browser. It should look something like this:

Screen Shot 2016-03-17 at 16.51.21

You may want to write your ontology from scratch in Cypher as I’ve just done but it is also possible that if you have some existing OWL or RDFS ontologies (here is an OWL version of the same movies ontology) you will want a generic way of translating them into Neo4j ontologies like the previous one, and that’s exactly what this Neo4j stored procedure does. So an alternative to running the previous Cypher could be to deploy and run this stored procedure. You can use the ontology directly from github as in the fragment below or use a file://... URI if you have the ontology in your local drive.

CALL semantics.LiteOntoImport('https://raw.githubusercontent.com/jbarrasa/explicitsemanticsneo4j/master/moviesontology.owl','RDF/XML')
+=================+==============+=========+
|terminationStatus|elementsLoaded|extraInfo|
+=================+==============+=========+
|OK               |16            |         |
+-----------------+--------------+---------+

Checking data consistency with Cypher

So I have now a language to describe basic ontologies in Neo4j but if I want to use it in any interesting way (other than visualising it colorfully as we’ve seen), I will need to implement mechanisms to exploit schemas defined using this language. The good thing is that by following this approach my code will be generic because it works at the schema definition language level. Let’s see what that means exactly with an example. A rule that checks the consistent usage of relationships in a graph could be written in Cypher as follows:

// ObjectProperty domain semantics check
MATCH (n:Class)<-[:DOMAIN]-(p:ObjectProperty) 
WITH n.uri as class, n.label as classLabel, p.uri as prop, p.label as propLabel 
MATCH (x)-[r]->() WHERE type(r)=propLabel AND NOT classLabel in Labels(x) 
RETURN id(x) AS nodeUID, 'domain of ' + propLabel + ' [' + prop + ']' AS `check failed`, 
'Node labels: (' + reduce(s = '', l IN Labels(x) | s + ' ' + l) + ') should include ' + classLabel AS extraInfo

This query scans the data in your DB looking for ilegal usage of relationships according to your ontology. If in your ontology you state that ACTED_IN is a relationship between a Person and a Movie then this rule will pick up situations where this is not true. Let me try to describe very briefly the semantics it implements. Since our schema definition language is inspired by RDFS and OWL, it makes sense to follow their standard semantics. Our Cypher works on the :DOMAIN relationship.  The :DOMAIN relationship when defined between an ObjectProperty p and a Class c, states that any node in a dataset that has a value for that particular property p is an instance of the class c. So for example when I state (actedin_op)-[:DOMAIN]->(person_class), I mean that Persons are the subject of the ACTED_IN predicate, or in other words, if a node is connected to another node through the ACTED_IN relationship, then it should be labeled as :Person because only a person can act in a movie.

Ok, so back to my point on generic code. This consistency check in Cypher is completely domain agnostic. There is no mention of Person, or Movie or ACTED_IN…  it only uses the primitives in the ontology definition language (DOMAIN, DatatypeProperty, Class, etc.). This means that as long as a schema is defined in terms of these primitives this rule will pick up eventual inconsistencies in a Neo4j graph. It’s kind of a meta-rule.

Ok, so I’ve implemented a couple more meta-rules for consistency checking to play with in this example but I leave it to you, interested reader, to experiment and extend the set and/or tune it to your specific needs.

Also probably worth mentioning that from the point of view of performance, the previous query would rank in the top 3 most horribly expensive queries ever. It scans all relationships in all nodes in the graph… but I won’t care much about this here, I’d just say that if you were to implement something like this to work on large graphs, you would most likely write some server-side java code probably something like this stored procedure, but that’s another story.

Here is another Cypher rule very similar to the previous one, except that it applies to relationship attributes instead. The semantics of the :DOMAIN primitive are the same when defined on DatatypeProperties (describing neo4j attributes) or on ObjectProperties (describing relationships) so if you got the idea in the previous example this will be basically the same.

// DatatypeProperties on ObjectProperty domain semantics meta-rule (property graph specific. attributes on relationships) 
MATCH (r:ObjectProperty)<-[:DOMAIN]-(p:DatatypeProperty) 
WITH r.uri as rel, r.label as relLabel, p.uri as prop, p.label as propLabel 
MATCH ()-[r]->() WHERE r[propLabel] IS NOT NULL AND relLabel<>type(r) 
RETURN id(r) AS relUID, 'domain of ' + propLabel + ' [' + prop + ']' AS `check failed`, 'Rel type: ' + type(r) + ' but should be ' + relLabel AS extraInfo 

If I run this query on the movie database I should get no results because the data in the graph is consistent with the ontology: Only persons act in movies, only movies have titles, only persons have date of birth, and so on. But we can disturb this boring peace by inserting a node like this:

MATCH (:Person {name:'Demi Moore'})-[r:ACTED_IN]->(:Movie {title:'A Few Good Men'}) 
SET r.rating = 88

This is wrong, we are setting a value for attribute rating on a property where a rating does not belong. If we now re-run the previous consistency check query, it should produce some results:

+======+=====================================================+=========================================+
|relUID|check failed                                         |extraInfo                                |
+======+=====================================================+=========================================+
|64560 |domain of rating [http://neo4j.com/voc/movies#rating]|Rel type: ACTED_IN but should be REVIEWED|
+------+-----------------------------------------------------+-----------------------------------------+

Our ontology stated that the rating property was exclusive of the :REVIEWED relationship, so our data is now inconsistent with that. I can fix the problem by unsetting the value of the property with the following Cypher and get the graph back to a consistent state

MATCH (:Person {name:'Demi Moore'})-[r:ACTED_IN]->(:Movie {title:'A Few Good Men'}) 
REMOVE r.rating

Right, so I could extend the meta-model primitives by adding things like hierarchical relationships between classes like RDFS’s SubClassOf or even more advanced ones like OWL’s Restriction elements or disjointness relationships but my objective today was to introduce the concept, not to define a full ontology language for Neo4j. In general, the choice of the schema language will depend on the level of expressivity  that one needs in the schema. In the RDF world you will decide whether you want to use RDFS if your model is simple or OWL if you want to express more complex semantics. The cost, of course, is more expensive processing both for reasoning and consistency checking. Similarly, if you go down the DIY approach that we are following here, keep in mind that for every primitive added to your language, the corresponding meta-rules, stored procedures or any alternative implementation of its semantics, will be required too.

The use of meta-data definition languages with well defined semantics is pretty powerful as we’ve seen because it enables the construction of general purpose engines based only on the language primitives that are reusable across domains. Examples of this idea are the set of Cypher rules and the stored procedure linked above. You can try to reuse them on other neo4j databases just by defining an ontology in the language we’ve used here. Finally, the combination of this approach with the fact that the schema definition itself is stored just as ordinary data, we get a pretty dynamic setup, because storing Classes, ObjectProperties, etc. as nodes and relationships in Neo4j means that they may evolve over time and that there are no precompiled rules or static/hardcoded logic to detect consistency violations.

This is precisely the kind of approach that RDF stores follow. In addition to storing data as RDF triples and offering query capabilities for it, if you make your model ontology explicit by using RDFS or OWL, you get out of the box inferencing and consistency checking.

Consistency checking at transaction level

So far we’ve used our ontology to drive the execution of meta-rules to check the consistency of a data set after some change was made (an attribute was added to a relationship in our example). That is one possible way of doing things (a posteriori check) but I may want to use the explicit semantics I’ve added to the graph in Neo4j in a more real-time scenario. As I described before, I may want transactions to be committed only if they leave our graph in a consistent state and that’s what these few lines of python do.

The logic of the consistency check is on the client-side, which may look a bit strange but my intention was to make the whole thing as explicit and clear as I possibly could. The check consists of running the whole set of individual meta-rules defined in the previous sections one by one and breaking if any of them picks up an inconsistency of any type. The code requires the meta-rules to be available in the Neo4j server and the way I’ve done it is by storing each of them individually as ConsistencyCheck nodes with the Cypher code stored as a property. Something like this:

CREATE (ic:ConsistencyCheck { ccid:1, ccname: 'DTP_DOMAIN', 
cccypher: 'MATCH (n:Class)<-[:DOMAIN]-(p:Datatyp ... '})

The code of the Cypher meta-rule in the cccypher property has been truncated in this snippet but you can view the whole lot in github. Now the transaction consistency checker can grab all the meta-rules and cache them (line 25 of the python code) with this simple query:

MATCH (cc:ConsistencyCheck)
RETURN cc.ccid AS ccid, cc.ccname AS ccname, cc.cccypher AS cccypher

that returns the batch of individual meta-rules to be run. Something like this:

Screen Shot 2016-03-18 at 23.18.41.png

The code can be tested with different Cypher fragments. If I try to populate the graph with data that is consistent with the schema then things will go smoothly:

$ python transactional.py " CREATE (:Person { name: 'Marlon Brando'})-[:ACTED_IN]->(:Movie{ title: 'The Godfather'}) "
Consistency Checks passed. Transaction committed

Now if I try to update the graph in inconsistent ways, the meta-rules should pick this up. Let’s try to insert a node labeled as :Thing with an attribute that’s meant to be used by Person nodes:

$ python transactional.py " CREATE (:Thing { name: 'Marlon Brando'}) "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                               | extraInfo                                  
---+---------+------------------------------------------------------------+---------------------------------------------
 1 |    7231 | domain of property name [http://neo4j.com/voc/movies#name] | Node labels: ( Thing) should include Person

Or link an existing actor (:Person) to something that is not a movie through the :ACTED_IN relationship:

$ python transactional.py " MATCH (mb:Person { name: 'Marlon Brando'}) CREATE (mb)-[:ACTED_IN]->(:Play { playTitle: 'The mousetrap'}) "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                              | extraInfo                                
---+---------+-----------------------------------------------------------+-------------------------------------------
 1 |    7241 | domain of ACTED_IN [http://neo4j.com/voc/movies#ACTED_IN] | Node labels: ( Play) should include Movie

I can try also to update the labels of a few nodes, causing a bunch of inconsistencies:

$ python transactional.py " MATCH (n:Person { name: 'Kevin Bacon'})-[:ACTED_IN]->(bm) REMOVE bm:Movie SET bm:BaconMovie  "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                                       | extraInfo                                      
---+---------+--------------------------------------------------------------------+-------------------------------------------------
 1 |    7046 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 2 |    7166 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 3 |    7173 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 4 |    7046 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 5 |    7166 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 6 |    7173 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 7 |    7046 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie
 8 |    7166 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie
 9 |    7173 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie

Again a word of warning about the implementation. My intention was to make the example didactic and easy to understand but of course in a real world scenario you would probably go with a different implementation, most certainly involving some server-side code like this stored proc. An option would be to write the logic of running the consistency checks and either committing or rolling back the transaction as an unmanaged extension that you can expose as a ‘consistency-protected’ Cypher HTTP endpoint and invoke it from your client.

Consistency check via dynamic UI generation

Another way of guaranteeing the consistency in the data in your graph when it is manually populated through a front-end is to build a generic UI driven by your model ontology. Let’s say for example that you need to generate a web form to populate new movies. Well, you can retrieve both the structure of your form and the data insertion query with this bit of Cypher:

MATCH (c:Class { label: {className}})<-[:DOMAIN]-(prop:DatatypeProperty) 
RETURN { cname: c.label, catts: collect( {att:prop.label})} as object, 
 'MERGE (:' + c.label + ' { uri:{uri}' + reduce(s = "", x IN collect(prop.label) | s + ',' +x + ':{' + x + '}' ) +'})' as cypher

Again, a generic (domain agnostic) query that works on your schema. The query takes a parameter ‘className’. When set to ‘Movie’, you get in the response all the info you need to generate your movie insertion UI. Here’s a fragment of the json structure returned by Neo4j.

 "row": [
            {
              "cname": "Movie",
              "catts": [
                {
                  "att": "tagline"
                },
                {
                  "att": "title"
                },
                {
                  "att": "released"
                }
              ]
            },
            "MERGE (:Movie { uri:{uri},tagline:{tagline},title:{title},released:{released}})"
          ]

Conclusion

With this simple example, I’ve tried to prove that it’s relatively straightforward to model and store explicit domain semantics in Neo4j, making it effectively a semantic graph. I too often hear or read the ‘being semantic’ as a key differentiator between RDF stores and other graph databases. By explaining exactly what it means to ‘be semantic’ and experimenting with how this concept can be transposed to a property graph like Neo4j,  I’ve tried to show that such a difference is not an essential one.

I’ve focused on using the explicit semantics to run consistency checks in a couple of different ways but the concept could be extended to other uses such as automatically inferring new facts from existing ones, for instance.