Building a semantic graph in Neo4j

There are two key characteristics of RDF stores (aka triple stores): the first and by far the most relevant is that they represent, store and query data as a graph. The second is that they are semantic, which is a rather pompous way of saying that they can store not only data but also explicit descriptions of the meaning of that data. The RDF and linked data community often refer to these explicit descriptions as ontologies. In case you’re not familiar with the concept, an ontology is a machine readable description of a domain that typically includes a vocabulary of terms and some specification of how these terms inter-relate, imposing a structure on the data for such domain. This is also known as a schema. In this post both terms schema and ontology will be used interchangeably to refer to these explicitly described semantics.

Making the semantics of your data explicit in an ontology will enable data and/or knowledge exchange and interoperability which will be useful in some situations. In other scenarios you may want use your ontology to run generic inferencing on your data to derive new facts from existing ones. Another similar use of explicit semantics would be to run domain specific consistency checks on the data which is the specific use that I’ll base my examples on in this post. The list of possible uses goes on but these are some of the most common ones.

Right, so let’s clarify what I mean by domain specific consistency because it is quite different from integrity constraints in the relational style. I’m actually talking about the definition of rules such as “if a data point is connected to a Movie through the ACTED_IN relationship then it’s an Actor” or “if a User is connected to a BlogEntry through a POSTED relationship then it’s an ActiveUser”. We’ll look first at how to express these kind of consistency checks and then I’ll show how they can be applied in three different ways:

  • Controlling data insertion into the database at a transaction level by checking before committing that the state in which a given change leaves the database would still be consistent with the semantics in the schema, and rolling back the transaction if not.
  • Using the ontology to drive a generic front end guaranteeing that only data that keeps the graph consistent is added through such front end.
  • Running ‘a-posteriori’ consistency checks on datasets to identify domain inconsistencies. While the previous two could be considered preventive, this approach would be corrective. An example of this could be when you build a data set (a graph in our case) by bringing data from different sources and you want to check if they link together in a consistent way according to some predefined schema semantics.

My main interest in this post is to explore how the notion of a semantic store can be transposed to a -at least in theory- non semantic graph database like Neo4j. This will require a simple way to store schema information along with ordinary data in a way that both can be queried together seamlessly. This is not a problem with Neo4j+Cypher as you’ll see if you keep reading. We will also need a formalism, a vocabulary to express semantics in Neo4j, and in order to do that, I’ll pick some of the primitives in OWL and RDFS to come up with a basic ontology definition language as you will see in the next section. I will show how to use it by modelling a simple domain and prototyping ways of checking the consistency of the data in your graph according to the defined schema.

Defining an RDFS/OWL style ontology in Neo4j

The W3C recommends the use of RDFS and OWL for the definition of ontologies. For my example, I will use a language inspired by these two, in which I’ll borrow some elements like owl:Class, owl:ObjectProperty and owl:DatatypeProperty, rdfs:domain and rdfs:range, etc. If you’ve worked with any of these before, what follows will definitely look familiar.

Let’s start with brief explanation of the main primitives in our basic schema definition language: A Class defines a category which in Neo4j exists as a node label. A DatatypeProperty describes an attribute in Neo4j (each of the individual key-value pairs in both nodes or relationships) and an ObjectProperty describes a relationship in Neo4j. The domain and range properties in an ontology are used to further describe ObjectProperty and DatatypeProperty definitions. In the case of an ObjectProperty, the domain and range specify the source and the target classes of the nodes connected by instances of the ObjectProperty. Similarly, in the case of a DatatypeProperty, the domain will specify the class of nodes holding values for such property and the range, when present can be used to specify the XMLSchema datatype of the property values values. Note that because the property graph model accepts properties on relationships, the domain of a DatatypeProperty can be an ObjectProperty, which is not valid in the RDF world (and actually not even representable without using quite complicated workarounds, but this is another discussion). If you have not been exposed to the RDF model before this can all sound a bit too ‘meta’ so if you have time (a lot) you can find a more detailed description of the different elements in the OWL language reference or probably better just keep reading because it’s easier than it may initially seem and most probably the examples will help clarifying things.

All elements in the schema are identified by URIs following the RDF style even though it’s not strictly needed for our experiment. What is definitely needed is a reference to how each element in the ontology is actually used in the Neo4j data and this name is stored in the label property. Finally there is a comment with a human friendly description of the element in question.

If you’ve downloaded Neo4j you may be familiar by now with the movie database. You can load it in your neo4j instance by running :play movies on your browser. I’ll use this database for my experiment; here is a fragment of the ontology I’ve built for it. You can see the Person class, the name DatatypeProperty and the ACTED_IN ObjectProperty defined as described in the previous paragraphs.

// A Class definition (a node label in Neo4j)
(person_class:Class {	uri:'http://neo4j.com/voc/movies#Person',
			label:'Person',
			comment:'Individual involved in the film industry'})

// A DatatypeProperty definition (a property in Neo4j) 
(name_dtp:DatatypeProperty {	uri:'http://neo4j.com/voc/movies#name',
				label:'name',
				comment :'A person's name'}),
(name_dtp)-[:DOMAIN]->(person_class)

// An ObjectProperty definition (a relationship in Neo4j) 
(actedin_op:ObjectProperty { 	uri:'http://neo4j.com/voc/movies#ACTED_IN',
				label:'ACTED_IN',
				comment:'Actor had a role in film'}),
(person_class)<-[:DOMAIN]-(actedin_op)-[:RANGE]->(movie_class)

The whole code of the ontology can be found in the github repository jbarrasa/explicitsemanticsneo4j. Grab it and run it on your Neo4j instance and you should be able to visualise it in the Neo4j browser. It should look something like this:

Screen Shot 2016-03-17 at 16.51.21

You may want to write your ontology from scratch in Cypher as I’ve just done but it is also possible that if you have some existing OWL or RDFS ontologies (here is an OWL version of the same movies ontology) you will want a generic way of translating them into Neo4j ontologies like the previous one, and that’s exactly what this Neo4j stored procedure does. So an alternative to running the previous Cypher could be to deploy and run this stored procedure. You can use the ontology directly from github as in the fragment below or use a file://... URI if you have the ontology in your local drive.

CALL semantics.LiteOntoImport('https://raw.githubusercontent.com/jbarrasa/explicitsemanticsneo4j/master/moviesontology.owl','RDF/XML')
+=================+==============+=========+
|terminationStatus|elementsLoaded|extraInfo|
+=================+==============+=========+
|OK               |16            |         |
+-----------------+--------------+---------+

Checking data consistency with Cypher

So I have now a language to describe basic ontologies in Neo4j but if I want to use it in any interesting way (other than visualising it colorfully as we’ve seen), I will need to implement mechanisms to exploit schemas defined using this language. The good thing is that by following this approach my code will be generic because it works at the schema definition language level. Let’s see what that means exactly with an example. A rule that checks the consistent usage of relationships in a graph could be written in Cypher as follows:

// ObjectProperty domain semantics check
MATCH (n:Class)<-[:DOMAIN]-(p:ObjectProperty) 
WITH n.uri as class, n.label as classLabel, p.uri as prop, p.label as propLabel 
MATCH (x)-[r]->() WHERE type(r)=propLabel AND NOT classLabel in Labels(x) 
RETURN id(x) AS nodeUID, 'domain of ' + propLabel + ' [' + prop + ']' AS `check failed`, 
'Node labels: (' + reduce(s = '', l IN Labels(x) | s + ' ' + l) + ') should include ' + classLabel AS extraInfo

This query scans the data in your DB looking for ilegal usage of relationships according to your ontology. If in your ontology you state that ACTED_IN is a relationship between a Person and a Movie then this rule will pick up situations where this is not true. Let me try to describe very briefly the semantics it implements. Since our schema definition language is inspired by RDFS and OWL, it makes sense to follow their standard semantics. Our Cypher works on the :DOMAIN relationship.  The :DOMAIN relationship when defined between an ObjectProperty p and a Class c, states that any node in a dataset that has a value for that particular property p is an instance of the class c. So for example when I state (actedin_op)-[:DOMAIN]->(person_class), I mean that Persons are the subject of the ACTED_IN predicate, or in other words, if a node is connected to another node through the ACTED_IN relationship, then it should be labeled as :Person because only a person can act in a movie.

Ok, so back to my point on generic code. This consistency check in Cypher is completely domain agnostic. There is no mention of Person, or Movie or ACTED_IN…  it only uses the primitives in the ontology definition language (DOMAIN, DatatypeProperty, Class, etc.). This means that as long as a schema is defined in terms of these primitives this rule will pick up eventual inconsistencies in a Neo4j graph. It’s kind of a meta-rule.

Ok, so I’ve implemented a couple more meta-rules for consistency checking to play with in this example but I leave it to you, interested reader, to experiment and extend the set and/or tune it to your specific needs.

Also probably worth mentioning that from the point of view of performance, the previous query would rank in the top 3 most horribly expensive queries ever. It scans all relationships in all nodes in the graph… but I won’t care much about this here, I’d just say that if you were to implement something like this to work on large graphs, you would most likely write some server-side java code probably something like this stored procedure, but that’s another story.

Here is another Cypher rule very similar to the previous one, except that it applies to relationship attributes instead. The semantics of the :DOMAIN primitive are the same when defined on DatatypeProperties (describing neo4j attributes) or on ObjectProperties (describing relationships) so if you got the idea in the previous example this will be basically the same.

// DatatypeProperties on ObjectProperty domain semantics meta-rule (property graph specific. attributes on relationships) 
MATCH (r:ObjectProperty)<-[:DOMAIN]-(p:DatatypeProperty) 
WITH r.uri as rel, r.label as relLabel, p.uri as prop, p.label as propLabel 
MATCH ()-[r]->() WHERE r[propLabel] IS NOT NULL AND relLabel<>type(r) 
RETURN id(r) AS relUID, 'domain of ' + propLabel + ' [' + prop + ']' AS `check failed`, 'Rel type: ' + type(r) + ' but should be ' + relLabel AS extraInfo 

If I run this query on the movie database I should get no results because the data in the graph is consistent with the ontology: Only persons act in movies, only movies have titles, only persons have date of birth, and so on. But we can disturb this boring peace by inserting a node like this:

MATCH (:Person {name:'Demi Moore'})-[r:ACTED_IN]->(:Movie {title:'A Few Good Men'}) 
SET r.rating = 88

This is wrong, we are setting a value for attribute rating on a property where a rating does not belong. If we now re-run the previous consistency check query, it should produce some results:

+======+=====================================================+=========================================+
|relUID|check failed                                         |extraInfo                                |
+======+=====================================================+=========================================+
|64560 |domain of rating [http://neo4j.com/voc/movies#rating]|Rel type: ACTED_IN but should be REVIEWED|
+------+-----------------------------------------------------+-----------------------------------------+

Our ontology stated that the rating property was exclusive of the :REVIEWED relationship, so our data is now inconsistent with that. I can fix the problem by unsetting the value of the property with the following Cypher and get the graph back to a consistent state

MATCH (:Person {name:'Demi Moore'})-[r:ACTED_IN]->(:Movie {title:'A Few Good Men'}) 
REMOVE r.rating

Right, so I could extend the meta-model primitives by adding things like hierarchical relationships between classes like RDFS’s SubClassOf or even more advanced ones like OWL’s Restriction elements or disjointness relationships but my objective today was to introduce the concept, not to define a full ontology language for Neo4j. In general, the choice of the schema language will depend on the level of expressivity  that one needs in the schema. In the RDF world you will decide whether you want to use RDFS if your model is simple or OWL if you want to express more complex semantics. The cost, of course, is more expensive processing both for reasoning and consistency checking. Similarly, if you go down the DIY approach that we are following here, keep in mind that for every primitive added to your language, the corresponding meta-rules, stored procedures or any alternative implementation of its semantics, will be required too.

The use of meta-data definition languages with well defined semantics is pretty powerful as we’ve seen because it enables the construction of general purpose engines based only on the language primitives that are reusable across domains. Examples of this idea are the set of Cypher rules and the stored procedure linked above. You can try to reuse them on other neo4j databases just by defining an ontology in the language we’ve used here. Finally, the combination of this approach with the fact that the schema definition itself is stored just as ordinary data, we get a pretty dynamic setup, because storing Classes, ObjectProperties, etc. as nodes and relationships in Neo4j means that they may evolve over time and that there are no precompiled rules or static/hardcoded logic to detect consistency violations.

This is precisely the kind of approach that RDF stores follow. In addition to storing data as RDF triples and offering query capabilities for it, if you make your model ontology explicit by using RDFS or OWL, you get out of the box inferencing and consistency checking.

Consistency checking at transaction level

So far we’ve used our ontology to drive the execution of meta-rules to check the consistency of a data set after some change was made (an attribute was added to a relationship in our example). That is one possible way of doing things (a posteriori check) but I may want to use the explicit semantics I’ve added to the graph in Neo4j in a more real-time scenario. As I described before, I may want transactions to be committed only if they leave our graph in a consistent state and that’s what these few lines of python do.

The logic of the consistency check is on the client-side, which may look a bit strange but my intention was to make the whole thing as explicit and clear as I possibly could. The check consists of running the whole set of individual meta-rules defined in the previous sections one by one and breaking if any of them picks up an inconsistency of any type. The code requires the meta-rules to be available in the Neo4j server and the way I’ve done it is by storing each of them individually as ConsistencyCheck nodes with the Cypher code stored as a property. Something like this:

CREATE (ic:ConsistencyCheck { ccid:1, ccname: 'DTP_DOMAIN', 
cccypher: 'MATCH (n:Class)<-[:DOMAIN]-(p:Datatyp ... '})

The code of the Cypher meta-rule in the cccypher property has been truncated in this snippet but you can view the whole lot in github. Now the transaction consistency checker can grab all the meta-rules and cache them (line 25 of the python code) with this simple query:

MATCH (cc:ConsistencyCheck)
RETURN cc.ccid AS ccid, cc.ccname AS ccname, cc.cccypher AS cccypher

that returns the batch of individual meta-rules to be run. Something like this:

Screen Shot 2016-03-18 at 23.18.41.png

The code can be tested with different Cypher fragments. If I try to populate the graph with data that is consistent with the schema then things will go smoothly:

$ python transactional.py " CREATE (:Person { name: 'Marlon Brando'})-[:ACTED_IN]->(:Movie{ title: 'The Godfather'}) "
Consistency Checks passed. Transaction committed

Now if I try to update the graph in inconsistent ways, the meta-rules should pick this up. Let’s try to insert a node labeled as :Thing with an attribute that’s meant to be used by Person nodes:

$ python transactional.py " CREATE (:Thing { name: 'Marlon Brando'}) "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                               | extraInfo                                  
---+---------+------------------------------------------------------------+---------------------------------------------
 1 |    7231 | domain of property name [http://neo4j.com/voc/movies#name] | Node labels: ( Thing) should include Person

Or link an existing actor (:Person) to something that is not a movie through the :ACTED_IN relationship:

$ python transactional.py " MATCH (mb:Person { name: 'Marlon Brando'}) CREATE (mb)-[:ACTED_IN]->(:Play { playTitle: 'The mousetrap'}) "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                              | extraInfo                                
---+---------+-----------------------------------------------------------+-------------------------------------------
 1 |    7241 | domain of ACTED_IN [http://neo4j.com/voc/movies#ACTED_IN] | Node labels: ( Play) should include Movie

I can try also to update the labels of a few nodes, causing a bunch of inconsistencies:

$ python transactional.py " MATCH (n:Person { name: 'Kevin Bacon'})-[:ACTED_IN]->(bm) REMOVE bm:Movie SET bm:BaconMovie  "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                                       | extraInfo                                      
---+---------+--------------------------------------------------------------------+-------------------------------------------------
 1 |    7046 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 2 |    7166 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 3 |    7173 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 4 |    7046 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 5 |    7166 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 6 |    7173 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 7 |    7046 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie
 8 |    7166 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie
 9 |    7173 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie

Again a word of warning about the implementation. My intention was to make the example didactic and easy to understand but of course in a real world scenario you would probably go with a different implementation, most certainly involving some server-side code like this stored proc. An option would be to write the logic of running the consistency checks and either committing or rolling back the transaction as an unmanaged extension that you can expose as a ‘consistency-protected’ Cypher HTTP endpoint and invoke it from your client.

Consistency check via dynamic UI generation

Another way of guaranteeing the consistency in the data in your graph when it is manually populated through a front-end is to build a generic UI driven by your model ontology. Let’s say for example that you need to generate a web form to populate new movies. Well, you can retrieve both the structure of your form and the data insertion query with this bit of Cypher:

MATCH (c:Class { label: {className}})<-[:DOMAIN]-(prop:DatatypeProperty) 
RETURN { cname: c.label, catts: collect( {att:prop.label})} as object, 
 'MERGE (:' + c.label + ' { uri:{uri}' + reduce(s = "", x IN collect(prop.label) | s + ',' +x + ':{' + x + '}' ) +'})' as cypher

Again, a generic (domain agnostic) query that works on your schema. The query takes a parameter ‘className’. When set to ‘Movie’, you get in the response all the info you need to generate your movie insertion UI. Here’s a fragment of the json structure returned by Neo4j.

 "row": [
            {
              "cname": "Movie",
              "catts": [
                {
                  "att": "tagline"
                },
                {
                  "att": "title"
                },
                {
                  "att": "released"
                }
              ]
            },
            "MERGE (:Movie { uri:{uri},tagline:{tagline},title:{title},released:{released}})"
          ]

Conclusion

With this simple example, I’ve tried to prove that it’s relatively straightforward to model and store explicit domain semantics in Neo4j, making it effectively a semantic graph. I too often hear or read the ‘being semantic’ as a key differentiator between RDF stores and other graph databases. By explaining exactly what it means to ‘be semantic’ and experimenting with how this concept can be transposed to a property graph like Neo4j,  I’ve tried to show that such a difference is not an essential one.

I’ve focused on using the explicit semantics to run consistency checks in a couple of different ways but the concept could be extended to other uses such as automatically inferring new facts from existing ones, for instance.

Advertisements