Neo4j is your RDF store (part 3) : Thomson Reuters’ OpenPermID

If you’re new to RDF/LPG, here is a good introduction to the differences between both types of graphs.  

For the last post in this series, I will work with a larger public RDF dataset in Neo4j. We’ve already seen a few times that importing an RDF dataset into Neo4j is easy, so what I will focus on in this post is what I think is the more interesting part, which is what comes after the data import, here are some highlights:

  1. Applying transformations to the imported RDF graph to make it benefit from the LPG modelling capabilities and enriching the graph with additional complementary data sources.
  2. Querying the graph to do complex path analysis and use graph patterns to detect data quality issues like data duplication and also to profile your dataset
  3. Integrate Neo4j with standard BI tools to build nice charts on the output of Cypher queries on your graph.
  4. Building an RDF API on top of your Neo4j graph.

All the code I’ll use is available on GitHub. Enjoy!

 

The dataset

In this example, I’ll use the Open PermID dataset by Thomson Reuters. Here is a diagram representing the main entities we will find in it (from https://permid.org/).

Screen Shot 2018-01-30 at 21.20.04

I’ll be using for this experiment the dump from the 7th January 2018. You can download the current version of the dataset from the bulk download section. The data dump contains around 38 million triples 127million triples [Note: Halfway through the writing of this blog entry a new file containing person data was added to the downloads page increasing the size of the full dump by 300%. If you had looked at this dataset before, I recommend you have a look at the new stuff].

The Data import

The dump is broken down into 6 files named OpenPermID-bulk-XXX.ntriples (where XXX is Person, Organization, Instrument, Industry, Currency, AssetClass, Quote). I will import them into Neo4j as usual with the semantics.importRDF stored procedure that you can find in the neosemantics neo4j extension. I’ll use the N-Triples RDF serialization format but a Turtle one is also available.

Here is the script that will do the data load for you. You’ll find in it one call to the semantics.importRDF stored procedure per RDF file, just like this one:

CALL semantics.importRDF("file:////OpenPermID/OpenPermID-bulk-organization-20180107_070346.ntriples","N-Triples", {})

The data load took just over an hour on my 16Gb MacBook Pro and while this may sound like a modest load performance (~35K triples ingested per second), we have to keep in mind that in Neo4j all connections (relationships) between nodes are materialised at write time as opposed to triple stores or other non-native graphs stores where they are computed via joins at query time. So it’s a trade-off between write and read performance: with Neo4j you get more expensive transactional data load but lightning speed traversals at read time.

Also worth mentioning that the approach described here is transactional which may or may not be the best choice when loading super-large datasets. It’s ok for this example because the graph is relatively small for neo4j standards, but if your dataset is really massive you may want to consider alternatives like the non-transactional import tool (https://neo4j.com/docs/operations-manual/current/import/).

I completed the data load by importing the countryInfo.txt file from Geonames to enrich the country information that in the PermID data export is reduced to URIs. The Geonames data is tabular so I used the following LOAD CSV script:

LOAD CSV WITH HEADERS from "file:///OpenPermID/countryInfo.txt" as row fieldterminator '\t'
MATCH (r:Resource { uri: "http://sws.geonames.org/" + row.geonameid + "/" } )
SET r+= row, r:Country;

The imported graph

The raw import produces a graph with 18.8 million nodes and 101 million relationships. As usual, there is an order of magnitude more triples in the RDF graph than nodes in the LPG as discussed in the previous post.

I’ll start the analysis by getting some metrics on the imported dataset. This query will give us the avg/min/max/percentiles degree of the nodes in the graph:

MATCH (n)
WITH n, size((n)-[]-()) AS degree
RETURN AVG(degree), MAX(degree), MIN(degree), 
 percentileCont(degree,.99) as `percent.99`, 
 percentileCont(degree,.9999)as `percent.9999`, 
 percentileCont(degree,.99999)as `percent.99999`;
+-------------------------------------------------------------------------------------+
| AVG(degree) | MAX(degree) | MIN(degree) | percent.99 | percent.9999 | percent.99999 |
+-------------------------------------------------------------------------------------+
| 5.377148198 | 9696290     | 0           | 20.0       | 205.0        | 8789.78792    |
+-------------------------------------------------------------------------------------+

We see that the average degree of the nodes in the graph is close to 5 but there are super dense nodes with up to 10 million (!) relationships. These dense nodes are a very small minority as we can see in the percentile information and they are due to specific modelling decisions that are OK in RDF but total anti-patterns when modelling a Property Graph.  But don’t worry, we should be able to refactor the graph as I’ll show in the following section.

We also see from the MIN(degree) being zero in the previous query that there are some orphan nodes -not connected to any other node in the graph-. Let’s find more about them by running the following query (and the equivalent one for Organizations):

MATCH (n:ns7__Person) WHERE NOT (n)--()
RETURN count(n)

In the public dataset 34% of the Person entities are completely disconnected (this is 1.48 million out of a total of 4.4 million). The same happens with 5% of the Organizations (200K out of a total of 4.5mill).

Refactoring LPG antipatterns in the RDF graph

Nodes that could (should) be properties

In the RDF OpenPermID graph, there are a number of properties (hasPublicationStatus or hasGender in Persons or hasActivityStatus in Organizations) for which the value is either boolean style (active/inactive for hasActivityStatus) or have a very reduced set of values (male/female for hasGender). In the RDF graph, these values are modelled as resources instead of literals and they generate very dense nodes when imported into an LPG like Neo4j. Think of three or four million Person nodes with status active, all linked through the hasStatus relationship to one single node representing the ‘active’ status. The status -and the same goes for all the other properties mentioned before- can perfectly be modelled as node attributes (literal properties in RDF) without any loss of information so we’ll get rid of them with Cypher expressions like the following:

MATCH (x:ns7__Person)-[r:ns0__hasPublicationStatus]->(v) 
SET x.ns0__publicationStatus = substring(v.uri,length('http://permid.org/ontology/common/publicationstatus'))
DELETE r

If we tried to run this on the whole PermID dataset it would involve millions of updates so we will want to batch it using APOC’s periodic commit procedure as follows:

CALL apoc.periodic.commit("
MATCH (x:ns7__Person)-[r:ns0__hasPublicationStatus]->(v) 
WITH x,r,v LIMIT $limit
SET x.ns0__publicationStatus = substring(v.uri,length('http://permid.org/ontology/common/publicationstatus'))
DELETE r
RETURN COUNT(r)
", { limit : 50000});

I’ll apply the same transformation to the rest of the mentioned properties, you can see the whole script here (and you can run it too).

Nodes that are not needed

As I’ve discussed in previous articles and talks, in the LPG each instance of a relationship is uniquely identified and can have properties. This is not the case in RDF and in order to represent qualified relationships, they need to be reified as nodes (there are other options but I’m not going to talk about them here). Here is an example (that I used in a presentation a while ago) of what I’m talking about:

rdfverbose

A very similar approach is used in OpenPermID to represent tenures in organizations. Tenures are modelled as resources in RDF (nodes in LPG) that contain information like the role and the start and end dates indicating its duration and they are connected to the person and the organization.

With the following refactoring, we will remove the intermediate (and unnecessary in an LPG) node representing the tenure and we will store all the information about it in a newly created relationship (:HOLDS_POSITION_IN) connecting directly the Person and the Organization. We are essentially transforming a graph like the green one in the previous diagram into another one like the blue one.

Here’s the Cypher that will do this for you:

MATCH (p)-[:ns7__hasTenureInOrganization]->(t:ns7__TenureInOrganization)-[:ns7__isTenureIn]->(o)
MERGE (p)-[hpi:HOLDS_POSITION_IN { ns0__hasPermId : t.ns0__hasPermId }]->(o) ON CREATE SET hpi+=t
DETACH DELETE t

As in the previous example, this is a heavy operation over the whole graph that will update millions of nodes and relationships so we will want to batch it in a similar way using apoc.periodic.commit. Remember that the complete script with all the changes is available here.

call apoc.periodic.commit("
MATCH (p)-[:ns7__hasTenureInOrganization|ns7__hasHolder]-(t:ns7__TenureInOrganization)-[:ns7__isTenureIn]->(o)
WITH p, t, o LIMIT $limit
MERGE (p)-[hpi:HOLDS_POSITION_IN { ns0__hasPermId : t.ns0__hasPermId }]->(o) ON CREATE SET hpi+=t
DETACH DELETE t
RETURN count(hpi)", {limit : 25000}
) ;

Ok, so we could go on with the academic qualifications, etc… but we’ll leave the transformations here for now. It’s time to start querying the graph.

Querying the graph

Path analysis

Let’s start with a classic: “What is the shortest path between…”  I’ll use the last two presidents of the US for this example but feel free to make your choice of nodes. For our first attempt, we can be lazy and ask Neo4j to find a shortestPath between the two nodes that we can look up by name and surname. Here’s how:

MATCH p = shortestPath((trump:ns7__Person {`ns2__family-name` : "Trump", `ns2__given-name` : "Donald" })-[:HOLDS_POSITION_IN*]-(obama:ns7__Person {`ns2__family-name` : "Obama", `ns2__given-name` : "Barack" }))
RETURN p

Screen Shot 2018-01-18 at 18.06.48

This query returns one of the multiple shortest paths starting in Trump and ending in Obama and formed of people connected to organisations. The connections meaning that they hold or have held positions at such organisations at some point.

If we were doing this in a social/professional network platform to find an indirect connection (a path) to link two people via common friends/ ex-colleagues/etc, then the previous approach would be incomplete. And this is because if we want to be sure that two people have been sitting on the board of an organization at the same time, or held director positions at the same time, etc, we will have to check that there is a time overlap in their tenures. In our model, the required information is in the HOLDS_POSITION_IN relationship, which is qualified with two properties (from and to) indicating the duration of the tenure.

Let’s try to express this time-based constraint in Cypher: In a pattern like this one…

(x:Person)-[h1:HOLDS_POSITION_IN]->(org)<-[h2:HOLDS_POSITION_IN]-(y:Person)

…for x and y to know each other, there must be an overlap on the intervals defined by the from and to of h1 and h2. Sounds complicated? Nothing is too hard for the combination of Cypher + APOC.

MATCH p = shortestPath((trump:ns7__Person {`ns2__family-name` : "Trump", `ns2__given-name` : "Donald" })-[:HOLDS_POSITION_IN*]-(obama:ns7__Person {`ns2__family-name` : "Obama", `ns2__given-name` : "Barack" }))
WHERE all(x in filter( x in apoc.coll.pairsMin(relationships(p)) WHERE (startNode(x[0])<>startNode(x[1]))) WHERE 
 not ( (x[0].to < x[1].from) or (x[0].from > x[1].to) ) 
)
return p

We are checking that every two adjacent tenures in the path -connecting two individuals to the same organization- do overlap in time. Even with the additional constraints, Neo4j only takes a few milliseconds to compute the shortest path.

Screen Shot 2018-01-23 at 13.40.09.png

In the first path visualization (and you’ll definitely find it too if you run your own data analysis) you may have noticed some of the Person nodes appear in yellow and with no name. This is because they have no properties other than the URI. The OpenPermID data dump seems to be incomplete and for some entities, it only includes the URI and the connections (ObjectProperties in RDF lingo) but is missing the attributes (the datatype properties). There are over 40K persons in this situation as we can see running this query:

MATCH (p)-[:HOLDS_POSITION_IN]->()
WHERE NOT p:ns7__Person
RETURN COUNT( DISTINCT p) AS count

╒═══════╕
│"count"│
╞═══════╡
│46139  │
└───────┘

And the symmetric one shows that also nearly 5K organizations have the same problem.

Luckily, the nodes in this situation have their URI, which is all we need to look them up in the PermID API. Let’s look at one random example: https://permid.org/1-34415693987. No attributes for this node, although we see that it has a number of :HOLDS_POSITION_IN connections to other nodes.

MATCH (r :Resource { uri : "https://permid.org/1-34413241385"})-[hpi:HOLDS_POSITION_IN]->(o) 
RETURN *

Screen Shot 2018-01-23 at 15.17.19

Who is this person? Let’s get the data from the URI (available as HTML and as RDF):

So we now know that it represents Mr David T. Nish, effectively connected to the organisations we saw in the graph view. So we can fix the information gap by ingesting the missing triples directly from the API as follows:

call semantics.importRDF("https://permid.org/1-34413241385?format=turtle","Turtle",{})

This invocation of the importRDF procedure will pull the triples from the API and import them into Neo4j in the same way we did the initial data load.

It’s not clear to me why some random triples are excluded from the bulk download… Let’s run another example of path analysis with someone closer to me than the POTUSes. Let’s find the node in OpenPermID representing Neo4j.

MATCH (neo:ns3__Organization {`ns2__organization-name` : "Neo4j Inc"})
RETURN neo

Screen Shot 2018-01-19 at 14.09.12.png

If we expand neo Neo4j node in the browser we’ll find that the details of all the individuals connected to it are again not included in the bulk export dataset (all the empty yellow nodes connected to ‘Neo4j Inc’ through HOLDS_POSITION relationships), so let’s complete them in a single go with this fragment of Cypher calling semantics.importRDF for each of the incomplete Person nodes:

MATCH (neo:ns3__Organization {`ns2__organization-name` : "Neo4j Inc"})<-[:HOLDS_POSITION_IN]-(person)
WITH DISTINCT person
CALL semantics.importRDF(person.uri + "?format=turtle&access-token=","Turtle",{}) 
YIELD terminationStatus,triplesLoaded,extraInfo
RETURN *

Once we have the additional info in the graph, we can ask Neo4j what would be Emil’s best chance of meeting Mark Zuckerberg (why not?).

MATCH p = shortestPath((emil:ns7__Person {`ns2__family-name` : "Eifrem", `ns2__given-name` : "Emil" })-[:HOLDS_POSITION_IN*]-(zuck:ns7__Person {`ns2__family-name` : "Zuckerberg", `ns2__given-name` : "Mark" })) 
WHERE all(x in filter( x in apoc.coll.pairsMin(relationships(p)) WHERE (startNode(x[0])<>startNode(x[1]))) WHERE NOT ( (x[0].to < x[1].from) or (x[0].from > x[1].to) ) ) 
RETURN p

Screen Shot 2018-02-01 at 12.53.23

Nice! Now finally, and just for fun, return the same results as a description in English of what this path looks like. All we need to do is add the following transformation to the previous path query. Just replace the return with the following two lines:

UNWIND filter( x in apoc.coll.pairsMin(relationships(p)) WHERE (startNode(x[0])<>startNode(x[1]))) as match
RETURN apoc.text.join(collect(startNode(match[0]).`ns2__family-name` + " knows " + startNode(match[1]).`ns2__family-name` + " from " + endNode(match[0]).`ns2__organization-name`),", ") AS explanation

And here’s the result:

Eifrem knows Treskow from Neo4j Inc, Treskow knows Earner from Space Ape Games (UK) Ltd, Earner knows Breyer from Accel Partners & Co Inc, Breyer knows Zuckerberg from Facebook Inc.

Industry / Economic Sector classification queries

Another interesting analysis is the exploration of the industry/economic sector hierarchy. The downside is that only 8% of the organizations in the dataset are classified according to it.

Here is, however, a beautiful bird’s eye view of the whole economic sector taxonomy where we can see that industry sub-taxonomies are completely disjoint between each other and the two richest hierarchies are the one for Consumer Cyclicals and Industrials :

MATCH (n:ns10__EconomicSector)<-[b:ns6__broader*]-(c) 
RETURN *

Screen Shot 2018-01-25 at 16.00.37

And a probably more useful detailed view of the Telecoms sector:

MATCH (n:ns10__EconomicSector)<-[b:ns6__broader*]-(c) 
WHERE n.ns9__label = "Telecommunications Services"
RETURN *

Screen Shot 2018-01-23 at 17.51.27

Instrument classification queries

Similar to the industries and economic sectors, instruments issued by organizations are classified in a rich hierarchy of asset categories. The problem again is that in the public dataset only 2% (32 out of nearly 1400) of the asset categories have instruments in them. This makes the hierarchy limited in use. Let’s run a couple of anlaytic queries on the hierarchy.

The following table shows the top level asset categories that have at least an instrument or a subcategory in them:

Screen Shot 2018-01-22 at 12.30.32

And here is the Cypher query that produces the previous results

MATCH (ac:ns5__AssetClass) 
WHERE NOT (ac)-[:ns6__broader]->() //Top level only
WITH ac.ns9__label as categoryName, size((ac)<-[:ns6__broader*]-()) as childCategoryCount, size((ac)<-[:ns6__broader*0..]-()<-[:ns5__hasAssetClass]-(:ns5__Instrument)) as instrumentsInCategory  WHERE childCategoryCount + instrumentsInCategory > 0
RETURN categoryName, childCategoryCount, instrumentsInCategory 
ORDER BY childCategoryCount + instrumentsInCategory DESC

The instruments in the public DB are all in under the “Equities” category, and within this category, they are distributed as follows (only top 10 shown, remove limit in the query to see all):

Screen Shot 2018-01-22 at 12.37.54.png

Here’s the Cypher query that produces the previous results.

MATCH (:ns5__AssetClass { ns9__label : "Equities"})<-[:ns6__broader*0..]-(ac)
WITH ac.uri as categoryId, ac.ns9__label as categoryName, size((ac)<-[:ns5__hasAssetClass]-(:ns5__Instrument)) as instrumentsInCategory 
RETURN categoryName, instrumentsInCategory 
ORDER BY instrumentsInCategory DESC LIMIT 10

To finalise the analysis of the asset categories, a visualization of the “Equities” hierarchy with the top three (by number of instruments in the dataset) highlighted.

Screen Shot 2018-01-22 at 12.45.39

Global queries and integration with BI tools

Cypher queries on the graph can return rich structures like nodes or paths, but they can also produce data frames (data in tabular form) that can easily be used by standard BI tools like Tableau, Qlik, Microstrategy and many others to create nice charts. The following query returns the ratio of women holding positions at organizations both by country and economic sector.

MATCH (c:Country)<-[:ns3__isIncorporatedIn|ns4__isDomiciledIn]-(o)<-[:HOLDS_POSITION_IN]-(p) OPTIONAL MATCH (o)-[:ns3__hasPrimaryEconomicSector]->(es) 
WITH c.Country AS country, coalesce(es.ns9__label,'Unknown') AS economicSector, size(filter(x in collect(p) where x.ns2__gender = "female")) AS femaleCount, count(p) AS totalCount
RETURN country, economicSector, femaleCount, totalCount, femaleCount*100/totalCount as femaleRatio

Here is an example of how this dataset can be used in Tableau (tableau can query the Neo4j graph directly using Cypher via the Neo4j Tableau WDC).

Screen Shot 2018-01-22 at 17.07.42.png

According to the public OpenPermID dataset, the top three countries ranked by ratio of women holding positions at organisations are Burundi, Albania and Kyrgyzstan. Interesting… not what I was expecting to be perfectly honest, and definitely not aligned with other analysis and sources. The chart to the right shows that women are evenly distributed across economic sectors which sounds plausible.

Well, I would expect you to be at least as sceptical as I am about these results, and I believe it has again to do with the fact that the dataset seems to be incomplete, so please do not take this as anything more than a data integration / manipulation exercise.

Other data quality issues

Entity duplication

While running some of the queries, I noticed some duplicate nodes. Here is an example of what I mean, the one and only duplicate Barak Obama.

MATCH (obama:ns7__Person) 
WHERE obama.`ns2__family-name` = "Obama" AND 
      obama.`ns2__given-name` = "Barack" 
RETURN obama

Screen Shot 2018-01-23 at 11.19.34

One of the Obama nodes (permID: 34414148146) seems to connect to the organizations he’s held positions in and the other one (permID: 34418851840) links to his academic qualifications. Maybe two sources (academic + professional) have been combined in OpenPermID but not integrated completely yet?

Here’s how Cypher can help detecting these complex patterns proving why graph analysis is extremely useful for entity resolution. The following query (notice it’s quite a heavy one) will get you some 30K instances of this pattern (which means at least 2x as many nodes will be affected). I’ve saved the likely duplicates in the dupes.csv file in this directory.

MATCH (p1:ns7__Person)
WHERE (p1)-[:ns7__hasQualification]->() AND NOT (p1)-[:HOLDS_POSITION_IN]->()
WITH DISTINCT p1.`ns2__given-name` AS name, p1.`ns2__family-name` AS sur, coalesce(p1.`ns2__additional-name`,'') AS add
MATCH (p2:ns7__Person)
WHERE p2.`ns2__family-name` = sur AND 
p2.`ns2__given-name` = name AND 
coalesce(p2.`ns2__additional-name`,'') = add AND
(p2)-[:HOLDS_POSITION_IN]->() AND 
NOT (p2)-[:ns7__hasQualification]->()
RETURN DISTINCT name, add, sur

Default dates

If you’re doing date or interval computations, keep an eye also on default values for dates because you will find both nulls an the famous 1st Jan 1753 (importing data from SQL Server?). Make sure you deal with either case in your Cypher queries.

Here’s the Cypher query you can use to profile the dates in the HOLD_POSITION_IN relationship, showing at the top of the frequency list the two special cases mentioned.

MATCH (:ns7__Person)-[hpi:HOLDS_POSITION_IN]->()
RETURN hpi.ns0__from AS fromDate, count(hpi) AS freq
ORDER BY freq DESC

Screen Shot 2018-01-23 at 12.19.00

Your data in Neo4j serialized as RDF

So, going back to the title of the post, if “Neo4j is your RDF store”, then you should be able to serialize your graph as RDF. The neosemantics extension not only helped us in the ingestion of the RDF, but also gives us a simple way to expose the graph in Neo4j as RDF. And the output is identical to the one in the PermID API. Here is, as an example, the RDF description generated directly from Neo4j of Dr Jim Webber, our Chief Cientist:

Screen Shot 2018-01-23 at 15.59.43

Even more interesting, we can have the result of a Cypher query serialized as RDF. Here’s an example producing the graph of Dr. Weber’s colleagues at Neo4j.

Screen Shot 2018-01-23 at 17.02.33

Takeaways

Once again we’ve seen how straightforward it is to import RDF data into Neo4j. Same as serializing a Neo4j graph as RDF. Remember, RDF is a model for data exchange but does not impose any constraint on where/how the data is stored.

I hope I’ve given a nice practical example of (1) how to import a medium-size RDF graph into Neo4j, (2) transform and enrich it using Cypher and APOC and finally (3) query it to benefit from Neo4j’s native graph storage.

Now download Neo4j if you have not done it yet, check out the code from GitHub and give it a try! I’d love to hear your feedback.

 

 

Advertisements

Neo4j is your RDF store (part 2)

As in previous posts, for those of you less familiar with the differences and similarities between RDF and the Property Graph, I recommend you watch this talk I gave at Graph Connect San Francisco in October 2016.

In the previous post on this series, I showed the most basic way in which a portion of your graph can be exposed as RDF. That was identifying a node by ID or URI if your data was imported from an RDF dataset. In this one, I’ll explore a more interesting way by running Cypher queries and serialising the resulting subgraph as RDF.

The dataset

For this example I’ll use the Nortwind database that you can easily load in your Neo4j instance by running the following in your Neo4j browswer.

:play northwind graph

If you follow the step by step instructions you should get the graph built in no time. You’re ready then to run queries like “Get the detail of the orders by Rita Müller containing at least a dairy product”. Here is the cypher for it:

MATCH (cust:Customer {contactName : "Rita Müller"})-[p:PURCHASED]->(o:Order)-[or:ORDERS]->(pr:Product)
WHERE (o)-[:ORDERS]->()-[:PART_OF]->(:Category {categoryName:"Dairy Products"})
RETURN *

And this the resulting graph:

Screen Shot 2016-12-16 at 12.46.40.png

Serialising the output of a cypher query as RDF

The result of the previous query is a portion of the Nortwhind graph, a set of nodes and relationships that can be serialised as RDF using the neosemantics neo4j extension.

Once installed on your Neo4j instance, you’ll notice that the neosemantics extension includes a cypher endpoint /rdf/cypher (described here) that takes a cypher queryas input and returns the results serialised as RDF with the usual choice of serialisation format in the HTTP request.

The endpoint can be tested directly from the browser and will produce JSON-LD by default.

Screen Shot 2016-12-16 at 12.58.39.png

The uris of the resources in RDF are generated from the node ids in neo4j and in this first version of the LPG-to-RDF endpoint, all elements in the graph -RDF properties and types- share the same generic vocabulary namespace (It will be different if your graph has been imported from an RDF dataset as we’ll see in the final section).

Validating the RDF output on the W3C RDF Validation Service

A simple way of validating the output of the serialisation could be to load it into the W3C RDF validation service. It takes two simple steps:

Step one: Run your Cypher query on the rdf/cypyher endpoint selecting application/rdf+xml as serialization format on the Accept header of the http request. This is what the curl expresion would look like:

curl http://localhost:7474/rdf/cypher -H Accept:application/rdf+xml 
     -d "MATCH (cust:Customer {contactName : 'Rita Müller'})-[p:PURCHASED]->(o:OrdeERS]->(pr:Product) WHERE (o)-[:ORDERS]->()-[:PART_OF]->(:Category {categoryName:'Dairy Products'}) RETURN *"

This should produce something like this (showing only the first few rows):

<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF xmlns:neovoc="neo4j://vocabulary#"
         xmlns:neoind="neo4j://indiv#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdf:Description rdf:about="neo4j://indiv#77511">
    <rdf:type rdf:resource="neo4j://vocabulary#Customer"/>
    <neovoc:country rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Germany</neovoc:country>
    <neovoc:address rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Adenauerallee 900</neovoc:address>
    <neovoc:contactTitle rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Sales Representative</neovoc:contactTitle>
    <neovoc:city rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Stuttgart</neovoc:city>
    <neovoc:phone rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0711-020361</neovoc:phone>
    <neovoc:contactName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Rita Müller</neovoc:contactName>
    <neovoc:companyName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Die Wandernde Kuh</neovoc:companyName>
    <neovoc:postalCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string">70563</neovoc:postalCode>
    <neovoc:customerID rdf:datatype="http://www.w3.org/2001/XMLSchema#string">WANDK</neovoc:customerID>
    <neovoc:fax rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0711-035428</neovoc:fax>
    <neovoc:region rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NULL</neovoc:region>
</rdf:Description>

<rdf:Description rdf:about="neo4j://indiv#77937">
    <neovoc:ORDERS rdf:resource="neo4j://indiv#76432"/>
</rdf:Description>
...

I know the XML based format is pretty horrible but we need it because it’s the only one that the RDF validator accetps 😦

Step two:  Go to the W3C RDF validation service page (https://www.w3.org/RDF/Validator/) and copy the xml from the previous step in the text box and select triples and graph in the display options. Hit Parse RDF and… you should get the list of 266 parsed triples plus a graphical representation of the RDF graph like this one:

266triples.png

Yes, I know, huge if we compare it to the original property graph but this is normal. RDF makes an atomic decomposition of every single statement in your data. In an RDF graph not only entities but also every single property produce a new vertex, leading to this explosion in the size of the graph.

Screen Shot 2016-12-16 at 15.58.33.png

That’s a slide from this talk at Graph Connect SF in Oct 2016 where I discussed that it’s normal that the number of triples in an RDF dataset is an order of magnitude bigger than the number of nodes in a LPG.

The portion of the Northwind graph returned by our example query is not an exception 19 nodes => 266 triples.

If the graph was imported from RDF…

So if your graph in Neo4j had been imported using the semantics.importRDF procedure (described in previous blog posts and with some examples) then you want to use the rdf/cypheronrdf endpoint (described here) instead. It works exactly in the same way, but uses the uris as unique identifiers for nodes instead of the ids.

If you’re interested on what this would look like, watch this space for part three of this series.

Takeaways

As in the previous post, the main takeaway is that it is pretty straightforward to offer an RDF “open standards compliant” API for publishing your graph while still getting the benefits of native graph storage and Cypher querying in Neo4j.

 

 

 

Neo4j is your RDF store (part 1)

If you want to understand the differences and similarities between RDF and the Labeled Property Graph implemented by Neo4j, I’d recommend you watch this talk I gave at Graph Connect San Francisco in October 2016.

Intro

Let me start with some basics: RDF is a standard for data exchange, but it does not impose any particular way of storing data.

What do I mean by that? I mean that data can be persisted in many ways: tables, documents, key-value pairs, property graphs, triple graphs… and still be published/exchanged as RDF.

It is true though that the bigger the paradigm impedance mismatch -the difference between RDF’s modelling paradigm (a graph) and the underlying store’s one-, the more complicated and inefficient the translation for both ingestion and publishing will be.

I’ve been blogging over the last few months about how Neo4j can easily import RDF data and in this post I’ll focus on the opposite: How can a Neo4j graph be published/exposed as RDF.

Because in case you didn’t know, you can work with Neo4j getting the benefits of native graph storage and processing -best performance, data integrity and scalability- while being totally ‘open standards‘ to the eyes of any RDF aware application.

Oh! hang on… and your store will also be fully open source!

A “Turing style” test of RDFness

In this first section I’ll show the simplest way in which data from a graph in Neo4j can be published as RDF but I’ll also demonstrate that it is possible to import an RDF dataset into Neo without loss of information in a way that the RDF produced when querying Neo4j is identical to that produced by the original triple store.

Screen Shot 2016-11-17 at 01.18.36.png

You’ll probably be familiar with the Turing test where a human evaluator tests a machine’s ability to exhibit intelligent behaviour, to the point where it’s indistinguishable from that of a human. Well, my test aims to prove Neo4j’s ability to exhibit “RDF behaviour” to an RDF consuming application, making it indistinguishable from that of a triple store. To do this I’ll use the neosemantics neo4j extension.

The simplest test one can think of, could be something like this:

Starting from an RDF dataset living in a triple store, we migrate it (all or partially) into Neo4j. Now if we run a Given a SPARQL DESCRIBE <uri> query on the triple store and its equivalent rdf/describe/uri<uri> in Neo4j, do they return the same set of triples? If that is the case -and if we also want to be pompous- we could say that the results are semantically equivalent, and therefore indistinguishable to a consumer application.

We are going to run this test step by step on data from the British National Bibliography dataset:

Get an RDF node description from the triple store

To do that, we’ll run the following SPARQL DESCRIBE query in the British National Bibliography public SPARQL endpoint, or alternatively in the more user friendly SPARQL editor.

DESCRIBE <http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940>

The request returns an RDF fragment containing all information about Mikhail Bulgakov in the BNB. A pretty cool author, by the way, which I strongly recommend. The fragment actually contains 86 triples, the first of which are these:

<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/givenName> "Mikhail" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.w3.org/2000/01/rdf-schema#label> "Bulgakov, Mikhail, 1891-1940" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/familyName> "Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/name> "Mikhail Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/010535795> .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/008720599> .
...

You can get the whole set running the query in the SPARQL editor I mentioned before or sending an  HTTP request with the query to the SPARQL endpoint:

curl -i http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E -H Accept:text/plain

Ok, so that’s our base line,  exactly the output we want to get from Neo4j to be able to affirm that they are indistinguishable to an RDF consuming application.

Move the data from the triple store to Neo4j

We need to load the RDF data into Neo4j. We could load the whole British National Bibliography since it’s available for download as RDF, but for this example we are going to load just the portion of data that we need.

I will not go into the details of how this happens as it’s been described in previous blog posts and with some examples. The semantics.importRDF procedure runs a straightforward and lossless import of RDF data into Neo4j. The procedure is part of the neosemantics extension. If you want to run the test with me on your Neo4j instance, now is the moment when you need to install it (instructions in the README).

Once the extension ins installed, the migration could not be simpler, just run the following stored procedure:

CALL semantics.importRDF("http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E",
"RDF/XML",true,true,500)

We are passing as parameter the url of the BNB SPARQL endpoint returning the RDF data needed for our test, along with some import configuration options. The output of the execution shows that the 86 triples have been correctly imported into Neo4j:

Screen Shot 2016-11-16 at 03.01.52.png

Now that the data is in Neo4j and you can query it with Cypher and visualise it in the browser. Here is a query example returning Bulgakov and all the nodes he’s connected to:

MATCH (a)-[b]-(c:Resource { uri: "http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940"})
RETURN *

Screen Shot 2016-11-16 at 02.54.34.png

There is actually not much information in the graph yet, just the node representing good old Mikhail with a few properties (name, uri, etc…) and connections to the works he created or contributed to, the events of his birth and death and a couple more. But let’s not worry about size for now, well deal with that later. The question was: can we now query our Neo4j graph and produce the original set of RDF triples? Let’s see.

Get an RDF description of the same node, now from Neo4j

The neosemantics repo also includes an extensions (http endpoints) that provide precisely this capability. The equivalent in Neo4j of the SPARQL DESCRIBE on Mikhail Bulgakov would be the following:

:GET /rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940

If you run it in the browser, you will get the default serialisation which is JSON-LD, something like this:

Screen Shot 2016-11-16 at 16.40.23.png

But if you set in the request header the serialisation format of your choice -for example using curl again- you can get the RDF fragment in any of the available formats.

curl -i http://localhost:7474/rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940 -H accept:text/plain

Well, you should not be surprised to know that it return 86 triples, exactly the same set that the original query on the triple store returned.

So mission accomplished. At least for the basic case.

RDF out Neo4j’s movie database

I thought it could be interesting to prove that an RDF dataset can be imported into Neo4j and then published without loss of information but OK, most of you may not care much about existing RDF datasets, that’s fair enough. You have a graph in Neo4j and you just want to publish it as RDF. This means that in your graph, the nodes don’t necessarily have a property for the uri (why would they?) or are labelled as Resources. Not a problem.

Ok, so if your graph is not the result of some RDF import, the service you want to use instead of the uri based one, is the nodeid based equivalent.

:GET /rdf/describe/id?nodeid=<nodeid>

We’ll use for this example Neo4j’s movie database. You can get it loaded in your Neo4j instance by running

:play movies

You can get the ID of a node either directly by clicking on it on the browser or by running a simple query like this one:

MATCH (x:Movie {title: "Unforgiven"}) 
RETURN ID(x)

In my Neo4j instance, the returned ID is 97 so the GET request would pass this ID and return in the browser the JSON-LD serialisation of the node representing the movie “Unforgiven” with its attributes and the set of nodes connected to it (both inbound and outbound connections):

screen-shot-2016-11-16-at-17-07-26

But as in the previous case, the endpoint can also produce your favourite serialisation just by setting it in the accept parameter in the request header.

curl -i http://localhost:7474/rdf/describe/id?nodeid=97 -H accept:text/plain

When setting the serialisation to N-Triples forma the previous request gets you these triples:

<neo4j://indiv#97> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <neo4j://vocabulary#Movie> .
<neo4j://indiv#97> <neo4j://vocabulary#tagline> "It's a hell of a thing, killing a man" .
<neo4j://indiv#97> <neo4j://vocabulary#title> "Unforgiven" .
<neo4j://indiv#97> <neo4j://vocabulary#released> "1992"^^<http://www.w3.org/2001/XMLSchema#long> .
<neo4j://indiv#167> <neo4j://vocabulary#REVIEWED> <neo4j://indiv#97> .
<neo4j://indiv#89> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#DIRECTED> <neo4j://indiv#97> .
<neo4j://indiv#98> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .

The sharpest of you may notice when you run it that there is  a bit missing. There are relationship properties in the movie database that are lost in the RDF fragment. Yes, that is because there is no way of expressing that in RDF. At least not without recurring to horribly complicated patterns like reification or singleton property that are effectively unusable in any practical real world use case. But we’ll get to that too in future posts.

Takeaways

 

I guess the main one is that if you want to get the benefits of native graph storage and be able to query your graph with Cypher in Neo4j but also want to:

  •  be able to easily import RDF data into your graph and/or
  •  offer an RDF “open standards compliant” API for publishing your graph

Well, that’s absolutely fine, because we’ve just seen how Neo4j does a great job at producing and consuming RDF.

Remember: RDF is about data exchange, not about storage.

There is more to come on producing RDF from Neo4j than what I’ve shown in this post. For instance, publishing the results of a Cypher query as RDF. Does it sound interesting?Watch this space.

Also I’d love to hear your feedback!

 

 

 

The ‘hidden’ connections in Google’s Knowledge Graph

As far as I know, the only way to query Google’s Knowledge Graph currently is the search API. Let’s run a query on it, search for instance for Miles Davis’ album “Sketches of Spain”.

https://kgsearch.googleapis.com/v1/entities:search?query=sketches%20of%20spain&key=<your_key_here>&limit=1

The API returns this JSON-LD fragment back (thanks, Jos de Jong for the great JSON Editor Online):

screen-shot-2016-09-12-at-14-33-15

Strip out the wrapping entities and each search result returned is just a node from the Knowledge Graph for which we get the id, type (category), name and description. Additionally, you may get your node linked to a Wikipedia page that provides a detailed description of the entity. That’s what the red box highlights in the previous fragment. Visually, what we get is something like this:

screen-shot-2016-09-12-at-15-11-24

This is nice because your text search is returning an entity in Google’s knowledge graph and it’s structured data… yes but there’s something missing. I don’t think I’d be exaggerating if I said there is the most important bit missing: The context, the connections, the other bits of the graph that this entity relates to. Let me explain what I mean: If I run the same search in a browser I get a much richer result from the Knowledge Graph:

screen-shot-2016-09-12-at-12-56-10

The dashed red box shows what the search API currently returns, and the bits connected with the arrows are the context that I’m talking about. The author of the album, the producers, the awards received, the genre… The data is obviously in the graph and JSON-LD’s capabilities for expressing rich linked data are crying to be used. If that was not enough, the relationships are already defined in schema.org so it looks like we have all we need. Actually, Google! you have all you need 🙂

Right, so based on this, what would a (WAY) richer result look like? Look at the little blue box that I added to the original query output:

screen-shot-2016-09-12-at-15-41-39

Or probably for a more intuitive representation, look at the graph that this new JSON-LD fragment represents:

screen-shot-2016-09-12-at-16-03-41

Wouldn’t it be cool? And not only cool but also extremely useful? Let me know your thoughts.

And yes, for those of you who may be wondering where did I get the IRIs of the extra nodes and whether they are real or made up, I did run separate queries on the search API for each of the related entities and stuck it all together manually so valid IRIs but retrieved separately.

One final comment: If you’re interested in publishing/sharing connected data (graph data) as JSON-LD straight from your Neo4j Graph Database, have a look at this repo.

 

 

 

Importing RDF data into Neo4j

The previous blog post might have been a bit too dense to start with, so I’ll try something a bit lighter this time like importing RDF data into Neo4j. It asumes, however, a certain degree of familiarity with both RDF and graph databases.

There are a number of RDF datasets out there that you may be aware of and you may have asked yourself at some point: “if RDF is a graph, then it should be easy to load it into a graph database like Neo4j, right?”. Well, the RDF model and the property graph model (implemented by Neo4j) are both graph models but with some important differences that I wont go over in this post. What I’ll do though, is describe one possible way of migrating data from an RDF graph into Neo4j’s property graph database.

I’ve also implemented this approach as a Neo4j stored procedure, so if you’re less interested in the concept and just want to see how to use the procedure you can go straight to the last section. Give it a try and share your experience, please.

The mapping

The first thing to do is plan a way to map both models. Here is my proposal.

An RDF graph is a set of tiples or statements (subject,predicate,object) where both the subject and the predicate are resources and the object can be either another resource or a literal. The only particularity about literals is that they cannot be the subject of other statements. In a tree structure we would call them leaf nodes. Also keep in mind that resources are uniquely identified by URIs.
Rule1: Subjects of triples are mapped to nodes  in Neo4j. A node in Neo4j representing an RDF resource will be labeled :Resource and have a property uri with the resource’s URI.
(S,P,O) => (:Resource {uri:S})...
Rule2a: Predicates of triples are mapped to node properties in Neo4j if the object of the triple is a literal
(S,P,O) && isLiteral(O) => (:Resource {uri:S, P:O})
Rule 2b: Predicates of triples are mapped to relationships in Neo4j if the object of the triple is a resource
(S,P,O) && !isLiteral(O) => (:Resource {uri:S})-[:P]->(:Resource {uri:O})
Let’s look at an example: Here is a short RDF fragment from the RDF Primer by the W3C that describes a web page and links it to its author. The triples are the following:
ex:index.html   dc:creator              exstaff:85740 .
ex:index.html   exterms:creation-date   "August 16, 1999" .
ex:index.html   dc:language             "en" .
The URIs of the resources are shortened by using the xml namespace mechanism. In this example, ex stands for http://www.example.org/, exterms stands for http://www.example.org/terms/, exstaff stands for http://www.example.org/staffid/  and dc stands for http://purl.org/dc/elements/1.1/
The full URIs are shown in the graphical representation of the triples (the figure is taken from the W3C page).
threetriples
If we iterate over this set of triples applying the  three rules defined before, we would get the following elements in a Neo4j property graph. I’ll use Cypher to describe them.
The application of rules 1 and 2b to the first triple would produce:
(:Resource { uri:"ex:index.html"})-[:`dc:creator`]->(:Resource { uri:"exstaff:85740"})
The second triple is transformed using rules 1 and 2a:
(:Resource { uri:"ex:index.html", `exterms:creation-date`: "August 16, 1999"})
And finally the third triple is transformed also with rules 1 and 2a producing:
(:Resource { uri:"ex:index.html", `dc:language`: "dc"})

Categories

The proposed set of basic mapping rules can be improved by adding one obvious exception for categories. RDF can represent both data and metadata as triples in the same graph and one of the most common uses of this is to categorise resources by linking them to classes through an instance-of style relationships (called rdf:type). So let’s add a new rule to deal with this case.
rule3: The rdf:type statements are mapped to categories in Neo4j.
(Something ,rdf:type, Category) => (:Category {uri:Something})
The rule basically maps the way individual resources (data) are linked to classes (metadata) in RDF through the rdf:type predicate to the way you categorise nodes in Neo4j i.e. by using labels.
This has also the advantage of removing dense nodes that aren’t particularly nice to deal with for any database. Rather than having a few million nodes representing people in your graph all of them connected to a single Person class node, we will have them all labeled as :Person which makes a lot more sense and there is no semantic loss.

The naming of things

Resources in RDF are identified by URIs which makes them unique, and thats great, but they are meant to be machine readable rather than nice to the human eye. So even though you’d like to read ‘Person’, RDF will use http://vocabularies.com/socialNetowrk/voc#Person (for example). While these kind of names can be used in Neo4j with no problem, they will make your labels and property names horribly long and hard to read and your Cypher queries will be polluted with http://… making the logic harder to grasp.
So what can we do? We have two options: 1) leave things named just as they are in the RDF model, with full URIS, and just deal with it in your queries. This would be the right thing to do if your data uses multiple schemas not necessarily under your control and/or more schemas can be added dynamically. Option 2) would be to make the pragmatic decision of shortening names to make both the model and the queries more readable. This will require some governance to ensure there are no name clashes. Probably a reasonable thing to do if you are migrating into Neo4j data from an RDF graph where you are the owner of the vocabularies being used or at least you have control over what schemas are used.
The initial version of the importRDF stored procedure supports both approaches as we will see in the final sections.

Datatypes in RDF literals

Literals can have data types associated in RDF by by pairing a string with a URI that identifies a particular XSD datatype.

exstaff:85740  exterms:age  "27"^^xsd:integer .

As part of the import process you may want to map the XSD datatype used in a triple to one of Neo4j’s datatypes. If datatypes are not explicitly declared in your RDF data you can always just load all literals as Strings and then cast them if needed at query time or through some batch post-import processing.

Blank nodes

The building block of the RDF model is the triple and this implies an atomic decomposition of your data in individual statements. However -and I quote here the W3C’s RDF Primer again- most real-world  data involves structures that are more complicated than that and the way to model structured information is by linking the different components to an aggregator resource. These aggregator resources may never need to be referred to directly, and hence may not require universal identifiers (URIs). Blank nodes are the artefacts in RDF that fulfil this requirement of representing anonymous resources. Triple stores will give them some sort of graph store local unique ID for the purposes of keeping unicity and avoiding clashes.

Our RDF importer will label blank nodes as BNode, and resources identified with URIs as URI, however, it’s important to keep in mind that if you bring data into Neo4j from multiple RDF graphs, identifiers of blank nodes are not guaranteed to be unique and unexpected clashes may occur so extra controls may be required.

The importRDF stored procedure

As I mentioned at the beginning of the post, I’ve implemented these ideas in the form of a Neo4j stored procedure. The usage is pretty simple. It takes four arguments as input.

  • The url of the RDF data to import.
  • The type of serialization used. The most frequent serializations for RDF are JSON-LD, Turtle, RDF/XML, N-Triples and  TriG. There are a couple more but these are the ones accepted by the stored proc for now.
  • A boolean indicating whether we want the names of labels, properties and relationships shortened as described in the “naming of things” section.
  • The periodicity of the commits. Number of triples ingested after which a commit is run.
CALL semantics.importRDF("file:///Users/jbarrasa/Downloads/opentox-example.turtle","Turtle", false, 500)

Will produce the following output:

Screen Shot 2016-06-08 at 23.39.43

UPDATE [16-Nov-2016] The stored procedure has been evolving over the last few months and the signature has changed. It takes now an extra boolean argument indicating whether category optimisation (Rule3) is applied or not. I expect the code to keep evolving so take this post as an introduction to the approach and look for the latest on the implementation in the github repo README.

The URL can point at a local RDF file, like in the previous example or to one accessible via HTTP. The next example loads a public dataset with 3.5 million triples on food products, their ingredients, allergens, nutrition facts and much more from Open Food Facts.

CALL semantics.importRDF("http://fr.openfoodfacts.org/data/fr.openfoodfacts.org.products.rdf","RDF/XML", true, 25000)

On my laptop the whole import took just over 4 minutes to produce this output.

Screen Shot 2016-06-09 at 00.45.38

 

When shortening of names is selected, the list of prefix being used is included in the import summary. If you want to give it a try don’t forget to create the following indexes beforehand, otherwise the stored procedure will abort the import and will remind you:

CREATE INDEX ON :Resource(uri) 
CREATE INDEX ON :URI(uri)
CREATE INDEX ON :BNode(uri) 
CREATE INDEX ON :Class(uri)

Once imported, I can find straight away what’s the set of shared ingredients between your Kellogg’s Coco Pops cereals and a bag of pork pies that you can buy at your local Spar.

Screen Shot 2016-06-08 at 23.57.51

 

Below is the cypher query that produces these results. Notice how the urls have been shortened but unicity of names is preserved by prefixing them with a namespace prefix.

MATCH (prod1:Resource { uri: 'http://world-fr.openfoodfacts.org/produit/9310055537194/coco-pops-kellogg-s'})
MATCH (prod2:ns3_FoodProduct { ns3_name : '2 Snack Pork Pies'})
MATCH (prod1)-[:ns3_containsIngredient]->(x1)-[:ns3_food]->(sharedIngredient)<-[:ns3_food]-(x2)<-[:ns3_containsIngredient]-(prod2)
RETURN prod1, prod2, x1, x2, sharedIngredient

I’ve intentionally written the two MATCH blocks for the two products in different ways, one identifying the product by its unique identifier (URI) and the other combining the category and the name.

A couple of open points

There are a couple of thing that I have not explored in this post and that the current implementation of the RDF importer does not deal with.

Mutltivalued properties

The current implementation does not deal with multivalued properties, although an obvious implementation could be to use arrays of values for this.

And the metadata?

This works great for instance data, but there is a little detail to take into account: An RDF graph can contain metadata statements. This means that you can find in the same graph (JB, rdf:type, Person) and (Person, rdf:type, owl:Class) and even (rdf:type, rdf:type, refs:Property). The post on Building a semantic graph in Neo4j gives some ideas on how to deal with RDF metadata but this is a very interesting topic and I’ll be coming back to it in future posts.

Conclusions

Migrating data from an RDF graph into a property graph like the one implemented by Neo4j can be done in a generic and relatively straightforward way as we’ve seen. This is interesting because it gives an automated way of importing your existing RDF graphs (regardless of your serialization: JSON-LD, RDF/XML, Turtle, etc.) into Neo4j without loss of its graph nature and without having to go through any intermediate flattening step.

The import process being totally generic results in a graph in Neo4j that of course inherits the modelling limitations of RDF like the lack of support for attributes on relationships so you will probably want to enrich / fix your raw graph once it’s been loaded in Neo4j. Both potential improvements to the import process and post-import graph processing will be discussed in future posts. Watch this space.