QuickGraph#5 Learning a taxonomy from your tagged data

The Objective

Say we have a dataset of multi-tagged items: books with multiple genres, articles with multiple topics, products with multiple categories… We want to organise logically these tags -the genres, the topics, the categories…- in a descriptive but also actionable way. A typical organisation will be hierarchical, like a taxonomy.

But rather than building it manually, we are going to learn it from the data in an automated way. This means that the quality of the results will totally depend on the quality and distribution of the tagging in your data, so sometimes we’ll produce a rich taxonomy but sometimes the data will only yield a set of rules describing how tags relate to each other.

Finally, we’ll want to show how this taxonomy can be used and I’ll do it with an example on content recommendation / enhanced search.

The dataset

We’ll use data from Goodreads on books and how they’ve been categorised by readers. In Goodreads, there is a notion of “shelf” which is a user created public category or tag that can be added to books and reused by other readers. Here is a page from Goodreads on a book along with the graph view of the data that we’ll extract from the page for this experiment.

Each book has a few shelves (genres) and an author, although I will not use the author information in this case.

Here is the data load script that you can try on your local Neo4j instance:

CREATE INDEX ON :Author(name)
CREATE INDEX ON :Book(id)
CREATE INDEX ON :Genre(name)
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/jbarrasa/datasets/master/goodreads/booklist.csv" AS row
MERGE (b:Book { id : row.itemUrl})
SET b.description = row.description, b.title = row.itemTitle
WITH b, row
UNWIND split(row.genres,';') AS genre
MERGE (g:Genre { name: substring(genre,8)})
MERGE (b)-[:HAS_GENRE]->(g)
WITH b, row
UNWIND split(row.author,';') AS author
MERGE (a:Author { name: author})
MERGE (b)-[:HAS_AUTHOR]->(a)

The data model is pretty simple as we’ve seen, and it only models three types of entities, the books, their authors and the genres. Here is the db.schema:

Screen Shot 2017-03-31 at 02.50.34.png

And some metrics on the tagging:

  • AVG number of genres per book: 5.4
  • MAX number of genres per book: 10
  • MIN number of genres per book: 1

You can get them by running this query:

MATCH (n:Book) 
WITH id(n) AS bookid, size((n)-[:HAS_GENRE]->()) AS genreCount
RETURN AVG(genreCount) AS avgNumGenres, MAX(genreCount) AS maxNumGenres, MIN(genreCount) AS minNumGenres

The Taxonomy Learning Algorithm

The algorithm is based on tag co-occurrence. The items in our dataset have multiple tags each, which means that tags will co-occur (will appear together) in a number of items. This algorithm will analyse the sets of items where tags co-occur and apply some pretty straightforward logic: If every item tagged as A is also tagged as B, we can derive that A implies B or in other words, the category defined by tag A is “narrower-than” the category defined by tag B. Easy, right?

Let’s look at the algo step by set using the simple model described before on books and genres. Books are our tagged items and the genres are the tags.

(b:Book)-[:HAS_GENRE]->(g:Genre)

STEP1: Compute co-occurrence

Co-occurrence is the basic building block for the algorithm and is in itself a quite useful relationship because it indicates some degree of overlap between tags and therefore a certain degree of similarity which is something that can be exploited for query expansion or recommendation.

The co-occurrence index between two categories A and B is computed as the portion of items in category A that are also in category B, this is a simple division of the number of items tagged as both A and B divided by the number of items in tagged as A.

COOC(A,B) = #items tagged as both A and B /  #items tagged as A

Screen Shot 2017-03-31 at 00.56.10.png

Notice that while the co-occurrence relationship is not directional, the co-occurrence index is so we will persist in our graph the cooccurrence of two tags as two relationships one on each direction containing the co-occurrence index.

In cypher:

MATCH (g:Genre) WHERE SIZE((g)<-[:HAS_GENRE]-()) > 5 //Threshold 
WITH g, size((g)<-[:HAS_GENRE]-()) as totalCount
MATCH (g)<-[:HAS_GENRE]-(book)-[:HAS_GENRE]->(relatedGenre)
WITH g, relatedGenre, toFloat(count(book)) / totalCount AS coocIndex
CREATE (g)-[:CO_OCCURS {index: coocIndex }]->(relatedGenre)

I’ve also included in the Cypher implementation a WHERE clause (marked red) to exclude categories that contain fewer items than a given threshold. This is an optional adjustment that you may want to apply and like this one there are a number of optimisations that can be applied to the basic co-occurrence computation to make it produce higher quality results.

STEP2: Infer same-as relationships

Once we have co-occurrence in the graph, we want to detect equivalent genres. Two genres are equivalent if the co-occurrence index in both directions is 1 or in other words, if every item having genre g1 has also genre g2 and vice versa.

Here is the Cypher that does the job. Equivalent categories (genres) are linked through the SAME_AS relationship

MATCH (g1)-[co1:CO_OCCURS {index : 1}]->(g2),
      (g2)-[co2:CO_OCCURS { index: 1}]->(g1)
WHERE ID(g1) > ID(g2)
MERGE (g1)-[:SAME_AS]-(g2)

STEP3: Infer narrower-than relationships

Here is where we infer the hierarchical relationship between two categories (genres). Very similar to the previous rule, we check now if one of the co-occurrence indexes is 1 and the other is less than 1. Or in English, if every item having genre g1 has also genre g2 but the opposite is not true.

The Cypher that creates the NARROWER_THAN hierarchy is as follows.

MATCH (g1)-[co1:CO_OCCURS]->(g2), 
      (g2)-[co2:CO_OCCURS]->(g1)
WHERE ID(g1) > ID(g2) 
      AND co1.index = 1 and co2.index < 1 MERGE (g1)-[:NARROWER_THAN]->(g2)

STEP4: Reduce transitive narrower-than relationships

Finally, this computation may have produced more NARROWER_THAN relationships than needed so we want to remove transitive ones. If (X)-[:NARROWER_THAN]->(Y) and (Y)-[:NARROWER_THAN]->(Z), then we may want to get rid of any (X)-[:NARROWER_THAN]->(Z) as it is kind of redundant.  But this is an optional step that you may or may not want to include.

MATCH (g1)-[:NARROWER_THAN*2..]->(g3), 
      (g1)-[d:NARROWER_THAN]->(g3)
DELETE d

So that’s it! Here is the taxonomy:

A couple of interesting examples:

  • military-science-fiction << space-opera << science-fiction
  • nutrition << health << non-fiction

Worth mentioning that while these are true in our dataset, as our graph grows and evolves over time, we may find contradictions to them that would invalidate the taxonomy so as data evolves it will make sense to re-compute the taxonomy creation.

Using the taxonomy

The objective of learning how a set of tags relate was to then be able to use it in some meaningful way. The CO_OCCURS relationship in itself is a useful one as it indicates some degree of overlap between tags and therefore a certain degree of similarity. But NARROWER_THAN has stronger semantics, let’s see how could we use it:

One possible use would be to recommend books based on the taxonomy we’ve just learned. This first query lists available subcategories with the number of items in them

MATCH (b:Book) WHERE b.title = 'Slow Bullets'
MATCH (b)-[:HAS_GENRE]->(g)<-[:NARROWER_THAN]-(childGenre)
WHERE size((g)<-[:HAS_GENRE]-()) < 500       AND NOT (b)-[:HAS_GENRE]->()-[:NARROWER_THAN]->(g) 
     AND NOT (b)-[:HAS_GENRE]->(childGenre)
RETURN "Hello! '" + b.title + "' is tagged as '" + g.name + "', and we have " + size((childGenre)<-[:HAS_GENRE]-()) + " books on '" + childGenre.name + "' which is a narrower category. Want to have a look? " AS recommendationQuestion

When run on ‘Pride and Prejudice’ it produces this output:

Screen Shot 2017-03-31 at 12.22.46.png

Actually it does a bit more than just that, if we analyse the Cypher, we can see that we start from a selected book and for each genre with less than 500 books in it -we want to exclude large ones like ‘fiction’ as they are too generic to provide relevant recommendations- we get the sub-categories that the current book is not tagged as. We also stop the generation of sub-category based recommendations if there is already a sibling subcategory in the tags of the selected book. Basically, if a book is tagged as ‘sports’ and ‘tennis’, tennis being a subcategory of sports, we will not recommend other subcategories of sports like ‘hockey’ or ‘football’. Yes, all that in 5 lines of cypher! Anyway, this is just one possible query that uses the hierarchy and that makes sense in my data set but you may want to tune it to yours.

And this second query, very similar to the previous one, lists the actual items in the subcategory:

MATCH (b:Book) WHERE b.title = {title}
MATCH (b)-[:HAS_GENRE]->(g)<-[:NARROWER_THAN]-(childGenre)
WHERE size((g)<-[:HAS_GENRE]-()) < 500 AND NOT (b)-[:HAS_GENRE]->()-[:NARROWER_THAN]->(g) AND NOT (b)-[:HAS_GENRE]->(childGenre)
WITH childGenre
MATCH (booksInChildCategory)-[:HAS_GENRE]->(childGenre)
RETURN booksInChildCategory.title AS bookTitle, substring(booksInChildCategory.description,1, 70) + '...' AS description

We can run it this time on ‘Slow Bullets’ and it will produce:

Screen Shot 2017-03-31 at 12.17.55.png

Notice that both queries are neutral from the point of view of what’s in the taxonomy, as the taxonomy evolves over time, the results will be different.

Of course, recommendation can get a lot more complicated but this is just a basic suggestion on how the taxonomy could be used. Another option is to use the NARROWER_THAN in combination with the CO_OCCURS for richer recommendations in case there are no NARROWER_THAN alternatives. More on this in future blog posts.

What’s interesting about this QuickGraph?

This is a basic attempt at analysing tag co-occurrence using a graph. The algorithm can be refined in a number of ways but I thought it would be interesting to share it in its basic form and maybe blog later on on how to improve it.

I think the most interesting is the fact that the approach is generic and can be used in many contexts to build a purely dynamic and automated solution. The taxonomy creation algorithm can be re-run on a regular basis as new tagged data is added to the graph and the logic (like the one described in the “using the taxonomy” section) will produce results adapted to the fresh version of the taxonomy.

It’s worth mentioning that the quality of the hierarchy will directly depend on the quality of your data tagging! We are not creating a formal ontology here but rather building a pragmatic and actionable taxonomy derived in an automated way from your data.

Watch this space for other examples of use of this approach and some suggested refinements.

I’d also love to get your feedback!

Quick note on getting the data

To get some data from Goodreads your best option is to write some code using their API. Other alternatives are for instance import.io (click to try the import.io ‘lightning’ scraper on a GoodReads list) or HTML scraping libraries for your favourite programming language, rvest if you’re an R fan or Beautiful Soup or lxml if you prefer python.

If you want to test the algorithm on your Neo4j instance with the same dataset I used you just need to run the data load scripts above, they include the link to the data.

Graph DB + Data Virtualization = Live dashboard for fraud analysis

The scenario

Retail banking: Your graph-based fraud detection system powered by Neo4j is being used as part of the controls run when processing line of credit applications or when accounts are provisioned. It’s job is to block -or at least to flag- potentially fraudulent submissions as they come into your systems. It’s also sending alarms to fraud operations analysts whenever unusual patterns are detected in the graph so they can be individually investigated ASAP.

This is all working great but you want other analysts in your organisation to benefit from the super rich insights that your graph database can deliver, people whose job is not to react on the spot to individual fraud threats but rather understand the bigger picture. They are probably more strategic business analysts, maybe some data scientists doing predictive analysis too and they will typically want to look at fraud patterns globally rather than individually, combine the information in your fraud detection graph with other datasources (external to the graph) for reporting purposes, to get new insights, or even to ‘learn’ new patterns by running algorithms or applying ML techniques.

In this post I’ll describe through an example how Data Virtualization can be used to integrate your Neo4j graph with other data sources providing a single unified view easy to consume by standard analytical/BI tools.

Don’t get confused by the name, DV is about data integration, nothing to do with hardware  or infrastructure virtualization.

The objective

I thought a good example for this scenario could be the creation of an integrated dashboard on your fraud detection platform aggregating data from a couple of different sources.

Nine out of ten times integration will be synonym of ETL-ing your data into a centralised store or data warehouse and then running your analytics/BI from there. Fine. This is of course a valid approach but it also has its shortcomings, specially regarding agility, time to solution and cost of evolution just to name a few. And as I said in the intro, I wanted to explore an alternative approach, more modern and agile, called data virtualization or as it’s called these days, I’ll be building a logical data warehouse.

The “logical” in the name comes from the fact that data is not necessarily replicated (materialised) into a store but rather “wrapped” logically at the source and exposed as a set of virtual views that are run on demand. This is what makes this federated approach essentially different from the ETL based one.

Screen Shot 2016-11-25 at 18.13.55.png

The architecture of my experiment is not too ambitious but rich enough to prove the point. It uses an off the shelf commercial data virtualization platform (Data Virtuality) abstracting and integrating two data sources (one relational, one graph) and offering a unified view to a BI tool.

Before I go into the details, a quick note of gratitude: When I decided to go ahead with this experiment, I reached out to Data Virtuality, and they very kindly gave me access to a VM with their data virtualization platform preinstalled and supported me along the way. So here is a big thank you to them, especially to Niklas Schmidtmer, a top solutions engineer who has been super helpful and answered all my technical questions on DV.

The data sources

 

Neo4j  for fraud detection

In this post I’m focusing on the integration aspects so I will not go into the details of what a graph-based fraud detection solution built on Neo4j looks like. I’ll just say that Neo4j is capable of keeping a real time view of your account holders’ information and detect potentially fraudulent patterns as they appear. By “real time” here, I mean as accounts are provisioned or updated in your system, or as transactions arrive, or in other words, as suspicious patterns are formed in your graph.

In our example, say we have a Cypher query returning the list of potential fraudsters. A potential fraudster in our example is an individual account holder involved in a suspicious ring pattern like the one in the Neo4j browser capture below. The query also returns some additional information derived from the graph like the size of the fraud ring and the financial risk associated with it. The list of fraudsters returned by this query will be driving my dashboard but we will want to enrich them first with some additional information from the CRM.

For a detailed description of what first party bank fraud is and how graph databases can fight it read this post.

Screen Shot 2016-11-25 at 18.38.35.png

 

RDBMS backed CRM system

The second data source is any CRM system backed by a relational database. You can put here the name of your preferred one or whichever in-house built solution your organisation is currently using.

The data in a CRM is less frequently updated and contains additional information about our account holders.

Data Virtualization

As I said before, data virtualization is a modern approach to data integration based on the idea of data on demand. A data virtualization platform wraps different types of data sources: relational, NoSQL, APIs, etc… and makes them all look like relational views. These views can then be combined through standard relational algebra operations to produce rich derived (integrated) views that will ultimately be consumed by all sorts of BI, analytics and reporting tools or environments as if they came from a single relational database.

The process of creating a virtual integrated view of a number of data sources can be broken down in three parts. 1) Connecting to the sources and virtualizing the relevant elements in them to create base views, 2) Combining the base views to create richer derived ones and 3) publishing them for consumption by analytical and BI applications. Let’s describe each step in a bit more detail.

Connecting to the sources from the data virtualization layer and creating base views

The easiest way to interact with your Neo4j instance from a data virtualization platform is through the JDBC driver. The connection string and authentication details is pretty much all that’s needed as we can see in the following screen capture.

Screen Shot 2016-11-25 at 13.25.23.png

Once the data source is created, we can easily define a virtual view on it based on our Cypher query with the standard CREATE VIEW… expression in SQL. Notice the usage of the ARRAYTABLE function to take the array structure returned by the request and produce a tabular output.

Screen Shot 2016-11-25 at 14.29.47.png

Once our fraudsters view is created, it can be queried just as if it was a relational one. The data virtualization layer will take care of the “translation” because obviously Neo4j actually talks Cypher and not SQL.

Screen Shot 2016-11-25 at 14.58.45.png

If for whatever reason you want to hit directly Neo4j’s HTTP REST API, you can do that by creating a POST request on the Cypher transactional endpoint and building the JSON message containing the Cypher (find description in Neo4j’s developer manual here). In Data Virtuality this can easily be done through a web service data import wizard, see next screen capture:

You’ll need to provide the endpoint details, the type of messages exchanged, the structure of the request. The wizard will then send a test request to figure out what the returned structure looks like and offer you a visual point and click way to select which values are relevant to your view and even offer a preview of the results.

Similar to the previous JDBC based case, now we have a virtual relational view built on our Cypher query that can be queried through SQL. Again the DV platform takes care of translating it into a HTTP POST request behind the scenes…

Screen Shot 2016-11-25 at 14.54.12.png

Now let’s go to the other data source, our CRM. Virtualizing relational datasources is pretty simple because they are already relational. So once we’ve configured the connection (identical to previous case, indicating server, port, and authentication credentials) the DV layer can introspect the relational schema and do the work for us by offering the tables and views discovered.

Screen Shot 2016-11-25 at 15.19.58.png

So we create a view on customer details from the CRM. This view includes the global user ID that we will use to combine this table with the fraudster data coming from Neo4j.

Combining the data from the two sources

Since we now hav two virtual relational views in our data virtualization layer, all we need to do is to combine them using a straightforward SQL JOIN. This can be achieved visually:

Screen Shot 2016-11-25 at 15.26.45.png

…or directly typing the SQL script

screen-shot-2016-11-25-at-15-28-06

The result is a new fraudster360 view combining information from both our CRM system and the Neo4j powered fraud detection platform. As in the previous cases, it is a relational view that can be queried and most interestingly exposed to consumer applications.

Important to note: no data movement at this point, data stays at the source. We are only defining a virtual view (metadata if you want). Data will be retrieved on demand when a consumer application queries the virtual view as we’ll see in the next section.

We can however test what will happen by running a test query from the Data Virtuality SQL editor. It is a simple projection on the fraudster360 view.

screen-shot-2016-11-28-at-09-15-17

We can visualise the query execution plan to see that…

screen-shot-2016-11-28-at-09-15-52

…the query on the fraudster360 is broken down into two, one hitting Neo4j and the other the relational DB. The join is carried out on the fly and the results streamed to the requester.

Even though I’m quite familiar with the data virtualization world, it was not my intention in this post to dive too deep into the capabilities of these platforms. Probably worth mentioning though that it is possible to use the DV layer as a way to control access to your data in a centralised way by defining role based access rules. Or that DV platforms are pretty good at coming up with execution plans that delegate down to the sources as much of the processing as possible, or alternatively, caching a the virtual level if the desired behavior is precisely the opposite (i.e. protecting the sources from being hit by analytical/BI workload).

But there is a lot more, so if you’re interested, ask the experts.

Exposing composite views

I’ll use Tableau for this example. Tableau can connect to the Data Virtualization server via ODBC. The virtual views created in Data Virtuality are listed and all that needs to be done is selecting our fraudster360 view and check that data types are imported correctly.

Screen Shot 2016-11-25 at 15.54.41.png

I’m obviously not a Tableau expert but I managed to easily create a couple of charts and put them into a reasonably nice looking dashboard. You can see it below, it actually shows how the different potential fraud cases are distributed by state, how does the size of a ring (group of fraudsters collaborating) relate to the financial risk associated with it or how these two factors are distributed regionally.

Screen Shot 2016-11-25 at 17.08.22.png

And the most interesting thing about this dashboard is that since it is built on a virtual (non materialised) view, whenever the dashboard is re-opened or refreshed, the Data Virtualization layer will query the underlying Neo4j graph for the most recent fraud rings and join them with the CRM data so that the dashboard is guaranteed to be built on the freshest version of data from both all sources.

Needless to say that if instead of Tableau you are a Qlik or an Excel user, or you write R or python code for data analysis, you would be able to consume the virtual view in exactly the same way (or very similar if you use JDBC instead of ODBC).

Well, that’s it for this first experiment. I hope you found it interesting.

Takeaways

Abstraction: Data virtualization is a very interesting way of exposing Cypher based dynamic views on your Neo4j graph database to non technical users making it possible for them to take advantage of the value in the graph without necessarily having to write the queries themselves. They will consume the rich data in your graph DB through the standard BI products they feel comfortable with (Tableau, Excel, Qlik, etc).

Integration: The graph DB is a key piece in your data architecture but it will not hold all the information and integration will be required sooner or later. Data Virtualization proves to be a quite nice agile approach to integrating your graph with other data sources offering controlled virtual integrated datasets to business users enabling self service BI.

Interested in more ways in which Data Virtualization can integrate with Neo4j? Watch this space.

 

 

Neo4j is your RDF store (part 1)

If you want to understand the differences and similarities between RDF and the Labeled Property Graph implemented by Neo4j, I’d recommend you watch this talk I gave at Graph Connect San Francisco in October 2016.

Intro

Let me start with some basics: RDF is a standard for data exchange, but it does not impose any particular way of storing data.

What do I mean by that? I mean that data can be persisted in many ways: tables, documents, key-value pairs, property graphs, triple graphs… and still be published/exchanged as RDF.

It is true though that the bigger the paradigm impedance mismatch -the difference between RDF’s modelling paradigm (a graph) and the underlying store’s one-, the more complicated and inefficient the translation for both ingestion and publishing will be.

I’ve been blogging over the last few months about how Neo4j can easily import RDF data and in this post I’ll focus on the opposite: How can a Neo4j graph be published/exposed as RDF.

Because in case you didn’t know, you can work with Neo4j getting the benefits of native graph storage and processing -best performance, data integrity and scalability- while being totally ‘open standards‘ to the eyes of any RDF aware application.

Oh! hang on… and your store will also be fully open source!

A “Turing style” test of RDFness

In this first section I’ll show the simplest way in which data from a graph in Neo4j can be published as RDF but I’ll also demonstrate that it is possible to import an RDF dataset into Neo without loss of information in a way that the RDF produced when querying Neo4j is identical to that produced by the original triple store.

Screen Shot 2016-11-17 at 01.18.36.png

You’ll probably be familiar with the Turing test where a human evaluator tests a machine’s ability to exhibit intelligent behaviour, to the point where it’s indistinguishable from that of a human. Well, my test aims to prove Neo4j’s ability to exhibit “RDF behaviour” to an RDF consuming application, making it indistinguishable from that of a triple store. To do this I’ll use the neosemantics neo4j extension.

The simplest test one can think of, could be something like this:

Starting from an RDF dataset living in a triple store, we migrate it (all or partially) into Neo4j. Now if we run a Given a SPARQL DESCRIBE <uri> query on the triple store and its equivalent rdf/describe/uri<uri> in Neo4j, do they return the same set of triples? If that is the case -and if we also want to be pompous- we could say that the results are semantically equivalent, and therefore indistinguishable to a consumer application.

We are going to run this test step by step on data from the British National Bibliography dataset:

Get an RDF node description from the triple store

To do that, we’ll run the following SPARQL DESCRIBE query in the British National Bibliography public SPARQL endpoint, or alternatively in the more user friendly SPARQL editor.

DESCRIBE <http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940>

The request returns an RDF fragment containing all information about Mikhail Bulgakov in the BNB. A pretty cool author, by the way, which I strongly recommend. The fragment actually contains 86 triples, the first of which are these:

<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/givenName> "Mikhail" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.w3.org/2000/01/rdf-schema#label> "Bulgakov, Mikhail, 1891-1940" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/familyName> "Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/name> "Mikhail Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/010535795> .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/008720599> .
...

You can get the whole set running the query in the SPARQL editor I mentioned before or sending an  HTTP request with the query to the SPARQL endpoint:

curl -i http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E -H Accept:text/plain

Ok, so that’s our base line,  exactly the output we want to get from Neo4j to be able to affirm that they are indistinguishable to an RDF consuming application.

Move the data from the triple store to Neo4j

We need to load the RDF data into Neo4j. We could load the whole British National Bibliography since it’s available for download as RDF, but for this example we are going to load just the portion of data that we need.

I will not go into the details of how this happens as it’s been described in previous blog posts and with some examples. The semantics.importRDF procedure runs a straightforward and lossless import of RDF data into Neo4j. The procedure is part of the neosemantics extension. If you want to run the test with me on your Neo4j instance, now is the moment when you need to install it (instructions in the README).

Once the extension ins installed, the migration could not be simpler, just run the following stored procedure:

CALL semantics.importRDF("http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E",
"RDF/XML",true,true,500)

We are passing as parameter the url of the BNB SPARQL endpoint returning the RDF data needed for our test, along with some import configuration options. The output of the execution shows that the 86 triples have been correctly imported into Neo4j:

Screen Shot 2016-11-16 at 03.01.52.png

Now that the data is in Neo4j and you can query it with Cypher and visualise it in the browser. Here is a query example returning Bulgakov and all the nodes he’s connected to:

MATCH (a)-[b]-(c:Resource { uri: "http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940"})
RETURN *

Screen Shot 2016-11-16 at 02.54.34.png

There is actually not much information in the graph yet, just the node representing good old Mikhail with a few properties (name, uri, etc…) and connections to the works he created or contributed to, the events of his birth and death and a couple more. But let’s not worry about size for now, well deal with that later. The question was: can we now query our Neo4j graph and produce the original set of RDF triples? Let’s see.

Get an RDF description of the same node, now from Neo4j

The neosemantics repo also includes an extensions (http endpoints) that provide precisely this capability. The equivalent in Neo4j of the SPARQL DESCRIBE on Mikhail Bulgakov would be the following:

:GET /rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940

If you run it in the browser, you will get the default serialisation which is JSON-LD, something like this:

Screen Shot 2016-11-16 at 16.40.23.png

But if you set in the request header the serialisation format of your choice -for example using curl again- you can get the RDF fragment in any of the available formats.

curl -i http://localhost:7474/rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940 -H accept:text/plain

Well, you should not be surprised to know that it return 86 triples, exactly the same set that the original query on the triple store returned.

So mission accomplished. At least for the basic case.

RDF out Neo4j’s movie database

I thought it could be interesting to prove that an RDF dataset can be imported into Neo4j and then published without loss of information but OK, most of you may not care much about existing RDF datasets, that’s fair enough. You have a graph in Neo4j and you just want to publish it as RDF. This means that in your graph, the nodes don’t necessarily have a property for the uri (why would they?) or are labelled as Resources. Not a problem.

Ok, so if your graph is not the result of some RDF import, the service you want to use instead of the uri based one, is the nodeid based equivalent.

:GET /rdf/describe/id?nodeid=<nodeid>

We’ll use for this example Neo4j’s movie database. You can get it loaded in your Neo4j instance by running

:play movies

You can get the ID of a node either directly by clicking on it on the browser or by running a simple query like this one:

MATCH (x:Movie {title: "Unforgiven"}) 
RETURN ID(x)

In my Neo4j instance, the returned ID is 97 so the GET request would pass this ID and return in the browser the JSON-LD serialisation of the node representing the movie “Unforgiven” with its attributes and the set of nodes connected to it (both inbound and outbound connections):

screen-shot-2016-11-16-at-17-07-26

But as in the previous case, the endpoint can also produce your favourite serialisation just by setting it in the accept parameter in the request header.

curl -i http://localhost:7474/rdf/describe/id?nodeid=97 -H accept:text/plain

When setting the serialisation to N-Triples forma the previous request gets you these triples:

<neo4j://indiv#97> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <neo4j://vocabulary#Movie> .
<neo4j://indiv#97> <neo4j://vocabulary#tagline> "It's a hell of a thing, killing a man" .
<neo4j://indiv#97> <neo4j://vocabulary#title> "Unforgiven" .
<neo4j://indiv#97> <neo4j://vocabulary#released> "1992"^^<http://www.w3.org/2001/XMLSchema#long> .
<neo4j://indiv#167> <neo4j://vocabulary#REVIEWED> <neo4j://indiv#97> .
<neo4j://indiv#89> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#DIRECTED> <neo4j://indiv#97> .
<neo4j://indiv#98> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .

The sharpest of you may notice when you run it that there is  a bit missing. There are relationship properties in the movie database that are lost in the RDF fragment. Yes, that is because there is no way of expressing that in RDF. At least not without recurring to horribly complicated patterns like reification or singleton property that are effectively unusable in any practical real world use case. But we’ll get to that too in future posts.

Takeaways

 

I guess the main one is that if you want to get the benefits of native graph storage and be able to query your graph with Cypher in Neo4j but also want to:

  •  be able to easily import RDF data into your graph and/or
  •  offer an RDF “open standards compliant” API for publishing your graph

Well, that’s absolutely fine, because we’ve just seen how Neo4j does a great job at producing and consuming RDF.

Remember: RDF is about data exchange, not about storage.

There is more to come on producing RDF from Neo4j than what I’ve shown in this post. For instance, publishing the results of a Cypher query as RDF. Does it sound interesting?Watch this space.

Also I’d love to hear your feedback!

 

 

 

QuickGraph#4 Explore your browser history in Neo4j

The dataset

For this example I am going to use my browser history data. Most browsers store this data in SQLite. This means relational data, easy to access from Neo4j using the apoc.load.jdbc  stored procedure. I’m a Chrome user, and in my Mac, Chrome stores the history db at

~/Library/Application Support/Google/Chrome/Default/History

There are two main tables in the History DB: urls and visits. I’m going to explore them directly from Neo4j’s browser using the same apoc.load.jdbc procedure. In order to do that, you’ll have to download first a jdbc driver for SQLite, and copy it in the plugins directory of your Neo4j instance. Also keep in mind that Chrome locks the History DB when the browser is open so if you want to play with it(even read only acces) you will have to either close the browser or as I did, copy the DB (a single file) somewhere else and work from that snapshot.

This Cypher fragment will return the first few records of the urls table and we see on them things like an unique identifier for the page, its url, title of the page and some counters with the number of visits and the number of times the url has been typed as opposed to reached by following a hyperlink.

CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    "urls") yield row 
WITH row LIMIT 10
RETURN *

The results look like this on my browser history.

screen-shot-2016-09-29-at-02-52-48

The visits table contain information about the page visit event, a timestamp (visit_time), a unique identifier (id) for each visit and most interesting, whether the visit follows a previous one (from_visit). This would mean that there was a click on a hyperlink that lead from page A to page B.

screen-shot-2016-09-29-at-13-36-33

A bit of SQL manipulation using the date and time functions on the SQLite side will filter out the columns from the visits table that we don’t care about for this experiment and also format the timestamp in a user friendly date and time.

SELECT id, url, time(((visit_time/1000000)-11644473600), 'unixepoch') as visit_time, 
date(((visit_time/1000000)-11644473600), 'unixepoch') as visit_date,
visit_time as visit_time_raw 
FROM visits

Here’s what records look like using this query. Nice and ready to be imported into Neo4j.

screen-shot-2016-09-29-at-14-35-41

Loading the data into Neo4j

The model I’m planning to build is quite simple: I’ll use a node to represent a web page and a separate one to represent each individual visit to a page. Each visit event is linked to the page through the :VISIT_TO_PAGE relationship, and chained page visits (hyperlink navigation) are linked through the :NAVIGATION_TO relationship. Here is what that looks visually on an example navigation from a post on the Neo4j blog to a page with some code on Github:

screen-shot-2016-09-29-at-19-28-38

Ok, so let’s go with the import scripts.  First the creation of Page nodes out of every record in the urls table:

CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    "urls") yield row 
WITH row 
CREATE (p:Page {page_id: row.id, 
                page_url: row.url, 
                page_title: row.title, 
                page_visit_count: row.visit_count, 
                page_typed_count: row.typed_count})

And I’ll do the same with the visits, but linking them to the pages we’ve just loaded. Actually, to accelerate the page lookup I’ll create an index on page ids first.

CREATE INDEX ON :Page(page_id)

And here’s the Cypher running the visit data load.

WITH "SELECT id, url, visit_time as visit_time_raw, 
 time(((visit_time/1000000)-11644473600), 'unixepoch') as visit_time, 
 date(((visit_time/1000000)-11644473600), 'unixepoch') as visit_date 
 FROM visits" AS sqlstring

CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    sqlstring ) yield row
WITH row 
MATCH (p:Page {page_id: row.url}) 
CREATE (v:PageVisit { visit_id: row.id, 
                      visit_time: row.visit_time, 
                      visit_date: row.visit_date, 
                      visit_timestamp: row.visit_time_raw}) 
CREATE (v)-[:VISIT_TO_PAGE]->(p)

And finally, I’ll load the transitions between visits but as we did before with the pages, let’s create first an index on visit ids:

CREATE INDEX ON :PageVisit(visit_id)
WITH "SELECT id, from_visit, transition, segment_id, visit_duration 
      FROM visits" AS sqlstring
CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    sqlstring
                    ) yield row 
WITH row 
MATCH (v1:PageVisit {visit_id: row.from_visit}),
      (v2:PageVisit {visit_id: row.id}) 
CREATE (v1)-[:NAVIGATION_TO]->(v2)

So we are ready to start querying our graph!

Querying the graph

Let’s look for a direct navigation in the graph that goes for instance from a page in the Neo4j web site to Twitter.

MATCH (v1)-[:VISIT_TO_PAGE]->(p1),
      (v2)-[:VISIT_TO_PAGE]->(p2),
      (v1)-[:NAVIGATION_TO]->(v2) 
WHERE p1.page_url CONTAINS 'neo4j.com' 
      AND p2.page_url CONTAINS 'twitter.com'
RETURN * LIMIT 10

In my browser history data, this produces the following output. Notice that I’ve extended it to include an extra navigation step. I’ve done that just by clicking on the graph visualisation in the Neo4j browser to make the example more understandable:

screen-shot-2016-09-29-at-16-46-26

It actually corresponds to a visit to the Neo4j blog, followed by me tweeting how cool was what I just read. The proof that I’m working with real data is the actual tweet (!)

Ok, so while this basic model is good to analyse individual journeys, I think extracting a Site node by aggregating all pages in the same site can give us interesting insights. Let’s go for it.

Extending the model

This could be done in different ways, for example we could write a stored procedure and call it from a Cypher script. Having the full power of java, we could do a proper parsing of the url string to extract the domain.

I will do it differently though, I’ll run a SQL query on the History SQLite DB including string transformations to substring the urls and extract the domain name (sort of). The SQL that extracts the root of the url could be the following one:

SELECT id, substr(url,9,instr(substr(url,9),'/')-1) as site_root 
FROM urls 
WHERE instr(url, 'https://')=1 
UNION
SELECT id, substr(url,8,instr(substr(url,8),'/')-1) as site_root 
FROM urls
WHERE instr(url, 'http://')=1

Quite horrible, I know. But my intention is to show how the graph can be extended with new data without having to recreate it. Quite a common scenario when you work with graphs, but relax, graphs are good at accommodating change, nothing to do with RDBMS migrations when having to change your schema.

So this new query produces rows containing just the domain (the root of the url) and the page id that I will use to match to previously loaded pages. Something like this:

screen-shot-2016-09-29-at-19-47-56

And the Cypher that loads it and adds the extra information in our graph would be this:

WITH "select substr(url,9,instr(substr(url,9),'/')-1) as site_root, id 
      from urls where instr(url, 'https://')=1 
      UNION
      select substr(url,8,instr(substr(url,8),'/')-1) as site_root, id 
      from urls where instr(url, 'http://')=1"  AS query
CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                     query) yield row 
WITH row 
MATCH (p:Page {page_id: row.id})
MERGE (s:Site {site_root: row.site_root})
CREATE (p)-[:PAGE_IN_SITE]->(s)

And once we have the sites we can include weighted site level navigation. The weight is simply calculated by summing the number of transitions between pages belonging to each site. Here is the Cypher that does the job:

MATCH (s:Site)<-[:PAGE_IN_SITE]-()<-[:VISIT_TO_PAGE]-()<-[inbound:NAVIGATION_TO]-()-[:VISIT_TO_PAGE]->()-[:PAGE_IN_SITE]->(otherSite) 
WHERE otherSite <> s 
WITH otherSite, s, count(inbound) as weight 
CREATE (otherSite)-[sn:SITE_DIRECT_NAVIGATION{weight:weight}]->(s)

This is a much richer graph, where we can traverse not only individual journeys, but also Site level connections. In the following visualisation we can see that there are some transitions between the http://www.theguardian.co.uk and the http://www.bbc.co.uk sites (indicated in green), also to other sites like en.wikipedia.org. In the same capture we can see one of the individual navigations that explain the existence of  a :SITE_DIRECT_NAVIGATION relationship between the Guardian node and the BBC one. It actually represents a hyperlink I clicked on the Guardian’s article that connected it to a BBC one. The purple sequence of events (page visits) details my journey and the yellow nodes represent the pages, pretty much the same we saw on the previous example from neo4j.com to twitter.com.

screen-shot-2016-09-30-at-01-09-41

We can also have a bird’s eye view of a few thousand of the nodes on the graph and notice some interesting facts:

Screen Shot 2016-09-29 at 21.41.11.png

I’ve highlighted some interesting Site nodes. We can se that the most highly connected (more central in the visualization) are the googles and the URL shortening services (t.co, bit.ly, etc.). It makes sense because you typically navigate in and out of them, they are kind of bridge nodes in your navigation. This is confirmed if we run the betweenness centrality algorithm on the sites and their connections. Briefly, betweenness centrality is an indicator of a node’s centrality in a graph and is equal to the number of shortest paths from all nodes to all others that pass through that node.

Here is the Cypher script, again invoking the graph algo implementation as a stored procedure that you can find in the amazing APOC library:

MATCH (s:Site)
WITH collect(s) AS nodes
CALL apoc.algo.betweenness(['SITE_DIRECT_NAVIGATION'],nodes,'BOTH') 
  YIELD node, score
RETURN node.site_root, score
ORDER BY score DESC LIMIT 5

And these are the top five results of the computation on my browser history.

screen-shot-2016-09-30-at-00-44-51

I’m sure you can think of many other interesting queries on your own navigation, what’s the average length of a journey, how many different sites it traverses, is it mostly intra-site? Are there any isolated clusters? An example of this in my browser history are the Amazon sites (amazon.co.uk and music.amazon.co.uk). There seem to be loads of transitions (navigation) between them but none in or out to/from other sites. You can visually see this on the bottom left part of the previous bird’s eye view. I’m sure you will come up with many more but I’ll finish this QuickGraph with a query involving some serious path exploration.

The question is: Which sites have I navigated to from LinkedIn pages, how many times have I reached them and how long (as in how many hyperlink clicks) did it take me to get to them? You may be asking yourself how on earth would you even express that in SQL(?!?!). Well, not to worry, you’ll be pleased to see that it takes less writing expressing the query in Cypher than it takes to do it in English. Here it is:

MATCH (v1)-[:VISIT_TO_PAGE]->(p1)-[:PAGE_IN_SITE]-(s1:Site {site_root: "www.linkedin.com"}) 
MATCH p = (v1)-[:NAVIGATION_TO*]->(v2)-[:VISIT_TO_PAGE]->(p2)-[:PAGE_IN_SITE]-(s2)
WHERE s2 <> s1
WITH length(p) AS pathlen, s2.site_root AS site 
RETURN AVG(pathlen) AS avglen, count(*) AS count, site ORDER BY avglen

And my results, 21 milliseconds later…

screen-shot-2016-09-30-at-01-59-38

What’s interesting about this QuickGraph?

This experiment shows several interesting things, the first being how straightforward it can be to load relational data into Neo4j using the apoc.load.jdbc  stored procedure. As a matter of fact, the same applies to other types of data sources for example Web Services as I described in previous posts.

The second takeaway is how modelling and storing as a graph data that is naturally a graph (sequences of page visits) as opposed to shoehorning it into relational tables opens a lot of opportunities for querying and exploration that would be unthinkable in SQL.

Finally I’ve also shown how some graph algorithms (betweenness centrality) can be applied easily to your graph using stored procedures in Cypher. Worth mentioning that you can extend the list of available ones by writing your own and easily deploying it on your Neo4j instance.

The ‘hidden’ connections in Google’s Knowledge Graph

As far as I know, the only way to query Google’s Knowledge Graph currently is the search API. Let’s run a query on it, search for instance for Miles Davis’ album “Sketches of Spain”.

https://kgsearch.googleapis.com/v1/entities:search?query=sketches%20of%20spain&key=<your_key_here>&limit=1

The API returns this JSON-LD fragment back (thanks, Jos de Jong for the great JSON Editor Online):

screen-shot-2016-09-12-at-14-33-15

Strip out the wrapping entities and each search result returned is just a node from the Knowledge Graph for which we get the id, type (category), name and description. Additionally, you may get your node linked to a Wikipedia page that provides a detailed description of the entity. That’s what the red box highlights in the previous fragment. Visually, what we get is something like this:

screen-shot-2016-09-12-at-15-11-24

This is nice because your text search is returning an entity in Google’s knowledge graph and it’s structured data… yes but there’s something missing. I don’t think I’d be exaggerating if I said there is the most important bit missing: The context, the connections, the other bits of the graph that this entity relates to. Let me explain what I mean: If I run the same search in a browser I get a much richer result from the Knowledge Graph:

screen-shot-2016-09-12-at-12-56-10

The dashed red box shows what the search API currently returns, and the bits connected with the arrows are the context that I’m talking about. The author of the album, the producers, the awards received, the genre… The data is obviously in the graph and JSON-LD’s capabilities for expressing rich linked data are crying to be used. If that was not enough, the relationships are already defined in schema.org so it looks like we have all we need. Actually, Google! you have all you need 🙂

Right, so based on this, what would a (WAY) richer result look like? Look at the little blue box that I added to the original query output:

screen-shot-2016-09-12-at-15-41-39

Or probably for a more intuitive representation, look at the graph that this new JSON-LD fragment represents:

screen-shot-2016-09-12-at-16-03-41

Wouldn’t it be cool? And not only cool but also extremely useful? Let me know your thoughts.

And yes, for those of you who may be wondering where did I get the IRIs of the extra nodes and whether they are real or made up, I did run separate queries on the search API for each of the related entities and stuck it all together manually so valid IRIs but retrieved separately.

One final comment: If you’re interested in publishing/sharing connected data (graph data) as JSON-LD straight from your Neo4j Graph Database, have a look at this repo.

 

 

 

QuickGraph#2 How is Wikipedia’s knowledge organised

The dataset

For this QuickGraph I’ll use data about Wikipedia Categories. You may have noticed at the bottom of every Wikipedia article a section listing the categories it’s classified under. Every Wikipedia article will have at least one category, and categories branch into subcategories forming overlapping trees. It is sometimes possible for a category (and the Wikipedia hierarchy is an example of this) to be a subcategory of more than one parent category, so the hierarchy is effectively a graph.

I guess an example will be helpful at this point: If you open the Wikipedia page on the mathematician Leonard Euler you will find at the bottom of it the following list of categories:

Screen Shot 2016-08-03 at 09.43.02

If you click on any of them, for instance ‘Swiss mathematicians‘ you will be able to navigate up and down the hierarchy of categories from the selected one. I’ll go a couple of steps ahead here and show the final product before explaining how to build it: The structure you are navigating is a graph and looks something like this:

Screen Shot 2016-08-03 at 09.57.45

In the diagram the color coding of the Category nodes indicates the (shortest) distance to the root category so we will have Level1 (green) to Level4 (pink) categories. We can see that the ‘Swiss Mathematician’ category can be reached following many different paths from the root. A couple of examples would be:

People > People by occupation> People in STEM fields > Mathematicians > Mathematicians by nationality > Swiss mathematicians 

or

Science > Academic disciplines> STEM > Mathematics > Mathematicians > Mathematicians by nationality > Swiss mathematicians 

So this is the graph that we are going to build and explore/query in Neo4j in this blog post.

The data source

Wikipedia data can be accessed through the MediaWiki action API, and the specific request that returns information for a given category is described here and looks as follows:

https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=subcat&cmtitle=Category:Fundamental_categories&cmprop=ids|title&cmlimit=500

The request returns a simple JSON structure containing all subcategories of the one passed as parameter (cmtitle). Here is an example of the web service output:

Screen Shot 2016-08-02 at 04.15.33

Loading the data into Neo4j

Since the Wikipedia categorymembers web service takes the name of a category as input parameter, we’ll have to seed the graph with a root node in order to start loading categories. This node will be the starting point for our data load. For this example we will use the category “Main Topic Classification” which groupsWikipedia’s major topic classifications.

CREATE (c:Category:RootCategory {catId: '0000000', catName: 'Main_topic_classifications', fetched : false})

An index on category Ids will help speeding up the merges, so let’s add it.

CREATE INDEX ON :Category(catId)

Now from the top level category we’ll iteratively call the categorymembers web service and process the JSON fragment returned to create nodes and relationships in the graph. We will do this in Neo4j using the apoc.load.json procedure in the APOC library. Here is the Cypher script that does the job. It first invokes the web service with the name of the category, and then process the JSON that the web service returns by creating new instances of Category nodes connected to the parent trhough :SUBCAT_OF relationships.

MATCH (c:Category { fetched: false}) 
call apoc.load.json('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=subcat&cmtitle=Category:' + replace(c.catName,' ', '%20') + '&cmprop=ids|title&cmlimit=500') 
YIELD value as results 
UNWIND results.query.categorymembers AS subcat 
MERGE (sc:Category:Level1Category {catId: subcat.pageid}) 
ON CREATE SET sc.catName = substring(subcat.title,9), 
              sc.fetched = false
MERGE (sc)-[:SUBCAT_OF]->(c)
WITH DISTINCT c
SET c.fetched = true

Each iteration of the loader adds a new level to the category hierarchy by picking up all ‘unexpanded’ nodes and fetching for each of them all the subcategories. Nodes are marked on creation with the fetched : false property and once they have been expanded (subcategories retrieved and added to the graph) the property fetched is set to true.

So a first iteration of the loader after creating the top level category ” instantiates 13 new categories

Screen Shot 2016-08-02 at 11.47.20

Which are obviously the 13 main topic classifications in Wikipedia.

Screen Shot 2016-08-02 at 11.54.11

Screen Shot 2016-08-03 at 11.50.49

The second iteration of the loader picks each one of these 13 and instantiates its subcategories producing 423 new next level nodes.

Screen Shot 2016-08-02 at 11.56.51

A third iteration brings in 5815 new nodes and a fourth one takes the final number close to 50K categories by adding 42840 more. It’s important to keep in mind that every category expansion involves a request being sent to the wikipedia API so I would not recommend this approach for hierarchies more than 4 levels deep.  If you’re interested in loading the whole set of categories probably the best would be to download them as an RDB dump (the files enwiki-latest-category.sql.gz and enwiki-latest-categorylinks.sql.gz from the downloads page) and then use the superfast batch importer to create your graph in Neo4j. Maybe an idea for a future graphgist?

Ok, we have now a graph of 50K odd categories from Wikipedia organised hierarchically via :SUBCAT_OF relationships, so let’s try and run some interesting queries on them. Here’s how a few thousand categories (up to level three) look like in the Neo4j browser: the root node is the red one towards the center of the image, Level 1 categories are the green ones, followed by Level 2 in blue and Level 3 in yellow.

 

Screen Shot 2016-08-02 at 14.42.05

An alternative graph based on a smaller number of initial thematic classifications can be built using “Fundamental_categories” as root. In Wikipedia’s terms, this alternative classification “is intended to contain all and only the few most fundamental ontological categories which can reasonably be expected to contain every possible Wikipedia article under their category trees”. I leave this for the interested reader to try.

Querying the graph

What is the shortest full hierarchy for a given category?

Let’s use for this example the ‘Swiss Mathematicians’ example from the introduction.

MATCH shortestFullHierarchy = shortestPath((swissMatem:Category {catName : 'Swiss mathematicians'})-[:SUBCAT_OF*]->(root:RootCategory)) 
RETURN shortestFullHierarchy

The shortestPath function returns the expected depth four hierarchy:

Screen Shot 2016-08-03 at 10.48.23

There will be cases where more than one shortest path exist, and these can be picked up with the allShortestPaths variant of the shortest path function:

MATCH shortestFullHierarchies = allShortestPaths((swissMatem:Category {catName : 'Philosophers by period'})-[:SUBCAT_OF*]->(root:RootCategory)) 
RETURN shortestFullHierarchies

The query returns all shortest full hierarchies (all with max depth of 4) for the ‘Ancient philosophers’ category.

Screen Shot 2016-08-03 at 11.15.05

What are the richest categories in the top four levels of the Wikipedia?

What is a rich category? Well, this is a QuickGraph so I’ll come up with my own quick (although I hope still minimally reasoned) definition but let’s keep in mind that I’m not trying to build a theory about hierarchy relevance but rather show how the graph can be easily queried in interesting ways.

Right, so a rich category is one that has loads of subcategories because that means it’s complex enough to be broken down with such a high level of detail. Similarly, a rich category is also one that can be categorised from different perspectives, or in other words, one that has multiple parent categories. But we will want to discard unbalanced ones (many subcategories but very few super-categories or vice versa) to avoid enumeration-style categories like “Religion by country”, “Culture by nationality” or “People by ethnic or national descent”. With these three criteria we can build a graph query that returns our top ten. Here is the Cypher for our richest categories:

MATCH (cat:Category)
WITH cat, 
 size((cat)-[:SUBCAT_OF]->()) as superCatDegree, 
 size(()-[:SUBCAT_OF]->(cat)) as subCatDegree
WHERE ABS(toFloat(superCatDegree - subCatDegree)/(superCatDegree + subCatDegree)) < 0.4
RETURN cat.catName, (superCatDegree + subCatDegree)/2 AS richness 
ORDER BY richness DESC 
LIMIT 10

And the results are quite interesting…

 

Screen Shot 2016-08-02 at 20.47.04

If now we want to explore visually the Marxism category…

Screen Shot 2016-08-02 at 21.43.07

Unexpectedly long hierarchies!

One thing that caught my eye from the point of view of the structure of the graph was the fact that even though I had only loaded 4 levels, I was able to find way longer chains of parent-child categories. This is because not all :SUBCAT_OF relationships go from level n to level n-1. That is actually what makes the structure a graph and not a tree. The following query returns :SUBCAT_OF chains longer than 12 ( path =()-[r:SUBCAT_OF*12..]->() ), but checks that every node in the chain is at a maximum distance of 4 from the root ( shortestPath((node)-[:SUBCAT_OF*..4]->(root:Category {catId:’0000000′})) ).

MATCH path =()-[r:SUBCAT_OF*12..]->() WITH path LIMIT 1
UNWIND nodes(path) AS node
MATCH shortest = shortestPath((node)-[:SUBCAT_OF*..4]->(root:Category {catId:'0000000'}))
RETURN path, shortest

And the first result shows how effectively, even though there are chains of length 6 (and longer than that, actually) we can easily see that all the nodes in the chain are never more than 4 hops away from the root node. This is consistent with the fact that we’ve only loaded 4 levels of the category hierarchy.

Screen Shot 2016-08-02 at 19.20.14

 

Loops!

Yes, there are some. So if one day you decide to read the whole Wikipedia (as you do) and you choose to do it by category, then be careful not to enter into an infinite loop of knowledge 🙂

Looops are easy to pick up with this simple Cypher query:

MATCH loop = (cat)-[r:SUBCAT_OF*]->(cat) 
RETURN loop LIMIT 1

The caption on the left shows a loop and the one on the right shows the same loop but keeping the hierarchical order. Green nodes are level 1 categories, blue ones are level 2 and finally yellow ones represent the level 3 categories. The interesting thing is that Geography, even though it’s level one, it’s also a subcategory of Earth Sciences that is level 3. The same thing happens with Nature, being both level 1 and a subcategory of Environmental social science concepts, which is level 3. These two extra relationships create the loop that produces the following chain of :SUBCAT_OF.

MATCH loop = (cat)-[r:SUBCAT_OF*]->(cat) 
RETURN reduce(s = "", x IN nodes(loop) | s + ' > '+ x.catName) AS categoryChain limit 1

Screen Shot 2016-08-02 at 17.23.52

You can also try to find longer loops by just specifying the minimum length of the path

loop = (cat)-[r:SUBCAT_OF*12..]->(cat)

And get chains like:

Screen Shot 2016-08-02 at 18.19.21

What’s interesting about this QuickGraph?

The graph is built directly from querying a web service from the MediaWiki action API. The apoc.load.json procedure in the APOC library can invoke this service and ingest the JSON structure returned all within your Cypher script. Find how to install and use (and extend!) the APOC library of stored procedures for Neo4j.

Rich category hierarchies are graphs and extremely useful tool for scenarios like recommendation systems or  graph-enhanced search.

 

Building a semantic graph in Neo4j

There are two key characteristics of RDF stores (aka triple stores): the first and by far the most relevant is that they represent, store and query data as a graph. The second is that they are semantic, which is a rather pompous way of saying that they can store not only data but also explicit descriptions of the meaning of that data. The RDF and linked data community often refer to these explicit descriptions as ontologies. In case you’re not familiar with the concept, an ontology is a machine readable description of a domain that typically includes a vocabulary of terms and some specification of how these terms inter-relate, imposing a structure on the data for such domain. This is also known as a schema. In this post both terms schema and ontology will be used interchangeably to refer to these explicitly described semantics.

Making the semantics of your data explicit in an ontology will enable data and/or knowledge exchange and interoperability which will be useful in some situations. In other scenarios you may want use your ontology to run generic inferencing on your data to derive new facts from existing ones. Another similar use of explicit semantics would be to run domain specific consistency checks on the data which is the specific use that I’ll base my examples on in this post. The list of possible uses goes on but these are some of the most common ones.

Right, so let’s clarify what I mean by domain specific consistency because it is quite different from integrity constraints in the relational style. I’m actually talking about the definition of rules such as “if a data point is connected to a Movie through the ACTED_IN relationship then it’s an Actor” or “if a User is connected to a BlogEntry through a POSTED relationship then it’s an ActiveUser”. We’ll look first at how to express these kind of consistency checks and then I’ll show how they can be applied in three different ways:

  • Controlling data insertion into the database at a transaction level by checking before committing that the state in which a given change leaves the database would still be consistent with the semantics in the schema, and rolling back the transaction if not.
  • Using the ontology to drive a generic front end guaranteeing that only data that keeps the graph consistent is added through such front end.
  • Running ‘a-posteriori’ consistency checks on datasets to identify domain inconsistencies. While the previous two could be considered preventive, this approach would be corrective. An example of this could be when you build a data set (a graph in our case) by bringing data from different sources and you want to check if they link together in a consistent way according to some predefined schema semantics.

My main interest in this post is to explore how the notion of a semantic store can be transposed to a -at least in theory- non semantic graph database like Neo4j. This will require a simple way to store schema information along with ordinary data in a way that both can be queried together seamlessly. This is not a problem with Neo4j+Cypher as you’ll see if you keep reading. We will also need a formalism, a vocabulary to express semantics in Neo4j, and in order to do that, I’ll pick some of the primitives in OWL and RDFS to come up with a basic ontology definition language as you will see in the next section. I will show how to use it by modelling a simple domain and prototyping ways of checking the consistency of the data in your graph according to the defined schema.

Defining an RDFS/OWL style ontology in Neo4j

The W3C recommends the use of RDFS and OWL for the definition of ontologies. For my example, I will use a language inspired by these two, in which I’ll borrow some elements like owl:Class, owl:ObjectProperty and owl:DatatypeProperty, rdfs:domain and rdfs:range, etc. If you’ve worked with any of these before, what follows will definitely look familiar.

Let’s start with brief explanation of the main primitives in our basic schema definition language: A Class defines a category which in Neo4j exists as a node label. A DatatypeProperty describes an attribute in Neo4j (each of the individual key-value pairs in both nodes or relationships) and an ObjectProperty describes a relationship in Neo4j. The domain and range properties in an ontology are used to further describe ObjectProperty and DatatypeProperty definitions. In the case of an ObjectProperty, the domain and range specify the source and the target classes of the nodes connected by instances of the ObjectProperty. Similarly, in the case of a DatatypeProperty, the domain will specify the class of nodes holding values for such property and the range, when present can be used to specify the XMLSchema datatype of the property values values. Note that because the property graph model accepts properties on relationships, the domain of a DatatypeProperty can be an ObjectProperty, which is not valid in the RDF world (and actually not even representable without using quite complicated workarounds, but this is another discussion). If you have not been exposed to the RDF model before this can all sound a bit too ‘meta’ so if you have time (a lot) you can find a more detailed description of the different elements in the OWL language reference or probably better just keep reading because it’s easier than it may initially seem and most probably the examples will help clarifying things.

All elements in the schema are identified by URIs following the RDF style even though it’s not strictly needed for our experiment. What is definitely needed is a reference to how each element in the ontology is actually used in the Neo4j data and this name is stored in the label property. Finally there is a comment with a human friendly description of the element in question.

If you’ve downloaded Neo4j you may be familiar by now with the movie database. You can load it in your neo4j instance by running :play movies on your browser. I’ll use this database for my experiment; here is a fragment of the ontology I’ve built for it. You can see the Person class, the name DatatypeProperty and the ACTED_IN ObjectProperty defined as described in the previous paragraphs.

// A Class definition (a node label in Neo4j)
(person_class:Class {	uri:'http://neo4j.com/voc/movies#Person',
			label:'Person',
			comment:'Individual involved in the film industry'})

// A DatatypeProperty definition (a property in Neo4j) 
(name_dtp:DatatypeProperty {	uri:'http://neo4j.com/voc/movies#name',
				label:'name',
				comment :'A person's name'}),
(name_dtp)-[:DOMAIN]->(person_class)

// An ObjectProperty definition (a relationship in Neo4j) 
(actedin_op:ObjectProperty { 	uri:'http://neo4j.com/voc/movies#ACTED_IN',
				label:'ACTED_IN',
				comment:'Actor had a role in film'}),
(person_class)<-[:DOMAIN]-(actedin_op)-[:RANGE]->(movie_class)

The whole code of the ontology can be found in the github repository jbarrasa/explicitsemanticsneo4j. Grab it and run it on your Neo4j instance and you should be able to visualise it in the Neo4j browser. It should look something like this:

Screen Shot 2016-03-17 at 16.51.21

You may want to write your ontology from scratch in Cypher as I’ve just done but it is also possible that if you have some existing OWL or RDFS ontologies (here is an OWL version of the same movies ontology) you will want a generic way of translating them into Neo4j ontologies like the previous one, and that’s exactly what this Neo4j stored procedure does. So an alternative to running the previous Cypher could be to deploy and run this stored procedure. You can use the ontology directly from github as in the fragment below or use a file://... URI if you have the ontology in your local drive.

CALL semantics.LiteOntoImport('https://raw.githubusercontent.com/jbarrasa/explicitsemanticsneo4j/master/moviesontology.owl','RDF/XML')
+=================+==============+=========+
|terminationStatus|elementsLoaded|extraInfo|
+=================+==============+=========+
|OK               |16            |         |
+-----------------+--------------+---------+

Checking data consistency with Cypher

So I have now a language to describe basic ontologies in Neo4j but if I want to use it in any interesting way (other than visualising it colorfully as we’ve seen), I will need to implement mechanisms to exploit schemas defined using this language. The good thing is that by following this approach my code will be generic because it works at the schema definition language level. Let’s see what that means exactly with an example. A rule that checks the consistent usage of relationships in a graph could be written in Cypher as follows:

// ObjectProperty domain semantics check
MATCH (n:Class)<-[:DOMAIN]-(p:ObjectProperty) 
WITH n.uri as class, n.label as classLabel, p.uri as prop, p.label as propLabel 
MATCH (x)-[r]->() WHERE type(r)=propLabel AND NOT classLabel in Labels(x) 
RETURN id(x) AS nodeUID, 'domain of ' + propLabel + ' [' + prop + ']' AS `check failed`, 
'Node labels: (' + reduce(s = '', l IN Labels(x) | s + ' ' + l) + ') should include ' + classLabel AS extraInfo

This query scans the data in your DB looking for ilegal usage of relationships according to your ontology. If in your ontology you state that ACTED_IN is a relationship between a Person and a Movie then this rule will pick up situations where this is not true. Let me try to describe very briefly the semantics it implements. Since our schema definition language is inspired by RDFS and OWL, it makes sense to follow their standard semantics. Our Cypher works on the :DOMAIN relationship.  The :DOMAIN relationship when defined between an ObjectProperty p and a Class c, states that any node in a dataset that has a value for that particular property p is an instance of the class c. So for example when I state (actedin_op)-[:DOMAIN]->(person_class), I mean that Persons are the subject of the ACTED_IN predicate, or in other words, if a node is connected to another node through the ACTED_IN relationship, then it should be labeled as :Person because only a person can act in a movie.

Ok, so back to my point on generic code. This consistency check in Cypher is completely domain agnostic. There is no mention of Person, or Movie or ACTED_IN…  it only uses the primitives in the ontology definition language (DOMAIN, DatatypeProperty, Class, etc.). This means that as long as a schema is defined in terms of these primitives this rule will pick up eventual inconsistencies in a Neo4j graph. It’s kind of a meta-rule.

Ok, so I’ve implemented a couple more meta-rules for consistency checking to play with in this example but I leave it to you, interested reader, to experiment and extend the set and/or tune it to your specific needs.

Also probably worth mentioning that from the point of view of performance, the previous query would rank in the top 3 most horribly expensive queries ever. It scans all relationships in all nodes in the graph… but I won’t care much about this here, I’d just say that if you were to implement something like this to work on large graphs, you would most likely write some server-side java code probably something like this stored procedure, but that’s another story.

Here is another Cypher rule very similar to the previous one, except that it applies to relationship attributes instead. The semantics of the :DOMAIN primitive are the same when defined on DatatypeProperties (describing neo4j attributes) or on ObjectProperties (describing relationships) so if you got the idea in the previous example this will be basically the same.

// DatatypeProperties on ObjectProperty domain semantics meta-rule (property graph specific. attributes on relationships) 
MATCH (r:ObjectProperty)<-[:DOMAIN]-(p:DatatypeProperty) 
WITH r.uri as rel, r.label as relLabel, p.uri as prop, p.label as propLabel 
MATCH ()-[r]->() WHERE r[propLabel] IS NOT NULL AND relLabel<>type(r) 
RETURN id(r) AS relUID, 'domain of ' + propLabel + ' [' + prop + ']' AS `check failed`, 'Rel type: ' + type(r) + ' but should be ' + relLabel AS extraInfo 

If I run this query on the movie database I should get no results because the data in the graph is consistent with the ontology: Only persons act in movies, only movies have titles, only persons have date of birth, and so on. But we can disturb this boring peace by inserting a node like this:

MATCH (:Person {name:'Demi Moore'})-[r:ACTED_IN]->(:Movie {title:'A Few Good Men'}) 
SET r.rating = 88

This is wrong, we are setting a value for attribute rating on a property where a rating does not belong. If we now re-run the previous consistency check query, it should produce some results:

+======+=====================================================+=========================================+
|relUID|check failed                                         |extraInfo                                |
+======+=====================================================+=========================================+
|64560 |domain of rating [http://neo4j.com/voc/movies#rating]|Rel type: ACTED_IN but should be REVIEWED|
+------+-----------------------------------------------------+-----------------------------------------+

Our ontology stated that the rating property was exclusive of the :REVIEWED relationship, so our data is now inconsistent with that. I can fix the problem by unsetting the value of the property with the following Cypher and get the graph back to a consistent state

MATCH (:Person {name:'Demi Moore'})-[r:ACTED_IN]->(:Movie {title:'A Few Good Men'}) 
REMOVE r.rating

Right, so I could extend the meta-model primitives by adding things like hierarchical relationships between classes like RDFS’s SubClassOf or even more advanced ones like OWL’s Restriction elements or disjointness relationships but my objective today was to introduce the concept, not to define a full ontology language for Neo4j. In general, the choice of the schema language will depend on the level of expressivity  that one needs in the schema. In the RDF world you will decide whether you want to use RDFS if your model is simple or OWL if you want to express more complex semantics. The cost, of course, is more expensive processing both for reasoning and consistency checking. Similarly, if you go down the DIY approach that we are following here, keep in mind that for every primitive added to your language, the corresponding meta-rules, stored procedures or any alternative implementation of its semantics, will be required too.

The use of meta-data definition languages with well defined semantics is pretty powerful as we’ve seen because it enables the construction of general purpose engines based only on the language primitives that are reusable across domains. Examples of this idea are the set of Cypher rules and the stored procedure linked above. You can try to reuse them on other neo4j databases just by defining an ontology in the language we’ve used here. Finally, the combination of this approach with the fact that the schema definition itself is stored just as ordinary data, we get a pretty dynamic setup, because storing Classes, ObjectProperties, etc. as nodes and relationships in Neo4j means that they may evolve over time and that there are no precompiled rules or static/hardcoded logic to detect consistency violations.

This is precisely the kind of approach that RDF stores follow. In addition to storing data as RDF triples and offering query capabilities for it, if you make your model ontology explicit by using RDFS or OWL, you get out of the box inferencing and consistency checking.

Consistency checking at transaction level

So far we’ve used our ontology to drive the execution of meta-rules to check the consistency of a data set after some change was made (an attribute was added to a relationship in our example). That is one possible way of doing things (a posteriori check) but I may want to use the explicit semantics I’ve added to the graph in Neo4j in a more real-time scenario. As I described before, I may want transactions to be committed only if they leave our graph in a consistent state and that’s what these few lines of python do.

The logic of the consistency check is on the client-side, which may look a bit strange but my intention was to make the whole thing as explicit and clear as I possibly could. The check consists of running the whole set of individual meta-rules defined in the previous sections one by one and breaking if any of them picks up an inconsistency of any type. The code requires the meta-rules to be available in the Neo4j server and the way I’ve done it is by storing each of them individually as ConsistencyCheck nodes with the Cypher code stored as a property. Something like this:

CREATE (ic:ConsistencyCheck { ccid:1, ccname: 'DTP_DOMAIN', 
cccypher: 'MATCH (n:Class)<-[:DOMAIN]-(p:Datatyp ... '})

The code of the Cypher meta-rule in the cccypher property has been truncated in this snippet but you can view the whole lot in github. Now the transaction consistency checker can grab all the meta-rules and cache them (line 25 of the python code) with this simple query:

MATCH (cc:ConsistencyCheck)
RETURN cc.ccid AS ccid, cc.ccname AS ccname, cc.cccypher AS cccypher

that returns the batch of individual meta-rules to be run. Something like this:

Screen Shot 2016-03-18 at 23.18.41.png

The code can be tested with different Cypher fragments. If I try to populate the graph with data that is consistent with the schema then things will go smoothly:

$ python transactional.py " CREATE (:Person { name: 'Marlon Brando'})-[:ACTED_IN]->(:Movie{ title: 'The Godfather'}) "
Consistency Checks passed. Transaction committed

Now if I try to update the graph in inconsistent ways, the meta-rules should pick this up. Let’s try to insert a node labeled as :Thing with an attribute that’s meant to be used by Person nodes:

$ python transactional.py " CREATE (:Thing { name: 'Marlon Brando'}) "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                               | extraInfo                                  
---+---------+------------------------------------------------------------+---------------------------------------------
 1 |    7231 | domain of property name [http://neo4j.com/voc/movies#name] | Node labels: ( Thing) should include Person

Or link an existing actor (:Person) to something that is not a movie through the :ACTED_IN relationship:

$ python transactional.py " MATCH (mb:Person { name: 'Marlon Brando'}) CREATE (mb)-[:ACTED_IN]->(:Play { playTitle: 'The mousetrap'}) "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                              | extraInfo                                
---+---------+-----------------------------------------------------------+-------------------------------------------
 1 |    7241 | domain of ACTED_IN [http://neo4j.com/voc/movies#ACTED_IN] | Node labels: ( Play) should include Movie

I can try also to update the labels of a few nodes, causing a bunch of inconsistencies:

$ python transactional.py " MATCH (n:Person { name: 'Kevin Bacon'})-[:ACTED_IN]->(bm) REMOVE bm:Movie SET bm:BaconMovie  "
Consistency Checks failed. Transaction rolled back
   | nodeUID | check failed                                                       | extraInfo                                      
---+---------+--------------------------------------------------------------------+-------------------------------------------------
 1 |    7046 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 2 |    7166 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 3 |    7173 | domain of property tagline [http://neo4j.com/voc/movies#tagline]   | Node labels: ( BaconMovie) should include Movie
 4 |    7046 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 5 |    7166 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 6 |    7173 | domain of property title [http://neo4j.com/voc/movies#title]       | Node labels: ( BaconMovie) should include Movie
 7 |    7046 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie
 8 |    7166 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie
 9 |    7173 | domain of property released [http://neo4j.com/voc/movies#released] | Node labels: ( BaconMovie) should include Movie

Again a word of warning about the implementation. My intention was to make the example didactic and easy to understand but of course in a real world scenario you would probably go with a different implementation, most certainly involving some server-side code like this stored proc. An option would be to write the logic of running the consistency checks and either committing or rolling back the transaction as an unmanaged extension that you can expose as a ‘consistency-protected’ Cypher HTTP endpoint and invoke it from your client.

Consistency check via dynamic UI generation

Another way of guaranteeing the consistency in the data in your graph when it is manually populated through a front-end is to build a generic UI driven by your model ontology. Let’s say for example that you need to generate a web form to populate new movies. Well, you can retrieve both the structure of your form and the data insertion query with this bit of Cypher:

MATCH (c:Class { label: {className}})<-[:DOMAIN]-(prop:DatatypeProperty) 
RETURN { cname: c.label, catts: collect( {att:prop.label})} as object, 
 'MERGE (:' + c.label + ' { uri:{uri}' + reduce(s = "", x IN collect(prop.label) | s + ',' +x + ':{' + x + '}' ) +'})' as cypher

Again, a generic (domain agnostic) query that works on your schema. The query takes a parameter ‘className’. When set to ‘Movie’, you get in the response all the info you need to generate your movie insertion UI. Here’s a fragment of the json structure returned by Neo4j.

 "row": [
            {
              "cname": "Movie",
              "catts": [
                {
                  "att": "tagline"
                },
                {
                  "att": "title"
                },
                {
                  "att": "released"
                }
              ]
            },
            "MERGE (:Movie { uri:{uri},tagline:{tagline},title:{title},released:{released}})"
          ]

Conclusion

With this simple example, I’ve tried to prove that it’s relatively straightforward to model and store explicit domain semantics in Neo4j, making it effectively a semantic graph. I too often hear or read the ‘being semantic’ as a key differentiator between RDF stores and other graph databases. By explaining exactly what it means to ‘be semantic’ and experimenting with how this concept can be transposed to a property graph like Neo4j,  I’ve tried to show that such a difference is not an essential one.

I’ve focused on using the explicit semantics to run consistency checks in a couple of different ways but the concept could be extended to other uses such as automatically inferring new facts from existing ones, for instance.