QuickGraph#6 Building the Wikipedia Knowledge Graph in Neo4j (QG#2 revisited)

After last week’s Neo4j online meetup, I thought I’d revisit QuickGraph#2 and update it a bit to include a couple new things:

  • How to load not only categories but also pages (as in Wikipedia articles) and enrich the graph by querying DBpedia. In doing this I’ll describe some advanced usage of APOC procedures.
  • How to batch load the whole Wikipedia hierarchy of categories into Neo4j

Everything I explain here will also go into an interactive guide that you can easily run from your Neo4j instance. Or why not giving it a try in the Neo4j Sandbox?

All you have to do is run this on your Neo4j browser:

:play https://guides.neo4j.com/wiki

For a description of the Wikipedia data and the MediaWiki API, check QuickGraph#2.

Loading the data into Neo4j

First, let’s prepare the DB with a few indexes to accelerate the ingestion and querying of the data:

CREATE INDEX ON :Category(catId)
CREATE INDEX ON :Category(catName)
CREATE INDEX ON :Page(pageTitle)

Approach 1: Loading a reduced subset incrementally through the MediaWiki API

This approach uses the WikiMedia API and is adequate if all you want is a portion of the category hierarchy around a particular topic. Let’s say we want to create the Wikipedia Knowledge Graph about Databases.

The first thing we’ll do is create the root category: Databases.

CREATE (c:Category:RootCategory {catId: 0, catName: 'Databases', subcatsFetched : false, pagesFetched : false, level: 0 })

Now we’ll iteratively load the next level of subcategories to a depth of our choice. I’ve selected only three levels down from the root.

UNWIND range(0,3) as level 
CALL apoc.cypher.doit("
MATCH (c:Category { subcatsFetched: false, level: $level})
CALL apoc.load.json('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=subcat&cmtitle=Category:' + apoc.text.urlencode(c.catName) + '&cmprop=ids%7Ctitle&cmlimit=500')
YIELD value as results
UNWIND results.query.categorymembers AS subcat
MERGE (sc:Category {catId: subcat.pageid})
ON CREATE SET sc.catName = substring(subcat.title,9),
 sc.subcatsFetched = false,
 sc.pagesFetched = false,
 sc.level = $level + 1
WITH sc,c
CALL apoc.create.addLabels(sc,['Level' + ($level + 1) + 'Category']) YIELD node
MERGE (sc)-[:SUBCAT_OF]->(c)
WITH DISTINCT c
SET c.subcatsFetched = true", { level: level }) YIELD value
RETURN value

Once we have the categories, we can load the pages in a similar way:

UNWIND range(0,4) as level 
CALL apoc.cypher.doit("
MATCH (c:Category { pagesFetched: false, level: $level })
CALL apoc.load.json('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=page&cmtitle=Category:' + apoc.text.urlencode(c.catName) + '&cmprop=ids%7Ctitle&cmlimit=500')
YIELD value as results
UNWIND results.query.categorymembers AS page
MERGE (p:Page {pageId: page.pageid})
ON CREATE SET p.pageTitle = page.title, p.pageUrl = 'http://en.wikipedia.org/wiki/' + apoc.text.urlencode(replace(page.title, ' ', '_'))
WITH p,c
MERGE (p)-[:IN_CATEGORY]->(c)
WITH DISTINCT c
SET c.pagesFetched = true", { level: level }) yield value
return value

Notice that we are only loading the id and the title for each page. This is because the MediaWiki API only exposes metadata about pages, but we can get some extra information on them from the DBpedia. DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.
There is a public instance of the DBpedia that exposes an SPARQL endpoint that we can query to get a short description of a given Wikipedia page. The Cypher fragment below embeds the SPARQL query that’s sent to the endpoint.

WITH "SELECT ?label
WHERE {
?x <http://xmlns.com/foaf/0.1/isPrimaryTopicOf> <@wikiurl@> ;
<http://dbpedia.org/ontology/abstract> ?label .
FILTER(LANG(?label) = '' || LANGMATCHES(LANG(?label), 'en')) } LIMIT 1
" AS sparqlPattern
UNWIND range(0,3) as level
CALL apoc.cypher.doit("
MATCH (c:Category { level: $level })<-[:IN_CATEGORY]-(p:Page)
WHERE NOT exists(p.abstract) 
WITH DISTINCT p, apoc.text.replace(sparqlPattern,'@wikiurl@',p.pageUrl) as runnableSparql LIMIT 100
CALL apoc.load.json('http://dbpedia.org/sparql/?query=' + apoc.text.urlencode(runnableSparql) + '&format=application%2Fsparql-results%2Bjson') YIELD value
SET p.abstract = value.results.bindings[0].label.value
", { level: level, sparqlPattern: sparqlPattern }) yield value
return value

I’ve limited to 100 pages per level because we are generating an HTTP request to the DBpedia endpoint for each Page node in our graph. Feel free to remove this limit but keep in mind that this can take a while.

Ok, so we have our Wikipedia Knowledge Graph on Databases and we can start querying it.

Querying the graph

We can list categories by the number of sub/super categories or by the number of pages. We can also create custom indexes like the balanceIndex below that tells us how ‘balanced’ (ratio between supercategories and subcategories) a category is. Closer to zero are the more balanced categories and closer to one are the more unbalanced.

MATCH (c:Category)
WITH c.catName AS category, 
size((c)<-[:SUBCAT_OF]-()) AS subCatCount,  size((c)-[:SUBCAT_OF]->()) AS superCatCount,
size((c)<-[:IN_CATEGORY]-()) AS pageCount WHERE subCatCount > 0 AND superCatCount > 0
RETURN category, 
pageCount, 
subCatCount, 
superCatCount,
ABS(toFloat(superCatCount - subCatCount)/(superCatCount + subCatCount)) as balanceIndex
ORDER BY subCatCount DESC 
LIMIT 500

We can also aggregate these values to produce stats on our Knowledge Graph

MATCH (c:Category)
WITH c.catName AS category,
size((c)<-[:SUBCAT_OF]-()) AS subCatCount, size((c)-[:SUBCAT_OF]->()) AS superCatCount,
size((c)<-[:IN_CATEGORY]-()) AS pageCount,
size((c)-[:SUBCAT_OF]-()) AS total
RETURN AVG(subCatCount) AS `AVG #subcats`,
MIN(subCatCount) AS `MIN #subcats`,
MAX(subCatCount) AS `MAX #subcats`,
percentileCont(subCatCount,0.9) AS `.9p #subcats`,
AVG(pageCount) AS `AVG #pages`,
MIN(pageCount) AS `MIN #pages`,
MAX(pageCount) AS `MAX #pages`,
percentileCont(pageCount,0.95) AS `.9p #pages`,
AVG(superCatCount) AS `AVG #supercats`,
MIN(superCatCount) AS `MIN #supercats`,
MAX(superCatCount) AS `MAX #supercats`,
percentileCont(superCatCount,0.95) AS `.9p #supercats`

Screen Shot 2017-04-26 at 01.53.16

Approach 2: Batch loading the data with LOAD CSV from an SQL dump

There is a snapshot of the Wikipedia categories and their hierarchical relationships (as of mid-April 2017) here. It contains 1.4 million categories and 4 million hierarchical relationships. They can both be loaded into Neo4j using LOAD CSV. You can run the queries as they are or download the files to your Neo4j’s instance import directory and use LOAD CSV FROM "file:///..." instead.

First the categories. Notice that we are loading a couple of extra properties in the Category nodes: the pageCount and the subcatCount. These numbers are a precomputed in the data dump and not always accurate.

USING PERIODIC COMMIT 10000
LOAD CSV FROM "https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/cats.csv?raw=true" as row
CREATE (c:Category { catId: row[0]}) 
SET c.catName = row[2], c.pageCount = toInt(row[3]), c.subcatCount = toInt(row[4])

And then the subcategory relationships

USING PERIODIC COMMIT 10000
LOAD CSV FROM "https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" as row
MATCH (from:Category { catId: row[0]}) 
MATCH (to:Category { catId: row[1]})
CREATE (from)-[:SUBCAT_OF]->(to)

If you’re interested in regenerating fresh CSV files, here’s how:

  • Start by downloading the latest DB dumps from the Wikipedia downloads page.
    For the category hierarchy, you’ll only need the following tables: category, categorylinks and page.
  • Load them in your DBMS.
  • Generate the categories CSV file by running the following SQL
select p.page_id as PAGE_ID, c.cat_id as CAT_ID, cast(c.cat_title as nCHAR) as CAT_TITLE , c.cat_pages as CAT_PAGES_COUNT, c.cat_subcats as CAT_SUBCAT_COUNT
into outfile '/Users/jbarrasa/Applications/neo4j-enterprise-3.1.2/import/wiki/cats.csv' fields terminated by ',' enclosed by '"' escaped by '\\' lines terminated by '\n' 
from test.category c, test.page p
where c.cat_title = p.page_title
and p.page_namespace = 14
  • Generate the relationships file by running the following SQL
select p.page_id as FROM_PAGE_ID, p2.page_id as TO_PAGE_ID
into outfile '/Users/jbarrasa/Applications/neo4j-enterprise-3.1.2/import/wiki/rels.csv' fields terminated by ',' enclosed by '"' escaped by '\\' lines terminated by '\n' 
from test.category c, test.page p , test.categorylinks l, test.category c2, test.page p2
where l.cl_type = 'subcat'
and c.cat_title = p.page_title
and p.page_namespace = 14
and l.cl_from = p.page_id
and l.cl_to = c2.cat_title
and c2.cat_title = p2.cat_title
and p2.page_namespace = 14

What’s interesting about this QuickGraph?

It showcases interesting usages of procedures like apoc.cypher.doit to run Cypher fragments within our query or apoc.load.json to interact with APIs producing JSON results.

Rich category hierarchies like the one in Wikipedia are graphs and extremely useful for recommendation or  graph-enhanced search. Have a look at the queries in QG#2 and the ones in the interactive guide for some ideas.

:play https://guides.neo4j.com/wiki

QuickGraph#5 Learning a taxonomy from your tagged data

The Objective

Say we have a dataset of multi-tagged items: books with multiple genres, articles with multiple topics, products with multiple categories… We want to organise logically these tags -the genres, the topics, the categories…- in a descriptive but also actionable way. A typical organisation will be hierarchical, like a taxonomy.

But rather than building it manually, we are going to learn it from the data in an automated way. This means that the quality of the results will totally depend on the quality and distribution of the tagging in your data, so sometimes we’ll produce a rich taxonomy but sometimes the data will only yield a set of rules describing how tags relate to each other.

Finally, we’ll want to show how this taxonomy can be used and I’ll do it with an example on content recommendation / enhanced search.

The dataset

We’ll use data from Goodreads on books and how they’ve been categorised by readers. In Goodreads, there is a notion of “shelf” which is a user created public category or tag that can be added to books and reused by other readers. Here is a page from Goodreads on a book along with the graph view of the data that we’ll extract from the page for this experiment.

Each book has a few shelves (genres) and an author, although I will not use the author information in this case.

Here is the data load script that you can try on your local Neo4j instance:

CREATE INDEX ON :Author(name)
CREATE INDEX ON :Book(id)
CREATE INDEX ON :Genre(name)
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/jbarrasa/datasets/master/goodreads/booklist.csv" AS row
MERGE (b:Book { id : row.itemUrl})
SET b.description = row.description, b.title = row.itemTitle
WITH b, row
UNWIND split(row.genres,';') AS genre
MERGE (g:Genre { name: substring(genre,8)})
MERGE (b)-[:HAS_GENRE]->(g)
WITH b, row
UNWIND split(row.author,';') AS author
MERGE (a:Author { name: author})
MERGE (b)-[:HAS_AUTHOR]->(a)

The data model is pretty simple as we’ve seen, and it only models three types of entities, the books, their authors and the genres. Here is the db.schema:

Screen Shot 2017-03-31 at 02.50.34.png

And some metrics on the tagging:

  • AVG number of genres per book: 5.4
  • MAX number of genres per book: 10
  • MIN number of genres per book: 1

You can get them by running this query:

MATCH (n:Book) 
WITH id(n) AS bookid, size((n)-[:HAS_GENRE]->()) AS genreCount
RETURN AVG(genreCount) AS avgNumGenres, MAX(genreCount) AS maxNumGenres, MIN(genreCount) AS minNumGenres

The Taxonomy Learning Algorithm

The algorithm is based on tag co-occurrence. The items in our dataset have multiple tags each, which means that tags will co-occur (will appear together) in a number of items. This algorithm will analyse the sets of items where tags co-occur and apply some pretty straightforward logic: If every item tagged as A is also tagged as B, we can derive that A implies B or in other words, the category defined by tag A is “narrower-than” the category defined by tag B. Easy, right?

Let’s look at the algo step by set using the simple model described before on books and genres. Books are our tagged items and the genres are the tags.

(b:Book)-[:HAS_GENRE]->(g:Genre)

STEP1: Compute co-occurrence

Co-occurrence is the basic building block for the algorithm and is in itself a quite useful relationship because it indicates some degree of overlap between tags and therefore a certain degree of similarity which is something that can be exploited for query expansion or recommendation.

The co-occurrence index between two categories A and B is computed as the portion of items in category A that are also in category B, this is a simple division of the number of items tagged as both A and B divided by the number of items in tagged as A.

COOC(A,B) = #items tagged as both A and B /  #items tagged as A

Screen Shot 2017-03-31 at 00.56.10.png

Notice that while the co-occurrence relationship is not directional, the co-occurrence index is so we will persist in our graph the cooccurrence of two tags as two relationships one on each direction containing the co-occurrence index.

In cypher:

MATCH (g:Genre) WHERE SIZE((g)<-[:HAS_GENRE]-()) > 5 //Threshold 
WITH g, size((g)<-[:HAS_GENRE]-()) as totalCount
MATCH (g)<-[:HAS_GENRE]-(book)-[:HAS_GENRE]->(relatedGenre)
WITH g, relatedGenre, toFloat(count(book)) / totalCount AS coocIndex
CREATE (g)-[:CO_OCCURS {index: coocIndex }]->(relatedGenre)

I’ve also included in the Cypher implementation a WHERE clause (marked red) to exclude categories that contain fewer items than a given threshold. This is an optional adjustment that you may want to apply and like this one there are a number of optimisations that can be applied to the basic co-occurrence computation to make it produce higher quality results.

STEP2: Infer same-as relationships

Once we have co-occurrence in the graph, we want to detect equivalent genres. Two genres are equivalent if the co-occurrence index in both directions is 1 or in other words, if every item having genre g1 has also genre g2 and vice versa.

Here is the Cypher that does the job. Equivalent categories (genres) are linked through the SAME_AS relationship

MATCH (g1)-[co1:CO_OCCURS {index : 1}]->(g2),
      (g2)-[co2:CO_OCCURS { index: 1}]->(g1)
WHERE ID(g1) > ID(g2)
MERGE (g1)-[:SAME_AS]-(g2)

STEP3: Infer narrower-than relationships

Here is where we infer the hierarchical relationship between two categories (genres). Very similar to the previous rule, we check now if one of the co-occurrence indexes is 1 and the other is less than 1. Or in English, if every item having genre g1 has also genre g2 but the opposite is not true.

The Cypher that creates the NARROWER_THAN hierarchy is as follows.

MATCH (g1)-[co1:CO_OCCURS]->(g2), 
      (g2)-[co2:CO_OCCURS]->(g1)
WHERE ID(g1) > ID(g2) 
      AND co1.index = 1 and co2.index < 1 MERGE (g1)-[:NARROWER_THAN]->(g2)

STEP4: Reduce transitive narrower-than relationships

Finally, this computation may have produced more NARROWER_THAN relationships than needed so we want to remove transitive ones. If (X)-[:NARROWER_THAN]->(Y) and (Y)-[:NARROWER_THAN]->(Z), then we may want to get rid of any (X)-[:NARROWER_THAN]->(Z) as it is kind of redundant.  But this is an optional step that you may or may not want to include.

MATCH (g1)-[:NARROWER_THAN*2..]->(g3), 
      (g1)-[d:NARROWER_THAN]->(g3)
DELETE d

So that’s it! Here is the taxonomy:

A couple of interesting examples:

  • military-science-fiction << space-opera << science-fiction
  • nutrition << health << non-fiction

Worth mentioning that while these are true in our dataset, as our graph grows and evolves over time, we may find contradictions to them that would invalidate the taxonomy so as data evolves it will make sense to re-compute the taxonomy creation.

Using the taxonomy

The objective of learning how a set of tags relate was to then be able to use it in some meaningful way. The CO_OCCURS relationship in itself is a useful one as it indicates some degree of overlap between tags and therefore a certain degree of similarity. But NARROWER_THAN has stronger semantics, let’s see how could we use it:

One possible use would be to recommend books based on the taxonomy we’ve just learned. This first query lists available subcategories with the number of items in them

MATCH (b:Book) WHERE b.title = 'Slow Bullets'
MATCH (b)-[:HAS_GENRE]->(g)<-[:NARROWER_THAN]-(childGenre)
WHERE size((g)<-[:HAS_GENRE]-()) < 500       AND NOT (b)-[:HAS_GENRE]->()-[:NARROWER_THAN]->(g) 
     AND NOT (b)-[:HAS_GENRE]->(childGenre)
RETURN "Hello! '" + b.title + "' is tagged as '" + g.name + "', and we have " + size((childGenre)<-[:HAS_GENRE]-()) + " books on '" + childGenre.name + "' which is a narrower category. Want to have a look? " AS recommendationQuestion

When run on ‘Pride and Prejudice’ it produces this output:

Screen Shot 2017-03-31 at 12.22.46.png

Actually it does a bit more than just that, if we analyse the Cypher, we can see that we start from a selected book and for each genre with less than 500 books in it -we want to exclude large ones like ‘fiction’ as they are too generic to provide relevant recommendations- we get the sub-categories that the current book is not tagged as. We also stop the generation of sub-category based recommendations if there is already a sibling subcategory in the tags of the selected book. Basically, if a book is tagged as ‘sports’ and ‘tennis’, tennis being a subcategory of sports, we will not recommend other subcategories of sports like ‘hockey’ or ‘football’. Yes, all that in 5 lines of cypher! Anyway, this is just one possible query that uses the hierarchy and that makes sense in my data set but you may want to tune it to yours.

And this second query, very similar to the previous one, lists the actual items in the subcategory:

MATCH (b:Book) WHERE b.title = {title}
MATCH (b)-[:HAS_GENRE]->(g)<-[:NARROWER_THAN]-(childGenre)
WHERE size((g)<-[:HAS_GENRE]-()) < 500 AND NOT (b)-[:HAS_GENRE]->()-[:NARROWER_THAN]->(g) AND NOT (b)-[:HAS_GENRE]->(childGenre)
WITH childGenre
MATCH (booksInChildCategory)-[:HAS_GENRE]->(childGenre)
RETURN booksInChildCategory.title AS bookTitle, substring(booksInChildCategory.description,1, 70) + '...' AS description

We can run it this time on ‘Slow Bullets’ and it will produce:

Screen Shot 2017-03-31 at 12.17.55.png

Notice that both queries are neutral from the point of view of what’s in the taxonomy, as the taxonomy evolves over time, the results will be different.

Of course, recommendation can get a lot more complicated but this is just a basic suggestion on how the taxonomy could be used. Another option is to use the NARROWER_THAN in combination with the CO_OCCURS for richer recommendations in case there are no NARROWER_THAN alternatives. More on this in future blog posts.

What’s interesting about this QuickGraph?

This is a basic attempt at analysing tag co-occurrence using a graph. The algorithm can be refined in a number of ways but I thought it would be interesting to share it in its basic form and maybe blog later on on how to improve it.

I think the most interesting is the fact that the approach is generic and can be used in many contexts to build a purely dynamic and automated solution. The taxonomy creation algorithm can be re-run on a regular basis as new tagged data is added to the graph and the logic (like the one described in the “using the taxonomy” section) will produce results adapted to the fresh version of the taxonomy.

It’s worth mentioning that the quality of the hierarchy will directly depend on the quality of your data tagging! We are not creating a formal ontology here but rather building a pragmatic and actionable taxonomy derived in an automated way from your data.

Watch this space for other examples of use of this approach and some suggested refinements.

I’d also love to get your feedback!

Quick note on getting the data

To get some data from Goodreads your best option is to write some code using their API. Other alternatives are for instance import.io (click to try the import.io ‘lightning’ scraper on a GoodReads list) or HTML scraping libraries for your favourite programming language, rvest if you’re an R fan or Beautiful Soup or lxml if you prefer python.

If you want to test the algorithm on your Neo4j instance with the same dataset I used you just need to run the data load scripts above, they include the link to the data.

QuickGraph#4 Explore your browser history in Neo4j

The dataset

For this example I am going to use my browser history data. Most browsers store this data in SQLite. This means relational data, easy to access from Neo4j using the apoc.load.jdbc  stored procedure. I’m a Chrome user, and in my Mac, Chrome stores the history db at

~/Library/Application Support/Google/Chrome/Default/History

There are two main tables in the History DB: urls and visits. I’m going to explore them directly from Neo4j’s browser using the same apoc.load.jdbc procedure. In order to do that, you’ll have to download first a jdbc driver for SQLite, and copy it in the plugins directory of your Neo4j instance. Also keep in mind that Chrome locks the History DB when the browser is open so if you want to play with it(even read only acces) you will have to either close the browser or as I did, copy the DB (a single file) somewhere else and work from that snapshot.

This Cypher fragment will return the first few records of the urls table and we see on them things like an unique identifier for the page, its url, title of the page and some counters with the number of visits and the number of times the url has been typed as opposed to reached by following a hyperlink.

CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    "urls") yield row 
WITH row LIMIT 10
RETURN *

The results look like this on my browser history.

screen-shot-2016-09-29-at-02-52-48

The visits table contain information about the page visit event, a timestamp (visit_time), a unique identifier (id) for each visit and most interesting, whether the visit follows a previous one (from_visit). This would mean that there was a click on a hyperlink that lead from page A to page B.

screen-shot-2016-09-29-at-13-36-33

A bit of SQL manipulation using the date and time functions on the SQLite side will filter out the columns from the visits table that we don’t care about for this experiment and also format the timestamp in a user friendly date and time.

SELECT id, url, time(((visit_time/1000000)-11644473600), 'unixepoch') as visit_time, 
date(((visit_time/1000000)-11644473600), 'unixepoch') as visit_date,
visit_time as visit_time_raw 
FROM visits

Here’s what records look like using this query. Nice and ready to be imported into Neo4j.

screen-shot-2016-09-29-at-14-35-41

Loading the data into Neo4j

The model I’m planning to build is quite simple: I’ll use a node to represent a web page and a separate one to represent each individual visit to a page. Each visit event is linked to the page through the :VISIT_TO_PAGE relationship, and chained page visits (hyperlink navigation) are linked through the :NAVIGATION_TO relationship. Here is what that looks visually on an example navigation from a post on the Neo4j blog to a page with some code on Github:

screen-shot-2016-09-29-at-19-28-38

Ok, so let’s go with the import scripts.  First the creation of Page nodes out of every record in the urls table:

CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    "urls") yield row 
WITH row 
CREATE (p:Page {page_id: row.id, 
                page_url: row.url, 
                page_title: row.title, 
                page_visit_count: row.visit_count, 
                page_typed_count: row.typed_count})

And I’ll do the same with the visits, but linking them to the pages we’ve just loaded. Actually, to accelerate the page lookup I’ll create an index on page ids first.

CREATE INDEX ON :Page(page_id)

And here’s the Cypher running the visit data load.

WITH "SELECT id, url, visit_time as visit_time_raw, 
 time(((visit_time/1000000)-11644473600), 'unixepoch') as visit_time, 
 date(((visit_time/1000000)-11644473600), 'unixepoch') as visit_date 
 FROM visits" AS sqlstring

CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    sqlstring ) yield row
WITH row 
MATCH (p:Page {page_id: row.url}) 
CREATE (v:PageVisit { visit_id: row.id, 
                      visit_time: row.visit_time, 
                      visit_date: row.visit_date, 
                      visit_timestamp: row.visit_time_raw}) 
CREATE (v)-[:VISIT_TO_PAGE]->(p)

And finally, I’ll load the transitions between visits but as we did before with the pages, let’s create first an index on visit ids:

CREATE INDEX ON :PageVisit(visit_id)
WITH "SELECT id, from_visit, transition, segment_id, visit_duration 
      FROM visits" AS sqlstring
CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                    sqlstring
                    ) yield row 
WITH row 
MATCH (v1:PageVisit {visit_id: row.from_visit}),
      (v2:PageVisit {visit_id: row.id}) 
CREATE (v1)-[:NAVIGATION_TO]->(v2)

So we are ready to start querying our graph!

Querying the graph

Let’s look for a direct navigation in the graph that goes for instance from a page in the Neo4j web site to Twitter.

MATCH (v1)-[:VISIT_TO_PAGE]->(p1),
      (v2)-[:VISIT_TO_PAGE]->(p2),
      (v1)-[:NAVIGATION_TO]->(v2) 
WHERE p1.page_url CONTAINS 'neo4j.com' 
      AND p2.page_url CONTAINS 'twitter.com'
RETURN * LIMIT 10

In my browser history data, this produces the following output. Notice that I’ve extended it to include an extra navigation step. I’ve done that just by clicking on the graph visualisation in the Neo4j browser to make the example more understandable:

screen-shot-2016-09-29-at-16-46-26

It actually corresponds to a visit to the Neo4j blog, followed by me tweeting how cool was what I just read. The proof that I’m working with real data is the actual tweet (!)

Ok, so while this basic model is good to analyse individual journeys, I think extracting a Site node by aggregating all pages in the same site can give us interesting insights. Let’s go for it.

Extending the model

This could be done in different ways, for example we could write a stored procedure and call it from a Cypher script. Having the full power of java, we could do a proper parsing of the url string to extract the domain.

I will do it differently though, I’ll run a SQL query on the History SQLite DB including string transformations to substring the urls and extract the domain name (sort of). The SQL that extracts the root of the url could be the following one:

SELECT id, substr(url,9,instr(substr(url,9),'/')-1) as site_root 
FROM urls 
WHERE instr(url, 'https://')=1 
UNION
SELECT id, substr(url,8,instr(substr(url,8),'/')-1) as site_root 
FROM urls
WHERE instr(url, 'http://')=1

Quite horrible, I know. But my intention is to show how the graph can be extended with new data without having to recreate it. Quite a common scenario when you work with graphs, but relax, graphs are good at accommodating change, nothing to do with RDBMS migrations when having to change your schema.

So this new query produces rows containing just the domain (the root of the url) and the page id that I will use to match to previously loaded pages. Something like this:

screen-shot-2016-09-29-at-19-47-56

And the Cypher that loads it and adds the extra information in our graph would be this:

WITH "select substr(url,9,instr(substr(url,9),'/')-1) as site_root, id 
      from urls where instr(url, 'https://')=1 
      UNION
      select substr(url,8,instr(substr(url,8),'/')-1) as site_root, id 
      from urls where instr(url, 'http://')=1"  AS query
CALL apoc.load.jdbc("jdbc:sqlite:/Users/jbarrasa/Documents/Data/History",
                     query) yield row 
WITH row 
MATCH (p:Page {page_id: row.id})
MERGE (s:Site {site_root: row.site_root})
CREATE (p)-[:PAGE_IN_SITE]->(s)

And once we have the sites we can include weighted site level navigation. The weight is simply calculated by summing the number of transitions between pages belonging to each site. Here is the Cypher that does the job:

MATCH (s:Site)<-[:PAGE_IN_SITE]-()<-[:VISIT_TO_PAGE]-()<-[inbound:NAVIGATION_TO]-()-[:VISIT_TO_PAGE]->()-[:PAGE_IN_SITE]->(otherSite) 
WHERE otherSite <> s 
WITH otherSite, s, count(inbound) as weight 
CREATE (otherSite)-[sn:SITE_DIRECT_NAVIGATION{weight:weight}]->(s)

This is a much richer graph, where we can traverse not only individual journeys, but also Site level connections. In the following visualisation we can see that there are some transitions between the http://www.theguardian.co.uk and the http://www.bbc.co.uk sites (indicated in green), also to other sites like en.wikipedia.org. In the same capture we can see one of the individual navigations that explain the existence of  a :SITE_DIRECT_NAVIGATION relationship between the Guardian node and the BBC one. It actually represents a hyperlink I clicked on the Guardian’s article that connected it to a BBC one. The purple sequence of events (page visits) details my journey and the yellow nodes represent the pages, pretty much the same we saw on the previous example from neo4j.com to twitter.com.

screen-shot-2016-09-30-at-01-09-41

We can also have a bird’s eye view of a few thousand of the nodes on the graph and notice some interesting facts:

Screen Shot 2016-09-29 at 21.41.11.png

I’ve highlighted some interesting Site nodes. We can se that the most highly connected (more central in the visualization) are the googles and the URL shortening services (t.co, bit.ly, etc.). It makes sense because you typically navigate in and out of them, they are kind of bridge nodes in your navigation. This is confirmed if we run the betweenness centrality algorithm on the sites and their connections. Briefly, betweenness centrality is an indicator of a node’s centrality in a graph and is equal to the number of shortest paths from all nodes to all others that pass through that node.

Here is the Cypher script, again invoking the graph algo implementation as a stored procedure that you can find in the amazing APOC library:

MATCH (s:Site)
WITH collect(s) AS nodes
CALL apoc.algo.betweenness(['SITE_DIRECT_NAVIGATION'],nodes,'BOTH') 
  YIELD node, score
RETURN node.site_root, score
ORDER BY score DESC LIMIT 5

And these are the top five results of the computation on my browser history.

screen-shot-2016-09-30-at-00-44-51

I’m sure you can think of many other interesting queries on your own navigation, what’s the average length of a journey, how many different sites it traverses, is it mostly intra-site? Are there any isolated clusters? An example of this in my browser history are the Amazon sites (amazon.co.uk and music.amazon.co.uk). There seem to be loads of transitions (navigation) between them but none in or out to/from other sites. You can visually see this on the bottom left part of the previous bird’s eye view. I’m sure you will come up with many more but I’ll finish this QuickGraph with a query involving some serious path exploration.

The question is: Which sites have I navigated to from LinkedIn pages, how many times have I reached them and how long (as in how many hyperlink clicks) did it take me to get to them? You may be asking yourself how on earth would you even express that in SQL(?!?!). Well, not to worry, you’ll be pleased to see that it takes less writing expressing the query in Cypher than it takes to do it in English. Here it is:

MATCH (v1)-[:VISIT_TO_PAGE]->(p1)-[:PAGE_IN_SITE]-(s1:Site {site_root: "www.linkedin.com"}) 
MATCH p = (v1)-[:NAVIGATION_TO*]->(v2)-[:VISIT_TO_PAGE]->(p2)-[:PAGE_IN_SITE]-(s2)
WHERE s2 <> s1
WITH length(p) AS pathlen, s2.site_root AS site 
RETURN AVG(pathlen) AS avglen, count(*) AS count, site ORDER BY avglen

And my results, 21 milliseconds later…

screen-shot-2016-09-30-at-01-59-38

What’s interesting about this QuickGraph?

This experiment shows several interesting things, the first being how straightforward it can be to load relational data into Neo4j using the apoc.load.jdbc  stored procedure. As a matter of fact, the same applies to other types of data sources for example Web Services as I described in previous posts.

The second takeaway is how modelling and storing as a graph data that is naturally a graph (sequences of page visits) as opposed to shoehorning it into relational tables opens a lot of opportunities for querying and exploration that would be unthinkable in SQL.

Finally I’ve also shown how some graph algorithms (betweenness centrality) can be applied easily to your graph using stored procedures in Cypher. Worth mentioning that you can extend the list of available ones by writing your own and easily deploying it on your Neo4j instance.

QuickGraph#3 A step-by-step example of RDF to Property Graph transformation

The dataset

For this example I am going to use a sample movie dataset from the Cayley project. It’s a set of half a million triples about actors, directors and movies that can be downloaded here. Here is what the dataset looks like:

</en/meet_the_parents> <name> "Meet the Parents" .
</en/meet_the_parents> <type> </film/film> .
</en/meet_the_parents> </film/film/directed_by> </en/jay_roach> .
</en/meet_the_parents> </film/film/starring> _:28754 . 
_:28754 </film/performance/actor> </en/ben_stiller> .
_:28754 </film/performance/character> "Gaylord Focker" .
</en/meet_the_parents> </film/film/starring> _:28755 .
...

One could argue whether this dataset is actual RDF or just a triple based graph since it does not use valid URIs or even the RDF vocabulary (note for example that instead of  http://www.w3.org/1999/02/22-rdf-syntax-ns#type we find just type). But this would be a rather pointless discussion in my opinion. For what it’s worth, the graph is parseable with standard RDF parsers which is enough and as we’ll see the problems derived from this can be fixed, which is the point of this post.

 

Loading the data into Neo4j

I’ll use the RDF Importer described here for the data load. Now, there is something to take into account, even though the data set is called ‘30kmoviedata.nq’ it does not contain quads but triples, so I tried the parser setting the serialization format to ‘N-Triples’. The parser threw an error complaining about the structure of the URIs:

Not a valid (absolute) IRI: /film/performance/actor [line 1]

However, funnily enough the file parses as Turtle format. So if you want to give it a try, remember to set the second parameter of the importRDF stored procedure to ‘Turtle’ and run the import in the usual way. It took only 39 seconds to load the 471K triples on my laptop.

screen-shot-2016-09-09-at-16-56-13

Fixing the model

Fixing dense nodes representing categories

First thing we notice is that because the data set does not use the RDF vocabulary, the a <type> b statements are not transformed into labeled nodes as would have happened if rdf:type was used instead. So there are a couple of unusually dense nodes representing the categories (person and movie) because most of the nodes in the dataset are either actors or movies and are therefore linked to either one or the other category node. The two dense nodes are immediately visible in a small sample of 1000 nodes:

screen-shot-2016-09-09-at-17-11-09

We can get counts on the number of nodes connected to each of them by running this query:

MATCH (x)-[:ns1_type]->(t) RETURN t.uri, count (x)

screen-shot-2016-09-09-at-16-27-39

The natural way of representing categories in the Label Property Graph model is by using labels so let’s fix this!  Here is the Cypher fragment that does the job:

MATCH (x)-[:ns1_type]->({uri : 'file:/film/film'}) 
SET x:Film

And once we have the nodes labeled with their categories we can get rid of the dense nodes and the links that connect the rest of the nodes to them.

MATCH (f {uri : 'file:/film/film'}) DETACH DELETE f

Exactly the same applies to the other category: ‘file:/film/person’

MATCH (x)-[:ns1_type]->({uri : 'file:/people/person'}) 
SET x:Person 

MATCH (p {uri : 'file:/people/person'}) DETACH DELETE p

Fixing unneeded intermediate nodes holding relationship properties

In the tiny fragment that I copied at the beginning of the post, we can already see that the data set suffers from one of the known limitations of triple based graph models which is the impossibility of adding attributes to relationships. To do that, intermediate nodes need to be created. Let’s have a look at the example in the previous data fragment graphically.

Ben Stiller plays the role of Gaylord Focker in the movie Meet the Parents and when modelling this (think how would you draw that in a whiteboard) our intuition says something like this:

 

Screen Shot 2016-09-09 at 21.11.20.png

But in a triple based model you will need to introduce an intermediate node to hold the role played by an actor in a movie. Something like this.

screen-shot-2016-09-09-at-14-43-32

This obviously creates a gap between what you conceive when modelling a domain and what is stored in disk and ultimately queried. You will have to map what’s in your head, what you drew in the whiteboard when sketching the model to what the triple based formalism forces you to actually create. Does this ring a bell? Join tables in the relational model maybe? In your head it’s a many-to-many relationship but in the relational model it has to be modelled in a separate join table, an artificial construct imposed by the modelling paradigm that inevitably builds a gap between the conceptual model and the physical one. This ultimately makes your model harder to understand and maintain and your SQL queries looooooonger and less performant. But not to worry, we’ll fix this by using the property graph model, the one that is closer to the way we as humans understand and model domains.

But before we do that, let’s look at another problem derived from this. This complex model introduces the possibility of data quality problems in the form of broken links. What if we have the first leg connecting our intermediate node with the movie but no connection with the actor?  It would be a totally meaningless piece of information. The pattern I’m describing would be expressed like this:

()-[r:ns2_starring]->(x) WHERE NOT (x)-[:ns0_actor]->()

And a query producing a ‘Data Quality’ report on this particular issue could look something like this:

MATCH ()-[r:ns2_starring]->(x) WHERE NOT (x)-[:ns0_actor]->() 
WITH COUNT(r) as brokenLinks
MATCH ()-[r:ns2_starring]->(x)-[:ns0_actor]->() 
WITH COUNT(r) as linked, brokenLinks
RETURN linked + brokenLinks as total, linked, brokenLinks,  
     toFloat(brokenLinks)* 100/(linked + brokenLinks) as percentageBroken

Screen Shot 2016-09-09 at 17.43.59.png

So 0.03% does not seem to be significant, probably the dataset was truncated in a bad way, which would explain the missing bits. Anyway, we can get rid of these broken links that don’t add any value to our graph. Here’s how:

MATCH ()-[r:ns2_starring]->(x) WHERE NOT (x)-[:ns0_actor]->() 
DETACH DELETE x

Ok, so now we are in a position to get rid of the ugly and unintuitive intermediate nodes that I described before and replace them with relationships containing attributes on them.

MATCH (film)-[r:ns2_starring]->(x)-[:ns0_actor]->(actor)
CREATE (actor)-[:ACTS_IN { character: x.ns0_character}]->(film)
DETACH DELETE x
...
Deleted 136694 nodes, set 15043 properties, created 136694 relationships, statement executed in 7029 ms.

And voilà! Here is the final model zooming on the ‘Gaylord Focker’ area:

MATCH (actor)-[:ACTS_IN { character : 'Gaylord Focker' }]->(movie) 
RETURN * LIMIT 25

 

Screen Shot 2016-09-09 at 18.37.38.png

And to finish, one of our favourites at Neo4j, a recommendation engine for Hollywood actors. Who should Ben Stiller work with? We’ll base this in the concept of friend-of-a-friend. If Ben has worked several times with actor X and actor X has worked several times with actor Y then there is a good chance that Ben might be interested in working with actor Y.

Here is the Cypher query that returns our best recommendations for Ben Stiller:

MATCH (ben:Person {ns1_name: 'Ben Stiller'})-[:ACTS_IN]->(movie)<-[:ACTS_IN]-(friend) 
WITH ben, friend, count(movie) AS timesWorkedWithBen ORDER BY timesWorkedWithBen DESC LIMIT 3 //limit to top 3 
MATCH (friend)-[:ACTS_IN]->(movie)<-[:ACTS_IN]-(friendOfFriend)
WHERE NOT (ben)-[:ACTS_IN]->(movie)<-[:ACTS_IN]-(friendOfFriend) AND friendOfFriend <> ben
RETURN friend.ns1_name AS friendOfBen, timesWorkedWithBen, friendOfFriend.ns1_name AS recommendationForBen, count(movie) AS timesWorkedWithFriend ORDER BY timesWorkedWithFriend DESC limit 50

Easy, right? And here are the recommendations:

Screen Shot 2016-09-09 at 20.46.32.png

The following two visualisations give an idea of the portion of the graph explored with our recommendation query. This first one shows Ben’s friends and the movies where they worked together (~400 nodes in total):

Screen Shot 2016-09-09 at 19.26.35.png

And the next shows Ben’s friends’ friends, again with the movies that connect them (~1800 nodes):

Screen Shot 2016-09-09 at 19.34.10.png

You can try to write something similar on the original triple based graph using SPARQL, Gremlin or any other language but I bet you it will be less compact, less intuitive and certainly less performant than the Cypher I wrote. Prove me wrong if you can 😉

What’s interesting about this QuickGraph?

The example highlights some of the modelling limitations of triple based graph models like RDF and how it is possible to transform a model originally created as RDF into a more intuitive and easier to query and explore using the Labeled Property Graph in Neo4j.

 

 

 

 

 

QuickGraph#2 How is Wikipedia’s knowledge organised

The dataset

For this QuickGraph I’ll use data about Wikipedia Categories. You may have noticed at the bottom of every Wikipedia article a section listing the categories it’s classified under. Every Wikipedia article will have at least one category, and categories branch into subcategories forming overlapping trees. It is sometimes possible for a category (and the Wikipedia hierarchy is an example of this) to be a subcategory of more than one parent category, so the hierarchy is effectively a graph.

I guess an example will be helpful at this point: If you open the Wikipedia page on the mathematician Leonard Euler you will find at the bottom of it the following list of categories:

Screen Shot 2016-08-03 at 09.43.02

If you click on any of them, for instance ‘Swiss mathematicians‘ you will be able to navigate up and down the hierarchy of categories from the selected one. I’ll go a couple of steps ahead here and show the final product before explaining how to build it: The structure you are navigating is a graph and looks something like this:

Screen Shot 2016-08-03 at 09.57.45

In the diagram the color coding of the Category nodes indicates the (shortest) distance to the root category so we will have Level1 (green) to Level4 (pink) categories. We can see that the ‘Swiss Mathematician’ category can be reached following many different paths from the root. A couple of examples would be:

People > People by occupation> People in STEM fields > Mathematicians > Mathematicians by nationality > Swiss mathematicians 

or

Science > Academic disciplines> STEM > Mathematics > Mathematicians > Mathematicians by nationality > Swiss mathematicians 

So this is the graph that we are going to build and explore/query in Neo4j in this blog post.

The data source

Wikipedia data can be accessed through the MediaWiki action API, and the specific request that returns information for a given category is described here and looks as follows:

https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=subcat&cmtitle=Category:Fundamental_categories&cmprop=ids|title&cmlimit=500

The request returns a simple JSON structure containing all subcategories of the one passed as parameter (cmtitle). Here is an example of the web service output:

Screen Shot 2016-08-02 at 04.15.33

Loading the data into Neo4j

Since the Wikipedia categorymembers web service takes the name of a category as input parameter, we’ll have to seed the graph with a root node in order to start loading categories. This node will be the starting point for our data load. For this example we will use the category “Main Topic Classification” which groupsWikipedia’s major topic classifications.

CREATE (c:Category:RootCategory {catId: '0000000', catName: 'Main_topic_classifications', fetched : false})

An index on category Ids will help speeding up the merges, so let’s add it.

CREATE INDEX ON :Category(catId)

Now from the top level category we’ll iteratively call the categorymembers web service and process the JSON fragment returned to create nodes and relationships in the graph. We will do this in Neo4j using the apoc.load.json procedure in the APOC library. Here is the Cypher script that does the job. It first invokes the web service with the name of the category, and then process the JSON that the web service returns by creating new instances of Category nodes connected to the parent trhough :SUBCAT_OF relationships.

MATCH (c:Category { fetched: false}) 
call apoc.load.json('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=subcat&cmtitle=Category:' + replace(c.catName,' ', '%20') + '&cmprop=ids|title&cmlimit=500') 
YIELD value as results 
UNWIND results.query.categorymembers AS subcat 
MERGE (sc:Category:Level1Category {catId: subcat.pageid}) 
ON CREATE SET sc.catName = substring(subcat.title,9), 
              sc.fetched = false
MERGE (sc)-[:SUBCAT_OF]->(c)
WITH DISTINCT c
SET c.fetched = true

Each iteration of the loader adds a new level to the category hierarchy by picking up all ‘unexpanded’ nodes and fetching for each of them all the subcategories. Nodes are marked on creation with the fetched : false property and once they have been expanded (subcategories retrieved and added to the graph) the property fetched is set to true.

So a first iteration of the loader after creating the top level category ” instantiates 13 new categories

Screen Shot 2016-08-02 at 11.47.20

Which are obviously the 13 main topic classifications in Wikipedia.

Screen Shot 2016-08-02 at 11.54.11

Screen Shot 2016-08-03 at 11.50.49

The second iteration of the loader picks each one of these 13 and instantiates its subcategories producing 423 new next level nodes.

Screen Shot 2016-08-02 at 11.56.51

A third iteration brings in 5815 new nodes and a fourth one takes the final number close to 50K categories by adding 42840 more. It’s important to keep in mind that every category expansion involves a request being sent to the wikipedia API so I would not recommend this approach for hierarchies more than 4 levels deep.  If you’re interested in loading the whole set of categories probably the best would be to download them as an RDB dump (the files enwiki-latest-category.sql.gz and enwiki-latest-categorylinks.sql.gz from the downloads page) and then use the superfast batch importer to create your graph in Neo4j. Maybe an idea for a future graphgist?

Ok, we have now a graph of 50K odd categories from Wikipedia organised hierarchically via :SUBCAT_OF relationships, so let’s try and run some interesting queries on them. Here’s how a few thousand categories (up to level three) look like in the Neo4j browser: the root node is the red one towards the center of the image, Level 1 categories are the green ones, followed by Level 2 in blue and Level 3 in yellow.

 

Screen Shot 2016-08-02 at 14.42.05

An alternative graph based on a smaller number of initial thematic classifications can be built using “Fundamental_categories” as root. In Wikipedia’s terms, this alternative classification “is intended to contain all and only the few most fundamental ontological categories which can reasonably be expected to contain every possible Wikipedia article under their category trees”. I leave this for the interested reader to try.

Querying the graph

What is the shortest full hierarchy for a given category?

Let’s use for this example the ‘Swiss Mathematicians’ example from the introduction.

MATCH shortestFullHierarchy = shortestPath((swissMatem:Category {catName : 'Swiss mathematicians'})-[:SUBCAT_OF*]->(root:RootCategory)) 
RETURN shortestFullHierarchy

The shortestPath function returns the expected depth four hierarchy:

Screen Shot 2016-08-03 at 10.48.23

There will be cases where more than one shortest path exist, and these can be picked up with the allShortestPaths variant of the shortest path function:

MATCH shortestFullHierarchies = allShortestPaths((swissMatem:Category {catName : 'Philosophers by period'})-[:SUBCAT_OF*]->(root:RootCategory)) 
RETURN shortestFullHierarchies

The query returns all shortest full hierarchies (all with max depth of 4) for the ‘Ancient philosophers’ category.

Screen Shot 2016-08-03 at 11.15.05

What are the richest categories in the top four levels of the Wikipedia?

What is a rich category? Well, this is a QuickGraph so I’ll come up with my own quick (although I hope still minimally reasoned) definition but let’s keep in mind that I’m not trying to build a theory about hierarchy relevance but rather show how the graph can be easily queried in interesting ways.

Right, so a rich category is one that has loads of subcategories because that means it’s complex enough to be broken down with such a high level of detail. Similarly, a rich category is also one that can be categorised from different perspectives, or in other words, one that has multiple parent categories. But we will want to discard unbalanced ones (many subcategories but very few super-categories or vice versa) to avoid enumeration-style categories like “Religion by country”, “Culture by nationality” or “People by ethnic or national descent”. With these three criteria we can build a graph query that returns our top ten. Here is the Cypher for our richest categories:

MATCH (cat:Category)
WITH cat, 
 size((cat)-[:SUBCAT_OF]->()) as superCatDegree, 
 size(()-[:SUBCAT_OF]->(cat)) as subCatDegree
WHERE ABS(toFloat(superCatDegree - subCatDegree)/(superCatDegree + subCatDegree)) < 0.4
RETURN cat.catName, (superCatDegree + subCatDegree)/2 AS richness 
ORDER BY richness DESC 
LIMIT 10

And the results are quite interesting…

 

Screen Shot 2016-08-02 at 20.47.04

If now we want to explore visually the Marxism category…

Screen Shot 2016-08-02 at 21.43.07

Unexpectedly long hierarchies!

One thing that caught my eye from the point of view of the structure of the graph was the fact that even though I had only loaded 4 levels, I was able to find way longer chains of parent-child categories. This is because not all :SUBCAT_OF relationships go from level n to level n-1. That is actually what makes the structure a graph and not a tree. The following query returns :SUBCAT_OF chains longer than 12 ( path =()-[r:SUBCAT_OF*12..]->() ), but checks that every node in the chain is at a maximum distance of 4 from the root ( shortestPath((node)-[:SUBCAT_OF*..4]->(root:Category {catId:’0000000′})) ).

MATCH path =()-[r:SUBCAT_OF*12..]->() WITH path LIMIT 1
UNWIND nodes(path) AS node
MATCH shortest = shortestPath((node)-[:SUBCAT_OF*..4]->(root:Category {catId:'0000000'}))
RETURN path, shortest

And the first result shows how effectively, even though there are chains of length 6 (and longer than that, actually) we can easily see that all the nodes in the chain are never more than 4 hops away from the root node. This is consistent with the fact that we’ve only loaded 4 levels of the category hierarchy.

Screen Shot 2016-08-02 at 19.20.14

 

Loops!

Yes, there are some. So if one day you decide to read the whole Wikipedia (as you do) and you choose to do it by category, then be careful not to enter into an infinite loop of knowledge 🙂

Looops are easy to pick up with this simple Cypher query:

MATCH loop = (cat)-[r:SUBCAT_OF*]->(cat) 
RETURN loop LIMIT 1

The caption on the left shows a loop and the one on the right shows the same loop but keeping the hierarchical order. Green nodes are level 1 categories, blue ones are level 2 and finally yellow ones represent the level 3 categories. The interesting thing is that Geography, even though it’s level one, it’s also a subcategory of Earth Sciences that is level 3. The same thing happens with Nature, being both level 1 and a subcategory of Environmental social science concepts, which is level 3. These two extra relationships create the loop that produces the following chain of :SUBCAT_OF.

MATCH loop = (cat)-[r:SUBCAT_OF*]->(cat) 
RETURN reduce(s = "", x IN nodes(loop) | s + ' > '+ x.catName) AS categoryChain limit 1

Screen Shot 2016-08-02 at 17.23.52

You can also try to find longer loops by just specifying the minimum length of the path

loop = (cat)-[r:SUBCAT_OF*12..]->(cat)

And get chains like:

Screen Shot 2016-08-02 at 18.19.21

What’s interesting about this QuickGraph?

The graph is built directly from querying a web service from the MediaWiki action API. The apoc.load.json procedure in the APOC library can invoke this service and ingest the JSON structure returned all within your Cypher script. Find how to install and use (and extend!) the APOC library of stored procedures for Neo4j.

Rich category hierarchies are graphs and extremely useful tool for scenarios like recommendation systems or  graph-enhanced search.

 

QuickGraph #1 European Politics from DBpedia. Loading data from an RDF triple store into Neo4j via SPARQL

The first of a series of quick graphs in Neo4j built from public data. Watch this space! The example is also available as a GraphGist here

The dataset

For this example I’ve used DBpedia data about European cities, the political parties in their local governments and their ideologies. So big disclaimer, this data is not meant to be complete or more accurate than what can be expected from DBpedia/Wikipedia data. The DBpedia stores data using the RDF model so in order to query it we’ll have to use the SPARQL query language.

Here is the SPARQL query I’ve used to extract the cities and countries and the political parties currently in the local government. The query can be tested on the DBPedia SPARQL endpoint.

select distinct ?party ?city ?cityName ?ctr ?ctrName ?pop
where {

     ?city a dbo:Location ;
          rdfs:label ?cityName ;
           dbo:country ?ctr ;
           dbo:leaderParty ?party .
     FILTER(langMatches(lang(?cityName), "EN"))

     ?ctr dct:subject dbc:Countries_in_Europe ;
          rdfs:label ?ctrName .
     FILTER(langMatches(lang(?ctrName), "EN"))

     optional { ?city dbo:populationTotal ?pop }
}

And a second SPARQL query that returns for each party, the ideologies they are associated with. Again, according to DBpedia/Wikipedia of the different political parties.

select distinct ?party ?partyName ?ideology ?ideologyName where {

     ?city a dbo:Location ;
           dbo:country ?ctr ;
           dbo:leaderParty ?party .

     ?ctr dct:subject dbc:Countries_in_Europe .

     ?party dbo:ideology ?ideology ;
            rdfs:label ?partyName .
     FILTER(langMatches(lang(?partyName), "EN"))

     ?ideology rdfs:label ?ideologyName
     FILTER(langMatches(lang(?ideologyName), "EN"))
}

SPARQL queries of type SELECT return a set of variable bindings, or in other words, a tabular structure that the endpoint can serialise as CSV. This is quite convenient because it can be used directly by the LOAD CSV instruction in Cypher.

Loading the data into Neo4j

Here are the two Cypher statements that create the model in Neo4j. You may also want to create an index on city nodes first to get better performance:

CREATE INDEX ON :City(city_uri)
WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fparty+%3Fcity+%3FcityName+%3Fctr+%3FctrName+%3Fpop+where+%7B%0D%0A%3Fcity+a+dbo%3ALocation+%3B%0D%0A++++++dbo%3Acountry+%3Fctr+%3B%0D%0A++++++dbo%3AleaderParty+%3Fparty+.%0D%0A%0D%0A%3Fctr+dct%3Asubject+dbc%3ACountries_in_Europe+.%0D%0A%0D%0A%3Fparty+dbo%3Aideology+%3Fideology+.%0D%0A%0D%0Aoptional+%7B+%3Fcity+dbo%3ApopulationTotal+%3Fpop+%7D%0D%0A%0D%0A%3Fcity+rdfs%3Alabel+%3FcityName%0D%0AFILTER%28langMatches%28lang%28%3FcityName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%3Fctr+rdfs%3Alabel+%3FctrName%0D%0AFILTER%28langMatches%28lang%28%3FctrName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%7D&format=csv" AS queryOnSPARQLEndpoint
LOAD CSV WITH HEADERS FROM queryOnSPARQLEndpoint AS row
WITH row
MERGE (cty:City {city_uri: row.city}) 
      SET cty.city_pop= coalesce(row.pop,0), cty.city_name = row.cityName
MERGE (ctr:Country {ctry_uri: row.ctr}) 
      SET ctr.ctry_name = row.ctrName
MERGE (pty:Party {party_uri: row.party})
MERGE (cty)-[:CTY_IN_COUNTRY]->(ctr)
MERGE (cty)-[:GOVERNING_PARTY]->(pty)

Note that the query runs directly off the DBpedia SPARQL endpoint (same for the next one) so if the endpoint is down for maintenance or any other reason, this script won’t do much 🙂

WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fparty+%3FpartyName+%3Fideology+%3FideologyName+where+%7B%0D%0A%0D%0A%3Fcity+a+dbo%3ALocation+%3B%0D%0A++++++dbo%3Acountry+%3Fctr+%3B%0D%0A++++++dbo%3AleaderParty+%3Fparty+.%0D%0A%0D%0A%3Fctr+dct%3Asubject+dbc%3ACountries_in_Europe+.%0D%0A%0D%0A%3Fparty+dbo%3Aideology+%3Fideology+%3B%0D%0A+++++++rdfs%3Alabel+%3FpartyName+.%0D%0AFILTER%28langMatches%28lang%28%3FpartyName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%3Fideology+rdfs%3Alabel+%3FideologyName%0D%0AFILTER%28langMatches%28lang%28%3FideologyName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%7D&format=csv" AS queryOnSPARQLEndpoint
LOAD CSV WITH HEADERS FROM queryOnSPARQLEndpoint AS row
WITH row
MERGE (pty:Party {party_uri: row.party}) 
      SET pty.party_name = row.partyName
MERGE (ide:Ideology {ideology_uri: row.ideology}) 
      SET ide.ideology_name = row.ideologyName
MERGE (pty)-[:HAS_IDEOLOGY]->(ide)

And that’s pretty much it. The data in your graph should look something like this for a given city, for instance Bern in Switzerland:

 

Screen Shot 2016-07-27 at 10.22.33

Or like this for a few cities in Spain:

Screen Shot 2016-07-27 at 15.40.54

Querying the graph

Now a couple of interesting queries:

Which are the most transnational ideologies in Europe?

MATCH (ctr:Country)<-[:CTY_IN_COUNTRY]-(city)-[:GOVERNING_PARTY]->(party)-[:HAS_IDEOLOGY]->(i:Ideology)
RETURN i.ideology_name, COUNT(DISTINCT(ctr)) AS presentInCountries, COLLECT(DISTINCT(ctr.ctry_name)) AS CountryList
ORDER BY presentInCountries DESC LIMIT 5

Resulting in the following records:

╒═══════════════════╤══════════════════╤══════════════════════════════╕
│i.ideology_name    │presentInCountries│CountryList                   │
╞═══════════════════╪══════════════════╪══════════════════════════════╡
│Social democracy   │25                │[Bosnia and Herzegovina, Bulga│
│                   │                  │ria, Albania, Austria, Denmark│
│                   │                  │, United Kingdom, Estonia, Cze│
│                   │                  │ch Republic, Turkey, Switzerla│
│                   │                  │nd, Croatia, Cyprus, Spain, Sw│
│                   │                  │eden, Greece, Germany, France,│
│                   │                  │ Lithuania, Italy, Slovenia, S│
│                   │                  │erbia, Romania, Portugal, Norw│
│                   │                  │ay, Netherlands]              │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Christian democracy│17                │[Croatia, Bulgaria, Bosnia and│
│                   │                  │ Herzegovina, Austria, Estonia│
│                   │                  │, Switzerland, Germany, France│
│                   │                  │, Italy, Lithuania, Spain, Slo│
│                   │                  │venia, Slovakia, Serbia, Roman│
│                   │                  │ia, Poland, Netherlands]      │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Euroscepticism     │15                │[Ukraine, Turkey, Croatia, Swi│
│                   │                  │tzerland, Romania, Hungary, Gr│
│                   │                  │eece, Germany, France, United │
│                   │                  │Kingdom, Lithuania, Italy, Spa│
│                   │                  │in, Serbia, Netherlands]      │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Conservatism       │14                │[Bulgaria, Kosovo, Bosnia and │
│                   │                  │Herzegovina, Austria, Turkey, │
│                   │                  │Estonia, Denmark, Cyprus, Croa│
│                   │                  │tia, France, Spain, Italy, Slo│
│                   │                  │venia, Romania]               │
├───────────────────┼──────────────────┼──────────────────────────────┤
│Pro-Europeanism    │13                │[Bosnia and Herzegovina, Bulga│
│                   │                  │ria, Albania, Kosovo, Turkey, │
│                   │                  │Denmark, Switzerland, Croatia,│
│                   │                  │ Greece, Spain, Romania, Polan│
│                   │                  │d, Norway]                    │
└───────────────────┴──────────────────┴──────────────────────────────┘

Which are the most ideologically similar parties from different countries?

MATCH (p1:Party)-[:HAS_IDEOLOGY]->(i)<-[:HAS_IDEOLOGY]-(p2:Party)
WHERE (ID(p1) < ID(p2))
WITH p1, p2, COUNT(DISTINCT(i)) AS sharedIdeologyCount, 
     COLLECT(DISTINCT(i.ideology_name)) AS sharedIdeologies
WHERE sharedIdeologyCount > 2
MATCH (p1)<-[:GOVERNING_PARTY]-()-[:CTY_IN_COUNTRY]->(ctr1), 
      (p2)<-[:GOVERNING_PARTY]-()-[:CTY_IN_COUNTRY]->(ctr2)
WHERE ctr1 <> ctr2
RETURN DISTINCT p1.party_name AS Party1, ctr1.ctry_name AS Country1, 
      p2.party_name AS Party2, ctr2.ctry_name AS Country2, 
      sharedIdeologyCount, sharedIdeologies 
ORDER BY sharedIdeologyCount DESC LIMIT 5

Producing these results:

╒═════════════════╤═════════════════╤═════════════════╤═════════════════╕
│Party1           │Party2           │sharedIdeologyCou│sharedIdeologies │
│                 │                 │nt               │                 │
╞═════════════════╪═════════════════╪═════════════════╪═════════════════╡
│Croatian Party of│Independent Greek│3                │[Euroscepticism, │
│ Rights from Croa│s from Greece    │                 │Social conservati│
│tia              │                 │                 │sm, National cons│
│                 │                 │                 │ervatism]        │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Movement for Righ│Convergence and U│3                │[Liberalism, Popu│
│ts and Freedoms f│nion from Spain  │                 │lism, Centrism]  │
│rom Bulgaria     │                 │                 │                 │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Greater Romania P│Swiss People's Pa│3                │[Euroscepticism, │
│arty from Romania│rty from Switzerl│                 │Right-wing populi│
│                 │and              │                 │sm, National cons│
│                 │                 │                 │ervatism]        │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Movement for Righ│ANO 2011 from Cze│3                │[Liberalism, Cent│
│ts and Freedoms f│ch Republic      │                 │rism, Populism]  │
│rom Bulgaria     │                 │                 │                 │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│Independent Greek│Swiss People's Pa│3                │[Euroscepticism, │
│s from Greece    │rty from Switzerl│                 │Right-wing populi│
│                 │and              │                 │sm, National cons│
│                 │                 │                 │ervatism]        │
└─────────────────┴─────────────────┴─────────────────┴─────────────────┘

 

What’s interesting about this QuickGraph?

The graph is built directly from querying an RDF triple store via a SPARQL endpoint and consuming the output of the SPARQL query directly with ‘LOAD CSV’ in Cypher.