Graph DB + Data Virtualization = Live dashboard for fraud analysis

The scenario

Retail banking: Your graph-based fraud detection system powered by Neo4j is being used as part of the controls run when processing line of credit applications or when accounts are provisioned. It’s job is to block -or at least to flag- potentially fraudulent submissions as they come into your systems. It’s also sending alarms to fraud operations analysts whenever unusual patterns are detected in the graph so they can be individually investigated ASAP.

This is all working great but you want other analysts in your organisation to benefit from the super rich insights that your graph database can deliver, people whose job is not to react on the spot to individual fraud threats but rather understand the bigger picture. They are probably more strategic business analysts, maybe some data scientists doing predictive analysis too and they will typically want to look at fraud patterns globally rather than individually, combine the information in your fraud detection graph with other datasources (external to the graph) for reporting purposes, to get new insights, or even to ‘learn’ new patterns by running algorithms or applying ML techniques.

In this post I’ll describe through an example how Data Virtualization can be used to integrate your Neo4j graph with other data sources providing a single unified view easy to consume by standard analytical/BI tools.

Don’t get confused by the name, DV is about data integration, nothing to do with hardware  or infrastructure virtualization.

The objective

I thought a good example for this scenario could be the creation of an integrated dashboard on your fraud detection platform aggregating data from a couple of different sources.

Nine out of ten times integration will be synonym of ETL-ing your data into a centralised store or data warehouse and then running your analytics/BI from there. Fine. This is of course a valid approach but it also has its shortcomings, specially regarding agility, time to solution and cost of evolution just to name a few. And as I said in the intro, I wanted to explore an alternative approach, more modern and agile, called data virtualization or as it’s called these days, I’ll be building a logical data warehouse.

The “logical” in the name comes from the fact that data is not necessarily replicated (materialised) into a store but rather “wrapped” logically at the source and exposed as a set of virtual views that are run on demand. This is what makes this federated approach essentially different from the ETL based one.

Screen Shot 2016-11-25 at 18.13.55.png

The architecture of my experiment is not too ambitious but rich enough to prove the point. It uses an off the shelf commercial data virtualization platform (Data Virtuality) abstracting and integrating two data sources (one relational, one graph) and offering a unified view to a BI tool.

Before I go into the details, a quick note of gratitude: When I decided to go ahead with this experiment, I reached out to Data Virtuality, and they very kindly gave me access to a VM with their data virtualization platform preinstalled and supported me along the way. So here is a big thank you to them, especially to Niklas Schmidtmer, a top solutions engineer who has been super helpful and answered all my technical questions on DV.

The data sources

 

Neo4j  for fraud detection

In this post I’m focusing on the integration aspects so I will not go into the details of what a graph-based fraud detection solution built on Neo4j looks like. I’ll just say that Neo4j is capable of keeping a real time view of your account holders’ information and detect potentially fraudulent patterns as they appear. By “real time” here, I mean as accounts are provisioned or updated in your system, or as transactions arrive, or in other words, as suspicious patterns are formed in your graph.

In our example, say we have a Cypher query returning the list of potential fraudsters. A potential fraudster in our example is an individual account holder involved in a suspicious ring pattern like the one in the Neo4j browser capture below. The query also returns some additional information derived from the graph like the size of the fraud ring and the financial risk associated with it. The list of fraudsters returned by this query will be driving my dashboard but we will want to enrich them first with some additional information from the CRM.

For a detailed description of what first party bank fraud is and how graph databases can fight it read this post.

Screen Shot 2016-11-25 at 18.38.35.png

 

RDBMS backed CRM system

The second data source is any CRM system backed by a relational database. You can put here the name of your preferred one or whichever in-house built solution your organisation is currently using.

The data in a CRM is less frequently updated and contains additional information about our account holders.

Data Virtualization

As I said before, data virtualization is a modern approach to data integration based on the idea of data on demand. A data virtualization platform wraps different types of data sources: relational, NoSQL, APIs, etc… and makes them all look like relational views. These views can then be combined through standard relational algebra operations to produce rich derived (integrated) views that will ultimately be consumed by all sorts of BI, analytics and reporting tools or environments as if they came from a single relational database.

The process of creating a virtual integrated view of a number of data sources can be broken down in three parts. 1) Connecting to the sources and virtualizing the relevant elements in them to create base views, 2) Combining the base views to create richer derived ones and 3) publishing them for consumption by analytical and BI applications. Let’s describe each step in a bit more detail.

Connecting to the sources from the data virtualization layer and creating base views

The easiest way to interact with your Neo4j instance from a data virtualization platform is through the JDBC driver. The connection string and authentication details is pretty much all that’s needed as we can see in the following screen capture.

Screen Shot 2016-11-25 at 13.25.23.png

Once the data source is created, we can easily define a virtual view on it based on our Cypher query with the standard CREATE VIEW… expression in SQL. Notice the usage of the ARRAYTABLE function to take the array structure returned by the request and produce a tabular output.

Screen Shot 2016-11-25 at 14.29.47.png

Once our fraudsters view is created, it can be queried just as if it was a relational one. The data virtualization layer will take care of the “translation” because obviously Neo4j actually talks Cypher and not SQL.

Screen Shot 2016-11-25 at 14.58.45.png

If for whatever reason you want to hit directly Neo4j’s HTTP REST API, you can do that by creating a POST request on the Cypher transactional endpoint and building the JSON message containing the Cypher (find description in Neo4j’s developer manual here). In Data Virtuality this can easily be done through a web service data import wizard, see next screen capture:

You’ll need to provide the endpoint details, the type of messages exchanged, the structure of the request. The wizard will then send a test request to figure out what the returned structure looks like and offer you a visual point and click way to select which values are relevant to your view and even offer a preview of the results.

Similar to the previous JDBC based case, now we have a virtual relational view built on our Cypher query that can be queried through SQL. Again the DV platform takes care of translating it into a HTTP POST request behind the scenes…

Screen Shot 2016-11-25 at 14.54.12.png

Now let’s go to the other data source, our CRM. Virtualizing relational datasources is pretty simple because they are already relational. So once we’ve configured the connection (identical to previous case, indicating server, port, and authentication credentials) the DV layer can introspect the relational schema and do the work for us by offering the tables and views discovered.

Screen Shot 2016-11-25 at 15.19.58.png

So we create a view on customer details from the CRM. This view includes the global user ID that we will use to combine this table with the fraudster data coming from Neo4j.

Combining the data from the two sources

Since we now hav two virtual relational views in our data virtualization layer, all we need to do is to combine them using a straightforward SQL JOIN. This can be achieved visually:

Screen Shot 2016-11-25 at 15.26.45.png

…or directly typing the SQL script

screen-shot-2016-11-25-at-15-28-06

The result is a new fraudster360 view combining information from both our CRM system and the Neo4j powered fraud detection platform. As in the previous cases, it is a relational view that can be queried and most interestingly exposed to consumer applications.

Important to note: no data movement at this point, data stays at the source. We are only defining a virtual view (metadata if you want). Data will be retrieved on demand when a consumer application queries the virtual view as we’ll see in the next section.

We can however test what will happen by running a test query from the Data Virtuality SQL editor. It is a simple projection on the fraudster360 view.

screen-shot-2016-11-28-at-09-15-17

We can visualise the query execution plan to see that…

screen-shot-2016-11-28-at-09-15-52

…the query on the fraudster360 is broken down into two, one hitting Neo4j and the other the relational DB. The join is carried out on the fly and the results streamed to the requester.

Even though I’m quite familiar with the data virtualization world, it was not my intention in this post to dive too deep into the capabilities of these platforms. Probably worth mentioning though that it is possible to use the DV layer as a way to control access to your data in a centralised way by defining role based access rules. Or that DV platforms are pretty good at coming up with execution plans that delegate down to the sources as much of the processing as possible, or alternatively, caching a the virtual level if the desired behavior is precisely the opposite (i.e. protecting the sources from being hit by analytical/BI workload).

But there is a lot more, so if you’re interested, ask the experts.

Exposing composite views

I’ll use Tableau for this example. Tableau can connect to the Data Virtualization server via ODBC. The virtual views created in Data Virtuality are listed and all that needs to be done is selecting our fraudster360 view and check that data types are imported correctly.

Screen Shot 2016-11-25 at 15.54.41.png

I’m obviously not a Tableau expert but I managed to easily create a couple of charts and put them into a reasonably nice looking dashboard. You can see it below, it actually shows how the different potential fraud cases are distributed by state, how does the size of a ring (group of fraudsters collaborating) relate to the financial risk associated with it or how these two factors are distributed regionally.

Screen Shot 2016-11-25 at 17.08.22.png

And the most interesting thing about this dashboard is that since it is built on a virtual (non materialised) view, whenever the dashboard is re-opened or refreshed, the Data Virtualization layer will query the underlying Neo4j graph for the most recent fraud rings and join them with the CRM data so that the dashboard is guaranteed to be built on the freshest version of data from both all sources.

Needless to say that if instead of Tableau you are a Qlik or an Excel user, or you write R or python code for data analysis, you would be able to consume the virtual view in exactly the same way (or very similar if you use JDBC instead of ODBC).

Well, that’s it for this first experiment. I hope you found it interesting.

Takeaways

Abstraction: Data virtualization is a very interesting way of exposing Cypher based dynamic views on your Neo4j graph database to non technical users making it possible for them to take advantage of the value in the graph without necessarily having to write the queries themselves. They will consume the rich data in your graph DB through the standard BI products they feel comfortable with (Tableau, Excel, Qlik, etc).

Integration: The graph DB is a key piece in your data architecture but it will not hold all the information and integration will be required sooner or later. Data Virtualization proves to be a quite nice agile approach to integrating your graph with other data sources offering controlled virtual integrated datasets to business users enabling self service BI.

Interested in more ways in which Data Virtualization can integrate with Neo4j? Watch this space.

 

 

Advertisements

Neo4j is your RDF store (part 1)

If you want to understand the differences and similarities between RDF and the Labeled Property Graph implemented by Neo4j, I’d recommend you watch this talk I gave at Graph Connect San Francisco in October 2016.

Intro

Let me start with some basics: RDF is a standard for data exchange, but it does not impose any particular way of storing data.

What do I mean by that? I mean that data can be persisted in many ways: tables, documents, key-value pairs, property graphs, triple graphs… and still be published/exchanged as RDF.

It is true though that the bigger the paradigm impedance mismatch -the difference between RDF’s modelling paradigm (a graph) and the underlying store’s one-, the more complicated and inefficient the translation for both ingestion and publishing will be.

I’ve been blogging over the last few months about how Neo4j can easily import RDF data and in this post I’ll focus on the opposite: How can a Neo4j graph be published/exposed as RDF.

Because in case you didn’t know, you can work with Neo4j getting the benefits of native graph storage and processing -best performance, data integrity and scalability- while being totally ‘open standards‘ to the eyes of any RDF aware application.

Oh! hang on… and your store will also be fully open source!

A “Turing style” test of RDFness

In this first section I’ll show the simplest way in which data from a graph in Neo4j can be published as RDF but I’ll also demonstrate that it is possible to import an RDF dataset into Neo without loss of information in a way that the RDF produced when querying Neo4j is identical to that produced by the original triple store.

Screen Shot 2016-11-17 at 01.18.36.png

You’ll probably be familiar with the Turing test where a human evaluator tests a machine’s ability to exhibit intelligent behaviour, to the point where it’s indistinguishable from that of a human. Well, my test aims to prove Neo4j’s ability to exhibit “RDF behaviour” to an RDF consuming application, making it indistinguishable from that of a triple store. To do this I’ll use the neosemantics neo4j extension.

The simplest test one can think of, could be something like this:

Starting from an RDF dataset living in a triple store, we migrate it (all or partially) into Neo4j. Now if we run a Given a SPARQL DESCRIBE <uri> query on the triple store and its equivalent rdf/describe/uri<uri> in Neo4j, do they return the same set of triples? If that is the case -and if we also want to be pompous- we could say that the results are semantically equivalent, and therefore indistinguishable to a consumer application.

We are going to run this test step by step on data from the British National Bibliography dataset:

Get an RDF node description from the triple store

To do that, we’ll run the following SPARQL DESCRIBE query in the British National Bibliography public SPARQL endpoint, or alternatively in the more user friendly SPARQL editor.

DESCRIBE <http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940>

The request returns an RDF fragment containing all information about Mikhail Bulgakov in the BNB. A pretty cool author, by the way, which I strongly recommend. The fragment actually contains 86 triples, the first of which are these:

<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/givenName> "Mikhail" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.w3.org/2000/01/rdf-schema#label> "Bulgakov, Mikhail, 1891-1940" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/familyName> "Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://xmlns.com/foaf/0.1/name> "Mikhail Bulgakov" .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/010535795> .
<http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940> <http://www.bl.uk/schemas/bibliographic/blterms#hasCreated> <http://bnb.data.bl.uk/id/resource/008720599> .
...

You can get the whole set running the query in the SPARQL editor I mentioned before or sending an  HTTP request with the query to the SPARQL endpoint:

curl -i http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E -H Accept:text/plain

Ok, so that’s our base line,  exactly the output we want to get from Neo4j to be able to affirm that they are indistinguishable to an RDF consuming application.

Move the data from the triple store to Neo4j

We need to load the RDF data into Neo4j. We could load the whole British National Bibliography since it’s available for download as RDF, but for this example we are going to load just the portion of data that we need.

I will not go into the details of how this happens as it’s been described in previous blog posts and with some examples. The semantics.importRDF procedure runs a straightforward and lossless import of RDF data into Neo4j. The procedure is part of the neosemantics extension. If you want to run the test with me on your Neo4j instance, now is the moment when you need to install it (instructions in the README).

Once the extension ins installed, the migration could not be simpler, just run the following stored procedure:

CALL semantics.importRDF("http://bnb.data.bl.uk/sparql?query=DESCRIBE+%3Chttp%3A%2F%2Fbnb.data.bl.uk%2Fid%2Fperson%2FBulgakovMikhail1891-1940%3E",
"RDF/XML",true,true,500)

We are passing as parameter the url of the BNB SPARQL endpoint returning the RDF data needed for our test, along with some import configuration options. The output of the execution shows that the 86 triples have been correctly imported into Neo4j:

Screen Shot 2016-11-16 at 03.01.52.png

Now that the data is in Neo4j and you can query it with Cypher and visualise it in the browser. Here is a query example returning Bulgakov and all the nodes he’s connected to:

MATCH (a)-[b]-(c:Resource { uri: "http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940"})
RETURN *

Screen Shot 2016-11-16 at 02.54.34.png

There is actually not much information in the graph yet, just the node representing good old Mikhail with a few properties (name, uri, etc…) and connections to the works he created or contributed to, the events of his birth and death and a couple more. But let’s not worry about size for now, well deal with that later. The question was: can we now query our Neo4j graph and produce the original set of RDF triples? Let’s see.

Get an RDF description of the same node, now from Neo4j

The neosemantics repo also includes an extensions (http endpoints) that provide precisely this capability. The equivalent in Neo4j of the SPARQL DESCRIBE on Mikhail Bulgakov would be the following:

:GET /rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940

If you run it in the browser, you will get the default serialisation which is JSON-LD, something like this:

Screen Shot 2016-11-16 at 16.40.23.png

But if you set in the request header the serialisation format of your choice -for example using curl again- you can get the RDF fragment in any of the available formats.

curl -i http://localhost:7474/rdf/describe/uri?nodeuri=http://bnb.data.bl.uk/id/person/BulgakovMikhail1891-1940 -H accept:text/plain

Well, you should not be surprised to know that it return 86 triples, exactly the same set that the original query on the triple store returned.

So mission accomplished. At least for the basic case.

RDF out Neo4j’s movie database

I thought it could be interesting to prove that an RDF dataset can be imported into Neo4j and then published without loss of information but OK, most of you may not care much about existing RDF datasets, that’s fair enough. You have a graph in Neo4j and you just want to publish it as RDF. This means that in your graph, the nodes don’t necessarily have a property for the uri (why would they?) or are labelled as Resources. Not a problem.

Ok, so if your graph is not the result of some RDF import, the service you want to use instead of the uri based one, is the nodeid based equivalent.

:GET /rdf/describe/id?nodeid=<nodeid>

We’ll use for this example Neo4j’s movie database. You can get it loaded in your Neo4j instance by running

:play movies

You can get the ID of a node either directly by clicking on it on the browser or by running a simple query like this one:

MATCH (x:Movie {title: "Unforgiven"}) 
RETURN ID(x)

In my Neo4j instance, the returned ID is 97 so the GET request would pass this ID and return in the browser the JSON-LD serialisation of the node representing the movie “Unforgiven” with its attributes and the set of nodes connected to it (both inbound and outbound connections):

screen-shot-2016-11-16-at-17-07-26

But as in the previous case, the endpoint can also produce your favourite serialisation just by setting it in the accept parameter in the request header.

curl -i http://localhost:7474/rdf/describe/id?nodeid=97 -H accept:text/plain

When setting the serialisation to N-Triples forma the previous request gets you these triples:

<neo4j://indiv#97> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <neo4j://vocabulary#Movie> .
<neo4j://indiv#97> <neo4j://vocabulary#tagline> "It's a hell of a thing, killing a man" .
<neo4j://indiv#97> <neo4j://vocabulary#title> "Unforgiven" .
<neo4j://indiv#97> <neo4j://vocabulary#released> "1992"^^<http://www.w3.org/2001/XMLSchema#long> .
<neo4j://indiv#167> <neo4j://vocabulary#REVIEWED> <neo4j://indiv#97> .
<neo4j://indiv#89> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#DIRECTED> <neo4j://indiv#97> .
<neo4j://indiv#98> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .
<neo4j://indiv#99> <neo4j://vocabulary#ACTED_IN> <neo4j://indiv#97> .

The sharpest of you may notice when you run it that there is  a bit missing. There are relationship properties in the movie database that are lost in the RDF fragment. Yes, that is because there is no way of expressing that in RDF. At least not without recurring to horribly complicated patterns like reification or singleton property that are effectively unusable in any practical real world use case. But we’ll get to that too in future posts.

Takeaways

 

I guess the main one is that if you want to get the benefits of native graph storage and be able to query your graph with Cypher in Neo4j but also want to:

  •  be able to easily import RDF data into your graph and/or
  •  offer an RDF “open standards compliant” API for publishing your graph

Well, that’s absolutely fine, because we’ve just seen how Neo4j does a great job at producing and consuming RDF.

Remember: RDF is about data exchange, not about storage.

There is more to come on producing RDF from Neo4j than what I’ve shown in this post. For instance, publishing the results of a Cypher query as RDF. Does it sound interesting?Watch this space.

Also I’d love to hear your feedback!