This article is part of a series intended to show users how to use Memgraph on real-world data and, by doing so, retrieve some interesting and useful information.
We highly recommend checking out the other articles from this series which are listed in our tutorial overview section.
Getting useful information from your data is always challenging. With Memgraph, you can manipulate and extract a huge amount of information by writing queries. Memgraph even offers a set of built-in graph algorithms. You can use those algorithms in your queries, extending your power over the data. But what if you wanted to do more?
At its core, Memgraph "simply" builds a graph from your data. Graphs were always interesting to scientists and engineers because of their interesting properties allowing you to represent a specific kind of data in an intuitive way which makes extracting useful information much easier. A field called graph theory emerged in mathematics, producing a great number of algorithms, metrics, and other functions that are applied to the graphs.
Memgraph allows you to use the underlying graph in your functions by using
Query modules, and those functions are called procedures. In this tutorial,
we'll see how easy it is to extend the capabilities of Memgraph using Python.
It will also show you how to use one of the most popular Python libraries for graphs,
called NetworkX, which contains an insane amount
of functions and algorithms for the graphs.
Social graphs is a relatively recent term. Simply said, it's a representation of a social network. Social networks appear in various sites, e.g., Youtube, which is primarily a site for watching videos, can be classified as a social network. For this tutorial, we'll use data consisting of users of the music streaming platform called Deezer.
The data consists of around 50k Deezer users from Croatia, but we prepared a subset of the dataset only composed of 2k users. Each user is defined by id and a list of genres he loved. The edges represent the mutual friendship between the users. You can find a more detailed explanation of the dataset on the GitHub alongside many more similar datasets kindly provided by the same authors.
To import the dataset, download the Memgraph Lab
desktop application and navigate to the
Datasets tab in the sidebar. From there,
choose the dataset
Music genres social network and continue with the tutorial.
We need to set up the directory from which Memgraph will search for
custom query modules by changing the
--query-modules-directory flag in the main
/etc/memgraph/memgraph.conf) or by supplying it as a
command-line parameter using the directory containing our modules as the value.
For a more detailed explanation take a look at
Load and call query modules.
When using Memgraph installed from a DEB or RPM package, the currently running Memgraph server may need to be stopped. The user can do so using the following command:
When using Docker, the query module directory can be mounted with the following command:
First, let's create a Python script in our modules directory which will contain
our procedures. We'll name the script
After each change to the script, we need to tell Memgraph to reload all the modules. We can do that by calling the following command:
Let's count how many times a specific genre occurs in our network!
We will define a procedure called
Let's reload the modules and check the results:
We can notice multiple things:
- import of the
mgpmodule which contains helper functions and types for defining custom procedures
@mgp.read_procdecorator which marks the function as a procedure
- every argument is annotated with a type
- result is defined as an object of
mgp.Recordwhich also has annotated types of its members
This example is probably not that interesting to you because we can get the same result using the following query:
Let's try something more interesting. The genres are listed in the order the users have added them. If we assume that users picked the genres in order of preference, let's write a function that tells us in what percentage each genre appears in top n places.
Let's see what we got:
As said in the introduction, we want to use the power of the graphs to extract
some useful information from our data which would be otherwise impossible. Most
of those functions are complex and writing them from scratch would be tedious.
As every modern programmer, we'll just use a package that has everything we need
and more. To be precise, we'll be using
NetworkX that has a huge amount of
utility functions and graph algorithms implemented.
NetworkX algorithms we need to transform our graph to a type
recognizes. In our case, we need to use an undirected graph
make our lives easier, let's write a helper function that transforms Memgraph
And to get and display the data let's run the following command:
Another interesting property of a node in a graph is its centrality. Centrality tells us how important a node is for a graph. In our case, it would mean higher the centrality, the more popular the user is. Let's find out which user is the most popular in our network and take a peek at his/her music taste. We will use the betweenness centrality.
Now let's take a look at the results:
Calculating betweenness centrality for each node can take some time to
finish. The issue of slower
NetworkX implementations is something we at
Memgraph would like to address in the future. An example of this can be seen
in the next section of this tutorial.
For our last trick, let's try to locate communities inside our network. Communities are a set of nodes that are densely connected. The goal of the community detection algorithms can be nicely described with the next visualization:
As for centrality, there are multiple algorithms for finding communities in a graph. We will write a function that takes a method for calculating communities, uses it to find the communities, and, optionally, calculates some metrics specific to the graph partitioning so we can compare algorithms. To make things more interesting, let's find out which genre is the most popular in the community and return the percentage which tells us how many of the users have that genre on their list. In the end, music is something that connects us!
Now that we have our function in place let's test some algorithms. We will be checking out community detection using greedy modularity maximization by Clauset-Newman-Moore and label propagation.
In the above snippet, we can notice an optional argument
usage of the type
mgp.Map which is provided by Memgraph.
Let's see the results with:
Your results should look something like this:
Pop sure is popular. Let's ignore that genre:
and call our procedures again for each algorithm. Well, people always liked to dance!
We saw the biggest communities in our network using two different methods. It's
hard to say which of the methods did better so let's check a couple of metrics
by calling the same procedure with
calculate_quality set to true:
I think it should come as no surprise that an algorithm that maximizes modularity has higher modularity...
As noted before, we at Memgraph are aware of NetworkX's potential but the performance for some functions isn't that good. We decided to tackle this problem by writing a wrapper object for Memgraph's graph and with smarter usage of NetworkX algorithms. To make things even more convenient, we decided to implement procedures that call NetworkX functions with our graphs, so you have out-of-the-box access to the graph algorithms. NetworkX contains a huge amount of functions, and writing procedures for all of them require insane effort, so don't blame us if some of the algorithms aren't available. We are always open to any feedback, so if you think that an implementation for an algorithm is needed, we will surely take that into account.
To demonstrate performance improvement, we will calculate the betweenness centrality again, this time by using Memgraph's procedure.
To get access to the NetworkX procedures, start your Memgraph server without
modifying the query modules directory path. That way, the path will be set to
the default path, which contains
Now let's call the procedure:
You should get the same results as with our previous procedure for the betweenness centrality but in a much lower time.
We encourage you to take a look at our
How to for the modules at the
How to Implement Query Modules?.
This tutorial showed you how with a little effort you can extend your control
over the data. Using packages like
NetworkX you get a huge amount of already
implemented graph algorithms while Memgraph allows you complete access to its
If you are not a big fan of Python, don't worry! We have a C API with the exact same functionalities.