Understanding music (with query modules)
This article is part of a series intended to show users how to use Memgraph on real-world data and, by doing so, retrieve some interesting and useful information.
We highly recommend checking out the other articles from this series which are listed in our tutorial overview section.
#
IntroductionGetting useful information from your data is always challenging. With Memgraph, you can manipulate and extract a huge amount of information by writing queries. Memgraph even offers a set of built-in graph algorithms. You can use those algorithms in your queries, extending your power over the data. But what if you wanted to do more?
At its core, Memgraph "simply" builds a graph from your data. Graphs were always interesting to scientists and engineers because of their interesting properties allowing you to represent a specific kind of data in an intuitive way which makes extracting useful information much easier. A field called graph theory emerged in mathematics, producing a great number of algorithms, metrics, and other functions that are applied to the graphs.
Memgraph allows you to use the underlying graph in your functions by using
Query modules
, and those functions are called procedures. In this tutorial,
we'll see how easy it is to extend the capabilities of Memgraph using Python.
It will also show you how to use one of the most popular Python libraries for graphs,
called NetworkX, which contains an insane amount
of functions and algorithms for the graphs.
#
Data modelSocial graphs is a relatively recent term. Simply said, it's a representation of a social network. Social networks appear in various sites, e.g., Youtube, which is primarily a site for watching videos, can be classified as a social network. For this tutorial, we'll use data consisting of users of the music streaming platform called Deezer.
The data consists of around 50k Deezer users from Croatia, but we prepared a subset of the dataset only composed of 2k users. Each user is defined by id and a list of genres he loved. The edges represent the mutual friendship between the users. You can find a more detailed explanation of the dataset on the GitHub alongside many more similar datasets kindly provided by the same authors.
#
Importing the datasetTo import the dataset, download the Memgraph Lab
desktop application and navigate to the Datasets
tab in the sidebar. From there,
choose the dataset Music genres social network
and continue with the tutorial.
#
Defining a directory with query modulesWe need to set up the directory from which Memgraph will search for
custom query modules by changing the --query-modules-directory
flag in the main
configuration file(/etc/memgraph/memgraph.conf
) or by supplying it as a
command-line parameter using the directory containing our modules as the value.
For a more detailed explanation take a look at
Load and call query modules.
When using Memgraph installed from a DEB or RPM package, the currently running Memgraph server may need to be stopped. The user can do so using the following command:
When using Docker, the query module directory can be mounted with the following command:
#
Example queries and proceduresFirst, let's create a Python script in our modules directory which will contain
our procedures. We'll name the script deezer_example.py
.
After each change to the script, we need to tell Memgraph to reload all the modules. We can do that by calling the following command:
Let's count how many times a specific genre occurs in our network!
We will define a procedure called genre_count
:
Let's reload the modules and check the results:
We can notice multiple things:
- import of the
mgp
module which contains helper functions and types for defining custom procedures @mgp.read_proc
decorator which marks the function as a procedure- every argument is annotated with a type
- result is defined as an object of
mgp.Record
which also has annotated types of its members
This example is probably not that interesting to you because we can get the same result using the following query:
Let's try something more interesting. The genres are listed in the order the users have added them. If we assume that users picked the genres in order of preference, let's write a function that tells us in what percentage each genre appears in top n places.
Let's see what we got:
As said in the introduction, we want to use the power of the graphs to extract
some useful information from our data which would be otherwise impossible. Most
of those functions are complex and writing them from scratch would be tedious.
As every modern programmer, we'll just use a package that has everything we need
and more. To be precise, we'll be using NetworkX
that has a huge amount of
utility functions and graph algorithms implemented.
To use NetworkX
algorithms we need to transform our graph to a type NetworkX
recognizes. In our case, we need to use an undirected graph networkX.Graph
. To
make our lives easier, let's write a helper function that transforms Memgraph
graph to networkX.Graph
.
Now let's get some information about the graph. As our data represents social network we would like to know if it has bridges and we would like to calculate the average clustering.
And to get and display the data let's run the following command:
Another interesting property of a node in a graph is its centrality. Centrality tells us how important a node is for a graph. In our case, it would mean higher the centrality, the more popular the user is. Let's find out which user is the most popular in our network and take a peek at his/her music taste. We will use the betweenness centrality.
Now let's take a look at the results:
NOTE
Calculating betweenness centrality for each node can take some time to
finish. The issue of slower NetworkX
implementations is something we at
Memgraph would like to address in the future. An example of this can be seen
in the next section of this tutorial.
For our last trick, let's try to locate communities inside our network.
Communities are a set of nodes that are densely connected.
The goal of the community detection algorithms can be nicely described
with the next visualization:
As for centrality, there are multiple algorithms for finding communities in a graph. We will write a function that takes a method for calculating communities, uses it to find the communities, and, optionally, calculates some metrics specific to the graph partitioning so we can compare algorithms. To make things more interesting, let's find out which genre is the most popular in the community and return the percentage which tells us how many of the users have that genre on their list. In the end, music is something that connects us!
Now that we have our function in place let's test some algorithms. We will be checking out community detection using greedy modularity maximization by Clauset-Newman-Moore and label propagation.
In the above snippet, we can notice an optional argument calculate_quality
and
usage of the type mgp.Map
which is provided by Memgraph.
Let's see the results with:
and
Your results should look something like this:
Hmm, Pop
sure is popular. Let's ignore that genre:
and call our procedures again for each algorithm. Well, people always liked to dance!
We saw the biggest communities in our network using two different methods. It's
hard to say which of the methods did better so let's check a couple of metrics
by calling the same procedure with calculate_quality
set to true:
and
I think it should come as no surprise that an algorithm that maximizes modularity has higher modularity...
#
Optimized NetworkX integrationAs noted before, we at Memgraph are aware of NetworkX's potential but the performance for some functions isn't that good. We decided to tackle this problem by writing a wrapper object for Memgraph's graph and with smarter usage of NetworkX algorithms. To make things even more convenient, we decided to implement procedures that call NetworkX functions with our graphs, so you have out-of-the-box access to the graph algorithms. NetworkX contains a huge amount of functions, and writing procedures for all of them require insane effort, so don't blame us if some of the algorithms aren't available. We are always open to any feedback, so if you think that an implementation for an algorithm is needed, we will surely take that into account.
To demonstrate performance improvement, we will calculate the betweenness centrality again, this time by using Memgraph's procedure.
To get access to the NetworkX procedures, start your Memgraph server without
modifying the query modules directory path. That way, the path will be set to
the default path, which contains nxalg
module.
Now let's call the procedure:
You should get the same results as with our previous procedure for the betweenness centrality but in a much lower time.
#
Further readingWe encourage you to take a look at our How to
for the modules at the
How to Implement Query Modules?.
This tutorial showed you how with a little effort you can extend your control
over the data. Using packages like NetworkX
you get a huge amount of already
implemented graph algorithms while Memgraph allows you complete access to its
internal graph.
If you are not a big fan of Python, don't worry! We have a C API with the exact same functionalities.