Hands-On Graph Analytics with Neo4j
上QQ阅读APP看书,第一时间看更新

Measuring performance and tuning your query for speed

In order to measure a Cypher query performance, we will have to look at the Cypher query planner, which details the operations performed under the hood. In this section, we introduce the notions to learn how to access the Cypher execution plan. We will also deal with some good practices to avoid the worst operations in terms of performance, before concluding with a well-known example.

Cypher query planner

As you would do with SQL, you can check the Cypher query planner to understand what happens under the hood and how to improve your query. Two options are possible:

  • EXPLAIN: If you do not want the query to be run, EXPLAIN won't make any changes to your graph.
  • PROFILE: This will actually run the query and alter your graph, together with measuring performance.

In the rest of this chapter, we will use a dataset released by Facebook in 2012 for a recruiting competition hosted by Kaggle. The dataset can be downloaded here: https://www.kaggle.com/c/FacebookRecruiting/data. I have only used the training sample, containing a list of connections between anonymized people. It contains 1,867,425 nodes and 9,437,519 edges.

We already talked about one of the operations that can be identified in the query planner: Eager operations, which we need to avoid as much as possible since they really hurt performance. Let's see some more operators and how to tune our queries for performance.

A simple query to select a node with a given id and get Cypher query explanations could be written as follows:

PROFILE
MATCH (p { id: 1000})
RETURN p

When executing this query, a new tab is available in the result cell called Plan, shown in the following screenshot:

The query profile shows the use of the AllNodesScan operator, which performs a search on ALL nodes of the graph. In this specific case, this won't have a big impact since we have only one node label, Person. But if your graph happens to have many different labels, performing a scan on all nodes can be horribly slow. For this reason, it is highly recommended to explicitly set the node labels and relationship types of interest in our queries:

PROFILE
MATCH (p:Person { id: 1000})
RETURN p

In that case, Cypher uses the NodeByLabelScan operation as can be seen in the following screenshot:

In terms of performance, this query is executed in approximately 650 ms on my laptop, in both cases. In some cases, performance can be increased even more thanks to Neo4j indexing.

Neo4j indexing

Neo4j indexes are used to easily find the start node of a pattern matching query. Let's see the impact of creating an index on the execution plan and execution time:

CREATE INDEX ON :Person(id)

And let's run our query again:

PROFILE
MATCH (p:Person { id: 1000})
RETURN p

You can see that the query is now using our index through the NodeIndexSeek operation, which reduces the execution time to 1 ms:

An index can also be dropped with the following statement:

DROP INDEX ON :Person(id)

The Neo4j indexing system also supports combined indexes and full-text indexes. Check https://neo4j.com/docs/cypher-manual/current/schema/index/ for more information.

Back to LOAD CSV

Remember we talked about the Eager operator earlier in this chapter. We are importing US states with the LOAD CSV statement:

LOAD CSV WITH HEADERS FROM "file:///usa_state_neighbors_edges.csv" AS row FIELDTERMINATOR ';'
MERGE (n:State {code: row.code})
MERGE (m:State {code: row.neighbor_code})
MERGE (n)-[:SHARE_BORDER_WITH]->(m)

To better understand it and identify the root cause of this warning message, we ask Neo4j to EXPLAIN it. We would then get a complex diagram like the one displayed here:

I have highlighted three elements for you:

  • The violet part corresponds to the first MERGE statement.
  • The green part contains the same operations for the second MERGE statement.
  • The read box is the Eager operation.

From this diagram, you can see that the Eager operation is performed between step 1 (the first MERGE) and step 2 (the second MERGE). This is where your query needs to be split in order to avoid this operator.

You now know more about how to understand the operations Cypher performs when executing your query and how to identify and fix bottlenecks. It is time to actually measure query performance in terms of time. For this, we are going to use the famous friend-of-friend example in a social network.

The friend-of-friend example

The friend-of-friend example is the most famous argument in favor of Neo4j when talking about performance. Since Neo4j is known to be incredibly performant at traversing relationships, contrary to other Database engines, we expect the response time of this query to be quite low.

Neo4j Browser displays the query execution time in the result cell:

It can also be measured programmatically. For instance, using the Neo4j Python driver from the Neo4j package, we can measure the total execution and streaming time with the following:

from neo4j import GraphDatabase

URL = "bolt://localhost:7687"
USER = "neo4j"
PWD = "neo4j"

driver = GraphDatabase.driver(URL, auth=(USER, PWD)

query = "MATCH (a:Person {id: 203749})-[:IS_FRIEND_WITH]-(b:Person) RETURN count(b.id)"

with driver.session() as session:
with session.begin_transaction() as tx:
result = tx.run(query)
summary = result.summary()
avail = summary.result_available_after # ms
cons = summary.result_consumed_after # ms
total_time = avail + cons

With that code, we were able to measure the total time of execution for different starting nodes, with different degrees, and different depths (first-degree friends, second degree... up to the fourth degree).

The following figure shows the results. As you can see, the amount of time before the results are made available is below 1 ms for all depth-1 queries, independently of the number of first-degree neighbors of the node:

The time Neo4j needs to get the results increases with the depth of the query, as expected. However, you can see that the time difference between the number of friends for the initial node becomes really important only when having a lot of friends. When starting from the node with 100 friends at depth 4, the number of matching nodes is almost 450,000, identified within 1 minute approximately.

This benchmark was performed without any changes to the Neo4j Community Edition default configuration. Some gain is to be expected by tuning some of those parameters, such as the maximum heap size.

More information about these configurations will be given in Chapter 12, Neo4j at Scale.