Pages

Neo4j - Cypher Query Language

 
Cypher is a declarative graph query language that allows for expressive and efficient querying and updating of the graph store.It is designed to be a humane query language, suitable for both developers and operations professionals.

- clarity on what to retrieve from a graph, not on how to retrieve it.
- constructs are based on English prose and neat iconography.
- Design goal: make simple things easy, and the complex things possible.

  1. Introduction
    • Structure
      • Borrows structure from SQL - queries are built using various clauses.
      • Clauses are chained together, and the they feed intermediate result sets between each other.
      • query which finds a user called John and John’s friends
      • MATCH (john {name: 'John'})-[:friend]->()-[:friend]->(fof)
        RETURN john.name, fof.name
    • Read and Update
      • Cypher query part can’t both match and update the graph at the same time.
      • If query only performs read, cypher will be lazy and not actually match the pattern until results are asked.
      • In update queries, all reading will be done before a write actually happens.
      • WITH statement - barrier between a plan and the finished execution of that plan. Here we specify how we want to aggregate data before filtering happens.
      • MATCH (n {name: 'John'})-[:FRIEND]-(friend)
        WITH n, count(friend) as friendsCount
        WHERE friendsCount > 3
        RETURN n, friendsCount
      • MATCH (n {name: 'John'})-[:FRIEND]-(friend)
        WITH n, count(friend) as friendsCount
        SET n.friendCount = friendsCount
        RETURN n.friendsCount
    • Returning Data
      • RETURN clause - Period symbol at the end of the query.
      • Has three sub-clauses - SKIP/LIMIT and ORDER BY
    • Transactions
      • Any query that updates the graph, runs in a transaction.
      • Existing transaction in running context is used. If absent, new one is created.
      • updating query will always either fully succeed, or not succeed at all.
    • Uniqueness
      • While matching pattern, Neo4j makes sure to not include matches where the same graph relationship is found multiple times in a single pattern.
      • Example: looking for a user’s friends of friends should not return said user. It happens in Facebook.
      • MATCH (user:User { name: 'Adam' })-[r1:FRIEND]-()-[r2:FRIEND]-(friend_of_a_friend)
        RETURN friend_of_a_friend.name AS fofName
      • If it is desired that we want to have the user name returned as well, we can spread matching over multiple MATCH  clauses.
      • MATCH (user:User { name: 'Adam' })-[r1:FRIEND]-(friend)
        MATCH (friend)-[r2:FRIEND]-(friend_of_a_friend)
        RETURN friend_of_a_friend.name AS fofName
    • Parameters
      • Cypher supports querying with parameters. It works with use of { }.
      • Parameters can be used for:
        • Literals and Expressions in WHERE clause. WHERE n.name = { name }
        • Start clause. 
        • index queries and
        • Node/relationship ids.
      • Examples:
        • WHERE n.name =~ { regex }
        • WHERE n.name STARTS WITH { name }
        • CREATE ({ props })
        • SET n = { props }
        • SKIP { s } LIMIT { l }
  2. Syntax
    • Values: Supported values are all distinct and are as under:
      • Numeric
      • String
      • Boolean
      • Nodes
      • Relationship
      • Paths
      • Maps from Strings to other values
      • Collection of any other type of value.
    • Expressions
      • Escape sequences.
      • Case Expressions
        • Simple case
        • CASE test
          WHEN value THEN result
          [WHEN ...]
          [ELSE default]
          END
        • Generic case
        • CASE
          WHEN predicate THEN result
          [WHEN ...]
          [ELSE default]
          END
    • Identifiers
      • When we reference parts of a pattern or a query, we name them. These names are called identifiers.
      • If multiple query parts are chained together using WITH, identifiers have to be listed in the WITH clause to be carried over to the next part.
    • Operators
      • Mathematical: +, -, *, / and %, ^
      • Comparison: =, <>, <, >, <=, >=, IS NULL, IS NOT NULL, STARTS WITH, ENDS WITH and CONTAINS
      • Boolean: AND, OR, XOR, NOT
      • String: +, =~ (regular expression matching)
      • Collections: + (Concatenation)
    • Comments: //this is a comment
    • Patterns: Describes the shape of data we want to fetch.
      • Pattern for nodes: e.g. (a)
      • Pattern for related nodes: e.g. (a)-->(b), (a)-->(b)<--(c), (a)-->()<--(c)
      • Labels: e.g. (a:User:Admin)-->(b) 
      • Specifying properties:  e.g. (a { name: "Andres", sport: "Brazilian Ju-Jitsu" })
      • Describing relationship: e.g. (a)-[r:TYPE1|TYPE2]->(b)
      • Path:  e.g. p = (a)-[*3..5]->(b)
    • Collections: created by using brackets and separating the elements with commas
      • RETURN range(0,10)[3]         //returns 3
      • RETURN range(0,10)[-3]       //returns 8
      • RETURN range(0,10)[0..3]     //returns [0,1,2]
      • RETURN range(0,10)[0..-5]    //returns [0,1,2,3,4,5]  
      • RETURN range(0,10)[-5..]
      • RETURN range(0,10)[..4]
    • Working with NULL: a missing unknown value
      • expression NULL = NULL yields NULL and not TRUE
      • Logical operation with NULL: treated as unknown. 
      • IN operator: NULL IN [] is FALSE. Otherwise we always get NULL
  3. General Clauses
    • RETURN
      • return node, return relationship
      • return all elements (RETURN *)
      • Column alias (RETURN a.age AS age)
      • Optional properties (RETURN a.age)
      • Other expressions  e.g. RETURN a.age > 30, "I'm a literal",(a)-->()
      • Unique results e.g. RETURN DISTINCT b
    • ORDER BY: sub-clause following RETURN or WITH
    • LIMIT: e.g. LIMIT 3, LIMIT toInt(3 * rand())+ 1
    • SKIP: e.g. SKIP 3, SKIP 1 LIMIT 2, SKIP toInt(3*rand())+ 1
    • WITH: manipulate the output before it is passed on to the following query parts
      • Filter using ORDER BY, LIMIT and aggregate functions
      • MATCH (david { name: "David" })--(otherPerson)-->()
        WITH otherPerson, count(*) AS foaf
        WHERE foaf > 1
        RETURN otherPerson
      • MATCH (n)
        WITH n
        ORDER BY n.name DESC LIMIT 3
        RETURN collect(n.name)
    • UNWIND: expands a collection into a sequence of rows.
      • UNWIND[1,2,3] AS x
        RETURN x
    • UNION: combines the result of multiple queries.
      • Allow duplicates
      • MATCH (n:Actor)
        RETURN n.name AS name
        UNION ALL MATCH (n:Movie)
        RETURN n.title AS name
      • Do not allow duplicates
      • MATCH (n:Actor)
        RETURN n.name AS name
        UNION
        MATCH (n:Movie)
        RETURN n.title AS name
    • USING: influences the decisions of the planner when building an execution plan for a query
      • Query using one or more index hints
      • MATCH (m:German)-->(n:Swede)
        USING INDEX m:German(surname)
        USING INDEX n:Swede(surname)
        WHERE m.surname = 'Plantikow' AND n.surname = 'Taylor'
        RETURN m
      • Label scan
      • MATCH (m:German)
        USING SCAN m:German
        WHERE m.surname = 'Plantikow'
        RETURN m
  4. Reading Clauses
    • MATCH: primary way of getting data into the current set of bindings
      • -- is used without regard for relationship.
      • --> mentions that relationship is interesting.
    • OPTIONAL MATCH
      • searches for the pattern described in it, while using NULLs for missing parts of the pattern.
      • e.g. OPTIONAL MATCH (a)-->(x), OPTIONAL MATCH (a)-[r:ACTS_IN]->()
    • WHERE
      • adds constraints to the patterns in a MATCH or OPTIONAL MATCH clause or filters the results of a WITH clause.
    • START: Find starting points through legacy indexes.
      • START n=node:nodes(name = "A")
        RETURN n
    • Aggregation
      • COUNT: count the number of rows
        • count nodes
        • Group count relationship types
        • count entities
        • count not null values
      • statistics
        • sum
        • average
        • percentileDisc: e.g.  RETURN percentileDisc(n.property, 0.5) //0.5 = median
        • percentileCont
        • stdev
        • max
        • min
      • collect: collects all the values into a list. It will ignore NULLs.
      • DISTINCT
  5. Writing Clauses
    • CREATE
    • MERGE: use of ON CREATE and ON MATCH
    • SET: update labels on nodes or properties on nodes and relationships.
    • DELETE - deletes node or relationship.
    • REMOVE - removes a property or label.
    • FOREACH -  Iterates the nodes
      • MATCH p =(begin)-[*]->(END )
        WHERE begin.name='A' AND END .name='D'
        FOREACH (n IN nodes(p)| SET n.marked = TRUE )
    • CREATE UNIQUE:  mix of MATCH and CREATE — it will match what it can, and create what is missing.
    • PERIODIC COMMIT: USING PERIODIC COMMIT 500
  6. Functions
    • Predicates: boolean functions that return true or false for a given set of input.
      • ALL: syntax - ALL(identifier in collection WHERE predicate)
      • ANY: syntax - ANY(identifier in collection WHERE predicate)
      • NONE: syntax - NONE(identifier in collection WHERE predicate)
      • SINGLE: syntax - SINGLE(identifier in collection WHERE predicate)
      • EXISTS: syntax - EXISTS( pattern-or-property )
    • Scalar functions
      • TYPE: returns the string representation of relationship type. TYPE( relationship )
      • COALESCE: returns first non-NULL values in the list of expressions passed to it.
        • COALESCE( expression [, expression]* )
      • STARTNODE: returns the start node of a relationship
      • MATCH (x:foo)-[r]-()
        RETURN startNode(r)
      • ENDNODE: returns the end node of a relationship
      • others: SIZE, LENGTH, ID, HEAD, LAST, TIMESTAMP, TOINT, TOFLOAT 
    • Collection functions
      • NODES: returns all nodes in the path
      • MATCH p=(a)-->(b)-->(c)
        WHERE a.name='Alice' AND c.name='Eskil'
        RETURN nodes(p)
      • RELATIONSHIPS: returns all relationships in the path
      • MATCH p=(a)-->(b)-->(c)
        WHERE a.name='Alice' AND c.name='Eskil'
        RETURN relationships(p)
      • KEYS: returns a collection of propety names attached to a node/relationship
      • MATCH (a)
        WHERE a.name='Alice'
        RETURN keys(a)
      • LABELS: returns a collection of labels attached to a node
      • MATCH (a)
        WHERE a.name='Alice'
        RETURN labels(a)
      • EXTRACT: returns a collection of properties
        • syntax:EXTRACT( identifier in collection | expression )
        • MATCH p=(a)-->(b)-->(c)
          WHERE a.name='Alice' AND b.name='Bob' AND c.name='Daniel'
          RETURN extract(n IN nodes(p)| n.age) AS extracted
      • FILTER:  returns all the elements in a collection that comply to a predicate
        • syntax: FILTER(identifier in collection WHERE predicate)
        • MATCH (a)
          WHERE a.name='Eskil'
          RETURN a.array, filter(x IN a.array WHERE size(x)= 3)
      • TAIL: returns all but the first element in a collection.
      • RANGE: returns numerical values in range with a non zero step value.
        • syntax: RANGE( start, end [, step] )
        • RETURN range(0,10), range(2,18,3)
      • REDUCE: case of an accumulator
        • syntax: REDUCE( accumulator = initial, identifier in collection | expression )
        • MATCH p=(a)-->(b)-->(c)
          WHERE a.name='Alice' AND b.name='Bob' AND c.name='Daniel'
          RETURN reduce(totalAge = 0, n IN nodes(p)| totalAge + n.age) AS reduction
    • Mathematical functions:operate on numerical expressions only, and will return an error if used on any other values
    • String functions: operate on string expressions only, and will return an error if used on any other values. TOSTRING() accepts numbers as well.
  7. Schema
    • Neo4j 2.0 introduced an optional schema for the graph, based around the concept of labels.
    • Labels are used in the specification of indexes, and for defining constraints on the graph.
    • Together, indexes and constraints are the schema of the graph.
    • Indexes
      • Create an index
      • CREATE INDEX ON :Person(name)
      • Drop an index
      • DROP INDEX ON :Person(name)
      • Use an index
        • In WHERE clause with equality and also with inequality
        • MATCH (person:Person)
          WHERE person.name > 'B'
          RETURN person
        • with IN
        • MATCH (person:Person)
          WHERE person.name IN ['Andres', 'Mark']
          RETURN person
        • with STARTS WITH
        • MATCH (person:Person)
          WHERE person.name STARTS WITH 'And'
          RETURN person
        • while checking for the existence of a property
        • MATCH (p:Person)
          WHERE HAS (p.name)
          RETURN p
    • Constraints
      • enforces data integrity with the use of constraints.
      • Constraints can be applied to either nodes or relationships.
      • Unique Node property constraints
        • CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
        • DROP CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
      • Node Property Existence constraint
        • CREATE CONSTRAINT ON (book:Book) ASSERT exists(book.isbn)
        • DROP CONSTRAINT ON (book:Book) ASSERT exists(book.isbn)
      • Relationship property existence
        • CREATE CONSTRAINT ON ()-[like:LIKED]-() ASSERT exists(like.day)
    • Statistics
      • statistical information that Neo4j keeps is:
        • The number of nodes with a certain label.
        • Selectivity per index.
        • The number of relationships by type.
        • The number of relationships by type, ending or starting from a node with a specific label.
      • We can configure some options and also sample using shell.
  8. Query Tuning
    • Each Cypher query gets optimized and transformed into an execution plan by the Cypher execution engine. To minimize the resources used for this, make sure to use parameters instead of literals when possible.
    • How are queries executed?
      • Each query is turned into an execution plan by the execution planner.
      • Execution planning strategies:
        • Rule: This planner has rules that are used to produce execution plans. The planner considers available indexes, but does not use statistical information to guide the query compilation.
        • Cost: This planner uses the statistics service in Neo4j to assign cost to alternative plans and picks the cheapest one
      • Can be configured by using query.planner.version (2.2 version has cost planner for some queries)
      • or by adding CYPHER planner=cost or CYPHER planner=rule before the query.
    • How to profile a query?
      • We can look at the execution plan in following ways:
        • EXPLAIN: Append Cypher statement with EXPLAIN. It does not run the statement but shows the plan.
        • PROFILE: If you want to run the statement and see which operators are doing most of the work, use PROFILE. This will run your statement and keep track of how many rows pass through each operator, and how much each operator needs to interact with the storage layer to retrieve the necessary data
  9. Execution Plans

Satyam Shandilya

Satyam Shandilya is a Software Engineer based in Hyderabad, India. Apart from his usual coding and software stuffs, Satyam does enjoy writing and public speaking.

No comments:

Post a Comment

Instagram