I am working on making the framework for a music recommendation system (with data from Million Songs in a CSV file) by connecting songs in a graph database using neo4j. This is my first time using neo4j, but I have used SQL before.
So far I have three nodes: Song, Artist, and Tempo.
I have already created the relationship between artists and their songs, and now I'm trying to create a relationship between each song and a range of tempos.
I could just have each song have a relationship to a specific tempo (ex: 120bpm), however that would not be very useful since I would not then be able to backtrack from Tempo and see another song that's very close in speed (ex: 119 or 121bpm).
Therefore, I'm attempting to group my Tempo nodes (which are floats) from being one exact number (ex:120bpm) to a range such as 0-80 (classified as very slow), 81-100 (slow), 101-130 (moderate), ... etc.
I know it would theoretically be better not to have set groups of tempos, but I'm just beginning and it will be ok for now.
Each Song node has parameters title artistName tempo.
Each Artist node has parameters artistName title.
Each Tempo node has parameters tempo title.
I have tried using creating a new node via:
CREATE (Tempo {Tempo.tempo<80});
... and several other ways I can't remember right now. Anyone that knows how to do this or if it's possible?
You seem to be duplicating properties unnecessarily across multiple node labels, in a way that would prevent a given node from being related to multiple other nodes. For example, an Artist node should not have a title property, since that would tie that node to a specific Song. Every Song would presumably have a relationship to the appropriate Artist anyway, so there is no need to store the song's title in the Artist node.
Also, as #InverseFalcon suggested, you can represent a range by using a pair of properties, say min and max.
Here is an example of a path in a suitable data model:
(:Tempo {min: 0, max: 79})<-[:HAS_TEMPO]-(:Song {title: 'Foo'})<-[:PERFORMED]-(:Artist {name: 'Fred'})
There would be one Tempo node for each tempo range.
Using the above data model, this simple query will return all songs that have the same tempo range ($speed is a parameter indicating the specific tempo of interest):
MATCH (t:Tempo)
WHERE t.min <= $speed <= t.max
MATCH (t)<-[:HAS_TEMPO]-(s:Song)
RETURN s;
And this is how you'd return the distinct artists who have ever performed a song in the desired tempo range:
MATCH (t:Tempo)
WHERE t.min <= $speed <= t.max
MATCH (t)<-[:HAS_TEMPO]-(:Song)<-[:PERFORMED]-(a:Artist)
RETURN DISTINCT a;
Related
I'm new to Neo4j, and playing around by trying to set up a music database. To start simple, I'm just playing with two labels:
Artist
Song
Obviously this is a parent-child relationship, where a Song is a child of an Artist (or possibly multiple Artists), and might look something like:
(:Artist {name:'name'})-[:RECORDED]->(:Song {title:'title'})
I am making the following assumptions:
Artist names are unique
Song titles are not unique
Duplicate ingest data is unavoidable
To give an example of what I'd like to do:
I ingest "Hallelujah" by Leonard Cohen. A new Artist node and Song node are created, with a RECORDED relationship
I ingest "Hallelujah" by Jeff Buckley. Again, new Artist and Song node are created, with a RECORDED relationship. The first "Hallelujah" Song is not associated with this new graph at all.
I ingest "Hallelujah" by Jeff Buckley again. Nothing happens.
I ingest "Lilac Wine" by Jeff Buckley. We reuse our old Artist node, but I have a new Song node with a RECORDED relationship
From what I can tell, using MERGE gets me close, but not quite there (it stops duplication of the ARTIST, but not of the SONG). If I use CREATE, then point number 3. doesn't work properly.
I guess I could add another property to the SONG label which tracks its ARTIST (and I can therefore make unique), but that seems a little redundant and unidiomatic of a graph database, no?
Does anyone have any bright ideas on the most succinct way of enforcing these relationships and requirements?
Merge Artist first, and after Song:
WITH 'Leonard Cohen' AS ArtistName,
'Hallelujah' AS SongTitle
MERGE (A:Artist {name:ArtistName})
WITH A,
SongTitle
OPTIONAL MATCH p=(A)-[:RECORDED]->(:Song {title:SongTitle})
FOREACH (x in CASE WHEN p IS NULL THEN [1] ELSE [] END |
CREATE (S:Song {title:SongTitle})
MERGE (A)-[:RECORDED]->(S)
)
WITH A,
SongTitle
MATCH p = (A)-[:RECORDED]->(:Song {title:SongTitle})
RETURN p
I don't think song title is something you can rely on for uniqueness, especially if this graph includes covers of existing songs.
Determining some additional means to imply uniqueness is the way to go.
Artist is one way. Recorded date might be another piece of data to consider. If you're reading this from some other kind of database, there may be some other unique ID they use for uniqueness.
Whatever the case, once you have the fields that you want to use to determine uniqueness, MERGE your song node with all those fields present.
The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m
Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.
I've just recently started using neo4j and I've run into an issue. There doesn't seem to be an answer to this on here but I might also be wording it incorrectly. I'm building a small site that categorizes music. There are multiple song nodes with BELONGS_TO relationships to genre nodes. How can I get every song that belongs to a set of user specified genres.
For example. Song1, Song2, Song3 all belong to both Pop and Electronic. Song4 just belongs to Pop. How can I query to get every song belonging to both Pop and Electronic? In this case Song1, Song2, an Song3.
I've been struggling with this for a while. This is what I have so far but it doesn't return anything. If I replace AND with OR I get all the songs that belong to one of those genres.
MATCH (n:Song)-[r:BELONGS_TO]->(Genre)
WHERE (n)-[r]->(Genre{name:"Pop"}) AND (n)-[r]->(Genre{name:"Electronic"})
RETURN n
Thank you.
What you're trying to do in the WHERE clause you should actually do in the MATCH clause. Here you go:
MATCH (g1:Genre {name: "Pop"})<-[:BELONGS_TO]-(popElectronicSongs:Song)-[:BELONGS_TO]->(g2:Genre {name: "Electronic"});
RETURN popElectronicSongs;
You can actually do quite a lot with just the MATCH clause as you can see here. The WHERE bit usually gets used for filtering based on various predicates. For example you might say WHERE popElectronicSongs.title =~ /S.*/ to filter for only songs whose name starts with S.
I have read the Neo4j manual and saw the numerous short examples regarding movie graph. I have also installed it locally and played with the cypher.
Here is the setup:
I have the following nodes: Movies (with name and id, owned by friend), Actors(with name and ids) Directors (with names and id), Genre (with id and name)
Relations are: Actors acted in Movies (1 movie - many actors), Directors directed a movie (1 director per movie but a director can direct many movies), and Movies has several genre "(many to many)
1) Owned by friend I dont know why but following the LOAD CSV example they put USA as a node rather than a property but is there a logical reason why its better to put it as a node rather than a property like i did?
2)
What I want to search is similar to the answer given to this question:
Nearest nodes to a give node, assigning dynamically weight to relationship types
However - I do not have a weight on the relationship and its more of a "go find the first give nodes connected to it"
Given that the "owned by friend" can only be owned by 1 person.
If given movie title "Spider-Man" (which for example purpose is owned by frank) go find the next occurrence of a movie that is owned by John.
So after reading Neo4j I believe that I dont need to specify which relationship is needed to traverse but just go find the next movie that meets my criteria, right?
So Following the above link
MATCH (n:Start { title: 'Spider-Man' }),
(n)-[:CONNECTED*0..2]-(x)
RETURN x
So go to node Spider-Man and go find me X as long as it is connected but I got stump by *0..2 because its the range...what if I just say "go find me the first you that means the own by John"
3) following up to #2 - how do i insert the fitler "own by john" ?
There are a number of things in your question that don't quite make sense. Here's a stab at an answer.
1) Making 'USA' a node rather than a property is useful if you want to search based on country. If 'USA' is a node, you are able to limit your search by starting at the 'USA' node. If you don't care to do this, then it doesn't really matter. It may also save a small amount of space for longer country names to store the name once and link to it via relationships.
2) Your example doesn't match your described graph. I can't really speak to it without a better example.
3) This is probably easy to answer once you improve your example.
OK. Based on the comments to the answer, here's what you need. To find one movie owned by John that is connected via common actors, directors, etc to the movie Spider-man owned by Frank (that is, sub-graphs like (movie)<--(actor)-->(movie) ) you can write:
MATCH (n:Movie {title : 'Spider-Man', owned_by : 'Frank'})<-[*2]->(m:Movie {owned_by : 'John'})
RETURN m LIMIT 1
If you want more responses, alter or remove the LIMIT on the RETURN clause. If you want to allow chains that pass through chains like (movie)<--(actor)-->(movie)<--(director)-->(movie), you can increase the number of relationships matched (the *2) to 4, 6, 8, etc. You probably shouldn't just write the relationship part of the MATCH clause as -[*]-, because this could get into infinite loops.