how to cluster users based on tags - machine-learning

I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?
Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?
In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).
I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.
I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..

You basically want to cluster the users according to their tags.
To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space, and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].
In your own case, each user can be similarly represented as a point in a 20,000-dimensional space.
You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.
Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.
Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, i.e. You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout.)
Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory, such as information gain to find out which tags contain more information about the cluster.

You should consider using neo4j. You can model your data using the following node labels and relationship types.
If you are not familiar with neo4j's Cypher language notation, (:Foo) represents a node with the label Foo, and [:BAR] represents a relationship with the type BAR. The arrows around a relationship indicate its directionality. neo4j efficiently traverses relationships in both directions.
(:Cluster) -[:INCLUDES_TAG]-> (:Tag) <-[:HAS_TAG]- (:Program) <-[:WATCHED]- (:User)
You'd have k Cluster nodes, 20K Tag nodes, and several million WATCHED relationships.
With this model, starting with any given Cluster node, you can efficiently find all its related tags, programs, and users.

Related

Neo4j Link prediction ML Pipeline

I am working on a use case predict relation between nodes but of different type.
I have a graph something like this.
(:customer)-[:has]->(:session) (:session)-[:contains]->(:order) (:order)-[:has]->(:product) (:order)-[:to]->(:relation)
There are many customers who have placed orders. Some of the orders specify to whom the order was intended to (relation) i.e., mother/father etc. and some orders do not. For these orders my intention is to predict to whom the order was likely intended to.
I have prepared a Link Prediction ML pipeline on neo4j. The gds.beta.pipeline.linkPrediction.predict.mutate procedure has 2 ways of prediction: Exhaustive search, Approximate search. The first one predicts for all unconnected nodes and the second one applies KNN to predict. I do not want both; rather I want the model to predict the link only between 2 specific nodes 'order' node and 'relation' node. How do I specify this in the predict procedure?
You can also frame this problem as node classification and get what you are looking for. Treat Relation as the target variable and it will become a multi class classification problem. Let's say that Relation is a categorical variable with a few types (Mother/Father/Sibling/Friend etc.) and the hypothesis is that based on the properties on the Customer and the Order nodes, we can predict which relation a certain order is intended to.
Some of the examples of the properties of Customer nodes are age, location, billing address etc., and the properties of the Order nodes are category, description, shipped address, billing address, location etc. Properties of Session nodes are probably not useful in predicting anything about the order or the relation that order is intended to.
For running any algorithm in Neo4j, we have to project a graph into memory. Some of the properties on Customer and Order nodes are strings and graph projection procs do not support projecting strings into memory. Hence, the strings have to be converted into numerical values.
For example, Customer age can be used as is but the order description has to be converted into a word/phrase embedding using some NLP methodology etc. Some creative feature engineering also helps - instead of encoding billing/shipping addresses, a simple flag to identify if they are the same or different makes it easier to differentiate if the customer is shipping the order to his/her own address or to somewhere else.
Since we are using Relation as a target variable, let's label encode the relation type and add that as a class label property on Order nodes where relationship to Relation node exists (labelled examples). For all other orders, add a class label property as 0 (or any other number other than the label encoded relation type)
Now, project a graph with Customer, Session and Order nodes along with the properties of interest into memory. Since we are not using Session nodes in our prediction task, we can collapse the path between Customer and Order nodes. One customer can connect to multiple orders via multiple session nodes and orders are unique. Collapse path procedure will not result in multiple relationships between a customer and an order node and hence, aggregation is not needed.
You can now use Node classification ML pipeline in Neo4j GDS library to generate embeddings and use embedding property on Order node as a feature vector and class label property as target and train a multi class classification model to predict the class that particular order belongs to or the likelihood that particular order is intended to some relation type.
This use case is not supported by the latest stable release of GDS (2.1.11, at the time of writing). In GDS pipelines, we assume a homogeneous graph, where the training algorithm will consider each node as the same type as any other node, and similarly for relationships.
However, we are currently building features to support heterogeneous use cases. In 2.2 we will add so-called context configuration, where you can direct your training algorithm to attempt to learn only a specific relationship type between specific source and target node labels, while still allowing the feature-producing node property steps to use the richer graph.
This will be effective relative to the node features you are using -- if you are using an embedding, you must know that these are still homogeneous and will not likely be able to tell the various different relationship types apart (except for GraphSAGE). Even if you do use them, you will only get the predictions for the relevant label-type-label triple which you specified for training. But I would recommend to think about what features to use and how to tune your models effectively.
You can already try out the 2.2 features by using our alpha releases -- find our latest alpha through this download link. Preview documentation is available here. Note that this is preview software and that the API may change a lot until the final 2.2.0 released version.

(neo4j) Best practice for the number of properties in relationships and nodes

I've just started using neo4j and, having done a few experiments, am ready to start organizing the database in itself. Therefore, I've started by designing a basic diagram (on paper) and came across the following doubt:
Most examples in the material I'm using (cypher and neo4j tutorials) present only a few properties per relationship/node. But I have to wonder what the cost of having a heavy string of properties is.
Q: Is it more efficient to favor a wide variety of relationship types (GOODFRIENDS_WITH, FRIENDS_WITH, ACQUAINTANCE, RIVAL, ENEMIES, etc) or fewer types with varying properties (SEES_AS type:good friend, friend, acquaintance, rival, enemy, etc)?
The same holds for nodes. The first draft of my diagram has a staggering amount of properties (title, first name, second name, first surname, second surname, suffix, nickname, and then there's physical characteristics, personality, age, jobs...) and I'm thinking it may lower the performance of the db. Of course some nodes won't need all of the properties, but the basic properties will still be quite a few.
Q: What is the actual, and the advisable, limit for the number of properties, in both nodes and relationships?
FYI, I am going to remake my draft in such a way as to diminish the properties by using nodes instead (create a node :family names, another for :job and so on), but I've only just started thinking it over as I'll need to carefully analyse which 'would-be properties' make sense to remain, even because the change will amplify the number of relationship types I'll be dealing with.
Background information:
1) I'm using neo4j to map out all relationships between the people living in a fictional small town. The queries I'll perform will mostly be as follow:
a. find all possible paths between 2 (or more) characters
b. find all locations which 2 (or more) characters frequent
c. find all characters which have certain types of relationship (friends, cousins, neighbors, etc) to character X
d. find all characters with the same age (or similar age) who studied in the same school
e. find all characters with the same age / first name / surname / hair color / height / hobby / job / temper (easy to anger) / ...
and variations of the above.
2) I'm not a programmer, but having self-learnt HTML and advanced excel, I feel confident I'll learn the intuitive Cypher quickly enough.
First off, for small data "sandbox" use, this is a moot point. Even with the most inefficient data layout, as long as you avoid Cartesian Products and its like, the only thing you will notice is how intuitive your data is to yourself. So if this is a "toy" scale project, just focus on what makes the most organizational sense to you. If you change your mind later, reformatting via cypher won't be too hard.
Now assuming this is a business project that needs to scale to some degree, remember that non-indexed properties are basically invisible to the Cypher planner. The more meaningful and diverse your relationships, the better the Cypher planner is going to be at finding your data quickly. Favor relationships for connections you want to be able to explore, and favor properties for data you just want to see. Index any properties or use labels that will be key for finding a particular (or set of) node(s) in your queries.

What's the optimal structure for a multi-domain sentence/word graph in Neo4j?

I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

Finding clusters of similar nodes with Cypher

Given a set of hundred of thousands of nodes with like relationship, (Foodie) -likes-> (Food), I would like to find out logical cluster of Foodie nodes.
For instance suppose I want to divide the cluster into two sets. As an output I would like two sets which have the most common eating habits.
The same logic can be extended to 3,4,5 sets etc. In case of three sets, each set would have most like eating habits. Please note that sets may NOT have same number of nodes.
An application for instance could be coloring of nodes. If the foodies were of different countries, the color of the nodes could point to various countries assuming the people of different countries ate similar food.
I would like to write a Cypher query to extract the nodes. I am stumped as where to start. Any solution or pointers would be appreciated.
what about trying the current milestone of Neo4J 2.0 (http://www.neo4j.org/download, milestone section) and assigning your nodes different labels according to their characteristics (http://www.neo4j.org/develop/labels)?
Then, you'll only have to Cypher execute queries like:
MATCH (nodes:MY_LABEL)
WHERE /.../
RETURN nodes
so that you can retrieve nodes by clusters.
You might want to look into Cliques. This is a general Graph Theory idea, but it sounds like what you want is to define certain 'cliques' of foodies, say there are BBQ foodies, Food Truck foodies, etc.

Resources