For an academic project I have to analyse a customer database of an insurance company.
This insurance company would like to identify a couple things, first of all classifying customers who leave the company in order to make them some offers or such..
Then they also would like to know on which customers to make upselling or cross-selling, as well as finding risky customers, in terms of insurance claims.
So I am focusing on the customer cancellations as it seems the most important one.
The attributes provided by the insurance company are:
Bundled/Unbundled, Policy Status, Policy Type, Policy Combination, Issue Date, Effective Date, Maturity Date, Policy Duration, Loan Duration, Cancellation Date, Reason for cancellation, Total Premium, Splitter Premium, Partner ID, Agency ID, Country Agency, Zone ID, Agency potential, Sex Contractor, Birth Year Contractor, Job Contractor, Sex Insured, Job Insured, Birth Year Insured, Year Claim, Claim Status, Claim Provision, Claim Payments
The database is composed of ~200k records and there are many missing values for some attributes.
I started using Rapid Miner to mine the dataset.
I cleaned the dataset a bit, removing incoherent or wrong values.
I then tried applying decision trees, adding a new attribute derived from Policy Status (which can be issued,renewed or cancelled) called isCanceled, and using it as the label of the decision tree.
I tried changing every single parameter of the decision tree, but I either get a tree with only 1 leaf node and no splits, or some tree that is completely irrelevant since it has leaf nodes with almost the same number instances of the 2 classes.
This is getting really frustrating.
I'd like to know what the usual procedures to make churn analysis are, possibly using Rapid Miner..can anybody help me?
In my experience most data mining or machine learning activities spend most of their time cleaning, tidying, formatting and understanding the data.
Assuming this has been done, then as long as there is a relationship between some or all of the attributes and the label to be predicted it will be possible to perform some sort of churn analysis.
There are lots of ways to determine this relationship of course but a quick way is to try one of the Weight By operators. This will output a set of weights for each attribute with those near 1 being potentially more predictive of the label.
If you determine there are attributes of value, you can use Decision Trees or another operator to build a model that can be used to predict. The attributes you have are a mix of nominal and numeric types so Decision Trees will work and anyway this operator is easier to visualize. The tricky part is getting the parameters right and the way to do this is to observe the performance of a model on unseen data as the parameters are varied. The Loop Parameters operator can help with this.
Related
I am working on a use case predict relation between nodes but of different type.
I have a graph something like this.
(:customer)-[:has]->(:session) (:session)-[:contains]->(:order) (:order)-[:has]->(:product) (:order)-[:to]->(:relation)
There are many customers who have placed orders. Some of the orders specify to whom the order was intended to (relation) i.e., mother/father etc. and some orders do not. For these orders my intention is to predict to whom the order was likely intended to.
I have prepared a Link Prediction ML pipeline on neo4j. The gds.beta.pipeline.linkPrediction.predict.mutate procedure has 2 ways of prediction: Exhaustive search, Approximate search. The first one predicts for all unconnected nodes and the second one applies KNN to predict. I do not want both; rather I want the model to predict the link only between 2 specific nodes 'order' node and 'relation' node. How do I specify this in the predict procedure?
You can also frame this problem as node classification and get what you are looking for. Treat Relation as the target variable and it will become a multi class classification problem. Let's say that Relation is a categorical variable with a few types (Mother/Father/Sibling/Friend etc.) and the hypothesis is that based on the properties on the Customer and the Order nodes, we can predict which relation a certain order is intended to.
Some of the examples of the properties of Customer nodes are age, location, billing address etc., and the properties of the Order nodes are category, description, shipped address, billing address, location etc. Properties of Session nodes are probably not useful in predicting anything about the order or the relation that order is intended to.
For running any algorithm in Neo4j, we have to project a graph into memory. Some of the properties on Customer and Order nodes are strings and graph projection procs do not support projecting strings into memory. Hence, the strings have to be converted into numerical values.
For example, Customer age can be used as is but the order description has to be converted into a word/phrase embedding using some NLP methodology etc. Some creative feature engineering also helps - instead of encoding billing/shipping addresses, a simple flag to identify if they are the same or different makes it easier to differentiate if the customer is shipping the order to his/her own address or to somewhere else.
Since we are using Relation as a target variable, let's label encode the relation type and add that as a class label property on Order nodes where relationship to Relation node exists (labelled examples). For all other orders, add a class label property as 0 (or any other number other than the label encoded relation type)
Now, project a graph with Customer, Session and Order nodes along with the properties of interest into memory. Since we are not using Session nodes in our prediction task, we can collapse the path between Customer and Order nodes. One customer can connect to multiple orders via multiple session nodes and orders are unique. Collapse path procedure will not result in multiple relationships between a customer and an order node and hence, aggregation is not needed.
You can now use Node classification ML pipeline in Neo4j GDS library to generate embeddings and use embedding property on Order node as a feature vector and class label property as target and train a multi class classification model to predict the class that particular order belongs to or the likelihood that particular order is intended to some relation type.
This use case is not supported by the latest stable release of GDS (2.1.11, at the time of writing). In GDS pipelines, we assume a homogeneous graph, where the training algorithm will consider each node as the same type as any other node, and similarly for relationships.
However, we are currently building features to support heterogeneous use cases. In 2.2 we will add so-called context configuration, where you can direct your training algorithm to attempt to learn only a specific relationship type between specific source and target node labels, while still allowing the feature-producing node property steps to use the richer graph.
This will be effective relative to the node features you are using -- if you are using an embedding, you must know that these are still homogeneous and will not likely be able to tell the various different relationship types apart (except for GraphSAGE). Even if you do use them, you will only get the predictions for the relevant label-type-label triple which you specified for training. But I would recommend to think about what features to use and how to tune your models effectively.
You can already try out the 2.2 features by using our alpha releases -- find our latest alpha through this download link. Preview documentation is available here. Note that this is preview software and that the API may change a lot until the final 2.2.0 released version.
I am developing a Neo4j database that will contain genomic and clinical data for cancer patients. A common design issue in developing graph databases is whether a data item should be represented by a Node or by a property within a Node. In my case, patients will have hundreds of clinical and demographic measurements (e.g. sex, medications, tumor size). Some of these data will be constant (e.g. sex) while others will be subject to variation with each patient visit. The responses I've seen to previous node vs property questions have recommended using the anticipated queries against the data to make the decision. I think I can identify some properties that will be common search criteria and should be nodes (e.g. smoking history, sex, cancer type) but that still leaves me with hundreds of other properties. Is there a practical limit in Neo4j for the number of properties that a Node should contain? Also, a hybrid approach, where some data are properties and others are Nodes would seem to make both loading data from source files and subsequent queries more complicated.
The main idea behind "look at your queries to decide", is that how data relates to each other effects whether a node or property is better. Really, the main point of a graph database is to make walking relationships easier to query. So the real question you should ask yourself is "Does (a)-->()<--(b) have any significant meaning?" In other words, do I need to be able to find other nodes that share this property?
Here are some quick rule-of-thumb guidlines
Node
Has it's own sub-values or relations
Multiple nodes sharing this value has meaning, and you need to be able to walk along this shared value between them
Changes infrequently
If more than 1 value can apply at the same time
Properties
Has a large range of possible values
Changes over time
If more than 1 value can apply, values are usually updated as a set (rather than individually)
Label
Has a small, finite range of mutually exclusive values
Almost never changes
So lets go through the thought process of a few of your properties...
Sex
Will either be "Male" or "Female", and everyone will be connected to one of the two, so they will both end up being super nodes (overloaded). Also, if you do end up needing to find two people that share the same sex, almost any other method would be more efficient than finding them through the super node. However these are mutually exclusive, immutable, genetic traits so making this a label is also perfectly acceptable (and sometimes preferred).
Address
This is a variable value with sub-properties, won't be shared by very many nodes, and the walk from one person to another at the same address (or, by extension, live in an area) has valuable meaning. So this should almost definitely be a node.
Height and Weight
These change constantly with time, have no sub values, and two people sharing this value has little to no meaning. The range of values is far too wide, so Labels make no since either, so this should be a property.
Blood type
While has more options then Sex does, all the same logic applies, except that the relation does matter now (because people must share a blood type to donate). The problem is that this value will be so overloaded, that you will need to filter on area first, and than just verifying blood type. Could be a property or label. The case for node is if you include an "Can_Donate_To" or "Can_Accept" relation between the blood types. While you likely won't walk these relations to find a potential donor (because they are too overloaded, and you will have to filter by area first), you can use them to verify someone can be a donor.
Social Security Number
Is highly sensitive, and a lawsuit waiting to happen. Keep out of the DB if at all possible. If you absolutely have to; this property is immutable, but will be unique to every person, so because of the lack of reuse, is a bad label and will be pointless as a node. Definitely a property. (But should be salted+hashed if only for verification purposes only)
Mother's maiden name
The possible values are endless, and two nodes sharing this value has no real meaning. Definitely a property.
First born child
Since the child is already their own node, with it's own sub properties, just create a relation between the two. While the value of this info is questionable, any time you need to reference another node, always use a relationship for it. Definitely a node.
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
There are some not labeled corpus. I extracted from it triples (OBJECT, RELATION, OBJECT). For relation extraction I use Stanford OpenIE. But I need only some of this triples. For example, I need relation "funded".
Text:
Every startup needs a steady diet of funding to keep it strong and growing. Datadog, a monitoring service that helps customers bring together data from across a variety of infrastructure and software is no exception. Today it announced a massive $94.5 million Series D Round. The company would not discuss valuation.
From this text i want to extract relation (Datadog, announced, $94.5 million Round)
I have only one idea:
Use StanfordCoreference to detect that 'Datadog' in the first sentence and 'it' in second sentence are the same entity
Try to cluster relations, but i think it's won't work well
May be there are better approach? May be I need labeled corpus(i haven't it)?
I'm looking for a machine learning method to recognize input ranges that result customer dissatisfaction.
For instance, assume that we have a database of customer's age, customer's gender, date and time that customer stops by, person who is in charge of providing service to customer, etc. and finally a number in range of 0 to 10 which stands for customer satisfaction (Extracted from customer's feedback).
Now I'm looking for a method to determine input ranges which results dissatisfaction. For example male customers who are stopping by John, between 10-12pm are mainly dissatisfied.
I believe there already is a kind of clustering or neural network method for this purpose. Could you help me?
This is not a clustering problem. You have training data.
Instead, you may be looking for a decision tree.
There is more than one method to do it (correlation analysis for ex.)
One simple way is to classify your data by the degree of satisfaction (target)
Classes:
0-5 DISSATISFIED
6-10 SATISFIED
Than look for repetition along features in each cluster.
For example:
if you are interested by one feature, ex: the person who stopped clients, than just get the most frequent name within the two classes to get a result like 80% of unsatisfied client was stopped by jhon
if you are interested by more than one feature, ex: the person who stopped the client AND the time of the day, in this case you can consider the couple of features us one and do the same thing as the first case, than you will get something like 30% of unsatisfied client was stopped by jhon between 10 and 11 am
What do you want to know? Which service person to fire, what are the best hours to provide the service, or sth. else? I mean what are your classes?
Provided, you what to evaluate the service person - the classes are the
persons. In SVM (and I think for NN applies the same) I would split all not purely numerical data in boolean attributes.
Age: unchanged, number
Gender: male 1/0, female 1/0
Date: 7 features for days of week, possibly the number of experience days of the service person. for each special date an attribute e.g. national holiday 1/0
Time: split the time-span in reasonable ranges e.g. 15 min. Each range is a feature
Satisfaction: unchanged - number 1-10
With this model you could predict the satisfaction index for each service person for given date, time, gender, age.
I guess, you can try using anomaly detection algorithms. Basically if you consider the satisfaction level as the dependent variable, then you can try to find the samples which are located away from the majority of the samples in the euclidean space. These away samples could signify dissatisfaction.