I'm looking for a machine learning method to recognize input ranges that result customer dissatisfaction.
For instance, assume that we have a database of customer's age, customer's gender, date and time that customer stops by, person who is in charge of providing service to customer, etc. and finally a number in range of 0 to 10 which stands for customer satisfaction (Extracted from customer's feedback).
Now I'm looking for a method to determine input ranges which results dissatisfaction. For example male customers who are stopping by John, between 10-12pm are mainly dissatisfied.
I believe there already is a kind of clustering or neural network method for this purpose. Could you help me?
This is not a clustering problem. You have training data.
Instead, you may be looking for a decision tree.
There is more than one method to do it (correlation analysis for ex.)
One simple way is to classify your data by the degree of satisfaction (target)
Classes:
0-5 DISSATISFIED
6-10 SATISFIED
Than look for repetition along features in each cluster.
For example:
if you are interested by one feature, ex: the person who stopped clients, than just get the most frequent name within the two classes to get a result like 80% of unsatisfied client was stopped by jhon
if you are interested by more than one feature, ex: the person who stopped the client AND the time of the day, in this case you can consider the couple of features us one and do the same thing as the first case, than you will get something like 30% of unsatisfied client was stopped by jhon between 10 and 11 am
What do you want to know? Which service person to fire, what are the best hours to provide the service, or sth. else? I mean what are your classes?
Provided, you what to evaluate the service person - the classes are the
persons. In SVM (and I think for NN applies the same) I would split all not purely numerical data in boolean attributes.
Age: unchanged, number
Gender: male 1/0, female 1/0
Date: 7 features for days of week, possibly the number of experience days of the service person. for each special date an attribute e.g. national holiday 1/0
Time: split the time-span in reasonable ranges e.g. 15 min. Each range is a feature
Satisfaction: unchanged - number 1-10
With this model you could predict the satisfaction index for each service person for given date, time, gender, age.
I guess, you can try using anomaly detection algorithms. Basically if you consider the satisfaction level as the dependent variable, then you can try to find the samples which are located away from the majority of the samples in the euclidean space. These away samples could signify dissatisfaction.
Related
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
There are some not labeled corpus. I extracted from it triples (OBJECT, RELATION, OBJECT). For relation extraction I use Stanford OpenIE. But I need only some of this triples. For example, I need relation "funded".
Text:
Every startup needs a steady diet of funding to keep it strong and growing. Datadog, a monitoring service that helps customers bring together data from across a variety of infrastructure and software is no exception. Today it announced a massive $94.5 million Series D Round. The company would not discuss valuation.
From this text i want to extract relation (Datadog, announced, $94.5 million Round)
I have only one idea:
Use StanfordCoreference to detect that 'Datadog' in the first sentence and 'it' in second sentence are the same entity
Try to cluster relations, but i think it's won't work well
May be there are better approach? May be I need labeled corpus(i haven't it)?
I have a question:
I have a domain : LoanAccount. We have different product of loans but they just different on how to calculate the interest.
for example:
1. Regular Loan calculate interest rate using Annuity Interest Rate formula
2. Vehicle Loan calculate interest rate using Flat Interest Rate formula
3. Temporary Loan calculate interest rate with another formula (i have no idea what is that).
We also could change the rule every year ... we use different formula as well ...
My Question:
Should I put all the logic formula in services ?
Should I make every loan in different domain class ?
or should I make 1 domain class but it has different interest rate calculation methods ?
Any example would be good :)
Thank you in advance !
My suggestion is to separate interest calculating logic from the domain objects.
Hard-wiring the domain object and it's interest calculation is likely to lead you in trouble.
It would be more complicated to change the type of interest calculation for existing account type (which could be expected business request)
When new account type is created you can easily use all the calculation methods you have already implemented for it
It's likely that interest-calculating algorithm will grow in complexity in the future and it may need properties that should not be part of Account domain object, like some business constants, list of transactions etc.
Grails (because Spring) naturally supports to have business logic in services (declarative transactions etc.) rather than in the domain objects. You will always have less pain when going along with the framework than otherwise.
For an academic project I have to analyse a customer database of an insurance company.
This insurance company would like to identify a couple things, first of all classifying customers who leave the company in order to make them some offers or such..
Then they also would like to know on which customers to make upselling or cross-selling, as well as finding risky customers, in terms of insurance claims.
So I am focusing on the customer cancellations as it seems the most important one.
The attributes provided by the insurance company are:
Bundled/Unbundled, Policy Status, Policy Type, Policy Combination, Issue Date, Effective Date, Maturity Date, Policy Duration, Loan Duration, Cancellation Date, Reason for cancellation, Total Premium, Splitter Premium, Partner ID, Agency ID, Country Agency, Zone ID, Agency potential, Sex Contractor, Birth Year Contractor, Job Contractor, Sex Insured, Job Insured, Birth Year Insured, Year Claim, Claim Status, Claim Provision, Claim Payments
The database is composed of ~200k records and there are many missing values for some attributes.
I started using Rapid Miner to mine the dataset.
I cleaned the dataset a bit, removing incoherent or wrong values.
I then tried applying decision trees, adding a new attribute derived from Policy Status (which can be issued,renewed or cancelled) called isCanceled, and using it as the label of the decision tree.
I tried changing every single parameter of the decision tree, but I either get a tree with only 1 leaf node and no splits, or some tree that is completely irrelevant since it has leaf nodes with almost the same number instances of the 2 classes.
This is getting really frustrating.
I'd like to know what the usual procedures to make churn analysis are, possibly using Rapid Miner..can anybody help me?
In my experience most data mining or machine learning activities spend most of their time cleaning, tidying, formatting and understanding the data.
Assuming this has been done, then as long as there is a relationship between some or all of the attributes and the label to be predicted it will be possible to perform some sort of churn analysis.
There are lots of ways to determine this relationship of course but a quick way is to try one of the Weight By operators. This will output a set of weights for each attribute with those near 1 being potentially more predictive of the label.
If you determine there are attributes of value, you can use Decision Trees or another operator to build a model that can be used to predict. The attributes you have are a mix of nominal and numeric types so Decision Trees will work and anyway this operator is easier to visualize. The tricky part is getting the parameters right and the way to do this is to observe the performance of a model on unseen data as the parameters are varied. The Loop Parameters operator can help with this.
I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)
I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.
I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.
So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.
However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)
Thanks!
EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).
That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.
I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.
Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)
I have some similar system - a good base for source data is football-data.co.uk.
I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.
One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.
Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).
Hopefully I gave you some ideas, for more feel free to ask.
Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.
I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...
Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.