I'm looking for a good algorithm recommendation.
I have Users and Achievements. Users create Achievements and then give them to other Users. Associated with each Achievement is the point value that the user specifies. A User's total points is the sum of all their achievements.
Basically:
Achievement :
owner = Alias
points = int
User :
achievements = list(Achievement)
def points() :
sum([achievements.points])
Ok, so this system is obviously very game-able. You can make many accounts and give tons of achievements to eachother. I'm try to reduce that a little bit by scaling the point values to something different than what the user specified.
Assuming all users are honest, but they just gauge difficultly differently. How should I normalize the point values? AKA one user gives 5 points for every easy achievement, and another gives 10 points, how can I normalize them to one value. The goal would be a distribution where points are proportional to difficulty.
If one user isn't very good at judging point values, how can I figure out difficulty based on the number of users that have gotten the achievement?
Assume that Users could be mostly partitioned into disjoint groups with one User giving achievements to a whole set of other ones. Does that help the previous two algorithms? For example, User A only gives achievements to Users that end with an odd number and User B only gives achievements to Users that end with an even.
If everyone is malicious, how close can I get to not having user's being able to hyper-inflate their point values?
Note: The quality of a giving users is not in any way related to how many achievements he has received. Many givers are just bots that haven't received anything themselves but automatically reward users for doing certain actions.
My current plan is something like this. I have an allocation of 10 points / person that has got an achievement from me. If I have given out 10 achievements to 55 people total, my allocation is 550. Then this is given to each achievement based on the number of people who got it. If the distribution was [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] people who got each achievement, then the point values would be [50, 25, 16.6, 12.5, 10, 8.3, 7.1, 6.25, 5.5, 5].
Any problems with my approach and alternative recommendations are welcome and appreciated. Also, post other cases that you can think of that I've missed, and I'll add them to the list. Thanks!
I think that in your system, as in stackoverflow, digg, slashdot, etc. your basic goals are to
Indentify honest users
Promote their actions
Generally we identify honest users by their actions: those accounts that have existed for a long time on the site and have been vetted by other users, and by you. Stack overflow uses the reputation score for this, slashdot uses karma points.
Once you identify these honest users then you can have their votes count in proportion to the reputation score: the more honest a user seems to be the more we trust his achievements.
Thus, you might give new accounts an initial score of 10. That user can then give any number of achievements he wants but their actual total value will be 10 (like the proportional allocation you suggest). That is, if a new user gives 100 achievements (all worth the same number of points) then each one will be worth .1 points because his score is 10. Then, as that user gets achievements from other users his score increases.
Basically, I'm suggesting you use pagerank, but instead of ranking web pages you are ranking users and instead of hyperlinks the links are achievements given by that user to others.
That's one way to solve this problem. There are many others. It depends on your specific needs. Auctions are always fun. You can have everyone bid on an achievement before it is actually achieved in order to establish the price (score) that the community places on that achievement. You will need to limit the amount of 'money' people have.
I've been struggling with this type of problem on my own site. If you have a large corpus of existing data you can use as a baseline, score normalization seems pretty effective. First get the mean value and standard deviation for the user's created achievements:
SELECT AVG(Points) AS user_average,
STDDEV_POP(Points) AS user_stddev
FROM Achievements WHERE Owner = X
Use these values to calculate a context-free "z-score":
$zscore = ($rating - $user_average) / $user_stddev;
Get the mean and standard deviation for all achievments:
SELECT AVG(Points) AS all_average,
STDDEV_POP(Points) AS all_stddev
FROM Achievements
Use these values to create a normalized "t-score":
$tscore = $all_average + ($all_stddev * $zscore);
Then use the t-score as your internal representation of an achievement's value. YMMV. :)
Correct, $rating is input and $tscore is the normalized output.
Ideally, everyone would assign points for their achievements on an identical scale. One point for stupid or trivial achievements, ten points for modest achievements, 50 points for truly epic achievements, or whatever. But people have very different behavior when it comes to assigning scores. Some will be very generous, and make every achievement worth the max. Others will be strict and accurate, adhering carefully to the scale as it relates to the difficulty of the achievement. Others may think it's dumb that people worry about points, and assign the minimum value for all the achievements they create.
Normalization attempts to handle these individual abnormalities and fit everyone's ratings to the same scale. It's like what they do with the judges' scores in the Olympics. You don't "blindly trust" the value a user assigned to an achievement, but it's something you want to account for if it's part of the system. Otherwise you could presumably just hard-code the point value of achievements, limit how often they can be created, and it sounds like that would curb the worst abuse. But the score is useful because, after normalization, you can figure out what the achievement's value would be worth if it was created by a stereotypically average user. That makes it difficult for people to "game" the system because the further they get from the average value and distribution for achievements, the more their own values get normalized back towards the baseline.
I should mention that I am not a professionally trained programmer, and I have never taken a statistics class or any higher math. Due to my own limitations of understanding, perhaps I'm not the best person to be explaining this. But I have been struggling with a similar problem on my own site (user-to-user ratings), and after trying numerous approaches this one seems like the most promising. Most of the inspiration for the implementation came from http://www.ericdigests.org/2003-4/score-normilization.html so you might like to read that as well.
Related
The problem I would like to solve is how to choose the best seats on a train based on some ordered user preferences. eg. whether they'd like a seat facing forwards, backwards (or don't care), whether they'd like a seat at a table or not, whether they need to be near a toilet, luggage rack, buffet car, near the door. Window / Aisle seat. Whether they want the aisle to the left or the right (can be very important some someone with a stuff knee!).
Most customers will specify one or two preferences, other may specify more. For some, being near the toilet might be the most important factor, for others having that table to work at might be the most important.
There may be more than one passenger (although they will share preferences). These should be sat as close to each other as possible. 2 passengers would ideally be sat next to each other, or opposite each other at a table seat. A group of 8 passengers might best be split into 2 groups of 4 or 4 groups of 2...
Position is defined by carriage number (seats in the same carriage are better then seats in different carriages) and by x/y coordinate within that carriage - so easy enough to calculate distance between any pair of seats - but a BIG job to calculate distances between EVERY pair of seats...)
Each [available] seat (pre-filtered by ticket class) will have the above attributes either defined or set to NULL (for unknown - seat facing is often unknown).
So for training I can provide a vast array of example trains and customer preferences with the best balance of preferences version position.
For execution I want to provide a run-time specific array of seats with attributes, a set of user preferences and a set if weighting for those preference (eg. passenger 1 thinks being near toilet is most important, passenger 2 think having a table is most important, passenger 3 think being in the quiet carriage is..) and finally the number of passengers.
Output will be an array of seats (one per passenger) that strike the best compromise between matching as many customer preferences as possible (usually not possible to match all preferences) and keeping the seats fairly close to each other.
eg. We might be able to match 2 preferences with seats 2 rows apart, but match 3 preference with seats 10 rows apart...
Obviously distance will need a weighting the same as the individual preference and necessary to choose between those two. I suppose a distance not greater than X becomes just one more customer preference...
I've not done any ML work before, so it's all going to be a learning exercise for me. I wish I had the time to just play and see what comes out, but I don't, Happy to do that, but I need to have a reasonable expectation of a positive result otherwise I'll have to focus on a more traditional approach. Limited time and all that...
So, my questions are:
Is this a suitable problem for machine learning?
If so, is brain.js a good choice, or is something else more suitable? AWS ML service perhaps?
Any advice on how to organise all my data into something suitable for an ML engine to process?
Machine Learning is good at finding hidden patterns in complex data. In your case, you would need a lot of data where user preferences are already matched with optimal seating arrangements.
You could then try to see if the ML model can actually make optimal seating arrangements by itself. It’s an interesting problem but it may also lead to unexpected seating :)
If you don’t have training data you could collect it live, by registering where people sit down, knowing their preferences.
I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query
I posted question on stat stack exchange but unfortunately got no answer so far, so I clone it here and do hope someone can help.
I'm newbie in machine learning. Recently I tried to learn something on this and got following concern:
I have products classed by categories. Also I have users with gender and device model information.
First, I made a chi square test to check whether categories and gender + device information are associated. For example, my p-value is 0.000012 so I stated that the user (gender + device) is associated with categories.
So if a new user come with his gender (Female) + device (iPhone):
As the chi square test result, there should be an association between gender + device and categories. So I select top 10 categories that were consumed by Female who using iPhone. I've got the list, e.g. [1. Fashion, 2. Mobile devices 3. Cameras, 4. Home furnitures, 5. Bikes, etc.]
I also make a z-test on categories (without any user information), and got the list (higher z-score will be on top), e.g. [1. Mobile devices, 2. Bikes, 3. Fashion, 4. Laptops, etc.]
So in this case, which list should I give to that user? Or any possibility to combine them? Or did I do something wrong?
Thanks in advance :-)
Strictly speaking, none of the tests is appropriate. In both tests you have a null hypothesis (that gender and model is not related to category), and you are trying to find the probability that this hypothesis is wrong. However, theses two tests are parametric tests, that is for the results to be correct you have to know that the probability follows a specific distribution (chi square and normal distributions respectively). In your case you can make no such assumption, so the tests are not suitable. If you want to use significance tests, you should use a non-parametric tests, Wilcoxon and Friedman tests being the most common. However, significance tests are usually used after the problem has been solved to check if the results achieved can be attributed to luck. They are not used to solve the problem.
If you want to find correlation between gender, model and category, you should use some correlation coefficient, such as Pearson correlation and intraclass correlation. However, you have not described your data in detail, so I'm not sure what you are trying to achieve. Based on gender and model only, probably the safest and simplest thing you can do is return the most visited categories (number of occurences) by women who use iPhone.
I have a Machine Learning project that given the reactions of a group of users on a collection of online articles (displayed by means of like/dislike) I need to make a decision for a newly arrived article.
The task dictates that given each individual's reaction to be able to predict whether the newly arrived article should be considered as to be recommended to the community as a whole.
I have been wondering how am I supposed to incorporate each user's feedback to dictate whether this would be an interesting article to recommend.
Bearing in mind that within users' reactions there would be users that like and dislike the same article is there a way to incorporate all this information and reach a conclusion about the article?
Thank you in advance.
There are a lot of different ways to determine what's "interesting." I think reddit has a pretty good model to look at in considering different options. They have different categories, like "hot", or "controversial", etc.
So a couple options depending on what you/your professor want:
Take the net number of likes (like = +1, dislike = -1)
Take just the number of likes
Take the total number of ratings (who's read it at all)
Take the ones with the highest percentage of likes vs. dislikes
Some combination of these things
Etc.
So there are a lot of different things you could try. Maybe try a few and see which produce results most like what you want?
In terms of how to predict whether a new article compares to the articles you already have information about, that's a much broader question, but I don't think that's what you're asking, and it seems like that's what the Machine Learning project is about.
I am not sure if the recommending an article in this way is good, but if this is what your requirement then let me suggest you an approach.
Approach:
First, for every article give a lable(like/dislike) based on the number of likes & dislikes. Now you have set of articles with like/dislike lables. Based on this data you need to identify whether a new article's lable is like/dislike. This comes under simple linear classification problem, which can be solved by using any of the open source ml frameworks.
let us say, we have
- n number of users in the Group
- m number of articles
sample data
user1 article1 like
user1 article2 dislike
user2 article3 dislike
....
usern articlem like
Implementation:
for each article
count the number of likes
count the nubmer of dislikes
if no. of likes > no. of dislikes,
lable = like
else
lable = dislike
Give this input(articles with lables) to naive bayes(or any) classifier to build a model.
Use this model to classify, the new article.
Output: like/dislike, if you get like recommend the article.
Known Issues:
1. What is half of the users likes & other half dislikes the article, Will you consider it as a like or dislike?
2. What is 11 users dislike & 10 users like, is it Okay to consider this as dislike?
Such Questions should be answered by yourself or your client as a part of requirement clarification.
I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)
I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.
I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.
So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.
However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)
Thanks!
EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).
That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.
I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.
Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)
I have some similar system - a good base for source data is football-data.co.uk.
I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.
One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.
Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).
Hopefully I gave you some ideas, for more feel free to ask.
Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.
I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...
Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.