Machine learning - predicting multiple variables that add up to 100% - machine-learning

I am working on some toy models for the current US presidential election. There are four candidates and each will win some % of the vote. Its my goal to predict each candidate's %.
So far I have tried building a data set with one learned variable (% of vote rec'd) and several dozen dependent variables. Using WEKA, I have experimented with MLP and several other learning methods. My issue, is that once I learn a model for vote %, my predictions for each candidates share of the vote never adds up to 100%.
Clearly in this case its a necessity that the total % of votes received add up to 100%. Am I approaching the problem wrong? What can I do to improve my method?

This happens as the interdependence of the percentages don't seem embedded in your equations. If you were writing it from scratch, somewhere you would include an equation forcing v0 + v1 + v2 + v3 = 1. These are called constraints and they are used to impose some underlying physics on your set of equations that otherwise they wouldn't respect.
I don't oppose abhiieors answer however. If you have a model that seems logical to you, scaling is a fine option.
v0 = v0 * 100 / ( v0 + v1 + v2 + v3 ) etc.

Related

How to identify the normalized feature

I have been trying to solve a problem stated in an exam of coursera. I am not seeking the solution but I need to get the steps and concepts to resolve this.
Can any one share the concept and steps to help me find the solution.
UPDATE:
I was expecting a down-vote and its not unusual, as its the most easiest thing people can do. I am seeking the direction to solve the problem as I wasn't able to get the idea to solve it after watching the videos on Coursera. I hope someone sensible out there can share a direction and step to achieve the mentioned goal.
Mean Normalization
Mean normalization, also known as 'standardization', is one of the most popular techniques of feature scaling.
Andrew Ng describes it in the 12a slide of lecture 4:
How to resolve the problem
The problem asks you to standardize the first feature in the third example: midterm = 94;
well, we have just to resolve the equation!
Just for clarity, the notation:
μ (mu) = "avg value of x in training set", in other words: the mean of the x1 column.
σ (sigma) = "range (max-min)", literaly σ = max-min (of the x1 column).
So:
μ = ( 89 + 72 + 94 +69 )/4 = 81
σ = ( 94 - 69 ) = 25
x_std = (94 - 81)/25 = 0.52
Result: 0.52
Best regards,
Marco.
The first step of solving this question is to identify what is , from the content of the lecture, it refers to the first feature of the third training case. Which is the unsquared version of the midterm score in the third row of the table.
Secondly, you need to understand the concept of normalization. The reason why we need normalization is that the value of some features among all training examples may much larger than the value of other features, which may make the cost function have pretty bad shape and this will make it harder gradient descent to find the minimum. In order to solve this, we want to make all features have nearly the same scale, and make the range of the feature to be centered at zero.
In this question, we want to scale every feature to a scale of 1, in order to do this, you need to find the max and min value of the feature among all training cases. Then squeeze the range of the feature to 0 and 1. The second step is to find the center value of the feature (average value in this case) and move the center value of the feature to 0.
I think this is pretty much all hints I can give you, you will totally be able to calculate the answer to this question by yourself from this point.

Polynomial regression for one input feature

I am new to machine learning. I am having a question regarding polynomial regression using one feature.
My understanding is that if there is one input feature, we can create a hypothesis function by taking the squares and cubes the feature.
Suppose x1 is the input feature and our hypothesis function becomes something like this :
htheta(x) = theta0 + (theta1)x1 + (theta2)x1^2 + (theta3)x1^3.
My question is what is the use case of such scenario ? In what type of data, this type of hypothesis function will help ?
This scenario is for simple curve fitting problems. For example, you might have a spring and want to know how far the spring is stretched as a function of how much force you apply (the spring needn't be a linear spring obeying Hooke's law). You could build a model by collecting a bunch of measurements of different forces applied on the spring (measured in Newtons) and the resulting spring extension (also called displacement) in centimeters. You could then build a model of the form F(x) = theta_1 * x + theta_2 * x^3 + theta_3 * x^5 and fit the three theta parameters. You could of course do this with any other single variable problem (height vs. age, weight vs. blood pressure, current vs. voltage). In practice, you generally have many more than just a one dependent variable though.
Also worth pointing out that the transformations needn't be polynomial in the dependent variable (x in this case). You could just as well try logs, square roots, exponentials etc. If you're asking why is it always a parameter times a function of the input variable, this is more of a modeling choice than anything (specifically a linear model since it's linear in theta). It does not have to be this way and is a simple assumption that restricts the class of functions. Linear models also satisfy some intuitive statistical properties which also justify their use (see here)

Explanation for Values in Scharr-Filter used in OpenCV (and other places)

The Scharr-Filter is explained in Scharrs dissertation. However the values given on page 155 (167 in the pdf) are [47 162 47] / 256. Multiplying this with the derivation-filter would yield:
Yet all other references I found use
Which is roughly the same as the ones given by Scharr, scaled by a factor of 32.
Now my guess is that the range can be represented better, but I'm curious if there is an official explanation somewhere.
To get the ball rolling on this question in case no "expert" can be found...
I believe the values [3, 10, 3] ... instead of [47 162 47] / 256 ... are used simply for speed. Recall that this method is competing against the Sobel Operator whose coefficient values are are 0, and positive/negative 1's and 2's.
Even though the divisor in the division, 256 or 512, is a power of 2 and can can be performed by a shift, doing that and multiplying by 47 or 162 is going to take more time. A multiplication by 3 however can in fact be done on some RISC architectures like the IBM POWER series in a single shift-and-add operation. That is 3x = (x << 1) + x. (On these architectures, the shifter and adder are separate units and can be done independently).
I don't find it surprising that Phd paper used the more complicated and probably more precise formula; it needed to prove or demonstrate something, and the author probably wasn't totally certain or concerned that it be used and implemented alongside other methods. The purpose in the thesis was probably to have "perfect rotational symmetry". Afterwards when one decides to implement it, that person I suspect used the approximation formula and gave up a little on perfect rotational symmetry, to gain speed. That person's goal as I said was to have something that was competitive at the expense of little bit of speed for this rotational stuff.
Since I'm guessing you are willing to do work this as it is your thesis, my suggestion is to implement the original algorithm and benchmark it against both the OpenCV Scharr and Sobel code.
The other thing to try to get an "official" answer is: "Use the 'source', Luke!". The code is on github so check it out and see who added the Scharr filter there and contact that person. I won't put the person's name here, but I will say that the code was added 2010-05-11.

Need help maximizing 3 factors in multiple, similar objects and ordering appropriately

I need to write an algorithm in any language that would order an array based on 3 factors. I use resorts as an example (like Hipmunk). Let's say I want to go on vacation. I want the cheapest spot, with the best reviews, and the most attractions. However, there is obviously no way I can find one that is #1 in all 3.
Example (assuming there are 20 important attractions):
Resort A: $150/night...98/100 in favorable reviews...18 of 20 attractions
Resort B: $99/night...85/100 in favorable reviews...12 of 20 attractions
Resort C: $120/night...91/100 in favorable reviews...16 of 20 attractions
Resort B looks the most appealing in price, but is 3rd in the other 2 categories. Wherein, I can choose resort C for only $21 more a night and get more attractions and better reviews. Price is still important to me, but Resort A has outstanding reviews and a ton of attractions: Is $51 more worth the splurge?
I want to be able to populate a list that will order a lit from "best to worst" (I quote bc it is subjective to the consumer). How would I go about maximizing the value for each resort?
Should I put a weight for each factor (ie: 55% price, 30% reviews, 15% amenities) and come to the result of a set number and order them that way?
Do I need a mode, median and range for all the hotels and determine the average price, and have the hotels around the average price hold the most weight?
If it is a little confusing then check out www.hipmunk.com. They have an airplane sort they call Agony (and a hotel sort which is similar to my question) that they use as their own. I used resorts as an example to make my question hopefully make a little more sense. How does one put math to a problem like this?
I was about to ask the same question about multiple-factor weighted sorting, because my research only came up with answers (e.g. formulas with explanations) for two-factor sorting.
Even though we're both asking about 3 factors, I'll list the possibilities I've found in case they're helpful.
Possibilities:
Note: S is the "sorting score", which is what you'd sort by (asc or desc).
"Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax).
"Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 ....
Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).
Options #3 and #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but then the problem is how do you do this for more than two factors?
In your case, assigning the weights in #1 would probably be fine. You'll need to fine-tune the algorithm depending on what your users consider more important - you could expose the weights wx as a filter (like 1-10 dropdown) so your users can adjust their search on the fly. Or if you wanted to get clever you could poll your users before they're searching ("Which is more important to you?") and then assign a weighting set based on the response, and after tracking enough polls you could autosuggest the weighting scheme based on most responses.
Hope that gets you on the right track.
What about having variable weights, and letting the user adjust it through some input like levers, so that the sort order will be dynamically updated?

Recommended anomaly detection technique for simple, one-dimensional scenario?

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.
For example, with the following example data:
a = 10
b = 14
c = 25
d = 467
e = 12
d is clearly an anomaly, and I would want to perform a specific action based on this.
I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.
Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.
Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.
The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.
The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.
So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.
Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.
What is a recommended anomaly detection technique for simple, one-dimensional data?
Check out the three-sigma rule:
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
An alternative method is the IQR outlier test:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier
this test is usually employed by Box plots (indicated by the whiskers):
EDIT:
For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.
#smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.
A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.
DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.
If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.
If the number of neighbors is less than minPoints, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.
Finally the set of all points marked as noise are considered outliers.
There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.
After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by #Amro as a good starting point.
For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.
This is an old topic but still it lacks some information.
Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:
Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing
I think this problem can be solved in a few lines of python code like this:
import numpy as np
import scipy.stats as sts
x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion
Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:
array([ 0.6745 , 0. , 1.854875, 76.387125, 0.33725 ])
or the 467 entry being rejected.
Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.
Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.
The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
The IQR test should be:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier

Resources