Sampling with condition - machine-learning

From a pool of points I want to sample a fixed amount of point so that it satisfy some condition.
Is there any procedure to do this?
If there is any paper on this topic then also it will be helpful.
Example:
Let us consider we have 10000 users. about each user I know what is there income. Now let us consider I want to sample 150 users from this pool of users so that the mean income of the population became M.
Note: This mean income (condition) M is not same as the total population mean.
Thanks in advance.

If the goal of your procedure is to have equal income distribution in each of your samples, you could use stratified sampling. You make income classes and you draw a random sample of people from each income class.
For more theoretical information see the Wikipedia page here: https://en.wikipedia.org/wiki/Stratified_sampling .
For implementation examples see here : Stratified random sampling from data frame

Related

Classification based on likelihood toward maximum or minimum of survey score

So, I'm developing a model to classify a dataset into risk levels.
The dataset is labeled based on the survey score that the subject comepleted.
Now, from this survey score, I'll have maximum and minimum of score. I've read some paper they label the data set as 'High' or 'Low', based on the overall average score of the survey.
What I'm curious is that is there any method to develop a model to classify based on the likeli hood (For example, a data instance is 60% toward the maximum score), or the possible method is to divide the score based on decile or quartile.
I'm still new to this kind of problem, so any advise/answers would be really appreciated. Any keywords for me to search on would also be really appreciated.
Thanks in advance!
First thing to do is to decide the number of risk levels. For instance, for a two-level assignment (i.e. high and low), scores between minimum and median can be assigned to low and scores between median and maximum can be assigned to high.
Similarly, a 4-level assignment can be made using minimum, 1st quartile, median,3rd quartile and the maximum. This way you can obtain a balanced dataset with respect to labels (i.e. each label has the same number of observations)
Then, you can apply any classification technique to provide a model to your problem.

Which classification to choose?

I have huge amount of yelp data and I have to classify the reviews into 8 different categories.
Categories
Cleanliness
Customer Service
Parking
Billing
Food Pricing
Food Quality
Waiting time
Unspecified
Reviews contains multiple categories so I have used multilable classification. But I am confuse how I can handle the positive/negative . Example review may be for positive for food quality but negative for customer service. Ex- food taste was very good but staff behaviour was very bad. so review contains positive food quality but negative Customer service How can I handle this case? Should I do sentiment analysis before classification? Please help me
I think your data is very similar to Restaurants reviews. It contains around 100 reviews, with varied number of aspect terms in each (More information). So you can use Aspect-Based Sentiment Analysis like this:
1-Aspect term Extraction
Extracting the aspect terms from the reviews.
2-Aspect Polarity Detection
For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative.
3-Identify the aspect categories
Given a predefined set of aspect categories (e.g., food quality, Customer service), identify the aspect categories discussed in a given sentence.
4-Determine the polarity
Given a set of pre-identified aspect categories (e.g., food quality, Customer service), determine the polarity (positive, negative) of each aspect category.
Please see this for more information about similar project.
I hope this can help you.
Yes you would need a sentiment analysis. Why don't you create tokens of your data, that is find the required words out of the sentence, now the most possible approach for you is to find the related words along with their sentiment. i.e. food was good but the cleanliness was not appropriate
In this case you have [ food, good, cleanliness, not, appropriate ] now food links with its next term and cleanliness to its next terms "not appropriate"
again you can classify either into two classes i.e. 1,0 for good and bad .. or you can add classes based upon your case.
Then you would have data as such:
--------------------
FEATURE | VAL
--------------------
Cleanliness 0
Customer -1
Service -1
Parking -1
Billing -1
Food Pricing -1
Food Quality 1
Waiting time -1
Unspecified -1
I have given this just as an example where -1,1,0 are for no review, good and bad respectively. You can add more categories as 0,1,2 bad fair good
I may not be so good in answering this, but this is what i feel about it.
Note : You need to understand that you model cannot be perfect because that's what Machine Learning is all about, you have to be wrong. Your model cannot give a perfect classification it has to be wrong for certain inputs which it will learn with time and improve over.
There are many ways of doing multi label classification.
The simplest one would be having a model for each class, and if the review achieves a certain threshold score for that label, you would apply that label to the review.
This would treat the classes independently, but it seems like a good solution to your problem.

Why does Principal Component is in direction of maximum variance?

What does the variation data in the context of Principal Component Analysis refer to? I mean suppose we have 5 features or we can say that 5 dimensions then variation in data will be what? Means, Does it refers to the variation of data in every feature?And why PCA is in direction of maximum variation in data?
This answer from Cross Validated provides excellent answer to your questions.
On top of that, to answer And why PCA is in direction of maximum variation in data?, I suggest reading some basic on information theory, this blog article delivers a great introduction to the subject. To give a tangible example, imagine that among your 5 features you have a vector that is all ones. It's intuitive that it does not help you; all samples share the same feature. The variance of this particular feature will be zero - it bears no information. Zero entropy, a perfect order if you will, means nothing ever changes along given direction: a clear candidate to be dropped from the data. Increase variance = increase information content.

How to calculate precision and recall of a web service ranking algorithm?

I want to calculate precision and recall of a web service ranking algorithm. I have different web services in a data base.
A customer specify some conditions in his/her search. According to the customer`s requirements, my algorithm should assign a score for each web service in data base and retrieve web services with highest scores.
I have searched the net and have read all the questions in this site about this topic, and know about precision and recall,but I dont know how to calculate them in my case. The most relevant search was in this link:
http://ijcsi.org/papers/IJCSI-8-3-2-452-460.pdf
According to this article,
Precision = Highest rank score / Total rank score of all services
Recall= Highest rank score / Score of 2nd highest service
But, I think it is not true. Can you help me please?
Thanks a lot.
There is no such thing as "precision and recall for ranking". Precision and recall are defined for binary classification task and extended to multi label tasks. Rankings require different measures as this is much more complex problem. There are numerous ways to compute something similar to precision and recall, I will summarize some basic approaches for the precision, recall goes similarly:
limit search algorithm to some K best results and count true positives as number of queries for which the desired results is in those K results. So precision is fraction of queries for which you can find relevant result in K best outputs
very strict variation of the above, set K=1, meaning that results has to come "the best of all"
assign weights to each position, so for example you can give 1/T "true positive" to each query where valid result vame T'th. In other words, if the valid result was not returned you assign 1/inf=0, if it was the first one on the list then 1/1=1, if second 1/2, etc. now precision is simply a mean of these scores
As lejlot pointed out, using "precision and recall for ranking" is weired to measure ranking performance. The definition of "precision" and "recall" is very "customized" in the referenced paper you pointed out.
It is a measure of the tradeoff between the precision and
recall of the particular ranking algorithm. Precision is the
accuracy of the ranks i.e. how well the algorithm has
ranked the services according to the user preferences.
Recall is the deviation between the top ranked service and
the next relevant service in the list. Both these metrics are
used together to arrive at the f-measure which then tests the
algorithm efficiency.
Probably the original author has some specific motivation to use such definition. Some usual metric for evaluating ranking algorithms include:
Normalized discounted information gain or nDCG (used in a lot of kaggle competitions)
Precision#K, Recall#K
This paper also listed a few common ranking measures.
This is what I could think of:
Recall could be fraction of getting user click for top 5 queries and precision could be getting the fraction of user getting the click in the first query when compared to rest of the queries. I don't know but it seems very vague to speak about precision and recall in such a scenario.

How many principal components to take?

I know that principal component analysis does a SVD on a matrix and then generates an eigen value matrix. To select the principal components we have to take only the first few eigen values. Now, how do we decide on the number of eigen values that we should take from the eigen value matrix?
To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don't have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).
There is no correct answer, it is somewhere between 1 and n.
Think of a principal component as a street in a town you have never visited before. How many streets should you take to get to know the town?
Well, you should obviously visit the main street (the first component), and maybe some of the other big streets too. Do you need to visit every street to know the town well enough? Probably not.
To know the town perfectly, you should visit all of the streets. But what if you could visit, say 10 out of the 50 streets, and have a 95% understanding of the town? Is that good enough?
Basically, you should select enough components to explain enough of the variance that you are comfortable with.
As others said, it doesn't hurt to plot the explained variance.
If you use PCA as a preprocessing step for a supervised learning task, you should cross validate the whole data processing pipeline and treat the number of PCA dimension as an hyperparameter to select using a grid search on the final supervised score (e.g. F1 score for classification or RMSE for regression).
If cross-validated grid search on the whole dataset is too costly try on a 2 sub samples, e.g. one with 1% of the data and the second with 10% and see if you come up with the same optimal value for the PCA dimensions.
There are a number of heuristics use for that.
E.g. taking the first k eigenvectors that capture at least 85% of the total variance.
However, for high dimensionality, these heuristics usually are not very good.
Depending on your situation, it may be interesting to define the maximal allowed relative error by projecting your data on ndim dimensions.
Matlab example
I will illustrate this with a small matlab example. Just skip the code if you are not interested in it.
I will first generate a random matrix of n samples (rows) and p features containing exactly 100 non zero principal components.
n = 200;
p = 119;
data = zeros(n, p);
for i = 1:100
data = data + rand(n, 1)*rand(1, p);
end
The image will look similar to:
For this sample image, one can calculate the relative error made by projecting your input data to ndim dimensions as follows:
[coeff,score] = pca(data,'Economy',true);
relativeError = zeros(p, 1);
for ndim=1:p
reconstructed = repmat(mean(data,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)';
residuals = data - reconstructed;
relativeError(ndim) = max(max(residuals./data));
end
Plotting the relative error in function of the number of dimensions (principal components) results in the following graph:
Based on this graph, you can decide how many principal components you need to take into account. In this theoretical image taking 100 components result in an exact image representation. So, taking more than 100 elements is useless. If you want for example maximum 5% error, you should take about 40 principal components.
Disclaimer: The obtained values are only valid for my artificial data. So, do not use the proposed values blindly in your situation, but perform the same analysis and make a trade off between the error you make and the number of components you need.
Code reference
Iterative algorithm is based on the source code of pcares
A StackOverflow post about pcares
I highly recommend the following paper by Gavish and Donoho: The Optimal Hard Threshold for Singular Values is 4/sqrt(3).
I posted a longer summary of this on CrossValidated (stats.stackexchange.com). Briefly, they obtain an optimal procedure in the limit of very large matrices. The procedure is very simple, does not require any hand-tuned parameters, and seems to work very well in practice.
They have a nice code supplement here: https://purl.stanford.edu/vg705qn9070

Resources