Classify users based on demographics - machine-learning

I am working on a personal project just for fun. Basically I have collected the data which has demographic information about each country like :
Germany 74% male 26% female 10% married 16% Age_30-35 40% etc
Now what I want to do is when I get a new user, I see that user's country and try to predict information about the user ie if the user is a male who is married and of age 30-35 (just an example).
My question is how can I make such prediction , I can't just make a rule which says if a country has more than 50% male, the new user from this country is also male. Basically I want to know how can I decide on the value which would help me predict the users demographics with certainty.

This is not really a prediction but rather a probability question since you can just calculate all probabilities if you have values as described.
Here's an example:
Male population = 74%
People married = 16%
People between 30-35 = 40%
To get the probability for a new German user to be male, married and between 30-35 years old you do the following:
P(Male|Married|30-35) = p(Male) * p(Married) * p(30-35)
P(Male|Married|30-35) = 0,74 * 0,16 * 0,40 = 0,04736 ~ 4,7%
You don't need more to calculate this. If you however want to classify all users automatically I suggest you take a look at Naive Bayes Classification.

Related

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

Using SVM to predict text with label

I have data in a csv file in the following format
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
I consider values like "jon", "red" and "30 as inputs. Each input as a label. For instance inputs [jon,george,tom,bob] have label "name". Inputs [red,blue,purple] have label "power". This is basically how I have training data. I have bunch of values that are each mapped to a label.
Now I want to use svm to train a model based on my training data to accurately identify given a new input what is its correct label. so for instance if the input provided is "444" , the model should be smart enough to categorize it as a "Money" label.
I have installed py and also installed sklearn. I have completed the following tutorial as well. I am just not sure on how to prepare input data to train the model.
Also I am new to machine learning if i have said something that sounds wrong or odd please point it out as I will be happy to learn the correct.
With how your current question is formulated, you are not dealing with a typical machine learning problem. Currently, you have column-wise data:
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
If a user now inputs "Jon", you know it is going to be type "Name", by a simple hash-map look up, e.g.,:
hashmap["Jon"] -> "Name"
The main reason people are saying it is not a machine-learning problem is your "categorisation" or "prediction" is being defined by your column names. Machine learning problems, instead (typically), will be predicting some response variable. For example, imagine instead you had asked this:
Name Power Money Bought_item
Jon Red 30 yes
George blue 20 no
Tom Red 40 no
Bob purple 10 yes
We could build a model to predict Bought_item using the features Name, Power, and Money using SVM.
Your problem would have to look more like:
Feature1 Feature2 Feature3 Category
1.0 foo bar Name
3.1 bar foo Name
23.4 abc def Money
22.22 afb dad Power
223.1 dad vxv Money
You then use Feature1, Feature2, and Feature3 to predict Category. At the moment your question does not give enough information for anyone to really understand what you need or what you have to reformulate it this way, or consider an unsupervised approach.
Edit:
So frame it this way:
Name Power Money Label
Jon Red 30 Foo
George blue 20 Bar
Tom Red 40 Foo
Bob purple 10 Bar
OneHotEncode Name and Power, so you now have a variable for each name that can be 0/1.
Standardise Money so that it has a range between, approximately, -1/1.
LabelEncode your labels so that they are 0,1,2,3,4,5,6 and so on.
Use a One vs. All classifier, http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.

Right approach to find similar products solely based on content and not on user history using machine learning algorithms

I have around 2-3 million products. Each product follows this structure
{
"sku": "Unique ID of Product ( String of 20 chars )"
"title":"Title of product eg Oneplus 5 - 6GB + 64GB ",
"brand":"Brand of product eg OnePlus",
"cat1":"First Category of Product Phone",
"cat2":"Second Category of Product Mobile Phones",
"cat3":"Third Category of Product Smart Phones",
"price":500.00,
"shortDescription":"Short description about the product ( Around 8 - 10 Lines )",
"longDescription":"Long description about the product ( Aroung 50 - 60 Lines )"
}
The problem statement is
Find the similar products based on content or product data only. So when the e-commerce user will click on a product ( SKU ) , I will show the similar products to that SKU or product in the recommendation.
For example if the user clicks on apple iphone 6s silver , I will show these products in "Similar Products Recommendation"
1) apple iphone 6s gold or other color
2) apple iphone 6s plus options
3) apple iphone 6s options with other configurations
4) other apple iphones
5) other smart-phones in that price range
What I have tried so far
A) I have tried to use 'user view event ' to recommend the similar product but we do not that good data. It results fine results but only with few products. So this template is not suitable for my use case.
B) One hot encoder + Singular Value Decomposition ( SVD ) + Cosine Similarity
I have trained my model for around 250 thousand products with dimension = 500 with modification of this prediction io template. It is giving good result. I have not included long description of product in the training.
But I have some questions here
1) Is using One Hot Encoder and SVD is right approach in my use case?
2) Is there any way or trick to give extra weight the title and brand attribute in the training.
3) Do you think it is scalable. I am trying to increase the product size to 1 million and dimension = 800-1000 but it is talking a lot of time and system hangs/stall or goes out of memory. ( I am using apache prediction io )
4) What should be my dimension value when I want to train for 2 million products.
5) How much memory I would need to deploy the SVD trained model to find in-memory cosine similarity for 2 million products.
What should I use in my use-case so that I can also give some weight to my important attributes and I will get good results with reasonable resources. What should be the best machine learning algorithm I should use in this case.
Now that I've stated my objections to the posting, I will give some guidance on the questions:
"Right Approach" often doesn't exist in ML. The supreme arbiter is whether the result has the characteristics you need. Most important, is the accuracy what you need, and can you find a better method? We can't tell without having a significant subset of your data set.
Yes. Most training methods will adjust whatever factors improve the error (loss) function. If your chosen method (SVD or other) doesn't do this automatically, then alter the error function.
Yes, it's scalable. The basic inference process is linear on the data set size. You got poor results because you didn't scale up the hardware when you enlarged the data set; that's part of "scale up". You might also consider scaling out (more compute nodes).
Well, how should a dimension scale with the data base size? I believe that empirical evidence supports this being a log(n) relationship ... you'd want 600-700 dimension. However, you should determine this empirically.
That depends on how you use the results. From what you've described, all you'll need is a sorted list of N top matches, which requires only the references and the similarity (a simple float). That's trivial memory compared to the model size, a matter of N*8 bytes.

machine learning, why do we need to weight data

This my sound as very naive question. I checked on google and many YouTube videos for beginners and pretty much, all explain data weighting as something the most obvious. I still do not understand why data is being weighted.
Let's assume I have four features:
a b c d
1 2 1 4
If I pass each value to Sigmond function, I'll receive -1 >< 1 value already.
I really don't understand why data needs or it is recommended to be weighted first. If you could explain to me this in very simple manner, I would appreciate it a lot.
I think you are not talking about weighing data but features.
A feature is a column in your table and as data I would understand rows.
The confusion comes now from the fact that weighing rows is also sometimes sensible, e.g., if you want to punish misclassification of positive class more.
Why do we need to weigh features?
I assume you are talking about a modle like
prediction = sigmoid(sum_i weight_i * feature_i) > base
Let's assume you want to predict whether a person is overweight based on Bodyweight, height, and age.
In R we can generate a sample dataset as
height = rnorm(100,1.80,0.1) #normal distributed mean 1.8,variance 0.1
weight = rnorm(100,70,10)
age = runif(100,0,100)
ow = weight / (height**2)>25 #overweight if BMI > 25
data = data.frame(height,weight,age,bc,ow)
if we now plot the data you can see that at least my sample of the data can be separated with a straight line in weight/height. However, age does not provide any value. If we weight it prior to the sum/sigmoid you can put all factors into relation.
Furthermore, as you can see from the following plot the weight/height have a very different domain. Hence, they need to be put into relation, such that the line in the following plot has the right slope, as the value of weight have are one order of magnitude larger

What is the difference between a feature and a label? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm following a tutorial about machine learning basics and there is mentioned that something can be a feature or a label.
From what I know, a feature is a property of data that is being used. I can't figure out what the label is, I know the meaning of the word, but I want to know what it means in the context of machine learning.
Briefly, feature is input; label is output. This applies to both classification and regression problems.
A feature is one column of the data in your input set. For instance, if you're trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. The label is the final choice, such as dog, fish, iguana, rock, etc.
Once you've trained your model, you will give it sets of new input containing those features; it will return the predicted "label" (pet type) for that person.
Feature:
In Machine Learning feature means property of your training data. Or you can say a column name in your training dataset.
Suppose this is your training dataset
Height Sex Age
61.5 M 20
55.5 F 30
64.5 M 41
55.5 F 51
. . .
. . .
. . .
. . .
Then here Height, Sex and Age are the features.
label:
The output you get from your model after training it is called a label.
Suppose you fed the above dataset to some algorithm and generates a model to predict gender as Male or Female, In the above model you pass features like age, height etc.
So after computing, it will return the gender as Male or Female. That's called a Label
Here comes a more visual approach to explain the concept. Imagine you want to classify the animal shown in a photo.
The possible classes of animals are e.g. cats or birds.
In that case the label would be the possible class associations e.g. cat or bird, that your machine learning algorithm will predict.
The features are pattern, colors, forms that are part of your images e.g. furr, feathers, or more low-level interpretation, pixel values.
Label: Bird
Features: Feathers
Label: Cat
Features: Furr
Prerequisite: Basic Statistics and exposure to ML (Linear Regression)
It can be answered in a sentence -
They are alike but their definition changes according to the necessities.
Explanation
Let me explain my statement. Suppose that you have a dataset, for this purpose consider exercise.csv. Each column in the dataset are called as features. Gender, Age, Height, Heart Rate, Body_temp, and Calories might be one among various columns. Each column represents distinct features or property.
exercise.csv
User_ID Gender Age Height Weight Duration Heart_Rate Body_Temp Calories
14733363 male 68 190.0 94.0 29.0 105.0 40.8 231.0
14861698 female 20 166.0 60.0 14.0 94.0 40.3 66.0
11179863 male 69 179.0 79.0 5.0 88.0 38.7 26.0
To solidify the understanding and clear out the puzzle let us take two different problems (prediction case).
CASE1: In this case we might consider using - Gender, Height, and Weight to predict the Calories burnt during exercise. That prediction(Y) Calories here is a Label. Calories is the column that you want to predict using various features like - x1: Gender, x2: Height and x3: Weight .
CASE2: In the second case here we might want to predict the Heart_rate by using Gender and Weight as a feature. Here Heart_Rate is a Label predicted using features - x1: Gender and x2: Weight.
Once you have understood the above explanation you won't really be confused with Label and Features anymore.
Let's take an example where we want to detect the alphabet using handwritten photos. We feed these sample images in the program and the program classifies these images on the basis of the features they got.
An example of a feature in this context is: the letter 'C' can be thought of like a concave facing right.
A question now arises as to how to store these features. We need to name them. Here's the role of the label that comes into existence. A label is given to such features to distinguish them from other features.
Thus, we obtain labels as output when provided with features as input.
Labels are not associated with unsupervised learning.
A feature briefly explained would be the input you have fed to the system and the label would be the output you are expecting. For example, you have fed many features of a dog like his height, fur color, etc, so after computing, it will return the breed of the dog you want to know.
Suppose you want to predict climate then features given to you would be historic climate data, current weather, temperature, wind speed, etc. and labels would be months.
The above combination can help you derive predictions.

Resources