I have a dataset with several features of houses including type, location, the number of bedrooms, etc. For example:
Type: Apartment, Semi-detached House, Single-detached House
Location: (Lat, Lon) Pairs like (40.7128° N, 74.0059° W)
Number of Bedrooms: 1, 2, 3, 4 ...
The target variable I want to predict is the house price. However, the house price given in the original dataset is the intervals of prices instead of numeric values, for example:
House Price: [0,100000), [100000,150000), [150000,200000), [200000,250000), etc.
So my question is what model should I use if I want to predict the range of house price? Simple regression models seem not work because we are predicting intervals instead of continuous numeric values.
Thanks in advance.
I would use the median of the price range and run a linear regression. In your case the labels would be {50000, 125000, 175000, 225000, ...}. After you get the predicted price just pick the range it falls into.
Alternatively, if the price ranges are fixed, you can use a one-vs-all logistic regression, although I am sure this is not the best approach.
Related
I have a set of features x1,x2,x3,x4 where x1,x2,x3 are floats and x4 is an array of floats.
To give an example, say that I am trying to predict the price of a house. I could use the size of the house as an array (e.g. length, width, and height) along with other features like number of bedrooms, age of house, no of bathrooms etc.
This is simple, but I am sort of struggling how to represent this.
Here is a similar sample based on heart attack prediction https://colab.research.google.com/drive/1CQX2d0vkjlZKjX6wbG4ga6tRcI-zOMNA
I tried to add a column to add an array feature, with np.c_ to the end
##################################-Check-########################
print("Before",X_s[:1])
X_s =np.c_[ X_s,np.random.rand(303,2)] # add a numpy array here as a feature
print("After",X_s[:1])
print("shape of X_s",X_s.shape)
print(X_s[:1])
dataset = tf.data.Dataset.from_tensor_slices((X_s, y_s))
But the problems is that the array is added as two extra columns in the end
shape of X_s (303, 13)
shape of X_s (303, 15)
So if I have a feature array of say 330*300 with the above approach it will add 300 columns to the end. Not something I want
I am aware of CNN network, and one option is to model the whole problem as a CNN; that is pad the other features also as arrays and created an n dimension tensor and use a CNN
Is there something simpler and better than these approaches
I am trying to build a model that can predict insurance names based on insurance Id.
Before putting the question to this forum, I have tried KNN and Decian Tree, but the accuracy does not exceed more than 60%.
In my data frame, I have one column as a feature and the other as a label.
I can also extract other features from this data as well like Is Numeric, length, etc.
I have 2.8M rows of data in this shape.
insurance_id
insurance_name
XOH830990804
Medicare
XOH01179276
Medicare
H55575577
Medicare
H71096147
WELLMED
IBPW01981926
BCBS
MT25110S
Aetna
WXQQ07123
Aetna
6WU7NSSGY63
Oxford
MX7ZZ35T
Oxford
DU00079Z
Welcare
PB95800M
UHC
Please guide me on which approach or model can help me to achieve an accuracy of more than 80%.
You can try to diversify your inputs
As an example, you can pass aditional features to the network, such as:
Length of the insurance_id
Quantity of numbers in the insurance_id
Quantity of letters in the insurance_id
Sum of all numbers in the insurance_id
And any other transform you might think of.
As the output layer of your network, you might wanna use Dense(n_of_different_insurance_names, activation='softmax')
and a categorical_crossentropy loss function when compiling the model
Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.
Training Data Set:
--------------------
Patient Age: 25
Patient Weight: 60
Diagnosis one: Fever
Diagnosis two: Headache
> Medicine: **Crocin**
---------------------------------
Patient Age: 25
Patient Weight: 60
Diagnosis one: Fever
Diagnosis two: no headache
> Medicine: Paracetamol
----------------------------------
Give sample data set with drug/medicne prescribed to patient.
How to find what medicine based on patient info(age/weight) and diagnosis(fever/headeache/etc)?
The task you are aiming at is classification since the target values are a nominal scale.
Getting the vocabulary right is crucial since all the rest of the work is already done by others such as in sklearn library for Python which contains most relevant algorithms and plenty of data to test them and learn the algorithms.
It seems you have four variables as input:
age - metric variable
weight - metric variable
Diagnosis one - nominal variable
Diagnosis two - nominal variable
You will have to encode you nominal variables, where I would recommend an array of all possible diagnosis such as:
Fever, Headache, Stomach pain, x - [0, 0, 0, 0]
Now each array element will be set to 1 if the diagnosis is correct and 0 else.
Therefore you have a total of 2 + n input variables, whereas n is the number of possible symptoms.
Then you can simply go to the sklearn library and start using the most simple classification algorithm: Nearest Neighbour Classification
If this does not yield good result (probably results will be not good), you can start to use more sophisticated models (SVM, RandomForest). Yet first you should learn the vocabulary and use simple models to get to know the methods and the processing chain.
I am trying to completely understand difference between categorical and ordinal data when doing regression analysis. For now, what is clear:
Categorical feature and data example:
Color: red, white, black
Why categorical: red < white < black is logically incorrect
Ordinal feature and data example:
Condition: old, renovated, new
Why ordinal: old < renovated < new is logically correct
Categorical-to-numeric and ordinal-to-numeric encoding methods:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Example for categorical:
data = {'color': ['blue', 'green', 'green', 'red']}
Numeric format after One-Hot encoding:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Example for ordinal:
data = {'con': ['old', 'new', 'new', 'renovated']}
Numeric format after using mapping: Old < renovated < new → 0, 1, 2
0 0
1 2
2 2
3 1
In my data price increases as condition changes from "old" to "new". "Old" in numeric was encoded as '0'. 'New' in numeric was encoded as '2'. So, as condition increases, then price also increases. Correct.
Now lets have a look at 'color' feature. In my case, different colors also affect price. For example, 'black' will be more expensive than 'white'. But from above mentioned numeric representation of categorical data, I do not see increasing dependancy as it was with 'condition' feature. Does it mean that change in color does not affect price in regression model if using one-hot encoding? Why to use one-hot encoding for regression if it does not affect price anyway? Can you clarify it?
UPDATE TO QUESTION:
First I introduce formula for linear regression:
Let have a look at data representations for color:
Let's predict price for 1-st and 2-nd item using formula for both data representations:
One-hot encoding:
In this case different thetas for different colors will exist and prediction will be:
Price (1 item) = 0 + 20*1 + 50*0 + 100*0 = 20$ (thetas are assumed for example)
Price (2 item) = 0 + 20*0 + 50*1 + 100*0 = 50$ (thetas are assumed for example)
Ordinal encoding for color:
In this case all colors have common theta but multipliers differ:
Price (1 item) = 0 + 20*10 = 200$ (theta assumed for example)
Price (2 item) = 0 + 20*20 = 400$ (theta assumed for example)
In my model White < Red < Black in prices. Seem to be that it is logical predictions in both cases. For ordinal and categorical representations. So I can use any encoding for my regression regardless of the data type (categorical or ordinal)? This division is just a matter of conventions and software-oriented representations rather than a matter of regression logic itself?
You will see not increasing dependency. The whole point of this discrimination is that colour is not a feature you can meaningfully place on a continuum, as you've already noted.
The one-hot encoding makes it very convenient for the software to analyze this dimension. Instead of having a feature "colour" with the listed values, you have a set of boolean (present / not-present) features. For instance, your row 0 above has features color_blue = true, color_green = false, and color_red = false.
The prediction data you get should show each of these as a separate dimension. For instance, presence of color_blue may be worth $200, while green is -$100.
Summary: don't look for a linear regression line running across a (non-existent) color axis; rather, look for color_* factors, one for each color. As far as your analysis algorithm is concerned, these are utterly independent features; the "one-hot" encoding (a term from digital circuit design) is merely our convention for dealing with this.
Does this help your understanding?
After your edit of the question 02:03 Z 04 Dec 2015:
No, your assumption is not correct: the two representations are not merely a matter of convenience. The ordering of colors works for this example -- because the effect happens to be a neat, linear function of the chosen encoding. As your example shows, your simpler encoding assumes that White-to-Red-to-Black pricing is a linear progression. What do you do when Green, Blue, and Brown are all $25, the rare Yellow is worth $500, and Transparent reduces the price by $1,000?
Also, how is it that you know in advance that Black is worth more than White, in turn worth more than Red?
Consider the case of housing prices based on elementary school district, with 50 districts in the area. If you use a numerical coding -- school district number, ordinal position alphabetically, or some other arbitrary ordering -- the regression software will have great trouble finding a correlation between that number and the housing price. Is PS 107 a more expensive district than PS 32 or PS 15? Are Addington and Bendemeer preferred to Union City and Ventura?
Splitting these into 50 different features under that one-hot principle decouples the feature from the encoding, and allows the analysis software to treat with them in a mathematically meaningful manner. It's not perfect by any means -- expanding from, say, 20 features to 70 means that it will take longer to converge -- but we do get meaningful results for the school district.
If you wish, you could now encode that feature in the expected order of value, and get a reasonable fit with little loss of accuracy and faster prediction from your model (fewer variables).
You cannot use ordinal encoding for a categorical variable where order doesn't matter. Main purpose of building a regression model is to see how much change in one variable has how much effect on the response variable. When you obtain the regression formula this is how you read it: "1 unit change in variable X causes theta_x change in response variable".
For example, let's say you built a regression model on housing prices and you got this: price = 1000 + (-50)*age_of_house. This means 1 year increase in the age of the house causes the price go down by 50.
When you have a categorical variable you cannot mention a unit change in that variable. You cannot say 1 unit increase/decrease in the color... etc. So, one-hot encoding, as Prune said in his/her answer, is merely a convention for dealing with categorical variables. It allows you to interpret the results like, if the house is white it adds $200 to the value when coefficient of color_white in your final model is +200. If the house is not white, that variable has no impact on your response variable because the value will be 0.
Don't forget that "Linear Regression" models can only explain linear relations between variables.
I hope this helps.