ML/DL Train Model on Single Column Feature DataSet - machine-learning

I am trying to build a model that can predict insurance names based on insurance Id.
Before putting the question to this forum, I have tried KNN and Decian Tree, but the accuracy does not exceed more than 60%.
In my data frame, I have one column as a feature and the other as a label.
I can also extract other features from this data as well like Is Numeric, length, etc.
I have 2.8M rows of data in this shape.
insurance_id
insurance_name
XOH830990804
Medicare
XOH01179276
Medicare
H55575577
Medicare
H71096147
WELLMED
IBPW01981926
BCBS
MT25110S
Aetna
WXQQ07123
Aetna
6WU7NSSGY63
Oxford
MX7ZZ35T
Oxford
DU00079Z
Welcare
PB95800M
UHC
Please guide me on which approach or model can help me to achieve an accuracy of more than 80%.

You can try to diversify your inputs
As an example, you can pass aditional features to the network, such as:
Length of the insurance_id
Quantity of numbers in the insurance_id
Quantity of letters in the insurance_id
Sum of all numbers in the insurance_id
And any other transform you might think of.
As the output layer of your network, you might wanna use Dense(n_of_different_insurance_names, activation='softmax')
and a categorical_crossentropy loss function when compiling the model

Related

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

LSTM, multi-variate, multi-feature in pytorch

I'm having trouble understanding the format of data for an LSTM in pytorch. Lets say i have a CSV file with 4 features, laid out in timestamps one after the other ( a classic time series)
time1 feature1 feature2 feature3 feature4
time2 feature1 feature2 feature3 feature4
time3 feature1 feature2 feature3 feature4
time4 feature1 feature2 feature3 feature4, label
However, this entire set of 4 sequences only has a single label. The thing we're trying to classify started at time1, but we don't know how to label it until time 4.
My question is, can a typical pytorch LSTM support this? All of the tutorials i've read, watched, walked through, involve looking at a time sequence of a single feature, or a word model, which is still a dataset with a single dimension.
If it can support it, does the data need to be flattened in some way?
Pytorch's LSTM reference states:
input: tensor of shape (L,N,Hin)(L, N, H_{in})(L,N,Hin​) when batch_first=False or (N,L,Hin)(N, L, H_{in})(N,L,Hin​) when batch_first=True containing the features of the input sequence. The input can also be a packed variable length sequence.
Does this mean that it cannot support any input that contains multiple sequences? Or is there another name for this?
I'm really lost here, and could use any advice, pointers, help, so on. Maybe some disambiguation too.
I've posted a couple times here but gotten no responses at all. If this post is misplaced, could someone kindly direct me towards the correct place to post it?
Edit: Following Daniel's advice, do i understand correctly that the four features should be put together like this:
[(feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4), label] when given to the LSTM?
If that's correct, is the input size (16) in this case?
Finally, I was under the impression that the output of the LSTM Would be the predicted label. Do I have that wrong?
As you show, the LSTM layer's input size is (batch_size, Sequence_length, feature_size). This means that the feature is assumed to be a 1D vector.
So to use it in your case you need to stack your four features into one vector (if they are more then 1D themselves then flatten them first) and use that vector as the layer's input.
Regarding the label. It is defiantly supported to have a label only after a few iterations. The LSTM will output a sequence with the same length as the input sequence, but when training the LSTM you can choose to use any part of that sequence in the loss function. In your case you will want to use the last element only.

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Logistic regression training data set true/false ratio

I am working on a classifier, by logistic regression, based on Spark ML.
and I wonder should I train the equal quantity of data for true , false.
I mean
When I want to classify people into male or female,
Is it ok that train a model with 100 male data + 100 female data.
The online people may 40% male and 60% female , but this percent is forcasted based on the past, so it can be change(like 30% female, 70% male)
In this situation.
what female/male percent of data should I train?
is this related with overfitting?
when If I trained a model with 40%female + 60%male, It is useless to classifying a field data composed with 70%female+30%male?
Spark classification sample data has 43 false, 57true.
https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
what means the true/false ratio of trainig data in logisticregression?
I am really not good at English, but hope you understand me.
It should not matter what ratio you use, as long as it is reasonable.
60:40, 30:70, 50:50, it's okay. Just make sure it's not too lopsided, like 99:1.
If the entire data set is 70:30 female:male, and you want to only use a subset of this dataset, going for a 60:40 female:male ratio will not kill you.
Consider the following example:
Your test data contains 99% males, and 1 % female.
Technically, you can classify all males correctly, ALL females incorrectly, and your algorithm would show an error of 1%. Seems pretty good right? No, because your data is too lopsided.
This low error is not a result of overfitting (high variance), but rather a result of a lopsided data set.
This is an extreme example, but you get the point.

Resources