OneHotEncoding 2500 different categorical variables - machine-learning

I am working on a flight recommendation project where airport codes of each source will be given along with some data. with that i have to predict the destination to which airplane can reach.
I have to deal with 6+ million rows. so I am facing a problem while oneHotEncoding airport codes, (which in the present dataset are more than 3000). before fitting it into a model.
Can anyone suggest how to onehotencode or deal with this kind of problem?
from sklearn.preprocessing import OneHotEncoder
onehotencoder1 = OneHotEncoder()
onehotencoder1.fit(X)
X = onehotencoder1.transform(X)
and i am getting can't allocate 11.3 Gib.
I tried for less data and it's working.

Have you tried with pandas? It has a a similar get_dummies function which may work.

Related

Hello all I need some help regarding embeddings selection for relational GNN

Currently I am using a csv file converted from a pcap and I took the column length from my csv file and used it as my embedding. The code compiles and I do get accuracy in the high 70s like 77 percent. I am just not sure if this is an appropriate choice for an embedding. I am also getting this issue of some data sets get weird results as UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. I know there is people who answered this question but I tried all of there methods and still no clue why my model works for some data sets and not all.Please if someone could confirm what I am doing makes sense or not that would really help me.
CSV file snapshot for reference
df['embeddings'] =df['Length']
embeddings = torch.from_numpy(df['embeddings'].to_numpy())
# normalizing degree values
scale = StandardScaler()
embeddings = scale.fit_transform(embeddings.reshape(-1,1))

Image Classification with single class dataset using Transfer Learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I only have around 1000 images of computers. I need to train a model that can identify if the image is computer or not-computer. I do not have a dataset for not-computer, as it could be anything.
I guess the best method for this would be to apply transfer learning. I am trying to train data on a pre-trained VGG19 Model. But still, I am unaware on how to train a model with just computers images without any non-computer images.
I am new to ML Overall, so sorry if question is not to the point.
No way, I'm sorry. You'll need a lot (at least other 1000 images) of non-computer images. You can take them from everywhere, the more they "vary" the better is for your model to extract what features characterize a computer.
Imagine to be a baby that is trained to always say "yes" in front of something, next time you'll se something you'll say "yes" no matter what is in front of you...
The same is for machine learning models, you need positive examples and negative examples, or your model will have 100% accuracy by predicting always "yes".
If you want to see it a mathematically/geometrically, you can see each sample (in your case an image) as a point in the feature space: imagine to draw an axis for each attribute you have (x,y,z an so on), an image will be a point in that space.
For simplicity let's consider a 2-dimension space, which means that each image could be described with 2 attributes (not the case for images, usually the features are a lot, but for simplicity imagine feature_1 = number of colors, feature_2 = number of angles), in this example we can simply draw a point in a cartesian graph, one for each image:
The objective of a classifier is to draw a line which better separate the red dots from the blue dots, which means separate positive examples, from negative examples.
If you give the model only positive samples (which is what you were going to do), you'll have infinite models with 100% accuracy! Because you can put a line wherever you want, the only requirement is to not "cut" your dataset.
Given that I suppose you are a beginner, I'll just tell you what to do, not how because it would take years ;)
1) Collect data - as I told you, even negative examples, at least other 1000 samples
2) Split the data into train/test - a good split could be 2/3 of the samples in the training set and 1/3 in the test set. [REMEMBER] Keep consistency of the final class distribution, i.e. if you had 50%-50% of classes "Computer"-"Non computer", you should keep that percentage for both train set and test set
3) Train a model - have a look at this link for a guided examples, it uses the MNIST dataset, which is a famous image classification one, you should use your data
4) Test the model on the test set and look at performance
While it is not impossible to take data belonging to one only one class of data and then use methods to classify whether other data belong to the same class or not, you usually do not end up with too good accuracy that way.
One way to do this, is to use something called "autoencoders". The point here is that you use the same image as input and as the target, and you make sure that the (usually neural network) is forced to compress the image in some way so that it only stores what is important to recreate images of computers. Ideally, this should lead to a model which is good at recreating images of computers, and bad at everything else, meaning you can test how high the loss is on the output, and if it higher than some threshold you've decided on, you deem it to be something else. Again, you're probably not going to get anything close to 90% accuracy doing this, but it is an approach to your problem.
A better approach is to go hunting for models which have been pre-trained on some dataset which had computers as part of the dataset, take the same dataset and set all computers to one class (+ your own images, make sure they adhere to the dataset format) and a selection of the other images to the other class. Make sure to not make the classes too unbalanced, otherwise your model will suffer from it. Extend the pre-trained model with a couple of layer, fully connected should probably do fine, and make the pre-trained part of the model not trainable, so you don't mess up the good weights there when you're practically telling it to ignore everything which is not a computer.
This is probably your best bet, but is going to require a bit more effort on your side in terms of finding all of these parts which you need to make it happen, and to understand how to integrate that code into yours.
You can either use transfer learning using a pretrained model on the imagenet dataset. As mentioned in another answer, there are a bunch of classes inside imagenet close to computers and electronic devices (such as monitors, CD players, laptops, speakers, etc.). So you can fine-tune the model on your dataset and train it to predict computers (train on around 750 images and test on the remaining 250).
You can manually collect images for objects other than computers, preferably a lot of electronic devices (because they are close to computers) and a bunch of other household things (there is a home objects dataset by Caltech). You should collect about 1000 such images to have a class balance. You can train your own custom model once you have this dataset.
No problem!
step one: install a deep-learning toolkit of your choice. they all come with nice tutorials these days.
step two: grab a pre-trained imagenet model. In that model, there are already a few computer classes built into it! ( "desktop_computer", "laptop", 'notebook", and another class for hand-held computers "hand-held_computer")
step three: use model to predict. for this, you'll need to have your images the correct size.
more steps: further fine-tune the model...a bit more advanced but will give you some gains.
Something to think about is what is your goal? accuracy? false positives/negatives, etc? It's always good having a goal of what you need to accomplish from the start.
EDIT: probably the easiest way to get started(if you don't have libraries, gpu, etc) is to go to google colab ( https://colab.research.google.com/notebooks/welcome.ipynb ) and make a notebook in your browser and run the following code.
#some code take and modded from https://www.learnopencv.com/keras-tutorial- using-pre-trained-imagenet-models/
import keras
import numpy as np
from keras.applications import vgg16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.imagenet_utils import decode_predictions
import matplotlib.pyplot as plt
from PIL import Image
import requests
from io import BytesIO
%matplotlib inline
vgg_model = vgg16.VGG16(weights='imagenet')
def predict_image(image_url, model):
response = requests.get(image_url)
original = Image.open(BytesIO(response.content))
newsize = (224, 224)
original = original.resize(newsize)
# convert the PIL image to a numpy array
# IN PIL - image is in (width, height, channel)
# In Numpy - image is in (height, width, channel)
numpy_image = img_to_array(original)
# Convert the image / images into batch format
# expand_dims will add an extra dimension to the data at a particular axis
# We want the input matrix to the network to be of the form (batchsize, height, width, channels)
# Thus we add the extra dimension to the axis 0.
image_batch = np.expand_dims(numpy_image, axis=0)
plt.imshow(np.uint8(image_batch[0]))
plt.show()
# prepare the image for the VGG model
processed_image = vgg16.preprocess_input(image_batch.copy())
# get the predicted probabilities for each class
predictions = model.predict(processed_image)
# convert the probabilities to class labels
# We will get top 5 predictions which is the default
label = decode_predictions(predictions)
print label[0][0:2] #just display top 2
urls = ['https://4.imimg.com/data4/CO/YS/MY-29352968/samsung-desktop-computer-500x500.jpg', 'https://cdn.britannica.com/77/170477-050-1C747EE3/Laptop-computer.jpg']
for u in urls:
predict_image(u, vgg_model)
This should be a good starting point. Oh, and if the top predicted label is not in the computer, laptop, etc set, then it's NOT a computer!

When is a dataset called unbalanced?

I've a dataset (based on million song dataset) on which I need to do genre classification. Following is the distribution of various genre classes in the dataset.
Genre Count %age
1. Rock 115104 39.94364359
2. Pop 47534 16.49535337
3. Electronic 24313 8.437150809
4. Jazz 16465 5.713720564
5. Rap 15347 5.325749741
6. RnB 13769 4.778148706
7. Country 13509 4.687922933
8. Reggae 8739 3.032627027
9. Blues 7075 2.455182083
10. Latin 7042 2.44373035
11. Metal 6257 2.171317921
12. World 4624 1.604630664
13. Folk 3661 1.270448283
14. Punk 3479 1.207290242
15. New Age 1248 0.433083709
Would you call this data unbalanced? I've tried reading around but found that people describe datasets unbalanced where one of the classes is 99% of the dataset and it's a binary classification problem. Not sure whether the above set falls into this category. Please help. I'm not able to get the classification right and being a newbie can't decide whether it's the data or my naivety. This is one of the hypotheses I have and need to validate.
In general there is no strict definition of imbalanced dataset, but generally, if the smallest class is 10x smaller than the largest one, then calling it imbalanced is a good idea.
In your case, smallest class is actually 100x smaller than the biggest one, so you can even map it onto your consideration of "99-1" for binary classification. If you only ask about differentiating between New Age and Rock you will end up with 99-1 imbalance, so you can expect problems typical to imbalanced classification - to appear in your project.

Why doesn't model.predict() work well on novel MNIST-like input?

I'm an experienced developer, new to Machine Learning. I'm experimenting with Keras/TensorFlow, starting with the mnist_mlp.py example. I installed Keras and TensorFlow using pip on a Mac.
In order to understand the inner workings better, instead of running the file ('python mnist_mlp.py'), I'm cutting and pasting the file contents into a Python (2.7.12) interactive window.
Everything runs fine and I get the 98.4% test accuracy as noted in the comments of that file.
What I want to do next is to feed it novel input and use model.predict() to see how it performs. I create 28x28 images in GIMP and bring them into my Python session (being careful to convert from 4-channel, 8-bit RGBA images to a linear single-channel floating-point array).
When I feed this into the model, I get what look like strange results to me. Some images are correctly categorized while others are wildly off.
They look like perfectly reasonable numbers to me, and they match the MNIST set examples pretty closely. When I extract the array back out and look at it it looks OK, so it doesn't seem to be a flipping or flopping issue. When I feed MNIST images in the same way, they appear to work correctly.
I'm not sure what's going on here. Is it a case of overfitting? Why is the validation data set the same as the test set?
Test images and python code with instructions can be found here:
https://s3.amazonaws.com/stackoverflow-47799896/StackOverflow_47799896.zip
Thanks.
EDIT: I tried the same test with the convnet example (mnist_cnn.py) and got slightly better results but still similar errors. If anyone wants to try that, they can use the same functions in the readme.py file but make these changes:
import numpy as np
x = np.ndarray((1,28,28,1), dtype='float32')
def l (s):
with open(s, 'rb') as fd:
_ = fd.read(1)
for i in xrange(28):
for j in xrange(28):
v = ord(fd.read(1))
x[0][i][j][0] = v / 255.0
_ = fd.read(3)
EDIT 2: Interestingly, if I replace the first 19 items in the training data set (out of 60,000) with my images in the MLP case, I get at or near perfect prediction of all my images after training. Does this suggest overfitting?

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

Resources