I am using h2o autoencoder anomaly for finding outlier data in my model but issue is autoencoder only accepts numerical predictors.
My requirement is i have find outlier's based on CardNumber or merchant number.
and Cardnumber is 12 digit(342178901244) and unique mostly So its nominal data and we can not do hot encoding as well as it will create many new fields as many as unique card no.
So please suggest any way we can include categorical data as well and still we can run autoencoder
model=H2OAutoEncoderEstimator(activation="Tanh",
hidden=[70],
ignore_const_cols=False,
epochs=40)
model.train(x=predictors,training_frame=train.hex)
#Get anomalous values
test_rec_error=model.anomaly(test.hex,per_feature=True)
train_rec_error=model.anomaly(train.hex,per_feature=True)
recon_error_df['outlier'] = np.where(recon_error_df['Reconstruction.MSE'] > top_whisker, 'outlier', 'no_outlier')
You can't put an almost-unique categorical feature in a predictor (autoencoder or anything else) and expect it to work.
Instead you need to extract meaningful features from it, which depend on the problem you want to solve. For example if it is a credit card number you could add a feature encoding the card circuit (VISA, Mastercard, American Express, ...).
The limit is only your imagination and knowledge of the domain.
Related
I want to build a classifier that predicts if user i will retweet tweet j.
The dataset is huge, it contains 160 million tweets. Each tweet comes along with some metadata(e.g. does the retweeter follow the user of the tweet).
the text tokens for a single tweet is an ordered list of BERT ids. To get the embedding of the tweet, you just use the ids (So it is not text)
Is it possible to fine-tune BERT to do the prediction? if yes, what do courses/sources do you recommend to learn how to fine-tune? (I'm a beginner)
I should add that the prediction should be a probability.
If it's not possible, I'm thinking of converting the embeddings back to text then using some arbitrary classifier that I'm going to train.
You can fine-tune BERT, and you can use BERT to do retweet prediction, but you need more architecture in order to predict if user i will retweet tweet j.
Here is an architecture off the top of my head.
At a high level:
Create a dense vector representation (embedding) of user i (perhaps containing something about the user's interests, such as sports).
Create an embedding of tweet j.
Create an embedding of the combination of the first two embeddings together, such as with concatenation or hadamard product.
Feed this embedding through a NN that performs binary classification to predict retweet or non-retweet.
Let's break this architecture down by item.
To create an embedding of user i, you will need to create some kind of neural network that accepts whatever features you have about the user and produces a dense vector. This part is the most difficult component of the architecture. This area is not in my wheelhouse, but a quick google search for "user interest embedding" brings up this research paper on an algorithm called StarSpace. It suggests that it can "obtain highly informative user embeddings according to user behaviors", which is what you want.
To create an embedding of tweet j, you can use any type of neural network that takes tokens and produces a vector. Research prior to 2018 would have suggested using an LSTM or a CNN to produce the vector. However, BERT (as you mentioned in your post) is the current state-of-the-art. It takes in text (or text indices) and produces a vector for each token; one of those tokens should have been the prepended [CLS] token, which commonly is taken to be the representation of the whole sentence. This article provides a conceptual overview of the process. It is in this part of the architecture that you can fine-tune BERT. This webpage provides concrete code using PyTorch and the Huggingface implementation of BERT to do this step (I've gone through the steps and can vouch for it). In the future, you'll want to google for "BERT single sentence classification".
To create an embedding representing the combination of user i and tweet j, you can do one of many things. You can simply concatenate them together into one vector; so if user i is an M-dimensional vector and tweet j is an N-dimensional vector, then the concatenation produces an (M+N)-dimensional vector. An alternative approach is to compute the hadamard product (element-wise multiplication); in this case, both vectors must have the same dimension.
To make the final classification of retweet or not-retweet, build a simple NN that takes the combination vector and produces a single value. Here, since you are doing binary classification, a NN with a logistic (sigmoid) function would be appropriate. You can interpret the output as the probability of retweeting, so a value above 0.5 would be to retweet. See this webpage for basic details on building a NN for binary classification.
In order to get this whole system to work, you need to train it all together end-to-end. That is, you have to get all the pieces hooked up first and train it rather than training the components separately.
Your input dataset would look something like this:
user tweet retweet?
---- ----- --------
20 years old, likes sports Great game Y
30 years old, photographer Teen movie was good N
If you want an easier route where there is no user personalization, then just leave out the components that create an embedding of user i. You can use BERT to build a model to determine if the tweet is retweeted without regard to user. You can again follow the links I mentioned above.
There is already an answer on this in Data Science SE, which explains why BERT cannot be used for prediction. Here is the gist:
BERT can't be used for next word prediction, at least not with the current state of the research on masked language modeling.
BERT is trained on a masked language modeling task and therefore you cannot "predict the next word". You can only mask a word and ask BERT to predict it given the rest of the sentence (both to the left and to the right of the masked word).
But as I understand from your case that you want to do 'classification' then BERT is fully equipped to do that. Please refer to the link I have posted below. This will help you to classify the tweets according to its topic so you may then view them in your Leisure time.
I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.
So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.
The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.
Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?
I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.
Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.
After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)
But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.
You generally want to standardize all your features so it would be done after the encoding (that is assuming that you want to standardize to begin with, considering that there are some machine learning algorithms that do not need features to be standardized to work well).
So there is 50/50 voting on whether to standardize data or not.
I would suggest, given the positive effects in terms of improvement gains no matter how small and no adverse effects, one should do standardization before splitting and training estimator
I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.
I am researching what features I'll have for my machine learning model, with the data I have. My data contains a lot of textdata, so I was wondering how to extract valuable features from it. Contrary to my previous belief, this often consists of representation with Bag-of-words, or something like word2vec: (http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
Because my understanding of the subject is limited, I dont understand why I can't analyze the text first to get numeric values. (for example: textBlob.sentiment =https://textblob.readthedocs.io/en/dev/, google Clouds Natural Language =https://cloud.google.com/natural-language/)
Are there problems with this, or could I use these values as features for my machine learning model?
Thanks in advance for all the help!
Of course, you can convert text input single number with sentiment analysis then use this number as a feature in your machine learning model. Nothing wrong with this approach.
The question is what kind of information you want to extract from text data. Because sentiment analysis convert text input to a number between -1 to 1 and the number represents how positive or negative the text is. For example, you may want sentiment information of the customers' comments about a restaurant to measure their satisfaction. In this case, it is fine to use sentiment analysis to preprocess text data.
But again, sentiment analysis is only given an idea about how positive or negative text is. You may want to cluster text data and sentiment information is not useful in this case since it does not provide any information about the similarity of texts. Thus, other approaches such as word2vec or bag-of-words will be used for the representation of text data in those tasks. Because those algorithms provide vector representation of the text instance of a single number.
In conclusion, the approach depends on what kind of information you need to extract from data for your specific task.
There are two data sets - the training one and a data set of features, labels for which are yet to be predicted (the new one).
I built a Random Forest classifier. Along the way I had to do two things:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Now I have two questions. When i am predicting labels for the new data:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
In my case I possess both data sets, so I could calculate all this stuff in advance, but what if I only had a classifier and a new data set?
I only have a basic knowledge of the type of classifiers and normalization techniques you're using, but the general rule, that I think applies to what you're doing as well, is to do the following.
Your classifier is not a Random Forest Classifier. That is only one step of the pipeline that acts as your actual classifier. This pipeline / actual classifier is what you describe:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Use a Random Forest Classifier on what you get from the first 2 steps.
This pipeline, that encompasses 3 things, is what you're actually using as your classifier.
Now, how does a classifier work?
You build some state based on the training data.
You use that state to make predictions on the test data.
So:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
Your classifier normalizes the incoming features for the training data, so it will normalize those for unseen instances too. To do this, it must use the state it has built during training.
For example, if you were doing min-max scaling on your features, your state would store a min(f) and max(f) for each feature f. Then, during testing / prediction, you would do min-max scaling for each feature f using the stored min(f) and max(f) values.
I'm not sure what you mean by "normalize continuous numeric features". Do you mean discretization? If you build some state for this discretization during training, then you need to find a way to factor that in.
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
Don't you know how many values each category can have beforehand? Usually you do (since categoricals are things like nationality, continent etc. - things you know in advance). If you can get a value for a categorical feature that you haven't seen during training, it begs the question if you should even care about it. What good is a categorical value you've never trained on?
Maybe add an "unknown" category. I think expanding for a single one should be fine, what good are more going to do if you've never trained on them?
What kind of categoricals do you have?
I could be wrong, but do you really need one-hot encoding? AFAIK, tree-based classifiers don't seem to benefit that much from it.