Sentiment Analysis classifier using Machine Learning - machine-learning

How can we make a working classifier for sentiment analysis since for that we need to train our classifier on huge data sets.
I have the huge data set to train, but the classifier object (here using Python), gives memory error when using 3000 words. And I need to train for more than 100K words.
What I thought was dividing the huge data set into smaller parts and make a classifier object for each and store it in a pickle file and use all of them. But it seems using all the classifier object for testing is not possible as it takes only one of the object during testing.
The solution which is coming in my mind is either to combine all the saved classifier objects stored in the pickle file (which is just not happening) or to keep appending the same object with new training set (but again, it is being overwritten and not appended).
I don't know why, but I could not find any solution for this problem even when it is the basic of machine learning. Every machine learning project needs to be trained in huge data set and the object size for training those data set will always give a memory error.
So, how to solve this problem? I am open to any solution, but would like to hear what is followed by people who do real time machine learning projects.
Code Snippet :
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
PS : I am using the NLTK toolkit using NaiveBayes. My training dataset is being opened and stored in the documents.

There are two things you seem to be missing:
Datasets for text are usually extremely sparse, and you should store them as sparse matrices. For such representation, you should be able to store milions of documents inyour memory with vocab. of 100,000.
Many modern learning methods are trained in mini-batch scenario, meaning that you never need whole dataset in memory, instead, you feed it to the model with random subsets of data - but still training a single model. This way your dataset can be arbitrary large, memory consumption is constant (fixed by minibatch size), and only training time scales with the amount of samples.

Related

How to save large sklearn RandomForestRegressor model for inference

I trained a Sklearn RandomForestRegressor model on 19GB of training data. I would like to save it to disk in order to use it later for inference. As have been recomended in another stackoverflow questions, I tried the following:
Pickle
pickle.dump(model, open(filename, 'wb'))
Model was saved successfully. It's size on disk was 1.9 GB.
loaded_model = pickle.load(open(filename, 'rb'))
Loading of the model resulted in MemorError (despite 16 GB RAM)
cPickle - the same result as Pickle
Joblib
joblib.dump(est, 'random_forest.joblib' compress=3)
It also ends with the MemoryError while loading the file.
Klepto
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d['sklearn_random_forest'] = est
d.dump()
Arhcive is created, but when I want to load it using the following code, I get the KeyError: 'sklearn_random_forest'
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d.load(model_params)
est = d[model_params]
I tried saving dictionary object using the same code, and it worked, so the code is correct. Apparently Klepto cannot persist sklearn models. I played with cached and serialized parameters and it didn't help.
Any hints on how to handle this would be very appreciated. Is it possible to save the model in JSON, XML, maybe HDFS, or maybe other formats?
Try using joblib.dump()
In this method, you can use the param "compress". This param takes in Integer values between 0 and 9, the higher the value the more compressed your file gets. Ideally, a compress value of 3 would suffice.
The only downside is that the higher the compress value slower the write/read speed!
The size of a Random Forest model is not strictly dependent on the size of the dataset that you trained it with. Instead, there are other parameters that you can see on the Random Forest classifier documentation which control how big the model can grow to be. Parameters like:
n_estimators - the number of trees
max_depth - how "tall" each tree can get
min_samples_split and min_samples_leaf - the number of samples that allow nodes in the tree to split/continue splitting
If you have trained your model with a high number of estimators, large max depth, and very low leaf/split samples, then your resulting model can be huge - and this is where you run into memory problems.
In these cases, I've often found that training smaller models (by controlling these parameters) -- as long as it doesn't kill the performance metrics -- will resolve this problem, and you can then fall back on joblib or the other solutions you mentioned to save/load your model.

Why doc2vec is giving different and un-reliable results?

I have a set of 20 small document which talks about a particular kind of issue (training data). Now i want to identify those docs out of 10K documents, which are talking about the same issue.
For the purpose i am using the doc2vec implementation:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
# Tokenize_and_stem is creating the tokens and stemming and returning the list
# documents_prb store the list of 20 docs
tagged_data = [TaggedDocument(words=tokenize_and_stem(_d.lower()), tags=[str(i)]) for i, _d in enumerate(documents_prb)]
max_epochs = 20
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
def doc2vec_score(s):
s_list = tokenize_and_stem(s)
v1 = model.infer_vector(s_list)
similar_doc = model.docvecs.most_similar([v1])
original_match = (X[int(similar_doc[0][0])])
score = similar_doc[0][1]
match = similar_doc[0][0]
return score,match
final_data = []
# df_ws is the list of 10K docs for which i want to find the similarity with above 20 docs
for index, row in df_ws.iterrows():
print(row['processed_description'])
data = (doc2vec_score(row['processed_description']))
L1=list(data)
L1.append(row['Number'])
final_data.append(L1)
with open('file_cosine_d2v.csv','w',newline='') as out:
csv_out=csv.writer(out)
csv_out.writerow(['score','match','INC_NUMBER'])
for row in final_data:
csv_out.writerow(row)
But, I am facing the strange issue, the results are highly un-reliable (Score is 0.9 even if there is not a slightest match) and score is changing with great margin every time. I am running the doc2vec_score function. Can someone please help me what is wrong here ?
First & foremost, try not using the anti-pattern of calling train multiple times in your own loop.
See this answer for more details: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?
If there's still a problem after that fix, edit your question to show the corrected code, and a more clear example of the output you consider unreliable.
For example, show the actual doc-IDs & scores, and explain why you think the probe document you're testing should be "not a slightest match" for any documents returned.
And note that if a document is truly nothing like the training documents, for example by using words that weren't in the training documents, it's not really possible for a Doc2Vec model to detect that. When it infers vectors for new documents, all unknown words are ignored. So you'll be left with a document using only known words, and it will return the best matches for that subset of your document's words.
More fundamentally, a Doc2Vec model is really only learning ways to contrast the documents that are in the universe demonstrated by the training set, by their words' cooccurrences. If presented with a document with either totally different words, or words whose frequencies/cooccurrences are totally unlike anything seen before, its output will be essentially random, without much meaningful relationship to other more-typical documents. (That'll be maybe-close, maybe-far, because in a way the training on the 'known universe' tends to fill the whole available space.)
So, you wouldn't want to use a Doc2Vec model trained only only positive examples of what you want to recognize, if you also want to recognize negative examples. Rather, include all kinds, then remember the subset that's relevant for certain in/out decisions – and use that subset for downstream comparisons, or multiple subsets to feed a more-formal classification or clustering algorithm.

what does lightgbm python Dataset reference parameter mean?

I am trying to figure out how to train a gbdt classifier with lightgbm in python, but getting confused with the example provided on the official website.
Following the steps listed, I find that the validation_data comes from nowhere and there is no clue about the format of the valid_data nor the merit or avail of training model with or without it.
Another question comes with it is that, in the documentation, it is said that "the validation data should be aligned with training data", while I look into the Dataset details, I find that there is another statement shows that "If this is Dataset for validation, training data should be used as reference".
My final questions are, why should validation data be aligned with training data? what is the meaning of reference in Dataset and how is it used during training? is the alignment goal accomplished with reference set to training data? what is the difference between this "reference" strategy and cross-validation?
Hope someone could help me out of this maze, thanks!
The idea of "validation data should be aligned with training data" is simple :
every preprocessing you do to the training data, you should do it the same way for validation data and in production of course. This apply to every ML algorithm.
For example, for neural network, you will often normalize your training inputs (substract by mean and divide by std).
Suppose you have a variable "age" with mean 26yo in training. It will be mapped to "0" for the training of your neural network. For validation data, you want to normalize in the same way as training data (using mean of training and std of training) in order that 26yo in validation is still mapped to 0 (same value -> same prediction).
This is the same for LightGBM. The data will be "bucketed" (in short, every continuous value will be discretized) and you want to map the continuous values to the same bins in training and in validation. Those bins will be calculated using the "reference" dataset.
Regarding training without validation, this is something you don't want to do most of the time! It is very easy to overfit the training data with boosted trees if you don't have a validation to adjust parameters such as "num_boost_round".
still everything is tricky
can you share full example with using and without using this "reference="
for example
will it be different
import lightgbm as lgbm
importance_type_LGB = 'gain'
d_train = lgbm.Dataset(train_data_with_NANs, label= target_train)
d_valid = lgbm.Dataset(train_data_with_NANs, reference= target_train)
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)
from
import lightgbm as lgbm
importance_type_LGB = 'gain'
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)

Best strategy to reduce false positives: Google's new Object Detection API on Satellite Imagery

I'm setting up the new Tensorflow Object Detection API to find small objects in large areas of satellite imagery. It works quite well - it finds all 10 objects I want, but I also get 50-100 false positives [things that look a little like the target object, but aren't].
I'm using the sample config from the 'pets' tutorial, to fine-tune the faster_rcnn_resnet101_coco model they offer. I've started small, with only 100 training examples of my objects (just 1 class). 50 examples in my validation set. Each example is a 200x200 pixel image with a labeled object (~40x40) in the center. I train until my precision & loss curves plateau.
I'm relatively new to using deep learning for object detection. What is the best strategy to increase my precision? e.g. Hard-negative mining? Increase my training dataset size? I've yet to try the most accurate model they offer faster_rcnn_inception_resnet_v2_atrous_coco as i'd like to maintain some speed, but will do so if needed.
Hard-negative mining seems to be a logical step. If you agree, how do I implement it w.r.t setting up the tfrecord file for my training dataset? Let's say I make 200x200 images for each of the 50-100 false positives:
Do I create 'annotation' xml files for each, with no 'object' element?
...or do I label these hard negatives as a second class?
If I then have 100 negatives to 100 positives in my training set - is that a healthy ratio? How many negatives can I include?
I've revisited this topic recently in my work and thought I'd update with my current learnings for any who visit in the future.
The topic appeared on Tensorflow's Models repo issue tracker. SSD allows you to set the ratio of how many negative:postive examples to mine (max_negatives_per_positive: 3), but you can also set a minimum number for images with no postives (min_negatives_per_image: 3). Both of these are defined in the model-ssd-loss config section.
That said, I don't see the same option in Faster-RCNN's model configuration. It's mentioned in the issue that models/research/object_detection/core/balanced_positive_negative_sampler.py contains the code used for Faster-RCNN.
One other option discussed in the issue is creating a second class specifically for lookalikes. During training, the model will attempt to learn class differences which should help serve your purpose.
Lastly, I came across this article on Filter Amplifier Networks (FAN) that may be informative for your work on aerial imagery.
===================================================================
The following paper describes hard negative mining for the same purpose you describe:
Training Region-based Object Detectors with Online Hard Example Mining
In section 3.1 they describe using a foreground and background class:
Background RoIs. A region is labeled background (bg) if its maximum
IoU with ground truth is in the interval [bg lo, 0.5). A lower
threshold of bg lo = 0.1 is used by both FRCN and SPPnet, and is
hypothesized in [14] to crudely approximate hard negative mining; the
assumption is that regions with some overlap with the ground truth are
more likely to be the confusing or hard ones. We show in Section 5.4
that although this heuristic helps convergence and detection accuracy,
it is suboptimal because it ignores some infrequent, but important,
difficult background regions. Our method removes the bg lo threshold.
In fact this paper is referenced and its ideas are used in Tensorflow's object detection losses.py code for hard mining:
class HardExampleMiner(object):
"""Hard example mining for regions in a list of images.
Implements hard example mining to select a subset of regions to be
back-propagated. For each image, selects the regions with highest losses,
subject to the condition that a newly selected region cannot have
an IOU > iou_threshold with any of the previously selected regions.
This can be achieved by re-using a greedy non-maximum suppression algorithm.
A constraint on the number of negatives mined per positive region can also be
enforced.
Reference papers: "Training Region-based Object Detectors with Online
Hard Example Mining" (CVPR 2016) by Srivastava et al., and
"SSD: Single Shot MultiBox Detector" (ECCV 2016) by Liu et al.
"""
Based on your model config file, the HardMinerObject is returned by losses_builder.py in this bit of code:
def build_hard_example_miner(config,
classification_weight,
localization_weight):
"""Builds hard example miner based on the config.
Args:
config: A losses_pb2.HardExampleMiner object.
classification_weight: Classification loss weight.
localization_weight: Localization loss weight.
Returns:
Hard example miner.
"""
loss_type = None
if config.loss_type == losses_pb2.HardExampleMiner.BOTH:
loss_type = 'both'
if config.loss_type == losses_pb2.HardExampleMiner.CLASSIFICATION:
loss_type = 'cls'
if config.loss_type == losses_pb2.HardExampleMiner.LOCALIZATION:
loss_type = 'loc'
max_negatives_per_positive = None
num_hard_examples = None
if config.max_negatives_per_positive > 0:
max_negatives_per_positive = config.max_negatives_per_positive
if config.num_hard_examples > 0:
num_hard_examples = config.num_hard_examples
hard_example_miner = losses.HardExampleMiner(
num_hard_examples=num_hard_examples,
iou_threshold=config.iou_threshold,
loss_type=loss_type,
cls_loss_weight=classification_weight,
loc_loss_weight=localization_weight,
max_negatives_per_positive=max_negatives_per_positive,
min_negatives_per_image=config.min_negatives_per_image)
return hard_example_miner
which is returned by model_builder.py and called by train.py. So basically, it seems to me that simply generating your true positive labels (with a tool like LabelImg or RectLabel) should be enough for the train algorithm to find hard negatives within the same images. The related question gives an excellent walkthrough.
In the event you want to feed in data that has no true positives (i.e. nothing should be classified in the image), just add the negative image to your tfrecord with no bounding boxes.
I think I was passing through the same or close scenario and it's worth it to share with you.
I managed to solve it by passing images without annotations to the trainer.
On my scenario I'm building a project to detect assembly failures from my client's products, at real time.
I successfully achieved very robust results (for production env) by using detection+classification for components that has explicity a negative pattern (e.g. a screw that has screw on/off(just the hole)) and only detection for things that doesn't has the negative pattens (e.g. a tape that can be placed anywhere).
On the system it's mandatory that the user record 2 videos, one containing the positive scenario and another containing the negative (or the n videos, containing n patterns of positive and negative so the algorithm can generalize).
After a while testing I found out that if I register to detected only tape the detector was giving very confident (0.999) false positive detections of tape. It was learning the pattern where the tape was inserted instead of the tape itself. When I had another component (like a screw on it's negative format) I was passing the negative pattern of tape without being explicitly aware of it, so the FPs didn't happen.
So I found out that, in this scenario, I had to necessarily pass the images without tape so it could differentiate between tape and no-tape.
I considered two alternatives to experiment and try to solve this behavior:
Train passing an considerable amount of images that doesn't has any annotation (10% of all my negative samples) along with all images that I have real annotations.
On the images that I don't have annotation I create a dummy annotation with a dummy label so I could force the detector to train with that image (thus learning the no-tape pattern). Later on, when get the dummy predictions, just ignore them.
Concluded that both alternatives worked perfectly on my scenario.
The training loss got a little messy but the predictions work with robustness for my very controlled scenario (the system's camera has its own box and illumination to decrease variables).
I had to make two little modifications for the first alternative to work:
All images that didn't had any annotation I passed a dummy annotation (class=None, xmin/ymin/xmax/ymax=-1)
When generating the tfrecord files I use this information (xmin == -1, in this case) to add an empty list for the sample:
def create_tf_example(group, path, label_map):
with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image = Image.open(encoded_jpg_io)
width, height = image.size
filename = group.filename.encode('utf8')
image_format = b'jpg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
classes_text = []
classes = []
for index, row in group.object.iterrows():
if not pd.isnull(row.xmin):
if not row.xmin == -1:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
classes_text.append(row['class'].encode('utf8'))
classes.append(label_map[row['class']])
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(filename),
'image/source_id': dataset_util.bytes_feature(filename),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
}))
return tf_example
Part of the traning progress:
Currently I'm using tensorflow object detection along with tensorflow==1.15, using faster_rcnn_resnet101_coco.config.
Hope it will solve someone's problem as I didn't found any solution on the internet. I read a lot of people telling that faster_rcnn is not adapted for negative training for FPs reduction but my tests proved the opposite.

what's meaning of function predict's returned value in OpenCV?

I use function predict in opencv to classify my gestures.
svm.load("train.xml");
float ret = svm.predict(mat);//mat is my feature vector
I defined 5 labels (1.0,2.0,3.0,4.0,5.0), but in fact the value of ret are (0.521220207,-0.247173533,-0.127723947······)
So I am confused about it. As Opencv official document, the function returns a class label (classification) in my case.
update: I don't still know why to appear this result. But I choose new features to train models and the return value of predict function is what I defined during train phase (e.g. 1 or 2 or 3 or etc).
During the training of an SVM you assign a label to each class of training data.
When you classify a sample the returned result will match up with one of these labels telling you which class the sample is predicted to fall into.
There's some more documentation here which might help:
http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
With Support Vector Machines (SVM) you have a training function and a prediction one. The training function is to train your data and save those informations on an xml file (it facilitates the prediction process in case you use a huge number of training data and you must do the prediction function in another project).
Example : 20 images per class in your case : 20*5=100 training images,each image is associated with a label of its appropriate class and all these informations are stocked in train.xml)
For the prediction function , it tells you what's label to assign to your test image according to your training DATA (the hole work you did in training process). Your prediction results might be good and might be bad , it's all about your training data I think.
If you want try to calculate the error rate for your classifier to see how much it can give good results or bad ones.

Resources