I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.
When I do inference, I just collect 20 frames from video, feed it to model and it gives me the result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as an example that two objects collided and it can happen on a different scale, and I want to detect this action.
How provide to the model exact position for action recognition, if it can happen on a different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it?
Are there any end-to-end solutions for action recognition with the object location proposal for the action recognition network?
I've already looked at papers and blogs, what people suggest, couldn't find the solution for the localization issues, so action recognition model got correct frames.
So just for the summary, the idea is to get an object that can potentially collide in any scale and then feed for example 20 frames of it to 3D convnet to make the judgment.
Any advise from you? Maybe someone may explain me approach?
This is my models CNN+LSTM, so currently trying to improve it.
video = Input(shape=(None, 224,224,3))
cnn_base = VGG16(input_shape=(224,224,3),
weights="imagenet",
include_top=False)
cnn_out = GlobalAveragePooling2D()(cnn_base.output)
cnn = Model(inputs=cnn_base.input, outputs=cnn_out)
cnn.trainable = False
encoded_frames = TimeDistributed(cnn)(video)
encoded_sequence = LSTM(32, dropout=0.5, W_regularizer=l2(0.01), recurrent_dropout=0.5)(encoded_frames)
hidden_layer = Dense(units=64, activation="relu")(encoded_sequence)
dropout = Dropout(0.2)(hidden_layer)
outputs = Dense(5, activation="softmax")(dropout)
model = Model([video], outputs)
Related
I have EEG data, for which I want to calculate the peaks' amplitudes and latencies. I'm working with MNE, and found the method get_peak in the Evoked object. However, I want to find peaks on epochs data (not averaged). How can I do it? I didn't see similar functions for the epochs object. I would prefer to do it through MNE, but other python libraries can also work. It's important that there is an option to get the amplitude and latency of the peaks, and choose a time window for detection.
In addition, I didn't understand if the get_peak returns only the highest peak, or something else? If there is more than one peak.
Thanks!
You can create an Evoked data structure from a single trial by just selecting a single trial from your Epochs structure and then applying .average(), e.g., as follows:
tmp_evoked = all_epochs[subj][cond][trial].average()
The above assumes you have an all_epochs object organized as trials within conditions within subjects (i.e., you're working at the group level). If your Epochs object has only one subject, then it would just be:
tmp_evoked = all_epochs[cond][trial].average()
You can further refine this to find the peak at only one channel as:
tmp_evoked = all_epochs[subj][cond][trial].pick(chan).average()
I have a set of 20 small document which talks about a particular kind of issue (training data). Now i want to identify those docs out of 10K documents, which are talking about the same issue.
For the purpose i am using the doc2vec implementation:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
# Tokenize_and_stem is creating the tokens and stemming and returning the list
# documents_prb store the list of 20 docs
tagged_data = [TaggedDocument(words=tokenize_and_stem(_d.lower()), tags=[str(i)]) for i, _d in enumerate(documents_prb)]
max_epochs = 20
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
def doc2vec_score(s):
s_list = tokenize_and_stem(s)
v1 = model.infer_vector(s_list)
similar_doc = model.docvecs.most_similar([v1])
original_match = (X[int(similar_doc[0][0])])
score = similar_doc[0][1]
match = similar_doc[0][0]
return score,match
final_data = []
# df_ws is the list of 10K docs for which i want to find the similarity with above 20 docs
for index, row in df_ws.iterrows():
print(row['processed_description'])
data = (doc2vec_score(row['processed_description']))
L1=list(data)
L1.append(row['Number'])
final_data.append(L1)
with open('file_cosine_d2v.csv','w',newline='') as out:
csv_out=csv.writer(out)
csv_out.writerow(['score','match','INC_NUMBER'])
for row in final_data:
csv_out.writerow(row)
But, I am facing the strange issue, the results are highly un-reliable (Score is 0.9 even if there is not a slightest match) and score is changing with great margin every time. I am running the doc2vec_score function. Can someone please help me what is wrong here ?
First & foremost, try not using the anti-pattern of calling train multiple times in your own loop.
See this answer for more details: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?
If there's still a problem after that fix, edit your question to show the corrected code, and a more clear example of the output you consider unreliable.
For example, show the actual doc-IDs & scores, and explain why you think the probe document you're testing should be "not a slightest match" for any documents returned.
And note that if a document is truly nothing like the training documents, for example by using words that weren't in the training documents, it's not really possible for a Doc2Vec model to detect that. When it infers vectors for new documents, all unknown words are ignored. So you'll be left with a document using only known words, and it will return the best matches for that subset of your document's words.
More fundamentally, a Doc2Vec model is really only learning ways to contrast the documents that are in the universe demonstrated by the training set, by their words' cooccurrences. If presented with a document with either totally different words, or words whose frequencies/cooccurrences are totally unlike anything seen before, its output will be essentially random, without much meaningful relationship to other more-typical documents. (That'll be maybe-close, maybe-far, because in a way the training on the 'known universe' tends to fill the whole available space.)
So, you wouldn't want to use a Doc2Vec model trained only only positive examples of what you want to recognize, if you also want to recognize negative examples. Rather, include all kinds, then remember the subset that's relevant for certain in/out decisions – and use that subset for downstream comparisons, or multiple subsets to feed a more-formal classification or clustering algorithm.
I've recently started with Deep Learning and CNN which as quoted, attempts to extract the most optimal features from samples on it's own.
I made a model to recognize characters where the training set had images with black background and script in white.
Image Sample
This type of model though fails to recognize images with pattern in black on white background(I tried with my own input and the negative of previous set also).Negative of Image Sample
Is it possible to recognize both types of images using the same model or do I need to train two separate models?
I don't know if it's possible using ImageDataGenerator class.
Following is the current code snippet:
train_datagen = ImageDataGenerator(rescale = 1./255, shear_range = 0.2, zoom_range = 0.2, horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
Well, this is a situation ("negative images") where, as it was revealed relatively recently, the results are not what we may seem to expect them to be...
There is an unpublished paper # ArXiv, which shows exactly that CNN models that have achieved almost perfect test accuracy at datasets like MNIST & CIFAR-10, fail to give similar performance in the respective "negative" images (i.e. with inverted background & foreground, like your case here):
On the Limitation of Convolutional Neural Networks in Recognizing Negative Images
Here is the main result of the paper:
The issue is rather non-trivial, and there has been strong disagreement in the community as to if this result is indeed expected & unsurprising or not; see the (now archived) relevant discussion # Reddit, as well as a relevant piece # KDNuggets.
All in all, as the paper also suggests, you can do it with one model, but you will need to include at least some such "negative" images in your training. See also the SO thread High training accuracy but low prediction performance for Tensorflow's official MNIST model.
I'm setting up the new Tensorflow Object Detection API to find small objects in large areas of satellite imagery. It works quite well - it finds all 10 objects I want, but I also get 50-100 false positives [things that look a little like the target object, but aren't].
I'm using the sample config from the 'pets' tutorial, to fine-tune the faster_rcnn_resnet101_coco model they offer. I've started small, with only 100 training examples of my objects (just 1 class). 50 examples in my validation set. Each example is a 200x200 pixel image with a labeled object (~40x40) in the center. I train until my precision & loss curves plateau.
I'm relatively new to using deep learning for object detection. What is the best strategy to increase my precision? e.g. Hard-negative mining? Increase my training dataset size? I've yet to try the most accurate model they offer faster_rcnn_inception_resnet_v2_atrous_coco as i'd like to maintain some speed, but will do so if needed.
Hard-negative mining seems to be a logical step. If you agree, how do I implement it w.r.t setting up the tfrecord file for my training dataset? Let's say I make 200x200 images for each of the 50-100 false positives:
Do I create 'annotation' xml files for each, with no 'object' element?
...or do I label these hard negatives as a second class?
If I then have 100 negatives to 100 positives in my training set - is that a healthy ratio? How many negatives can I include?
I've revisited this topic recently in my work and thought I'd update with my current learnings for any who visit in the future.
The topic appeared on Tensorflow's Models repo issue tracker. SSD allows you to set the ratio of how many negative:postive examples to mine (max_negatives_per_positive: 3), but you can also set a minimum number for images with no postives (min_negatives_per_image: 3). Both of these are defined in the model-ssd-loss config section.
That said, I don't see the same option in Faster-RCNN's model configuration. It's mentioned in the issue that models/research/object_detection/core/balanced_positive_negative_sampler.py contains the code used for Faster-RCNN.
One other option discussed in the issue is creating a second class specifically for lookalikes. During training, the model will attempt to learn class differences which should help serve your purpose.
Lastly, I came across this article on Filter Amplifier Networks (FAN) that may be informative for your work on aerial imagery.
===================================================================
The following paper describes hard negative mining for the same purpose you describe:
Training Region-based Object Detectors with Online Hard Example Mining
In section 3.1 they describe using a foreground and background class:
Background RoIs. A region is labeled background (bg) if its maximum
IoU with ground truth is in the interval [bg lo, 0.5). A lower
threshold of bg lo = 0.1 is used by both FRCN and SPPnet, and is
hypothesized in [14] to crudely approximate hard negative mining; the
assumption is that regions with some overlap with the ground truth are
more likely to be the confusing or hard ones. We show in Section 5.4
that although this heuristic helps convergence and detection accuracy,
it is suboptimal because it ignores some infrequent, but important,
difficult background regions. Our method removes the bg lo threshold.
In fact this paper is referenced and its ideas are used in Tensorflow's object detection losses.py code for hard mining:
class HardExampleMiner(object):
"""Hard example mining for regions in a list of images.
Implements hard example mining to select a subset of regions to be
back-propagated. For each image, selects the regions with highest losses,
subject to the condition that a newly selected region cannot have
an IOU > iou_threshold with any of the previously selected regions.
This can be achieved by re-using a greedy non-maximum suppression algorithm.
A constraint on the number of negatives mined per positive region can also be
enforced.
Reference papers: "Training Region-based Object Detectors with Online
Hard Example Mining" (CVPR 2016) by Srivastava et al., and
"SSD: Single Shot MultiBox Detector" (ECCV 2016) by Liu et al.
"""
Based on your model config file, the HardMinerObject is returned by losses_builder.py in this bit of code:
def build_hard_example_miner(config,
classification_weight,
localization_weight):
"""Builds hard example miner based on the config.
Args:
config: A losses_pb2.HardExampleMiner object.
classification_weight: Classification loss weight.
localization_weight: Localization loss weight.
Returns:
Hard example miner.
"""
loss_type = None
if config.loss_type == losses_pb2.HardExampleMiner.BOTH:
loss_type = 'both'
if config.loss_type == losses_pb2.HardExampleMiner.CLASSIFICATION:
loss_type = 'cls'
if config.loss_type == losses_pb2.HardExampleMiner.LOCALIZATION:
loss_type = 'loc'
max_negatives_per_positive = None
num_hard_examples = None
if config.max_negatives_per_positive > 0:
max_negatives_per_positive = config.max_negatives_per_positive
if config.num_hard_examples > 0:
num_hard_examples = config.num_hard_examples
hard_example_miner = losses.HardExampleMiner(
num_hard_examples=num_hard_examples,
iou_threshold=config.iou_threshold,
loss_type=loss_type,
cls_loss_weight=classification_weight,
loc_loss_weight=localization_weight,
max_negatives_per_positive=max_negatives_per_positive,
min_negatives_per_image=config.min_negatives_per_image)
return hard_example_miner
which is returned by model_builder.py and called by train.py. So basically, it seems to me that simply generating your true positive labels (with a tool like LabelImg or RectLabel) should be enough for the train algorithm to find hard negatives within the same images. The related question gives an excellent walkthrough.
In the event you want to feed in data that has no true positives (i.e. nothing should be classified in the image), just add the negative image to your tfrecord with no bounding boxes.
I think I was passing through the same or close scenario and it's worth it to share with you.
I managed to solve it by passing images without annotations to the trainer.
On my scenario I'm building a project to detect assembly failures from my client's products, at real time.
I successfully achieved very robust results (for production env) by using detection+classification for components that has explicity a negative pattern (e.g. a screw that has screw on/off(just the hole)) and only detection for things that doesn't has the negative pattens (e.g. a tape that can be placed anywhere).
On the system it's mandatory that the user record 2 videos, one containing the positive scenario and another containing the negative (or the n videos, containing n patterns of positive and negative so the algorithm can generalize).
After a while testing I found out that if I register to detected only tape the detector was giving very confident (0.999) false positive detections of tape. It was learning the pattern where the tape was inserted instead of the tape itself. When I had another component (like a screw on it's negative format) I was passing the negative pattern of tape without being explicitly aware of it, so the FPs didn't happen.
So I found out that, in this scenario, I had to necessarily pass the images without tape so it could differentiate between tape and no-tape.
I considered two alternatives to experiment and try to solve this behavior:
Train passing an considerable amount of images that doesn't has any annotation (10% of all my negative samples) along with all images that I have real annotations.
On the images that I don't have annotation I create a dummy annotation with a dummy label so I could force the detector to train with that image (thus learning the no-tape pattern). Later on, when get the dummy predictions, just ignore them.
Concluded that both alternatives worked perfectly on my scenario.
The training loss got a little messy but the predictions work with robustness for my very controlled scenario (the system's camera has its own box and illumination to decrease variables).
I had to make two little modifications for the first alternative to work:
All images that didn't had any annotation I passed a dummy annotation (class=None, xmin/ymin/xmax/ymax=-1)
When generating the tfrecord files I use this information (xmin == -1, in this case) to add an empty list for the sample:
def create_tf_example(group, path, label_map):
with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image = Image.open(encoded_jpg_io)
width, height = image.size
filename = group.filename.encode('utf8')
image_format = b'jpg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
classes_text = []
classes = []
for index, row in group.object.iterrows():
if not pd.isnull(row.xmin):
if not row.xmin == -1:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
classes_text.append(row['class'].encode('utf8'))
classes.append(label_map[row['class']])
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(filename),
'image/source_id': dataset_util.bytes_feature(filename),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
}))
return tf_example
Part of the traning progress:
Currently I'm using tensorflow object detection along with tensorflow==1.15, using faster_rcnn_resnet101_coco.config.
Hope it will solve someone's problem as I didn't found any solution on the internet. I read a lot of people telling that faster_rcnn is not adapted for negative training for FPs reduction but my tests proved the opposite.
I am creating an AR application that tracks feature , calculates homography and then obtains the object's pose from 3D-2D point correspondences and use that to render any 3D Object.
I am selecting a specific area for detecting features on my source image (by masking). and then matching it with features detected on subsequent frames.Then I filter those matches and estimate Homography of the unmasked region.
The problem is lies in Homography estimation. It differs every time(very slightly, but nonetheless, differs). The effect is : Even on keeping my camera still, I get a vibrating rectangle around my tracked region, which i draw using the estimated homography.
I have already posted a question titled Unstable homography estimation using ORB and got a reassurance of a fact i was considering (not recalculating my homography if the position of the region is similar to its last position).
However, I recently came to know of the Kalman filter, that it gives a better estimate of the position by combining our prior knowledge with our measurement observation.
So ,after looking at various examples (one in particular, http://www.youtube.com/watch?v=GBYW1j9lC1I), I modeled a Kalman filter(rather 4, one for every point of the rectanglular region) for my scenario:
m_KF1.init(4, 2, 1);
setIdentity(m_KF2.transitionMatrix);
m_measurement1 = Mat::zeros(2,1,cv::DataType<float>::type);
m_KF1.statePre.setTo(0);
m_KF1.controlMatrix.setTo(0);
//initialzing filter
m_KF1.statePre.at<float>(0) = m_scene_corners[1].x; //the first reading
m_KF1.statePre.at<float>(1) = m_scene_corners[1].y;
m_KF1.statePre.at<float>(2) = 0;
m_KF1.statePre.at<float>(3) = 0;
setIdentity(m_KF1.measurementMatrix);
setIdentity(m_KF1.processNoiseCov,Scalar::all(.1)) //updated at every step
setIdentity(m_KF1.measurementNoiseCov, Scalar::all(4)); //assuming measurement error of
//not more than 2 pixels
setIdentity(m_KF1.errorCovPost, Scalar::all(.1));
4 state variables (position in x, y and velocity in x,y).
2 measurement variables (position in x,y)
1 control variable (acceleration)
following steps taken at every iteration
//---First,the predicion phase , to update the internal variables-------//
// 'dt' is the time taken between the measurements
//Updating the transitionMatrix
m_KF1.transitionMatrix.at<float>(0,2) = dt;
m_KF1.transitionMatrix.at<float>(1,3) = dt;
//Updating the Control matrix
m_KF1.controlMatrix.at<float>(0,1) = (dt*dt)/2;
m_KF1.controlMatrix.at<float>(1,1) = (dt*dt)/2;
m_KF1.controlMatrix.at<float>(2,1) = dt;
m_KF1.controlMatrix.at<float>(3,1) = dt;
//Updating the processNoiseCovmatrix
m_KF1.processNoiseCov.at<float>(0,0) = (dt*dt*dt*dt)/4;
m_KF1.processNoiseCov.at<float>(0,2) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(1,1) = (dt*dt*dt*dt)/4;
m_KF1.processNoiseCov.at<float>(1,3) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(2,0) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(2,2) = dt*dt;
m_KF1.processNoiseCov.at<float>(3,1) = (dt*dt*dt)/2;
m_KF1.processNoiseCov.at<float>(3,3) = dt*dt;
Mat prediction1 = m_KF1.predict();
Point2f predictPt1(prediction1.at<float>(0),prediction1.at<float>(1));
// Get the measured corner
m_measurement1.at<float>(0,0) = scene_corners[0].x;
m_measurement1.at<float>(0,1) = scene_corners[0].y;
//----Then, the correction phase which uses the predicted value and our measured value
Mat estimated = m_KF1.correct(m_measurement1);
Point2f statePt1(estimated.at<float>(0),estimated.at<float>(1));
This model hardly corrects my measured value
Now My Questions are:
Is Kalman filter suited for my scenario? Will it give me any better results?
If it is, then what's missing? am I modelling it right? Instead creating 4 filters for four points of the rectangle, should I model it in some other manner (for instance, take the 10 strongest matches based on the distance and use those as input to the filter)
If Kalman filter isn't suited, what else can i do to provide more stability to the estimated homography?
Any help would be highly appreciated.
Thanks.
This question is badly titled and after reading your explanation what you really ask is: "Why my OpenCV Kalman filter will still leaves lots of noise?"
Anyway your answers are:
Yes Kalman works for your scenario
You are using it wrong
Modify: KF.processNoiseCov, You can get the code from here: Opencv kalman filter prediction without new observtion there is nice line explaining it.
See:
setIdentity(KF.processNoiseCov, Scalar::all(.005)); //adjust this for faster convergence - but higher noise
For what I see, you have very basic understanding of it, you can go to a naive approach and use four 2D Kalman filters, for this you can use the code here: . It will work, from there grow and adapt until you get a better understanding.
After that, you could model it more closely to your problem or you can keep using four filters, there is no "perfect" implementation so if that works for you, just go for it.