I am working to create a video from a transcript where the idea is just to choose some series of images based on the meaning of the text. I need to create a model that can pick out the images based on the text but am struggling on how to create a meaningful way to choose images and also how to actually format the dataset with the images and text so that it can be used to train a model. Has anyone done anything similar to this?
Related
I have searched a lot to know how can we test an object exist in an image. I am searching for the name of the scientific/ technology that can provide this. As an example I can mention Instagram where you upload an image and Instagram writes: This image may contain sea, people, car. Is this content based image retrieval? Do I need local feature extraction for it? Are they based on deep learning or do they work by something like SIFT?
Whatever I studied was just able to receive a query image and search a database to say that which image is "similar" to that, not which image contains it.
Yeah it uses the technique of deep learning where they train their model to recognize number of objects in an image using either bounding box approach or multilabel classification. If a new image is passed to the model, it'll predict label of all the objects present in that image.
This is known as object detection.
I have one image stored in my bundle or in the application.
Now I want to scan images in camera and want to compare that images with my locally stored image. When image is matched I want to play one video and if user move camera from that particular image to somewhere else then I want to stop that video.
For that I have tried Wikitude sdk for iOS but it is not working properly as it is crashing anytime because of memory issues or some other reasons.
Other things came in mind that Core ML and ARKit but Core ML detect the image's properties like name, type, colors etc and I want to match the image. ARKit will not support all devices and ios and also image matching as per requirement is possible or not that I don't have idea.
If anybody have any idea to achieve this requirement they can share. every help will be appreciated. Thanks:)
Easiest way is ARKit's imageDetection. You know the limitation of devices it support. But the result it gives is wide and really easy to implement. Here is an example
Next is CoreML, which is the hardest way. You need to understand machine learning even if in brief. Then the tough part - training with your dataset. Biggest drawback is you have single image. I would discard this method.
Finally mid way solution is to use OpenCV. It might be hard but suit your need. You can find different methods of feature matching to find your image in camera feed. example here. You can use objective-c++ to code in c++ for ios.
Your task is image similarity you can do it simply and with more reliable output results using machine learning. Since your task is using camera scanning. Better option is CoreML.You can refer this link by apple for Image Similarity.You can optimize your results by training with your own datasets. Any more clarifications needed comment.
Another approach is to use a so-called "siamese network". Which really means that you use a model such as Inception-v3 or MobileNet and both images and you compare their outputs.
However, these models usually give a classification output, i.e. "this is a cat". But if you remove that classification layer from the model, it gives an output that is just a bunch of numbers that describe what sort of things are in the image but in a very abstract sense.
If these numbers for two images are very similar -- if the "distance" between them is very small -- then the two images are very similar too.
So you can take an existing Core ML model, remove the classification layer, run it twice (once on each image), which gives you two sets of numbers, and then compute the distance between these numbers. If this distance is lower than some kind of threshold, then the images are similar enough.
I need to recognize specific images using the iPhone camera. My goal is to have a set of 20 images, that when a print or other display of one of them is present in front of the camera, the app recognizes that image.
I thought about using classifiers (CoreML), but I don't think it would give the intended result. For example, if I had a model that recognizes fruits, and then I showed it two different pictures of a banana, It would recognize them both as bananas, which is not what I want. I want my app to recognize specific images, regardless of its content.
The behavior I want is exactly what ARToolKit does (https://www.artoolkit.org/documentation/doku.php?id=3_Marker_Training:marker_nft_training), but I do not wish to use this library.
So my question is: Are the any other libraries, or other ways, for me to recognize specific images from the camera on iOS (preferably in Swift).
Since you are using images specific to your use case there isn't going to be an existing model that you can use. You'd have to create a model, train it, and then import it into CoreML. It's hard to provide specific advice since I know nothing about your images.
As far as libraries are concerned checkout this list and Swift-AI.
Swift-AI has a neural network that you might be able to train if you had enough images.
Most likely you will have to create the model in another language, such as Python and then import it into your Xcode project.
Take a look at this question.
This blog post goes into some detail about how to train your own model for CoreML.
Keras is probably your best bet to build your model. Take a look at this tutorial.
There are other problems too though like you only have 20 images. This is certainly not enough to train an accurate model. Also the user can present modified versions of these images. You'd have to generate realistic sample of each possible image and then use that entire set to train the model. I'd say you need a minimum of 20 images of each image (400 total).
You'll want to pre-process the image and extract features that you can compare to the known features of your images. This is how facial recognition works. Here is a guide for facial recognition that might be able to help you with feature extraction.
Simply put without a model that is based on your images you can't do much.
Answering my own question.
I ended up following this awesome tutorial that uses OpenCV to recognize specific images, and teaches how to make a wrapper so this code can be accessed by Swift.
I am trying to understand openface.
I have installed the docker container, run the demos and read the docks.
What I am missing is, how to start using it correctly.
Let me explain to you my goals:
I have an app on a raspberry pi with a webcam. If I start the app it will take a picture of the person infront.
Now it should send this picture to my openface app and check, if the face is known. Known in this context means, that I already added pictures of this person to openface before.
My questions are:
Do I need to train openface before, or could I just put the images of the persons in a directory or s.th. and compare the webcam picture on the fly with these directories?
Do I compare with images or with a generated .pkl?
Do I need to train openface for each new person?
It feels like I am missing a big thing that makes the required workflow clearer to me.
Just for the record: With help of the link I mentioned I could figure it out somehow.
Do I need to train openface before, or could I just put the images of the persons in a directory or s.th. and compare the webcam picture on the fly with these directories?
Yes, a training is required in order to compare any images.
Do I compare with images or with a generated .pkl?
Images are compared with the generated classifier pkl.
The pkl file gets generated when training openface:
./demos/classifier.py train ./generated-embeddings/
This will generate a new file called ./generated-embeddings/classifier.pkl. This file has the SVM model you'll use to recognize new faces.
Do I need to train openface for each new person?
OK, for this question I don't have an answer yet. But just because I did not look deeper into this topic yet.
So I am working on a project for school and what we are trying to do is to teach a neural network to recognize buildings from non-buildings. The problem I am having right now is representing the data in a form, that would be "readable" by the classifier function.
The training data is a bunch of pictures + .wkt file with coordinates of buildings on a picture. So far we have been able to rescale the polygons, but kinda got stuck there.
Can you give any hints or ideas of how to bring this all to an appropriate form?
Edit: I do not need the code written for me, a link to an article on a similar subject or a book is more of stuff I am looking for.
You did not mention what framework you are using, but I will give an answer for caffe.
Your problem is very close to detecting objects within an image. You have full images with object (building in your case) bounding boxes.
The easiest way of doing this is through a python data layer which reads an image and a file with stored coordinates for that image and feeds that into your network. A tutorial on how to use it can be found here: https://github.com/NVIDIA/DIGITS/tree/master/examples/python-layer
To accelerate the process you may want to store image, coordinate pairs in your custom lmdb database.
Finally a good working example with complete caffe implementation can be found within Faster-RCNN library here: https://github.com/rbgirshick/caffe-fast-rcnn/
You should check roi_pooling_layer.cpp in their custom caffe branch and roi_data_layer on how the data is fed into the network.