Generating Captions for Captioned Food Dataset - machine-learning

I'm working on a college project. We're working on captioned food data set. We currently have the data set consisting of Captioned and Non Captioned, Food and food related images.
There 2 problems I have:
1) The number of captioned images are less then uncaptioned images.
2) Some captions are just food name's while others have information regarding how the food is, how the food looks, etc.
With these problems I'm not sure how to use the uncaptioned data effectively.
I can run an integrated CNN and RNN with LSTM for captioned images. But since the uncaptioned image data set is greater than captioned how to use the data effectively.
Or should i start a clustering CNN algorithm to find similar images and then use photo id of those images with captions and give it to the clustered class. If so then consider a situation which has a cluster of two captioned images and one uncaptioned image which caption should be given to the uncaptioned image?
I've used Spacy and gotten rid of useless information in the captions.Like getting rid of prepositions and other useless information.
Number of food items
Without captions are : 57906
With Captions : 56968
Non Food : 85126
Any good ways to use the data set effectively?

Related

Does a ML model classify between desired image classes or by datasets?

If I had a Dataset 1 with 90% cat images and 10% dog images, and I combined Dataset 2, with only dogs to equalize the class imbalance, will my model classify which are cats and dogs or which are dataset 1 images and dataset 2 images?
If it's the latter, how do I get the model to classify between cats and dogs?
Your model will only do what it is trained for, regardless of what name your dataset(s) have.
Name of the dataset is just an organizational issue which does not go into training, does not really effect the amount of loss that will be produced during a training step. What will effect your models responses is however is the properties of the data.
Sometimes data from different datasets have different properties even though the datasets serve for the same purpose; like images with different illumination, background, resolution etc. That surely have an effect on the model performance. This is why mixing datasets should be performed with caution. You might find it useful to have a look at this paper.

Why does object detection result in multiple found objects?

I trained an object detector with CreateML and when I test the model in CreateML, I get a high number of identified objects:
Notes:
The model was trained on a small data set of ~30 images with that particular label face-gendermale occuring ~20 times.
Each training image has 1-3 labelled objects.
There are 5 label total.
Questions:
Is that expected or is there something wrong with the model?
If this is expected, how should I evaluate these multiple results or even count the number of objects found in the model?
Cross-posted in Apple Developer Forums. Photo of man © Jason Stitt | Dreamstime.com
A typical object detection model with make about 1000 predictions for every image (although it can be much more depending on the model architecture). Most of these predictions have very low confidence, so they are filtered out. Then the ones that are left over are sent through non-maximum suppression (NMS), which removes bounding boxes that overlap too much.
In your case, it seems that the threshold for NMS is too low (or too high), because many overlapping boxes survive.
However, it also seems that the model hasn't been trained very well yet, probably because you used very few images.

Foreground extraction using Haar cascade classifier

I am working on a dataset (training + testing) which contains a different shopping cart items (eg: biscuits, soaps etc.) with different backgrounds. I need to predict the product ID for all testing images (product IDs are unique for each product, let's say Good-day 10 rs is having product ID 1 and so on... for different products )
My approach was to :
Extract the foreground from the image.
Apply sift/surf algorithm for finding matching keypoints.
However, the results are not satisfactory.
This is the input image:
This is the output image:
As you can see the bounding box generated by Haar-cascade doesn't cover the whole biscuit packet correctly.
Can you please tell me how to achieve bounding boxes correctly using Haar-cascade classifier (positive images dataset and negative images folder consists of persons and different climate conditions).
I know that in my dataset each biscuit packets are distinct products and contains only one image for a particular product, is this the reason why my Haar-cascade is not performing well?
If yes: please specify the data preprocessing steps to do.
And also specify other foreground extraction algorithms that solves my problem

Image classification - Car or Not

I am building a system that can classify cars based on damage severity. In this system I needed to insert a module that can tell me if an uploaded image is car or not. I am using tensorflow for this purpose. I only have one idea that I can have images of car in one folder and some random images of other things in other folder. But this is not feasible at all as I cannot add images of every possible thing.
Is there any other solution for this ?
Thanks in advance.
First solution
You can find images of "every possible thing" in some dataset as CIFAR100, (you can specify than you don't want car images before the downloading) then you can train your network to identify car images from other ones.
Second solution
Use a pretrained model, many models have been already trained to recognize cars in Tensorflow, you just have to pick one.
Third solution
If you have a folder with car images, you can train a Generative Adversial Network to generate some pictures of cars from a random vector, after training the discriminator should be able to recognise cars !

How to specify the region of interest to the computer from the image having lots of details?

Background:
I am working on my final year undergraduate college project and the topic I am involved on is the Paper Note Detection by Optical Character Recognition. I have already started working on basic image processing techniques, and since I am new on Image Processing with Java, progress is a bit slower.
I have a basic idea of image processing since I took a paper on previous semester.
Basically, I am working on the Nepali Paper Notes and the idea is to extract the key information from it. The notes I am using are the Nepali Currency Notes of rupees 100, 500 and 1000.
The image above is the Nepalese Currency of Rupees 500. The idea is to extract the information from the image and identify which currency the image belongs to.
The primary goal of my project is to determine the currency type, which is basically done with the recognition of the bottom right area. The bottom right area of the image defines the value of the currency.
The secondary goal is to fetch the number of the currency (unique) and store it in the database.
Question:
Well, my question is, how fairly this problem could be solved? What are the necessary prerequisites before entering into this project? How do I select the region of interest from the image?
The other two paper notes on which my project should recognize are listed below:
Nepalese Paper Note: Rs. 1000
Nepalese Paper Note: Rs. 100
Since I am new to Image Processing with Java, I need a fair
suggestion on how to achieve my problem to success.
I'm going to try and answer this step by step and since these are sequential, your accuracy will depend on the how well you do each and every step.
Determining and Extracting ROI: Considering you're working on currency notes, it is safe to assume that your input test/train data will be aligned the way it is in the images given above. Try using contouring to extract a region of interest around the numbers. Another thing you can do is create a mask which will filter out the remaining area of the image and leave you with only the area that you require. The second approach is more of a hardcode and will fail incase the image is not aligned.
Pre-processing: Oncec you have your ROI, you will need to go through some preprocessing techniques before you feed the data to an OCR. Most OCR show better accuracy with binary images, sometimes with grayscale too. This step is essential for getting good results from your OCR.
Applying OCR: You can always use the Tesseract OCR or others, but since the types of currency notes are limited, I would also suggest you have a look at object detection models. Many of them are readily available online and you can train them on your own by providing images of currency and manually labeling them the corresponding value. OCRs don't always return the best results and in your use case, I would suggest you try out other alternatives such as image matching or making a model.

Resources