Text Image Combined Classification Model - machine-learning

I am working on a problem statement where I have to match (text, image) pair. Given a furniture description and furniture image, I have to say they are same or not. This is a binary classification problem but I have to combine both text and image data.
One possible solution I am trying as follows
In the above diagram, I am combining the feature from the pre-trained text and image model and training the linear layer end to end.
Is there any other way to handle this type of problem. any leads are most welcome. thanks a lot in advance for your help.

I got a few recent works for the problem of "Image-Text Matching" which is slightly different from my problem statement but I can adjust the code for my project.
Transformer Reasoning Network for Image-Text Matching and Retrieval enter link description here
Stacked Cross Attention for Image-Text Matching enter link description here

Related

how can we apply masked language modelling on the images using multimodal models? How can we implement such a thing and get MLM scores?

It might not be clear from the question what I want to say, but how can we apply masked language modelling with the text and image given using multimodal models like lxmert. For example, if there is some text given (This is a MASK) and we mask some word in it, and there is an image given (maybe of a cat), how can we apply MML to predict the word as cat? How can we implement such a thing and get MLM scores out of it using huggingface library api? A snippet of code explaining such will be great. If anyone can help, it would help in better understanding.

How to get the outputs of hidden layers of pre-trained StyleGAN?

I currently study the styleGAN, and want to adjust some new loss functions, but the main problem I met is how to get the outputs of the hidden layers given a pre-trained styleGAN model. I want to get some feature maps in the middle layers of Discriminator and Generator as well. I've looked through the code released by Nvilab but didn't find a clue. Any suggestions? Thanks in advance.

How to specify the region of interest to the computer from the image having lots of details?

Background:
I am working on my final year undergraduate college project and the topic I am involved on is the Paper Note Detection by Optical Character Recognition. I have already started working on basic image processing techniques, and since I am new on Image Processing with Java, progress is a bit slower.
I have a basic idea of image processing since I took a paper on previous semester.
Basically, I am working on the Nepali Paper Notes and the idea is to extract the key information from it. The notes I am using are the Nepali Currency Notes of rupees 100, 500 and 1000.
The image above is the Nepalese Currency of Rupees 500. The idea is to extract the information from the image and identify which currency the image belongs to.
The primary goal of my project is to determine the currency type, which is basically done with the recognition of the bottom right area. The bottom right area of the image defines the value of the currency.
The secondary goal is to fetch the number of the currency (unique) and store it in the database.
Question:
Well, my question is, how fairly this problem could be solved? What are the necessary prerequisites before entering into this project? How do I select the region of interest from the image?
The other two paper notes on which my project should recognize are listed below:
Nepalese Paper Note: Rs. 1000
Nepalese Paper Note: Rs. 100
Since I am new to Image Processing with Java, I need a fair
suggestion on how to achieve my problem to success.
I'm going to try and answer this step by step and since these are sequential, your accuracy will depend on the how well you do each and every step.
Determining and Extracting ROI: Considering you're working on currency notes, it is safe to assume that your input test/train data will be aligned the way it is in the images given above. Try using contouring to extract a region of interest around the numbers. Another thing you can do is create a mask which will filter out the remaining area of the image and leave you with only the area that you require. The second approach is more of a hardcode and will fail incase the image is not aligned.
Pre-processing: Oncec you have your ROI, you will need to go through some preprocessing techniques before you feed the data to an OCR. Most OCR show better accuracy with binary images, sometimes with grayscale too. This step is essential for getting good results from your OCR.
Applying OCR: You can always use the Tesseract OCR or others, but since the types of currency notes are limited, I would also suggest you have a look at object detection models. Many of them are readily available online and you can train them on your own by providing images of currency and manually labeling them the corresponding value. OCRs don't always return the best results and in your use case, I would suggest you try out other alternatives such as image matching or making a model.

How to define the object that the selective search algorithm need to detect

Currently, I am trying to generate annotations "Bounding Box" for object detections with a deep neural network "RCNN", the problem is that I need to do it by hand, I have more than 500 images, and I want to generate the annotations automatically, for that I founded the "Selective Search Algorithm", but what I don't understant is how can I tell to my algorithm what label is correponding foreach generated bounding box
Thanks,
Look at the paper "Selective Search for Object Recognition". The author has provided the way to generate labels. You have to construct a classifier and train the classifier to get the labels. Hope this helps.

Training a Text Detection System

I'm currently developping a text detection system in a given image using logistic regression, and I need training data like the image below:
The first column show a positive example (y=1) of text wheras the second column show images without text (y=0).
I'm wondering where I can get a labeled dataset of this kind??
Thanks in advance.
A good place to start for these sorts of things is the UC Irvine Machine Learning Repository:
http://archive.ics.uci.edu/ml/
But maybe also consider heading over to Cross-Validated as well, for machine learning-related questions:
https://stats.stackexchange.com/
You can get a similar dataset here.
Hope it helps.

Resources