CNN that generate a new image from input image - machine-learning

CNN such that outputs the image with the feature added to the input image can be created?
For example, if an image of a person's face is input, outputs an image of the person's face wearing glasses.

There are several options but basically the same way that you have one input for every pixel you must have one output from every pixel in the output image.
In MLPs you must have the same neurons in the input layer than in the output layer.
In CNNs you can also have at the beginning convolutional layers and after that deconvolutional layers.
Take a look at this paper (it is awesome) to create very realistic images from other images (for example satellite and map views in google maps). It is a neural network that is trying to solve the problem and also trying to create images that other neural network is not capable to distinguish from real images (it also have the source code available):
https://phillipi.github.io/pix2pix/

To add to the answer above, another way of doing this is neural style transfer, where we feed two images to a CNN which then generates a new image combining the content from the second image and the style from the first. Check out this paper for further details, https://arxiv.org/abs/1508.06576
We could of course always use GANs to do achieve full perfection.

Related

How to verify if the image contains noise in background before ‘OCR’ing

I have several types of images that I need to extract text from.
I can manually classify the images into 3 categories based on the noise on the background:
Images with no noise.
Images with some light noise in the background.
Heavy noise in the background.
For the category 1 images, I could apply OCR’ing fine without problems. → basic case.
For the category 2 images and some of the category 3 images, I could manage to extract the texts by applying the following methods:
Grayscale, Gaussian blur, Otsu’s threshold
Morph open to remove noise and invert the image
→ then perform text extraction.
For the OCR’ing task, one removing noise method is obviously not working for all images. So, Is there any method for classifying the level background noise of the images?
Please all suggestions are welcome.
Thanks in advance.
Following up on your comment from other question here are some things you could try. Some combinations of ideas below should help.
Image Embedding and Vector Clustering
Manual
Use a pretrained network such as resnet on imagenet (may not work good) or a simple pretrained network trained on MNIST/EMNIST.
Extract and concat some layers flattened weight vectors toward end of network. Apply dimensionality reduction and apply nearest neighbor/approximate nearest neighbor algorithms to find closest matches. Set number of clusters 3 as you have 3 types of images.
For nearest neighbor start with KNN. There are also many libraries in github that may help such as faiss, annoy etc.
More can be found from,
https://github.com/topics/nearest-neighbor-search
https://github.com/topics/approximate-nearest-neighbor-search
If result of above is not good enough try finetuning only last few layers MNIST/EMNIST trained network.
Using Existing Libraries
For grouping/finding similar images look into,
https://github.com/jina-ai/jina
You should be able to find more similarity clustering using tags neural-search, image-search on github.
https://github.com/topics/neural-search
https://github.com/topics/image-search
OCR
Try easyocr as it worked better for me than tesserect last time used ocr.
Run it first on whole document to see if requirements met.
Use not so tight cropping instead some/large padding around text if possible with no other text nearby. Another way is try padding in all direction in tight cropped text to see if it improves ocr result.
For tesserect see if tools mentioned in improving quality doc helps.
Classification
If you already have data sorted into 3 different directory and want to classify future images only then I suggest a neural network. Modify mnist or cifar example of pytorch or tensorflow to train and classify test images.
Based on sample images it looks like computer font instead of handwritten text. If that is the case Template matching at multiple scales may help. You have to see if the noise affects the matching result. Image from, https://www.pyimagesearch.com/2021/03/22/opencv-template-matching-cv2-matchtemplate/
Noise Removal
Here also you can go with a neural network. Train a denoising autoencoder with Category 1 images, corrupted type 1 images by adding noise that mimicks Category 2 and Category 3 images. This way the neural network will classify the 3 image categories without needing manually create dataset and in post processing you can use another neural network or image processing method to remove noise based on category type. Image from, https://keras.io/examples/vision/autoencoder/
Try existing libraries or pretrained networks on github to remove noise in the whole document/cropped region. Look into rembg if it works on text documents.
Your samples are not very convincing. All images binarize easily (threshold 25).

How do we transfer a particular structure(defined by mask) in an image to another image?

Image showing required content transfer between source and target images
Essentially is there a way to grow an image patch defined by mask and keep it realistic?
For your first question, you are talking about image style transfer. In that case, CNNs may help you.
For the second, if I understand correctly, by growing you mean introducing variations in the image patch while keeping it realistic. If that's the goal, you may use GANs for generating images, provided you have a reasonable sized dataset to train with:
Image Synthesis with GANs
Intuitively, conditional GANs model the joint distribution of the input dataset (which in your case, are images you want to imitate) and can draw new samples (images) from the learned distribution, thereby allowing you to create more images having similar content.
Pix2Pix is the open-source code of a well-known paper that you can play around to generate images. Specifically, let X be your input image and Y be a target image. You can train the network and feed X to observe the output O of the generator. Thereafter, by tweaking the architecture a bit or by changing the skip connections (read the paper) train again and you can generate variety in the output images O.
Font Style Transfer is an interesting experiment with text on images (rather than image on image, as in your case).

Can i Retrain Inception's Final Layer using depth images from Kinect.?

I like to know whether I can use data set of signs that is made using Kinect to retrain inception's final layer like mentioned in the Tensor Flow tutorial website that uses ordinary RGB images.I am new to this field. Opinions are much appreciated.
The short answer is "No. You cannot just fine tune only the last layer. But you can fine tune the whole pre-trained network.". The first layers of the pre-trained network is looking for RGB features. Your depth frames will hardly provide enough entropy to match that. Your options are:
If the recognised/tracked objects (hands) are not masked and you have actual depth data for the background, you can train from scratch on depth images with few contrast stretching and data whitening ((x-mu)/sigma). This will take very long time for the ivy league networks like Inception and ResNet. Also, keep in mind that most python based deep learning frameworks rely on PIL image loaders which by default assumes images are of 8bits channels mapped in the range [0, 1]. These image loaders cast all 16bits pixels ones.
If the recognised/tracked object (hands) are masked which means your background is set to the same value or barely have gradient in it, the network will overfit on the silhouette of the object because this is where the strongest edges are. The solution for this is to colorise the depth image using normal maps, HSA, HSV, JET colour coding to convert it into 3x8bits channeled image. This makes the training converge much faster and in my late experiments we found that you can fine tune the ivy league networks on the colorised depth.
Since you are new to this field.I would like to suggest you to read what is transfer learning all the three types mentioned.I would like to tell you to apply any of the mentioned forms of transfer learning basing on your data set.If your data set is very similar to the type of model you are using then you can pass through last layers.If you data is not similar you have to tune the existing model and use it.
As the layers of the neural networks increases the data specific feature extraction increases so you have to take care of the specific layers if your dataset is not very similar to the pre-built model dataset. The starting layers will contain more generic features.

How to use a Neural Network for face detection?

I'm trying to build a face detection system using a neural network written in theano. I am a bit confused as to what should be the expected output against which i would have to calculate the crossentropy. I don't want to know whether the face is present or not, i need to highlight the face in an image (find the location of the face). The size of the images is constant. But the size of the faces in the image is not. How do i go about that? Also, my webcam currently captures 480x640 images. Creating that number of neurons in the input layer would be very heavy on the system, how do i compress the images without losing any features?
There are many possible solutions, one of the easiest ones is to perform a sliding window search and ask a network "is there a face in this part of an image?" - and this is quite "standard" approach. In particular, you do it hierarchicaly - split image into 9 overlapping squares (I assume the image is square) and ask in each of them "is there a face in it?" by rescaling it to your network input. Next you again split the one answering "yes" into 9 squares and repeat. This way you can find face kind fast. Another would be to perform supervised segmentation where you try to predict which part of image (pixels/superpixels) belong to face and which do not. This is not exhaustive list, but should give you general idea how to proceed.
how do i compress the images without losing any features?
You do not. It is not possible. You will always lose some data when downscaling (lossless compression exists but it destroys structure, thus making classification extremely hard).
You should first create a training set from the images received through the web-cam. The training set must contain face and non-face images (such as apple, car and ...). For better generalization you may use some off-the-shelve data sets. After you trained the network on the images you can use the network to classify unseen images.
This approach is suitable if your goal is to only detect whether an image contains a face. However, if you want to identify faces (e.g. this face belongs to John and not other people) you need to train the network with the images of the people you want to do identification for. The number of classes in such network is equivalent to the number of distinct people.

How do I test if jpeg is photo (or rather logo)

I am extracting all images from given PDF files (containing real estate synopses) using the pdfimages tool as jpegs. Now I want to automatically distinguish between photos and other pictures, like maybe the broker's logo. How should I do this?
Is there an open tool that can distinguish between photos and clipart/line drawings etc. like google image search does?
Is there an open tool that gives me the number of colors used for a given jpeg?
I know this will bear a certain uncertainty, but that's okay.
I would look at colour distribution. The colours are likely to be densely packed or "too" evenly spread in the case of gradients. Alternatively, you could look at the frequency distribution of the image.
You can solve your problem in two steps: (1) extract some kind of information from the image and (2) train a classifier that can distinguish the two types of images:
1 - Feature Extraction
In this step you will have to write a program/function that takes a image as input and returns a numeric vector to describe its visual information. As koan points out in his answer, the color distribution contains a lot of useful information. So I would try the following measures:
* Histogram of each color channel (Red, Green, Blue), as this is a basic description of the color distribution of the image;
* Mean, standard deviation and other statistical moments of each histogram. This should give you information on how the colors are distributed in the image. For a drawing, such as logo, the color distribution should be significantly different from a photo;
* Fourier Descriptors. In a drawing, you will probably find a lot edges whereas in a photo this is not expected. With fourier descriptors, you can get this kind of information.
2 - Classification
In this step you will train some sort of classifier. Basically, get a set of images and label each one manually as a drawing or a photo. Also, use your extraction function that you wrote in step 1 to extract vectors from each image. This will be your training set. The training set will be used as input to train a classifier. As Neil N commented, a neural network may be an overkill (or maybe not?), but there are a lot of classifier that you can use (e.g. k-NN, SVM, decision trees). You don't have to implement the classifier yourself, as you can use a machine learning software such as Weka.
Finally, after you have trained your classifier, extract the vector from the image you want test. Use this vector as input to the classifier to get a prediction of whether the image is a photo or a logo.
A simpler solution is to automatically send the image to google image search with the 'similar images' setting on, and see if google sends back primarily PNG results or JPEG results.

Resources