Feature Extraction and Cross-Validation of an image dataset - machine-learning

I have a dataset consisting of fMRI images. Each image belongs to one class. The dataset is as follows:
Class 1: 9 images
Class 2: 10 images
Class 3: 6 images
Class 4: 12 images
Each image is 4D (time series), i.e. 90x60x10x350 where 350 is the time dimension (i.e. 350 3D volumes). I want to train a classifier on this data.
Now I want to first extract features and then apply feature selection by applying e.g. PCA and then do clustering, like described in the paper "Principal Feature Analysis: A Multivariate Feature Selection Method for fMRI Data" (http://www.hindawi.com/journals/cmmm/2013/645921/). For feature extraction I see the following possibilities:
Each voxel is a feature and the average of each voxels time series
is taken. Each image has exactly one feature vector of dimension 90*60*10 = 54'000
Each voxel is a feature and each time point (i.e. each 3D volume) is a data point. Each image has 350 feature vectors of dimension 90*60*10 = 54'000 each.
Putting all voxels of the whole time series of an image into one feature vector of
size 90*60*10*350 = 18'900'000. Each image has only one feature vector.
Take the the correlation value between the voxels as feature values. But this is
computationally not doable.
I'm preferring 2. but I'm not sure if this is a good idea.
How would you do the feature extraction? And how would a correlation based approach in a computational feasible way work?
Last but not least, how would you do cross-validation on the dataset? The problem is that the different classes are imbalanced.
Thank you so much for the answers beforehand.


Weights in eigenface approach

1) In eigenface approach the eigenfaces is a combination of elements from different faces. What are these elements?
2) The output face is an image composed of different eigenfaces with different weights. What does the weights of eigenfaces exactly mean? I know that the weight is percentage of eigenfacein the image, but what does it mean exactly, is mean the number of selected pixels?
Please study about PCA to understand what is the physical meaning of eigenfaces, when PCA is applied to an image. The answer lies in the understanding of eigenvectors and eigenvalues associated with PCA.
EigenFaces is based on Principal Component Analysis
Principal Component Analysis does dimensionality reduction and finds unique features in the training images and removes the similar features from the face images
By getting unique features our recognition task gets simpler
By using PCA you calculate the eigenvectors for your face image data
From these eigenvectors you calculate EigenFace of every training subject or you can say calculating EigenFace for every class in your data
So if you have 9 classes then the number of EigenFaces will be 9
The weight usually means how important something is
In EigenFaces weight of a particular EigenFace is a vector which just tells you how important that particular EigenFace is in contributing the MeanFace
Now if you have 9 EigenFaces then for every EigenFace you will get exactly one Weight vector which will be of N dimension where N is number of eigenvectors
So every element out N elements in one weight vector will tell you how important that particular eigenvector is for that corresponding EigenFace
The facial Recognition in EigenFaces is done by comparing the weights of training images and testing images with some kind of distance function
You can refer this github link: https://github.com/jayshah19949596/Computer-Vision-Course-Assignments/blob/master/EigenFaces/EigenFaces.ipynb
The code on the above link is a good documented code so If you know the basics you will understand the code

Understanding Faster rcnn

I'm trying to understand fast(er) RCNN and following are the questions I'm searching for:
To train, a FastRcnn model do we have to give bounding box
information in training phase.
If you have to give bonding box information then what's the role of
ROI layer.
Can we use a pre-trained model, which is only trained for classification, not
object detection and use it for Fast(er) RCNN's
Your answers:
1.- Yes.
2.- The ROI layer is used to produce a fixed-size vector from variable-sized images. This is performed by using max-pooling, but instead of using the typical n by n cells, the image is divided into n by n non-overlapping regions (which vary in size) and the maximum value in each region is output. The ROI layer also does the job of proyecting the bounding box in input space to the feature space.
3.- Faster R-CNN MUST be used with a pretrained network (typically on ImageNet), it cannot be trained end-to-end. This might be a bit hidden in the paper but the authors do mention that they use features from a pretrained network (VGG, ResNet, Inception, etc).

Which machine learning model would be feasible for stripping the background from product photos?

My goal is to be able to have a way to process a product photo through the model, and have it return the same photo with the product against a white background. The product photos will be of varying sizes and product types.
I'd like to feed the model photos of products with backgrounds, and those without. In the future I will also expand on the dataset with partially removed backgrounds.
If you are looking for an easy way of doing this, I'd suggest the K-means clustering algorithm. Assuming that you have a simple plain background and an image (of interest) you can obtain the RGB pixel values and use a K-means clustering algorithm with the number of clusters set to 2.
Let me explain this to you with the help of an example. Suppose you have an image of dimension 28*28 (just another arbitrary dimension). The total number of pixels in the image would be 784. Each pixel is represented as a combination of 3 RGB values ranging from 0-255.
A K-Means clustering algorithm will cluster the pixel values into K clusters thus each cluster represents pixel values which are more similar than the pixel values in another cluster. This technique is especially helpful in drawing contours (borders) around images of interest.
In the K-means clustering algorithm, there would be 784 sample points each represented in a 3 dimensional plane for this example. It will cluster these data points into K (2 in this example) clusters.
Here is a very simple implementation of the K-means clustering algorithm.
If you are looking for advanced machine learning implementation, then I'd suggest you look for Deep Convolution Neural Networks for Background Removal in Images. This machine learning technique has been successfully used for the task for background image removal
Read more about it from here, here and here.

How to use wavelet decomposition for feature extraction (for fMRI images)?

I have a dataset consisting of fMRI images (from mice) which are divided into 4 groups (different drug dose levels applied). Each fMRI image is 4D, that means each voxel is a time series. For each fMRI image I want to extract one feature vector.
Now I want to use wavelet decomposition for feature extraction. In Matlab there exist no 4D wavelet decomposition, so I turn the 4D images into 3D by taking the average of the time series. Then I could apply 3D wavelet decomposition and taking the LL component as features, that means doing something like that:
WT = wavedec3(fMRI, 4, 'db4');
LL = WT.dec(1);
temp = cell2mat(LL);
feature_vector = temp(:);
Of course afterwards feature selection algorithms (like recursive feature elimination) could be applied to reduce dimensionality.
What do you think of this approach? Are there better approaches?

How to create a single constant-length feature vector from a variable number of image descriptors (SURF)

My problem is as follows:
I have 6 types of images, or 6 classes. For example, cat, dog, bird, etc.
For every type of image, I have many variations of that image. For example, brown cat, black dog, etc.
I'm currently using a Support Vector Machine (SVM) to classify the images using one-versus-rest classification. I'm unfolding each image into a single pixel vector and using that as the feature vector for a given image I'm experiencing decent classification accuracy, but I want to try something different.
I want to use image descriptors, particularly SURF features, as the feature vector for each image. This issue is, I can only have a single feature vector per given image and I'm given a variable number of SURF features from the feature extraction process. For example, 1 picture of a cat may give me 40 SURF features, while 1 picture of a dog will give me 68 SURF features. I could pick the n strongest features, but I have no way of guaranteeing that the chosen SURF features are ones that describe my image (for example, it could focus on the background). There's also no guarantee that ANY SURF features are found.
So, my problem is, how can I get many observations (each being a SURF feature vector), and "fold" these observations into a single feature vector which describes the raw image and can fed to an SVM for training?
Thanks for your help!
Typically the SURF descriptors are quantized using a K-means dictionary and aggregated into one l1-normalized histogram. So your inputs to the SVM algorithm are now fixed in size.
