what does training instance mean? - machine-learning

I am new to machine learning. I just stumble across the term 'training instances' in a paper about using CNN for image segmentation. In that paper, a total 1100 images were used for modeling. The authors chose sub-regions from the images for training, and they presented a classification performance curve over 500K training instances. I am confused about they get such a large number of training instances from only 1100 images. Does one training instance mean one training sample or something else related to the training size?

You can visualize training instances as training batches. If there are millions of data-sets to test, you don't want to do them all at the same time but in instances or batches.
If you take 'n' images and split each image in 'm' sub-sections, you will get n x m subsections.
So in your case suppose we split each image in 4096 sections (why 4096, because its a even 64x64 grid split) we will get
1100 * 4096 = 4505600 subsections of given training data.
To get 500K instances or subsets of training data , we simply divide 4505600 by 500k to get 9 images.
Thus we will get about 9 images in each of 500k subsets.
If the images are sufficiently dense in terms of pixel resolution and hence large in size, it may be possible to increase the subsections further to get greater number of images in each training batches or instances.

An instance in a training dataset is a single observation of record data.

Related

NiftyNet Selective Sampler batches not taken from mix of volumes?

I'm training on three CT volumes using the Selective Sampler to ensure that enough samples are taken around the RoI (due to class imbalance), with some random samples. I'm also augmenting the data by scaling, rotation, and flipping, which takes a significant amount of time whenever samples are created.
Setting sample_per_volume to some large value (such as 32768) and batch_size to 128, it seems like NiftyNet will do 256 iterations of 128 samples just taken from the first volume, then switch to samples only taken from the 2nd volume (with a sharp jump in loss) and so on. I want each batch of 128 samples to be a roughly even mixture of samples taken from all of the training volumes.
I've tried setting sample_per_volume to roughly 1/3 of the batch_size so that samples are reselected for each iteration, but this slows down each iteration from around 2s to 50-60s.
Am I misunderstanding something? Or is there a way around this to ensure my batches are made up of samples from a mix of all the training data? Thanks.
The samples populate a queue of length queue_length, given in the .ini file. They are then randomly taken from the queue to populate the batch.
I would make the queue_length parameter bigger. Then it will be filled with data from several different subjects.

Correctly splitting the dataset

I have downloaded a dataset of 10 class objects for the object detection. The dataset is not divided into training, validation, and testing. However, the author has mentioned in his paper to divide the dataset in 20% Training, 20% Validation, and 60% Testing and images are choose randomly.
Following the criteria said by the author, I have randomly selected 20% images for Training, 20% images for Validation, and 60% images Testing.
I want to know couple of things
1) Do I need to put difficult images in training set or validation set or testing set? for example currently there is 41 difficult images in test set, 30 in Training set and 20 in validation set.
2) How can I ensure that all ten object classes are equally distributed?
Updated
3)Ideally, for balance split difficult images should be equally distributed? and how much it effect the result if testing have more difficult, or training have more difficult or validation have more?
Ten classes: Airplane, Storage tank, Baseball ground, Tennis Court, Basketball court, ground track field, Bridge, Ship, Harbor, and Vehicle.
I have total 650 images, among them 466 images have exactly one class and there are more than one objects in a image
Airplane = 88 images, Storage tank= 10 images, Baseball ground=46 images, Tennis Court =29 images, Basketball court =32 images, ground track field=55 images, Bridge 58 images, Ship=36 images, Harbor 27 images, and Vehicle=85 images.
Remaining 184 images have multiple classes.
In total 757 airplanes, 302 ships, 655 storage tanks, 390 baseball diamonds, 524 tennis courts, 159 basketball courts, 163 ground track fields, 224 harbors, 124 bridges, and 477 vehicles
The most common technique is a random selection. For example, if you have 1000 images you can create an array that contains the names of every file and you can aleatorize the elements using a random permutation. Then you can use the first 200 elements for training, the next 200 elements for validation and the other elements for test (in the case of 20%,20%,60%)
If there is a extremely unbalanced class you can force the same proportion of classes in every set. To do that you must do the procedure that I mentioned class by class.
You shouldn't choose images by hand. If you know that there are some difficult images in your dataset you can not choose them by hand to include them in the train, validation and test set.
If you want a fair comparison of your algorithm, if a few images can highly modify the accuracy. You can repeat the random split several times. In some cases there will be many difficult images in the training set, and in other cases in the validation or test set. Then you can privide the mean and standard deviation of your accuracy (or the metric that you are using).
UPDATED:
I see, in your description you have more than 1 object in a image. Isn't it?
For example, can you have two ships and one bridge?
I use to work with datasets that contain a single object in every image. Then to detect several objects in a image I scan different parts of a image looking for single objects.
Probably the author of the paper that you have mentioned divided the dataset randomly. If you use a more complex division in a research paper you should mention it.
About your question about how is the effect of having more diffecult images in every set, the answer is very complex. It depends on the algorithm and how similar are the images of the training set when comparing with the images of the validation and test set.
With a complex model (for example a Neural Net with a lot of layers and neurons) you can obtain the accuracy you want on the traning set(for example 100%). Then if the images are very similar to the images in the validation and test set the accuracy will be similar. But if they are not very similar you have overfitted and the accuracy will be slower in the validation and test set. To solve that you need a simpler model (for example reducing the number of neurons or using a good regularization technique), in that case the accuracy will be slower in the training set but the accuracy of the validation and test set will be closer to the accuracy obtained with the training set.

Effect of Data Parallelism on Training Result

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?
The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

Optimizing Neural Network Input for Convergence

I'm building a neural network for Image classificaion/recognition. There are 1000 images (30x30 greyscale) for each of the 10 classes. Images of different classes are placed in different folders. I'm planning to use Back-propagation algorithm to train the net.
Does the order in which I feed training examples into the net affect it's convergence?
Should I feed training examples in random order?
First I will answer your question
Yes it will affect it's convergence
Yes it's encouraged to do that, it's called randomized arrangement
But why?
referenced from here
A common example in most ANN software is IRIS data, where you have 150 instances comprising your dataset. These are about three different types of Iris flowers (Versicola, Virginics, and Setosa). The data set contains measurements of four variables (sepal length and width, and petal length and width). The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to Versicola, and the rest belong to Virginica. Now, what you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 instances in Versicola class, then all 50 in Virginics class, then all 50 in Setosa class. Without randomization your training set wont represent all the classes and, hence, no convergence, and will fail to generalize.
Another example, in the past I also have 100 images for each Alphabets (26 classes),
When I trained them ordered (per alphabet), it failed to converged but after I randomized it got converged easily because the neural network can generalize the alphabets.

How to use pattern recognition on graphs/charts?

I can create time-series graphs from data (charts) as images in C#. One might be moving average of a measured value, say 100 pixels by 100 pixels, time on X, value on Y.
I only train with graphs of values that give a desired (or undesired) result. This means I have lots of 10k images of success that I can use for training a NN.
The idea is to look at a current graph and establish a % match against the training data (many successful images either compiled/summed, averaged, etc. A high % match is likely the same situation exists now as with previous successes.
But I cannot figure out:
Q: How to compare images, or more basically, how to load a current image to test against in a trained NN. Do I really need 10,000 input nodes?!
There has to be a better way.
Right now I'm trying to make Encog/C# work for the image recognition/matching. There seems to be a lot of research in OCR, where a hard yes/no is made on input data, but not much at all about a 'fuzzy' match to the training data...

Resources