What are coarse and fine labels in Cifar10? - machine-learning

I have came across the CIFAR10 dataset in binary. It says there are two labels, a coarse label and a fine label. Can anyone explain the difference between the two, as the format states:
<1 x coarse label><1 x fine label><3072 x pixel>
<1 x coarse label><1 x fine label><3072 x pixel>
Thank You!

I'm fairly certain only CIFAR100 has both coarse labels and fine labels.
CIFAR10 only has one set of labels: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.
For CIFAR100 coarse labels are the superclass to which the image belongs to, like: fish, flowers or insects. Fine labels are the subclass, like:
bee, beetle, butterfly, caterpillar...
See https://www.cs.toronto.edu/~kriz/cifar.html for a complete list.


Linear Interpolation in image processing

I am going through a paper in computer vision, and I came through this line :
the L values, or the luminance values, for these pixels are then linearly and horizontally interpolated between the pixels on the (one pixel wide) brightest column in region B, and the pixels in regions A and C.
What does linear and horizontal interpolation mean?
So I tried looking for linear interpolation, so does it mean that we average out the values of pixels which are linear to each other? As I can't see any proper definition.
Paper :
Every programmer should know linear interpolation!!! Especially if you're entering the domain of image-processing.
Please read this and never ever forget about it.
The paper describes pretty well what is going on. They synthesize skin texture by sampling the face and then interpolating between those samples. They sample 3 regions A, B and C.
They pick the brightest column of B, the left-most column of A and the right-most column of C.
Then for every row they linearly interpolate between the columns' pixels.

Meaning of Histogram on Tensorboard

I am working on Google Tensorboard, and I'm feeling confused about the meaning of Histogram Plot. I read the tutorial, but it seems unclear to me. I really appreciate if anyone could help me figure out the meaning of each axis for Tensorboard Histogram Plot.
Sample histogram from TensorBoard
I came across this question earlier, while also seeking information on how to interpret the histogram plots in TensorBoard. For me, the answer came from experiments of plotting known distributions.
So, the conventional normal distribution with mean = 0 and sigma = 1 can be produced in TensorFlow with the following code:
import tensorflow as tf
cwd = "test_logs"
W1 = tf.Variable(tf.random_normal([200, 10], stddev=1.0))
W2 = tf.Variable(tf.random_normal([200, 10], stddev=0.13))
w1_hist = tf.summary.histogram("weights-stdev_1.0", W1)
w2_hist = tf.summary.histogram("weights-stdev_0.13", W2)
summary_op = tf.summary.merge_all()
init = tf.initialize_all_variables()
sess = tf.Session()
writer = tf.summary.FileWriter(cwd, session.graph)
for i in range(2):
Here is what the result looks like:
The horizontal axis represents time steps.
The plot is a contour plot and has contour lines at the vertical axis values of -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, and 1.5.
Since the plot represents a normal distribution with mean = 0 and sigma = 1 (and remember that sigma means standard deviation), the contour line at 0 represents the mean value of the samples.
The area between the contour lines at -0.5 and +0.5 represent the area under a normal distribution curve captured within +/- 0.5 standard deviations from the mean, suggesting that it is 38.3% of the sampling.
The area between the contour lines at -1.0 and +1.0 represent the area under a normal distribution curve captured within +/- 1.0 standard deviations from the mean, suggesting that it is 68.3% of the sampling.
The area between the contour lines at -1.5 and +1-.5 represent the area under a normal distribution curve captured within +/- 1.5 standard deviations from the mean, suggesting that it is 86.6% of the sampling.
The palest region extends a little beyond +/- 4.0 standard deviations from the mean, and only about 60 per 1,000,000 samples will be outside of this range.
While Wikipedia has a very thorough explanation, you can get the most relevant nuggets here.
Actual histogram plots will show several things. The plot regions will grow and shrink in vertical width as the variation of the monitored values increases or decreases. The plots may also shift up or down as the mean of the monitored values increases or decreases.
(You may have noted that the code actually produces a second histogram with a standard deviation of 0.13. I did this to clear up any confusion between the plot contour lines and the vertical axis tick marks.)
#marc_alain, you're a star for making such a simple script for TB, which are hard to find.
To add to what he said the histograms showing 1,2,3 sigma of the distribution of weights. which is equivalent to the 68th,95th, and 98th percentiles. So think if you're model has 784 weights, the histogram shows how the values of those weights change with training.
These histograms are probably not that interesting for shallow models, you could imagine that with deep networks, weights in high layers might take a while to grow because of the logistic function being saturated. Of course I'm just mindlessly parroting this paper by Glorot and Bengio, in which they study the weights distribution through training and show how the logistic function is saturated for the higher layers for quite a while.
When plotting histograms, we put the bin limits on the x-axis and the count on the y-axis. However, the whole point of histogram is to show how a tensor changes over times. Hence, as you may have already guessed, the depth axis (z-axis) containing the numbers 100 and 300, shows the epoch numbers.
The default histogram mode is Offset mode. Here the histogram for each epoch is offset in the z-axis by a certain value (to fit all epochs in the graph). This is like seeing all histograms places one after the other, from one corner of the ceiling of the room (from the mid point of the front ceiling edge to be precise).
In the Overlay mode, the z-axis is collapsed, and the histograms become transparent, so you can move and hover over to highlight the one corresponding to a particular epoch. This is more like the front view of the Offset mode, with only outlines of histograms.
As explained in the documentation here:
takes an arbitrarily sized and shaped Tensor, and compresses it into a
histogram data structure consisting of many bins with widths and
counts. For example, let's say we want to organize the numbers [0.5,
1.1, 1.3, 2.2, 2.9, 2.99] into bins. We could make three bins:
a bin containing everything from 0 to 1 (it would contain one element, 0.5),
a bin containing everything from 1-2 (it would contain two elements, 1.1 and 1.3),
a bin containing everything from 2-3 (it would contain three elements: 2.2, 2.9 and 2.99).
TensorFlow uses a similar approach to create bins, but unlike in our
example, it doesn't create integer bins. For large, sparse datasets,
that might result in many thousands of bins. Instead, the bins are
exponentially distributed, with many bins close to 0 and comparatively
few bins for very large numbers. However, visualizing
exponentially-distributed bins is tricky; if height is used to encode
count, then wider bins take more space, even if they have the same
number of elements. Conversely, encoding count in the area makes
height comparisons impossible. Instead, the histograms resample the
data into uniform bins. This can lead to unfortunate artifacts in
some cases.
Please read the documentation further to get the full knowledge of plots displayed in the histogram tab.
The histogram plot allows you to plot variables from your graph.
w1 = tf.Variable(tf.zeros([1]),name="a",trainable=True)
For the example above the vertical axis would have the units of my w1 variable. The horizontal axis would have units of the step which I think is captured here:
summary_str = sess.run(summary_op, feed_dict=feed_dict)
summary_writer.add_summary(summary_str, **step**)
It may be useful to see this on how to make summaries for the tensorboard.
Each line on the chart represents a percentile in the distribution over the data: for example, the bottom line shows how the minimum value has changed over time, and the line in the middle shows how the median has changed. Reading from top to bottom, the lines have the following meaning: [maximum, 93%, 84%, 69%, 50%, 31%, 16%, 7%, minimum]
These percentiles can also be viewed as standard deviation boundaries on a normal distribution: [maximum, μ+1.5σ, μ+σ, μ+0.5σ, μ, μ-0.5σ, μ-σ, μ-1.5σ, minimum] so that the colored regions, read from inside to outside, have widths [σ, 2σ, 3σ] respectively.

What is Depth of a convolutional neural network?

I was taking a look at Convolutional Neural Network from CS231n Convolutional Neural Networks for Visual Recognition. In Convolutional Neural Network, the neurons are arranged in 3 dimensions(height, width, depth). I am having trouble with the depth of the CNN. I can't visualize what it is.
In the link they said The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
For example loook at this picture. Sorry if the image is too crappy.
I can grasp the idea that we take a small area off the image, then compare it with the "Filters". So the filters will be collection of small images? Also they said We will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron. So is the receptive field has the same dimension as the filters? Also what will be the depth here? And what do we signify using the depth of a CNN?
So, my question mainly is, if i take an image having dimension of [32*32*3] (Lets say i have 50000 of these images, making the dataset [50000*32*32*3]), what shall i choose as its depth and what would it mean by the depth. Also what will be the dimension of the filters?
Also it will be much helpful if anyone can provide some link that gives some intuition on this.
So in one part of the tutorial(Real-world example part), it says The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96].
Here we see the depth is 96. So is depth something that i choose arbitrarily? or something i compute? Also in the example above(Krizhevsky et al) they had 96 depths. So what does it mean by its 96 depths? Also the tutorial stated Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
So that means the depth will be like this? If so then can i assume Depth = Number of Filters?
In Deep Neural Networks the depth refers to how deep the network is but in this context, the depth is used for visual recognition and it translates to the 3rd dimension of an image.
In this case you have an image, and the size of this input is 32x32x3 which is (width, height, depth). The neural network should be able to learn based on this parameters as depth translates to the different channels of the training images.
In each layer of your CNN it learns regularities about training images. In the very first layers, the regularities are curves and edges, then when you go deeper along the layers you start learning higher levels of regularities such as colors, shapes, objects etc. This is the basic idea, but there lots of technical details. Before going any further give this a shot : http://www.datarobot.com/blog/a-primer-on-deep-learning/
Have a look at the first figure in the link you provided. It says 'In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).' It means that a ConvNet neuron transforms the input image by arranging its neurons in three dimeonsion.
As an answer to your question, depth corresponds to the different color channels of an image.
Moreover, about the filter depth. The tutorial states this.
Every filter is small spatially (along width and height), but extends through the full depth of the input volume.
Which basically means that a filter is a smaller part of an image that moves around the depth of the image in order to learn the regularities in the image.
For the real world example I just browsed the original paper and it says this : The first convolutional layer filters the 224×224×3 input image with 96 kernels of size 11×11×3 with a stride of 4 pixels.
In the tutorial it refers the depth as the channel, but in real world you can design whatever dimension you like. After all that is your design
The tutorial aims to give you a glimpse of how ConvNets work in theory, but if I design a ConvNet nobody can stop me proposing one with a different depth.
Does this make any sense?
Depth of CONV layer is number of filters it is using.
Depth of a filter is equal to depth of image it is using as input.
For Example: Let's say you are using an image of 227*227*3.
Now suppose you are using a filter of size of 11*11(spatial size).
This 11*11 square will be slided along whole image to produce a single 2 dimensional array as a response. But in order to do so, it must cover every aspect inside of 11*11 area. Therefore depth of filter will be depth of image = 3.
Now suppose we have 96 such filter each producing different response. This will be depth of Convolutional layer. It is simply number of filters used.
I'm not sure why this is skimped over so heavily. I also had trouble understanding it at first, and very few outside of Andrej Karpathy (thanks d00d) have explained it. Although, in his writeup (http://cs231n.github.io/convolutional-networks/), he calculates the depth of the output volume using a different example than in the animation.
Start by reading the section titled 'Numpy examples'
Here, we go through iteratively.
In this case we have an 11x11x4. (why we start with 4 is kind of peculiar, as it would be easier to grasp with a depth of 3)
Really pay attention to this line:
A depth column (or a fibre) at position (x,y) would be the activations
A depth slice, or equivalently an activation map at depth d
would be the activations X[:,:,d].
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V is your output volume. The zero'th index v[0] is your column - in this case V[0] = 0 this is the first column in your output volume.
V[1] = 0 this is the first row in your output volume. V[3]= 0 is the depth. This is the first output layer.
Now, here's where people get confused (at least I did). The input depth has absolutely nothing to do with your output depth. The input depth only has control of the filter depth. W in Andrej's example.
Aside: A lot of people wonder why 3 is the standard input depth. For color input images, this will always be 3 for plain ole images.
np.sum(X[:5,:5,:] * W0) + b0 (convolution 1)
Here, we are calculating elementwise between a weight vector W0 which is 5x5x4. 5x5 is an arbitrary choice. 4 is the depth since we need to match our input depth. The weight vector is your filter, kernel, receptive field or whatever obfuscated name people decide to call it down the road.
if you come at this from a non python background, that's maybe why there's more confusion since array slicing notation is non-intuitive. The calculation is a dot product of your first convolution size (5x5x4) of your image with the weight vector. The output is a single scalar value which takes the position of your first filter output matrix. Imagine a 4 x 4 matrix representing the sum product of each of these convolution operations across the entire input. Now stack them for each filter. That shall give you your output volume. In Andrej's writeup, he starts moving along the x axis. The y axis remains the same.
Here's an example of what V[:,:,0] would look like in terms of convolutions. Remember here, the third value of our index is the depth of your output layer
[result of convolution 1, result of convolution 2, ..., ...]
[..., ..., ..., ..., ...]
[..., ..., ..., ..., ...]
[..., ..., ..., result of convolution n]
The animation is best for understanding this, but Andrej decided to swap it with an example that doesn't match the calculation above.
This took me a while. Partly because numpy doesn't index the way Andrej does in his example, at least it didn't I played around with it. Also, there's some assumptions that the sum product operation is clear. That's the key to understand how your output layer is created, what each value represents and what the depth is.
Hopefully that helps!
Since the input volume when we are doing an image classification problem is N x N x 3. At the beginning it is not difficult to imagine what the depth will mean - just the number of channels - Red, Green, Blue. Ok, so the meaning for the first layer is clear. But what about the next ones? Here is how I try to visualize the idea.
On each layer we apply a set of filters which convolve around the input. Lets imagine that currently we are at the first layer and we convolve around a volume V of size N x N x 3. As #Semih Yagcioglu mentioned at the very beginning we are looking for some rough features: curves, edges etc... Lets say we apply N filters of equal size (3x3) with stride 1. Then each of these filters is looking for a different curve or edge while convolving around V. Of course, the filter has the same depth, we want to supply the whole information not just the grayscale representation.
Now, if M filters will look for M different curves or edges. And each of these filters will produce a feature map consisting of scalars (the meaning of the scalar is the filter saying: The probability of having this curve here is X%). When we convolve with the same filter around the Volume we obtain this map of scalars telling us where where exactly we saw the curve.
Then comes feature map stacking. Imagine stacking as the following thing. We have information about where each filter detected a certain curve. Nice, then when we stack them we obtain information about what curves / edges are available at each small part of our input volume. And this is the output of our first convolutional layer.
It is easy to grasp the idea behind non-linearity when taking into account 3. When we apply the ReLU function on some feature map, we say: Remove all negative probabilities for curves or edges at this location. And this certainly makes sense.
Then the input for the next layer will be a Volume $V_1$ carrying info about different curves and edges at different spatial locations (Remember: Each layer Carries info about 1 curve or edge).
This means that the next layer will be able to extract information about more sophisticated shapes by combining these curves and edges. To combine them, again, the filters should have the same depth as the input volume.
From time to time we apply Pooling. The meaning is exactly to shrink the volume. Since when we use strides = 1, we usually look at a pixel (neuron) too many times for the same feature.
Hope this makes sense. Look at the amazing graphs provided by the famous CS231 course to check how exactly the probability for each feature at a certain location is computed.
In simple terms, it can explain as below,
Let's say you have 10 filters where each filter is the size of 5x5x3. What does this mean? the depth of this layer is 10 which is equal to the number of filters. Size of each filter can be defined as we want e.g., 5x5x3 in this case where 3 is the depth of the previous layer. To be precise, depth of each filer in the next layer should be 10 ( nxnx10) where n can be defined as you want like 5 or something else. Hope will make everything clear.
The first thing you need to note is
receptive field of a neuron is 3D
ie If the receptive field is 5x5 the neuron will be connected to 5x5x(input depth) number of points. So whatever be your input depth, one layer of neurons will only develop 1 layer of output.
Now, the next thing to note is
depth of output layer = depth of conv. layer
ie The output volume is independent of the input volume, and it only depends on the number filters(depth). This should be pretty obvious from the previous point.
Note that the number of filters (depth of the cnn layer) is a hyper parameter. You can take it whatever you want, independent of image depth. Each filter has it's own set of weights enabling it to learn a different feature on the same local region covered by the filter.
The depth of the network is the number of layers in the network. In the Krizhevsky paper, the depth is 9 layers (modulo a fencepost issue with how layers are counted?).
If you are referring to the depth of the filter (I came to this question searching for that) then this diagram of LeNet is illustrating
Source http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
How to create such a filter; Well in python like https://github.com/alexcpn/cnn_in_python/blob/main/main.py#L19-L27
Which will give you a list of numpy arrays and length of the list is the depth
Example in the code above,but adding a depth of 3 for color (RGB), the below is the network. The first Convolutional layer is a filter of shape (5,5,3) and depth 6
Input (R,G,B)= [32.32.3] *(5.5.3)*6 == [28.28.6] * (5.5.6)*1 = [24.24.1] * (5.5.1)*16 = [20.20.16] *
FC layer 1 (20, 120, 16) * FC layer 2 (120, 1) * FC layer 3 (20, 10) * Softmax (10,) =(10,1) = Output
In Pytorch
np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})
# Generate a random image
image_size = 32
image_depth = 3
image = np.random.rand(image_size, image_size)
# to mimic RGB channel
image = np.stack([image,image,image], axis=image_depth-1) # 0 to 2
image = np.moveaxis(image, [2, 0], [0, 2])
print("Image Shape=",image.shape)
input_tensor = torch.from_numpy(image)
m = nn.Conv2d(in_channels=3,out_channels=6,kernel_size=5,stride=1)
output = m(input_tensor.float())
print("Output Shape=",output.shape)
Image Shape= (3, 32, 32)
Output Shape= torch.Size([6, 28, 28])

Defining an (initial) set of Haar Like Features

When it comes to cascade classifiers (using haar like features) I always read that methods like AdaBoosting are used to select the 'best' features for detection. However this only works if there is some initial set of features to begin boosting.
Given a 24x24 pixel image there are 162,336 possible haar features. I might be wrong here, but I don't think libraries like openCV initially test against all of these features.
So my question is how are the initial features selected or how are they generated? Is there any guideline about the initial number of features?
And if all 162,336 features are used initially. How are they generated?
From your question i am able to understand that you wanted to know what are 1,62,336 features.
From 4 original viola jones features(http://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework)
We can generate 1,62,336 features by varying size of 4 original features and their position on 24*24 input image.
For example consider one of the original feature which has two rectangles adjacent to each other.
Let us consider size of each rectangle is 1 pixel. Initially if one rectangle is present on (0,0) of 24*24 image then it is considered as one feature & now if you move it horizontally by one pixel( to (1,0) ) then it is considered as second feature as its position is changed to (1,0). In this way u can move it horizontally upto (22,0) generating 23 features. Similarly, if you move along vertical axis from (0,0) up to (0,23) then u can generate 24 features. Now if you move on image covering every position (for example (1,1),(1,2).....(22,23) ) then u can generate 24*23=552 features.
Now if we consider width of each rectangle is 2 pixels and height is 1 pixel. Initially if one rectangle is present on (0,0) and is moved along horizontal axis up to (20,0) as said above then we can have 21 features, as its height is same if we move along vertical axis from (0,0) to (0,23) we can have 24 features. Thus if we move so as to cover every position on image then we can have 24*21=504 features.
In this way if we increase width of each rectangle by one pixel keeping height of each rectangle as 1 pixel every time we cover complete image, so that its width changes from 1 pixel to 24 pixels we get no. of features = 24*(23+21+19.....3+1)
Now, if we consider width of each rectangle is 1 pixel and height as 2 pixel. Initially if one rectangle is present on (0,0) and is moved along horizontal axis up to (23,0) then we can have 23 features as its width is 1 pixel, as its height is 2 pixels if we move along vertical axis from (0,0) to (0,22) then we can have 23 features. Thus if we move so as to cover every position on image then we can have 23*23=529 features.
Similarly, if we increase width of each rectangle by one pixel keeping height of each rectangle as 2 pixels every time we cover complete image, so that its width changes from 1 pixel to 24 pixels we get no. of features = 23*(23+21+19.....3+1)
Now, if we increase height of each rectangle by 1 pixel after changing width of each rectangle from 1 pixel to 24 pixels until height of each rectangle becomes 24 pixels, then
no. of features = 24*(23+21+19.....3+1) + 23*(23+21+19.....3+1) + 22*(23+21+19.....3+1) +.................+ 2*(23+21+19.....3+1) + 1*(23+21+19.....3+1)
= 43,200 features
Now if we consider 2nd viola jones original feature which has two rectangles with one rectangle above other(that is rectangles are arranged vertically), as this is similar to 1st viola jones original feature it will also have
no. of features = 43,200
Similarly if we follow above process, from 3rd original viola jones feature which has 3 rectangles arranged along horizontal direction, we get
no. of features = 24*(22+19+16+....+4+1) + 23*(22+19+16+....+4+1) + 22*(22+19+16+....+4+1) +................+ 2*(22+19+16+....+4+1) + 1*(22+19+16+....+4+1)
Now, if we consider another feature which has 3 rectangles arranged vertically(that is one rectangle upon another) then we get
no. of features = 27,600 (as it is similar to 3rd original viola jones feature)
Lastly, if we consider 4th original viola jones feature which has 4 rectangles we get
no.of features = 23*(23+21+19+......3+1) + 21*(23+21+19+......3+1) + 19*(23+21+19+......3+1) ..................+ 3*(23+21+19+......3+1) + 1*(23+21+19+......3+1)
= 20,736
Now summing up all these features we get = 43,200 + 43,200 + 27,600 + 27,600 + 20,736
= 1,62,336 features
Thus from above 1,62,336 features Adaboost selects some of them to form strong classifier.
I presume, you're familiar with Viola/Jones' original work on this topic.
You start by manually choosing a feature type (e.g. Rectangle A). This gives you a mask with which you can train your weak classifiers. In order to avoid moving the mask pixel by pixel and retraining (which would take huge amounts of time and not any better accuracy), you can specify how much the feature moves in x and y direction per trained weak classifier. The size of your jumps depend on your data size. The goal is to have the mask be able to move in and out of the detected object. The size of the feature can also be variable.
After you've trained multiple classifiers with a respective feature (i.e. mask position), you proceed with AdaBoost and Cascade training as usual.
The number of features/weak classifiers is highly dependent on your data and experimental setup (i.e. also the type of classifier you use). You'll need to test the parameters extensibly to also know which type of features work best (rectangle/circle/tetris-like objects etc). I worked on this 2 years ago and it took us quite a long time to evaluate which features and feature-generation-heuristics yielded the best results.
If you wanna start somewhere, just take 1 of the 4 original Viola/Jones features and train a classifier applying it anchored to (0,0). Train the next classifier with (x,0). The next with (2x,0)....(0,y), (0,2y), (0,4y),.. (x,y), (x, 2y) etc...
And see what happens. Most likely you'll see that it's ok to have less weak classifiers, i.e. you can proceed to increase the x/y step values which determine how the mask slides. You can also have the mask grow or do other stuff to save time. The reason this "lazy" feature generation works is AdaBoost: as long as these features make the classifiers slightly better than random, AdaBoost will combine these classifiers into a meaningful classifier.
It seems to me that there is a little bit of confusion here.
Even the accepted answer seems not correct to me (maybe I haven’t got it well).
The original Viola-Jones algorithm, the main later improvements of it as the Lienhart-Maydt algorithm, and the Opencv implementation, all of them evaluate each and every feature of the feature set in turn.
You can check the source code of Opencv (and whatever implementation you prefer).
At the end of function void CvHaarEvaluator::generateFeatures() you have numFeatures, which is just 162,336 for BASIC mode and size 24x24.
And all of them are checked in turn, when all the feature set is provided in the form of featureEvaluator (source):
bool isStageTrained = tempStage->train( (CvFeatureEvaluator*)featureEvaluator, curNumSamples,
_precalcValBufSize, _precalcIdxBufSize, *((CvCascadeBoostParams*)stageParams) );
Every weak classifier is constructed by checking each feature and chosing the one that yields the best result at that point (in case of a decision tree the process is similar).
After this choice, the weights of samples are changed accordingly, so that at the next round a different feature, from all the feature set again, will be selected.
A single feature evalution is computationally cheap, but multiplied by numFeatures can be demanding.
The whole training of a cascade can take weeks, but the bottleneck is not the feature evaluation process, it is the negative sample gathering at latest stages.
From the wikipedia link you provided I read:
in a standard 24x24 pixel sub-window, there are a total of M= 162,336
possible features, and it would be prohibitively expensive to evaluate
them all when testing an image.
Don’t be mislead by this, it means than after the long training process your detection algorithm should be very fast and it only needs to check few features (just the ones selected during training).

Merging two labels in connect components during the first pass

In connected components labeling, if I see that the pixel to the left and the pixel above the current pixel have the same color but different labels, can't I automatically reassign their labels to be the same (instead of doing with an equivalence table)?
Wikipedia and MathWorks assigns the minimum label to the current pixel but otherwise leave the neighboring pixels the same. Then, they polish the label table with another pass. Unless I'm mistaken my tweak will allow me to label the image uniformly in a single pass. Is there an example in which my little tweak will break the algorithm?
You wouldn't eliminate the second pass. If you did change the labels of the neighboring pixels, what about their neighboring pixels? Basically, if this event happens, you've discovered the two labels are in the same equivalence class; but you'd still have to walk over everything you've examined so far to reassign those labels. You may as well just do that on the second pass and do all the reassigning in one sweep.
You examine pixel x, it matches both pixels north and west. Suppose A is the minimum label. So you choose to label the three pixels A, but that won't relabel the other B pixel. You still have to record that A==B, and will still have to sweep through to relabel any B's that remain. Furthermore, you might later find that A itself is equivalent to some other smaller label, and you'd have to relabel all these pixels later.
