What is the correct way of calculating MIPS? - instruction-set

I found below question about MIPS,
A computer system has a CPU with a word length of 64 bits and a clock
speed of
1.5GHz. For a certain task, the measured average CPI (cycles per instruction) of the processor is 0.6. What is the MIPS rate of the
processor?
The given answer is,
Instructions /second = cycles/second x instructions/cycles
= 1.5x109x 1/0.6
= 2.5x109/ 1 million
= 2500 MIPS
When I searched more examples, I found in most places the CPU speed was taken in Mhz (not as Hz) to calculate MIPS. Refer the below question & answer.
Alternatively, divide the number of cycles per second (CPU) by the
number of cycles per instruction (CPI) and then divide by 1 million to
find the MIPS. For instance, if a computer with a CPU of 600 megahertz
had a CPI of 3: 600/3 = 200; 200/1 million = 0.0002 MIPS.
But, if I solve the second one as solved the first one, the answer would be,
= 600x106x 1/3
= 200x106/ 1 million
= 200 MIPS
What is the correct way of calculating MIPS?

Related

Time needed to generate all of bits in a packet... why is the packet size divided by data size?

I'm learning packet switching system and trying to understand this problem
from a text book. It's about the time needed to generate all of bits in a packet. What we learned so far was calculating a delay time that happens after a packet was made so time for making a packet feels new. Can anyone help me understand why they divided packet sizes by data size in the solution?
Information)
"Host A converts analog voice to a digital 64 kpbs bit stream on the
fly.
Host A then groups the bits into 56 byte packets."
Answer) 56*8 / 64*1000 = 7msec
They are calculating the time needed to generate all of bits in a packet.
Each new bit is added to the packet until the packet is full. The full packet
will then be sent on its way, and a new empty packet will be created to hold the
next set of bits. Once it fills up, it will be sent also.
This means that each packet will contain bits that range from brand new ones, to
bits that have been waiting around for up to 7ms. (The age of the oldest bit in
the packet is important, because it contributes to the observed latency of the
application.)
Your bits are being created in a stream, at a fixed rate of 64*1000 bits per
second. In one second, 64,000 bits would be generated. Therefore, one bit is
generated every 1/64,000 = 0.016 milliseconds.
Those bits are being assembled into packets where each packet contains exactly
56*8 bits. If each bit took 0.016 milliseconds to create, then all 56*8 bits
will be created in approximately 7 milliseconds.
I like to sanity-check this kind of formula by looking at the units: BITS / SECOND.
56*8 BITS / 0.007 SECONDS = 66,286 BITS/SECOND which is approximately your bitrate.
If BITRATE = BITS / SECONDS then by simple algebra, SECONDS = BITS / BITRATE

Unit of digital numbers?

what is the unit of digital numbers https://en.wikipedia.org/wiki/Numerical_digit. For example what is the unit of the difference of two ADC values:
10 - 2 = 8 digits
10 - 2 = 8 units
10 - 2 = 8 symbols
10 - 2 = 8 ???
Or for example I want to describe a slope:
Temperature example: 2 °C per second = 2 °C/sec
ADC example: 2 ??? per second = 2 ???/sec
What is correct?
Best regards
Zlatan
Numbers don't have units by default. Units are simply a multiplied symbol that represents the "nature" of the quantity.
First of all figure out the LSB (least significant bit) of the ADC.
Example: The ADC uses a vref of 1.2V and has 8bit => LSB=1.2V/(2^8-1)=4.7mV
A typical temperature sensor using a bipolar junction has about -2mV/K. The example ADC with LSB=4.7mV will respond with a change of 1LSB per 2.35K temperature decrease.
A change of 1LSB/second means you have a change of -2.35K/per second.
If this isn't accurate enough for your application you can use an ADC with more bits or stack several diodes acting as temperature sensors.
If you use something else than a bipolar junction the sensitivity of the temperature sensor can be different. Just check the spec of the sensor and the ADC (and it's reference to calculate the LSB)
Parameters you need:
Reference voltage of the ADC
Number of bits of the ADC (to calculate LSB)
Temperature coefficient of the sensor

CUDA: Best number of pixel computed per thread (grayscale)

I'm working on a program to convert an image in grayscale. I'm using the CImg library. I have to read for each pixel, the 3 values R-G-B, calculate the corresponding gray value and store the gray pixel on the output image. I'm working with an NVIDIA GTX 480. Some details about the card:
Microarchitecture: Fermi
Compute capability (version): 2.0
Cores per SM (warp size): 32
Streaming Multiprocessors: 15
Maximum number of resident warps per multiprocessor: 48
Maximum amount of shared memory per multiprocessor: 48KB
Maximum number of resident threads per multiprocessor: 1536
Number of 32-bit registers per multiprocessor: 32K
I'm using a square grid with blocks of 256 threads.
This program can have as input images of different sizes (e.g. 512x512 px, 10000x10000 px). I observed that incrementing the number of the pixels assigned to each thread increments the performance, so it's better than compute one pixel per thread. The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number? I know that on the GTX 480, 1536 is the maximum number of resident threads per multiprocessor. Have I to consider this number? The following, is the code executed by the kernel.
for(i = ((gridDim.x + blockIdx.x) * blockDim.x) + threadIdx.x; i < width * height; i += (gridDim.x * blockDim.x)) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[i]);
float g = static_cast< float >(inputImage[(width * height) + i]);
float b = static_cast< float >(inputImage[(2 * width * height) + i]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[i] = static_cast< unsigned char >(grayPix);
}
The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number?
Although you haven't shown any code, you've mentioned an observed characteristic:
I observed that incrementing the number of the pixels assigned to each thread increments the performance,
This is actually a fairly common observation for these types of workloads, and it may also be the case that this is more evident on Fermi than on newer architectures. A similar observation occurs during matrix transpose. If you write a "naive" matrix transpose that transposes one element per thread, and compare it with the matrix transpose discussed here that transposes multiple elements per thread, you will discover, especially on Fermi, that the multiple element per thread transpose can achieve approximately the available memory bandwidth on the device, whereas the one-element-per-thread transpose cannot. This ultimately has to do with the ability of the machine to hide latency, and the ability of your code to expose enough work to allow the machine to hide latency. Understanding the underlying behavior is somewhat involved, but fortunately, the optimization objective is fairly simple.
GPUs hide latency by having lots of available work to switch to, when they are waiting on previously issued operations to complete. So if I have a lot of memory traffic, the individual requests to memory have a long latency associated with them. If I have other work that the machine can do while it is waiting for the memory traffic to return data (even if that work generates more memory traffic), then the machine can use that work to keep itself busy and hide latency.
The way to give the machine lots of work starts by making sure that we have enabled the maximum number of warps that can fit within the machine's instantaneous capacity. This number is fairly simple to compute, it is the product of the number of SMs on your GPU and the maximum number of warps that can be resident on each SM. We want to launch a kernel that meets or exceeds this number, but additional warps/blocks beyond this number don't necessarily help us hide latency.
Once we have met the above number, we want to pack as much "work" as possible into each thread. Effectively, for the problem you describe and the matrix transpose case, packing as much work into each thread means handling multiple elements per thread.
So the steps are fairly simple:
Launch as many warps as the machine can handle instantaneously
Put all remaining work in the thread code, if possible.
Let's take a simplistic example. Suppose my GPU has 2 SMs, each of which can handle 4 warps (128 threads). Note that this is not the number of cores, but the "Maximum number of resident warps per multiprocessor" as indicated by the deviceQuery output.
My objective then is to create a grid of 8 warps, i.e. 256 threads total (in at least 2 threadblocks, so they can distribute to each of the 2 SMs) and make those warps perform the entire problem by handling multiple elements per thread. So if my overall problem space is a total of 1024x1024 elements, I would ideally want to handle 1024*1024/256 elements per thread.
Note that this method gives us an optimization direction. We do not necessarily have to achieve this objective completely in order to saturate the machine. It might be the case that it is only necessary, for example, to handle 8 elements per thread, in order to allow the machine to fully hide latency, and usually another limiting factor will appear, as discussed below.
Following this method will tend to remove latency as a limiting factor for performance of your kernel. Using the profiler, you can assess the extent to which latency is a limiting factor in a number of ways, but a fairly simple one is to capture the sm_efficiency metric, and perhaps compare that metric in the two cases you have outlined (one element per thread, multiple elements per thread). I suspect you will find, for your code, that the sm_efficiency metric indicates a higher efficiency in the multiple elements per thread case, and this is indicating that latency is less of a limiting factor in that case.
Once you remove latency as a limiting factor, you will tend to run into one of the other two machine limiting factors for performance: compute throughput and memory throughput (bandwidth). In the matrix transpose case, once we have sufficiently dealt with the latency issue, then the kernel tends to run at a speed limited by memory bandwidth.

VLFeat: computation of number of octaves for SIFT

I am trying to go through and understand some of VLFeat code to see how they generate the SIFT feature points. One thing that has me baffled early on is how they compute the number of octaves in their SIFT computation.
So according to the documentation, if one provides a negative value for the initial number of octaves, it will compute the maximum which is given by log2(min(width, height)). The code for the corresponding bit is:
if (noctaves < 0) {
noctaves = VL_MAX (floor (log2 (VL_MIN(width, height))) - o_min - 3, 1) ;
}
This code is in the function is in the vl_sift_new function. Here o_min is supposed to be the index of the first octave (I guess one does not need to start with the full resolution image). I am assuming this can be set to 0 in most use cases.
So, still I do not understand why they subtract 3 from this value. This seems very confusing. I am sure there is a good reason but I have not been able to figure it out.
The reason why they subtract by 3 is to ensure a minimum size of the patch you're looking at to get some appreciable output. In addition, when analyzing patches and extracting out features, depending on what algorithm you're looking at, there is a minimum size patch that the feature detection needs to get a good output and so subtracting by 3 ensures that this minimum patch size is met once you get to the lowest octave.
Let's take a numerical example. Let's say we have a 64 x 64 patch. We know that at each octave, the sizes of each dimension are divided by 2. Therefore, taking the log2 of the smallest of the rows and columns will theoretically give you the total number of possible octaves... as you have noticed in the above code. In our case, either the rows and columns are the minimum value, and taking the log2 of either the rows or columns gives us 7 octaves theoretically (log2(64) = 7). The octaves are arranged like so:
Octave | Size
--------------------
1 | 64 x 64
2 | 32 x 32
3 | 16 x 16
4 | 8 x 8
5 | 4 x 4
6 | 2 x 2
7 | 1 x 1
However, looking at octaves 5, 6 and 7 will probably not give you anything useful and so there's actually no point in analyzing those octaves. Therefore by subtracting by 3 from the total number of octaves, we will stop analyzing things at octave 4, and so the smallest patch to analyze is 8 x 8.
As such, this subtraction is commonly performed when looking at scale-spaces in images because this enforces that the last octave is of a good size to analyze features. The number 3 is arbitrary. I've seen people subtract by 4 and even 5. From all of the feature detection code that I have seen, 3 seems to be the most widely used number. So with what I said, it wouldn't really make much sense to look at an octave whose size is 1 x 1, right?

Compute max memory bandwidth of a processor

I'm reading this CPU specification: http://ark.intel.com/products/67356/Intel-Core-i7-3612QM-Processor-6M-Cache-up-to-3_10-GHz-rPGA
It says the CPU has 2 channels. So I think it has 2 memory controller inside. Then the max memory bandwidth should be 1.6GHz * 64bits * 2 * 2 = 51.2 GB/s if the supported DDR3 RAM are 1600MHz. But the specification says its max memory bandwidth is 25.6 GB/s.
I multiplied two 2s here, one for the Double Data Rate, another for the memory channel.
Is it the problem of the specification? or I have some miss understanding?
Double data rate memory specs usually already take into account that its effective frequency is doubled. "1600 MHz memory" really runs on 800 Mhz, so you can leave out one factor of 2 from your calculation.

Resources