Calclulate Power consumption when CPU don't change to LPM Mode in Contiki - contiki

I need to calculate power consumption of CPU. According to this formula.
Power(mW) = cpu * 1.8 / time.
Where time is the sum of cpu + lpm.
I need to measure at the start of certain process and at the end, however the time passed it is to short, and cpu don't change to lpm mode as seen in the next values taken with powertrace_print().
all_cpu all_lpm all_transmit all_listen
116443 1514881 148 1531616
17268 1514881 148 1532440
Calculating power consumption of cpu I got 1.8 mW (which is exactly the value of current draw of CPU in active mode).
My question is, how calculate power consumption in this case?

If MCU does not go into a LPM, then it spends all the time in active mode, so the result of 1.8 mW you get looks correct.
Perhaps you want to ask something different? If you want to measure the time required to execute a specific block of code, you can add RTIMER_NOW() calls at the start and end of the block.
The time resolution of RTIMER_NOW may be too coarse for short-time operations. You can use a higher frequency timer for that, depending on your platform, e.g. read the TBR register for timing if you're compiling for a msp430 based sensor node.

Related

How much computing time does the kernel need

I wrote a program for a LED display. The program allows to set the refresh rate via webconfiguration. To meet the refresh rate I measure the processing time of a loop. At the end I calculate the delay and wait until the next loop.
e.g. Refresh Rate 5 Hz -> 200 milli seconds for one loop. 50 milli seconds computing time results in 150 milli seconds delay.
The ratio of process time (50 milli seconds) to total time (200 milli seconds) indicates the processor load of my program. But to find the optimal setting, I need the actual total processor load. And not only that of my program. But since I don't know the real processor load of the delay() (in which WIFI etc. is done), I don't really know the processor load. In other words, I don't know how much time the system spends doing system tasks in the delay(150).
Is there a way to find out how much of a delay is actually used for system tasks before the processor truly waits?
In other words, I'm looking for a way to get the kernel time within a certain time frame.
Cheers Gabriel

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

Contiki OS CC2538: Reducing current / power consumption

I am trying to drive down the current consumption of the contiki os running on the CC2538 development kit.
I would like to operate the device from a CR2032 with a run life of 2 years. To achieve this I would need an average current less than 100uA.
However when I run the following at 3V, I get the following results:
contiki/examples/hello-world = 0.4mA - 2mA
contiki/examples/er-rest-example/er-example-client = 27mA
contiki/examples/er-rest-example/er-example-server = 27mA
thingsquare websocket example = 4mA
I have also designed my own target platform based on the cc2538 and get similar results.
I have read the guide at https://github.com/contiki-os/contiki/blob/648d3576a081b84edd33da05a3a973e209835723/platform/cc2538dk/README.md
and have ensured that in the contiki-conf.h file:
- LPM_CONF_ENABLE 1
- LPM_CONF_MAX_PM 2
Can anyone give me some pointers as to how I can get the current down. It would be most appreciated.
Regards,
Shane
How did you measure the current?
You have to be aware that using a basic ampere meter to measure the current consumption of contiki-os wouldn't give you relevant results. The system is turning on/off the radio at a relative high rate (8Hz by default) in order to perform the CCA. This might not be very easy to catch for an ampere meter.
To have an idea of the current consumption when the device is in deep sleep (and then make calculation to determine the averaged current consumption), I'd rather put the device in the PM state before the program reach the infinite while loop. I used the following code to do that:
lpm_enter();
REG(SYS_CTRL_PMCTL) = SYS_CTRL_PMCTL_PM2;
do { asm("wfi"::); } while(0);
leds_on(LEDS_RED); // should not reach here
while(1){
...
On the CC2538, the CCA check consumes about 10-15mA and last approximately 2ms. When the radio transmit a packet, it consume 25mA. Have a look at this post: Contiki UDP packet transmission duration with CC2538.
Furthermore, to save a little more current, turn off the serial com:
#define CC2538_CONF_QUIET 1
Are you using the SmartRF board? If you want to make proper current measurement with this board, you have to remove every jumpers: P486, P487, P411 and P408. Keep only the jumpers of BTN_SEL and the RESET signals.

Profiling with an external counter or best alternative

I'am non-programmer, trying to assess the time spent in (opencv-)functions. We have an AD-converter which comes with a counter that is able to count external signals (e.g. from a function generator) with a frequency of 1 MHz = 1 µs resolution. The actual counter status can be queried with a function cbIn32(..., unsigned long *pointertovalue).
So my idea was to query the counter status before and after calling the function of interest and to calculate then the difference. However, doubts came up when I calcultated the difference without a function call in between, which revealed rel. high fluctuations (values between 80 and 400 µs or so). I wondered, if calculating the average time for calling cbIn32() (approx. 180 µs) and substract this from the putative time spent in the function of interest is a valid solution.
So my first two questions:
Is that approach generally feasible or useless?
Where do the fluctuations come from?
Alternatively, we tried using getTickCount(), which seemed to deliver reasonable values. But checking forums revealed that it has a low resolution of about 10 ms, which would be insatisfactory (100 µs resolution would be appreciated). However, the values we got were in the sub-ms range.
This brings me to the next questions:
How can the time assessed for a function with getTickCount() be in the microseconds range, when the resolution is around 10 ms?
Should I trust the obtained values or not?
I also tried it with gprof, but it gave me "no time accumulated", although I am sure that the time spent in a function containing opencv-related calls is at least a few milliseconds. I even tried rebuilding opencv with ENABLE_PROFILING=ON, but same result. I read somewhere that you need to build static opencv libraries to enable profiling, but I am not sure if this would improve the situation. So the question here is:
What do I have to do so that gprof also "sees" opencv functions?
Next alternative would be the QueryPerformanceCounter() function of the WINAPI. I don't how to use it, but I would fight my way through, if you recommend it. Question to that approach:
Will it be problematic because of multiple cores?
If yes, is there an "easy" way to handle that problem?
I also tried it with verysleepy, but it exits somehow to early (worked fine with other .exe).
Newbie-friendly answers would be very, very appreciated. My goal is to find the easiest approach with highest precision. I'm working on Win7 64bit, Eclipse with MinGW.
Thx for your help...

Problem with the timings of a program that uses 1-8 threads on a server that has 4 Dual Core Cpu's?

I am runing a program on a server at my university that has 4 Dual-Core AMD Opteron(tm) Processor 2210 HE and the O.S. is Linux version 2.6.27.25-78.2.56.fc9.x86_64. My program implements Conways Game of Life and it runs using pthreads and openmp. I timed the parrallel part of the program using the getimeofday() function using 1-8 threads. But the timings don't seem right. I get the biggest time using 1 thread(as expected), then the time gets smaller. But the smallest time I get is when I use 4 threads.
Here is an example when I use an array 1000x1000.
Using 1 thread~9,62 sec, Using 2 Threads~4,73 sec, Using 3 ~ 3.64 sec, Using 4~2.99 sec, Using 5 ~4,19 sec, Using 6~3.84, Using 7~3.34, Using 8~3.12.
The above timings are when I use pthreads. When I use openmp the timing are smaller but follow the same pattern.
I expected that the time would decrease from 1-8 because of the 4 Dual core cpus? I thought that because there are 4 cpus with 2 cores each, 8 threads could run at the same time. Does it have to do with the operating system that the server runs?
Also I tested the same programs on another server that has 7 Dual-Core AMD Opteron(tm) Processor 8214 and runs Linux version 2.6.18-194.3.1.el5. There the timings i get are what I expected. The timings get smaller starting from 1(the biggest) to 8(smallest execution time).
The program implements the Game of Life correct, both using pthreads and openmp, I just can't figure out why the timings are like the example I posted. So in conclusion, my questions are:
1) The number of threads that can run at the same time on a system depends by the cores of the cpus? it depends only by the cpus although each cpu has more than one cores? It depends by all the previous and the Operating System?
2) Does it have to do with the way I divide the 1000x1000 array to the number of threads? But if I did then the openmp code wouldn't give the same pattern of timings?
3) What is the reason I might get such timings?
This is the code I use with openmp:
#define Row 1000+2
#define Col 1000+2 int num; int (*temp)[Col]; int (*a1)[Col]; int (*a2)[Col];
int main() {
int i,j,l,sum;
int array1[Row][Col],array2[Row][Col];
struct timeval tim;
struct tm *tm;
double start,end;
int st,en;
for (i=0; i<Row; i++)
for (j=0; j<Col; j++)
{
array1[i][j]=0;
array2[i][j]=0;
}
array1[3][16]=1;
array1[4][16]=1;
array1[5][15]=1;
array1[6][15]=1;
array1[6][16]=1;
array1[7][16]=1;
array1[5][14]=1;
array1[4][15]=1;
a1=array1;
a2=array2;
printf ("\nGive number of threads:");
scanf("%d",&num);
gettimeofday(&tim,NULL);
start=tim.tv_sec+(tim.tv_usec/1000000.0); omp_set_num_threads(num);
#pragma omp parallel private(l,i,j,sum)
{
printf("Number of Threads:%d\n",omp_get_num_threads());
for (l=0; l<100; l++)
{
#pragma omp for
for (i=1; i<(Row-1); i++)
{
for (j=1; j<(Col-1); j++)
{
sum=a1[i-1][j-1]+a1[i-1][j]+a1[i-1][j+1]+a1[i][j-1]+a1[i][j+1]+a1[i+1][j-1]+a1[i+1][j]+a1[i+1][j+1];
if ((a1[i][j]==1) && (sum==2||sum==3))
a2[i][j]=1;
else if ((a1[i][j]==1) && (sum<2))
a2[i][j]=0;
else if ((a1[i][j]==1) && (sum>3))
a2[i][j]=0;
else if ((a1[i][j]==0 )&& (sum==3))
a2[i][j]=1;
else if (a1[i][j]==0)
a2[i][j]=0;
}//end of iteration J
}//end of iteration I
#pragma omp barrier
#pragma omp single
{
temp=a1;
a1=a2;
a2=temp;
}
#pragma omp barrier
}//end of iteration L
}//end of paraller region
gettimeofday(&tim,NULL);
end=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("\nTime Elapsed:%.6lf\n",end-start);
printf("all ok\n");
return 0; }
TIMINGS with openmp code
a)System with 7 Dual Core Cpus
Using 1 thread~7,72 sec, Using 2 threads~4,53 sec, Using 3 Threads~3,64 sec, Using 4 threads~ 2,24 sec, Using 5~2,02 sec, Using 6~ 1,78 sec, Using 7 ~1,59 sec,Using 8 ~ 1,44 sec
b)System with 4 Dual Core Cpus
Using 1 thread~9,06 sec, Using 2 threads~4,86 sec, Using 3 Threads~3,49 sec, Using 4 threads~ 2,61 sec, Using 5~3,98 sec, Using 6~ 3,53 sec, Using 7 ~3,48 sec,Using 8 ~ 3,32 sec
Above are the timings I get.
One thing you have to remember is that you're doing this on a shared memory architecture. The more loads/stores you are trying to do in parallel, the more chance you're going to have to hit contention with regards to memory access, which is a relatively slow operation. So in typical applications in my experience, don't benefit from more than 6 cores. (This is anecdotal, I could go into a lot of detail, but I don't feel like typing. Suffice to say, take these numbers with a grain of salt).
Try instead to minimize access to shared resources if possible, see what that does to your performance. Otherwise, optimize for what you got, and remember this:
Throwing more cores at a problem does not mean it will go quicker. Like with taxation, there's a curve as to when the number of cores, starts becoming a detriment to collecting the most performance out of your program. Find that "sweet spot", and use it.
You write
The above timings are when i use
pthreads. When i use openmp the timing
are smaller but follow the same
pattern.
Congratulations, you have discovered the pattern which all parallel programs follow ! If you plot execution time against number of processors the curve eventually flattens out and starts to rise; you reach a point where adding more processors slows things down.
The interesting question is how many processors you can profitably use and the answer to this is dependent on many factors. #jer has pointed out some of the factors which affect the scalability of programs on shared-memory computers. Other factors, principally the ratio of communication to computation, ensure that the shape of the performance curve will be the same on distributed-memory computers too.
The other factor which is important when measuring the parallel scalability of your program is the problem size(s) you use. How does your performance curve change when you try a grid of 1414 x 1414 cells ? I would expect that the curve will be below the curve for the problem on 1000 x 1000 cells and will flatten out later.
For further reading Google for Amdahl's Law and Gustafson's Law.
Could be your sysadmin is controlling how many threads you can execute simultaneously or how many cores you run on. I don't know if it is possible at the sysadmin level, but it sure is possible to tell a process that.
Or, your algorithm could be using L2 cache poorly. Hyper-threading or whatever they call it now works best when one thread is doing something that takes a long time and the other thread is not. Accessing memory not in L2 cache is SLOW and the thread doing so will stall while it waits. This is just one example of where the time to run multiple threads on a single core comes from. A Quad core memory bus might allow each core to access some of the ram at the same time, but not each thread in each core. If both threads go for RAM then they basically are running sequentially. So that could be where your 4 comes from.
You might look to see if you can change your loops so they operate on contiguous RAM. If you break the problem into small blocks of data that fit in your L2 cache and iterate through those blocks, you might get 8x. If you search for the intel machine language programmers guides for their latest processors...they talk about these issues.

Resources