When I do docker stats I see that usage is greater than 100% most of the times. I have a machine which has 8 cores. So, does below output mean that 100% CPU means one core is totally occupied. So, 690% means close to 7 cores is totally occupied ?
d99e067cfffc 690.00% 5.517 GiB / 12.7 GiB 43.46% 1.47 GB / 1.03 GB 9.15 MB / 0 B 338
Exactly as you stated. You can have up to N * 100% CPU usage when N is the number of cores you have.
By the way, you can run the container with a --cpus <your_num> flag to limit the usage of CPU cores if you like.
More details in the official docs.
Related
NVIDIA specifies that their GeForce RTX 3090 (Ti) has 24 GB of memory. How are you supposed to know how much data you can fit on it, when some sources use 1 GB = 1,0243 bytes, while other sources use 1 GB = 1,0003 bytes? Can you assume that hardware manufacturers always use 1,000-base since that means they can write higher number in the specifications, or do some hardware manufacturers still use 1,024-base?
I would expect 1 GB = 1,024^3 as I only encountered 1 GB = 1,000^3 in the context of storage as in SSD.
i executed a job in Google Cloud Dataflow and now i'm seeing the result on StackDriver. I don't understand the memory chart. I used only 1 and after 3 worker but the scale of this chart is the order of TB to second. it is normal? or maybe the scale is GB? in the metrics of this job, also, in a precise instant that i saw, the value of actual memory was 45 GB, and it isn't in this chart and is much smaller. can someone explain me this chart?
The Total memory usage time is one of the Dataflow metrics used to measure consumption of computing capacity (system memory in this case). This is
The total GB seconds of memory allocated to this Dataflow job.
Customers are billed for the consumed resources accordingly with the established Pricing .
Memory consumption is measured in GB-seconds. 1 GB.s is 1 second of wall clock time with 1GB of memory provisioned. Compute time is measured in 100ms increments, rounded up to the nearest increment.
Since memory usage on the chart is a time-aggregated value, values expressed in TB.s can be converted into GB.h by dividing by 3600 s:
1 GB.h = 3.6 TB.s
The curve shape and Y-coordinate depend on the aggregation and alignment settings you use: max or mean, 1m or 1h alignment period, etc. For instance in case of a short peak load, the wide time window will act as a big denominator for the mean aligner.
Memory usage (measured in GB or TB) and memory usage time (typically measured in GB hr or TB s) are different measurements.
The Dataflow UI gives the following explanation for memory time: "The total running time for all memory used by all workers associated with your job. For example, if your job used 3GB of memory for 4 hours, the total memory time is 12 [GB] hours."
I'm reading this CPU specification: http://ark.intel.com/products/67356/Intel-Core-i7-3612QM-Processor-6M-Cache-up-to-3_10-GHz-rPGA
It says the CPU has 2 channels. So I think it has 2 memory controller inside. Then the max memory bandwidth should be 1.6GHz * 64bits * 2 * 2 = 51.2 GB/s if the supported DDR3 RAM are 1600MHz. But the specification says its max memory bandwidth is 25.6 GB/s.
I multiplied two 2s here, one for the Double Data Rate, another for the memory channel.
Is it the problem of the specification? or I have some miss understanding?
Double data rate memory specs usually already take into account that its effective frequency is doubled. "1600 MHz memory" really runs on 800 Mhz, so you can leave out one factor of 2 from your calculation.
I have a NVIDIA GTX 570 graphics card running on a Ubuntu 10.10 system with Cuda 4.0.
I know that for performance, we need to access memory efficiently, and use register and shared memory on the device cleverly.
However I don't understand how to calculate, number of registers available per thread, or how much shared memory can a single block use and other such simple / important calculations for particular kernel configurations.
I want to understand this by an explicit example.
Incidentally, I am currently trying to write an a particle code, in which one of the kernels should look like this.
Each block is a 1-D collection of threads, and each grid is a 1-D collection of blocks.
Number of blocks : 16384
Number of threads per block : 32 ( => total threads 32*16384 = 524288)
Each thread-block is given a 32 x 32 two-d integer array of shared memory
to work with.
Within a thread I would like to store some numbers of type double. But I am not sure
how many such double numbers I can store without any register spilling into local memory (which is on device). Can someone tell
me how many doubles can be stored per thread for this kernel configuration?
Also is the above mentioned configuration for shared-memory for each of my blocks valid?
A sample computation about how one would go about deducing these things would be very
illustrative and helpful
Here is the information about my GTX 570: (using deviceQuery from CUDA-SDK)
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce GTX 570"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1279 MBytes (1341325312 bytes)
(15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores
GPU Clock Speed: 1.46 GHz
Memory Clock rate: 1900.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 570
[deviceQuery] test results...
PASSED
Press ENTER to exit...
So, the kernel configuration is a little complicated. You should use the CUDA OCCUPANCY CALCULATOR. And the other hand you have to study how warps work. Once a block is assigned to a SM, it is further divided into 32-thread units called warps. We can say that a warp is a unit of thread scheduling in SMs. We can calculate the number of warps that reside in a SM for a given block size and given number of blocks assigned to each SM. In your case a warp consists in 32 threads, so if you have a block with 256 threads then you have 8 warps. Now choosing a correctly kernel setting depends of your data and operations, remember that you have to full occupy a SM, that is: you have to get full thread capacity in each SM and the maximal number of warps for scheduling around the long-latency operations. Another important thing is dont exceed the limitations of up to maximum threads per blocks, in your case 1024.
IDE,SCSI,SSD,SATA or all of those.
I'm surprised: Figure 3 in the middle of this article, The Pathologies of Big Data, says that memory is only about 6 times faster when you're doing sequential access (350 Mvalues/sec for memory compared with 58 Mvalues/sec for disk); but it's about 100,000 times faster when you're doing random access.
Random Access Memory (RAM) takes nanoseconds to read from or write to, while hard drive (IDE, SCSI, SATA that I'm aware of) access speed is measured in milliseconds.
2016 Hardware Update: Actual read/write seq throughput
Now the Samsung 940 PRO SSD
reading at 3,500 MB/sec
writing at 2,100 MB/sec
Ram got faster too
reading at 61,000 MB/sec
writing at 48,000 MB/sec..
So now using this metric, RAM looks to be 20x faster than the stuff around when #ChrisW wrote his answer, not 100,000. And, SSDs are 10 times faster than RAM was when he wrote this question.
An important consideration is that we're only measuring memory bandwidth not latency.
It's not precisely about SCSI drives, but I think that the Latency Numbers Every Programmer Should Know table could assist you in understanding the speed and the difference between different latency numbers, including storage options.
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Here is a great visual representation that will help you to better understand the scale:
https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
RAM is 100 Thousand Times Faster than Disk for Database Access from
http://www.directionsmag.com/articles/ram-is-100-thousand-times-faster-than-disk-for-database-access/123964
Accessing the RAM is in the order of nanoseconds ( 10e-9 seconds ),
while accessing data on the disk or the network is in the order of
milliseconds (10e-3 seconds).
from Node.JS Design Patterns