I'd like to know if my pytorch code is fully utilizing the GPU SMs. According to this question gpu-util in nvidia-smi only shows how time at least one SM was used.
I also saw that typing nvidia-smi dmon gives the following table:
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 132 71 - 58 18 0 0 6800 1830
Where one would think that sm% would be SM utilization, but I couldn't find any documentation on what sm% means. The number given is exactly the same as gpu-util in nvidia-smi.
Is there any way to check the SM utilization?
On a side note, is there any way to check memory bandwidth utilization?
I have a use case where I need to downscale a 716x1280 mp4 video to 358x640 (half of the original). Command that I used is
ffmpeg -i ./input.mp4 -vf "scale=640:640:force_original_aspect_ratio=decrease,pad=ceil(iw/2)*2:ceil(ih/2)*2" ./output.mp4
Out of 10 sample videos, 2 of the them suffered impact on colors. Below I have attached a comparison from the one which was impacted the most.
NOTE: The one on the right is a frame from the original video and the frame on the left is the one from the processed (down scaled) video. Notice the colors red and green in the image (even the skin color and hair color were changed).
What I am looking for is
Is there any way I can prevent changes like these happening? Probably some flag on saturation, brightness, contrast or any other parameter.
I am assuming that ffmpeg uses some default settings while downscaling a video. What made ffmpeg change colors only for these two videos? If it made similar changes for the rest of the videos as well, how to predict this behaviour before hand?
EDIT:
What I already have Tried?
-crf with values 0 and 18.
-preset veryslow as mentioned here
None helped
Mediainfo input V/S output
param
input
output
color range
Limited
NA (attribute not in description)
color primaries
BT.2020
NA (attribute not in description)
transfer characteristics
HLG
NA (attribute not in description)
matrix coefficients
BT.2020 non-constant
NA (attribute not in description)
bit deapth
8
8
Logs of the ffmpeg command
ffmpeg -i ./input.mp4 -vf "scale=640:640:force_original_aspect_ratio=decrease,pad=ceil(iw/2)*2:ceil(ih/2)*2" -movflags +faststart ./output.mp4
ffmpeg version 4.3.1 Copyright (c) 2000-2020 the FFmpeg developers
built with Apple clang version 12.0.0 (clang-1200.0.32.28)
configuration: --prefix=/usr/local/Cellar/ffmpeg/4.3.1_9 --enable-shared --enable-pthreads --enable-version3 --enable-avresample --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librtmp --enable-libspeex --enable-libsoxr --enable-videotoolbox --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack
libavutil 56. 51.100 / 56. 51.100
libavcodec 58. 91.100 / 58. 91.100
libavformat 58. 45.100 / 58. 45.100
libavdevice 58. 10.100 / 58. 10.100
libavfilter 7. 85.100 / 7. 85.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 7.100 / 5. 7.100
libswresample 3. 7.100 / 3. 7.100
libpostproc 55. 7.100 / 55. 7.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './input.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.45.100
Duration: 00:00:30.05, start: 0.000000, bitrate: 10366 kb/s
Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(tv, bt2020nc/bt2020/arib-std-b67), 716x1280, 10116 kb/s, 30 fps, 30 tbr, 19200 tbn, 38400 tbc (default)
Metadata:
handler_name : Core Media Video
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 245 kb/s (default)
Metadata:
handler_name : Core Media Audio
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))
Stream #0:1 -> #0:1 (aac (native) -> aac (native))
Press [q] to stop, [?] for help
[libx264 # 0x7faab4808800] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
[libx264 # 0x7faab4808800] profile High, level 3.0, 4:2:0, 8-bit
[libx264 # 0x7faab4808800] 264 - core 161 r3027 4121277 - H.264/MPEG-4 AVC codec - Copyleft 2003-2020 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=12 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to './output.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.45.100
Stream #0:0(und): Video: h264 (libx264) (avc1 / 0x31637661), yuv420p, 358x640, q=-1--1, 30 fps, 15360 tbn, 30 tbc (default)
Metadata:
handler_name : Core Media Video
encoder : Lavc58.91.100 libx264
Side data:
cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s (default)
Metadata:
handler_name : Core Media Audio
encoder : Lavc58.91.100 aac
[mp4 # 0x7faab5808800] Starting second pass: moving the moov atom to the beginning of the file
frame= 901 fps=210 q=-1.0 Lsize= 3438kB time=00:00:30.02 bitrate= 938.0kbits/s speed=7.01x
video:2933kB audio:472kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.974633%
[libx264 # 0x7faab4808800] frame I:6 Avg QP:22.60 size: 20769
[libx264 # 0x7faab4808800] frame P:228 Avg QP:24.84 size: 7657
[libx264 # 0x7faab4808800] frame B:667 Avg QP:27.59 size: 1697
[libx264 # 0x7faab4808800] consecutive B-frames: 0.9% 0.9% 1.0% 97.2%
[libx264 # 0x7faab4808800] mb I I16..4: 9.5% 64.6% 26.0%
[libx264 # 0x7faab4808800] mb P I16..4: 2.5% 12.2% 2.5% P16..4: 37.2% 20.6% 11.2% 0.0% 0.0% skip:13.7%
[libx264 # 0x7faab4808800] mb B I16..4: 0.4% 2.1% 0.2% B16..8: 42.2% 7.1% 1.2% direct: 1.8% skip:44.9% L0:39.4% L1:52.8% BI: 7.8%
[libx264 # 0x7faab4808800] 8x8 transform intra:72.2% inter:74.2%
[libx264 # 0x7faab4808800] coded y,uvDC,uvAC intra: 61.8% 67.2% 20.2% inter: 16.7% 13.9% 1.3%
[libx264 # 0x7faab4808800] i16 v,h,dc,p: 24% 19% 7% 50%
[libx264 # 0x7faab4808800] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 21% 16% 15% 6% 9% 11% 7% 10% 6%
[libx264 # 0x7faab4808800] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 25% 16% 13% 7% 9% 10% 7% 9% 4%
[libx264 # 0x7faab4808800] i8c dc,h,v,p: 53% 16% 26% 5%
[libx264 # 0x7faab4808800] Weighted P-Frames: Y:3.9% UV:1.8%
[libx264 # 0x7faab4808800] ref P L0: 57.8% 19.5% 14.8% 7.8% 0.1%
[libx264 # 0x7faab4808800] ref B L0: 90.7% 7.2% 2.1%
[libx264 # 0x7faab4808800] ref B L1: 95.3% 4.7%
[libx264 # 0x7faab4808800] kb/s:799.80
[aac # 0x7faab2036a00] Qavg: 189.523
We may use Bit Stream Video Filter for setting h264 metadata.
When a video player plays a video file, it looks for metadata that attached to the video stream (h264 metadata for example).
The H.264 metadata parameters that affects the colors and brightness are: video_full_range_flag, colour_primaries, transfer_characteris and matrix_coefficients.
If the parameters are not set, there are defaults.
The defaults for low resolution video are "Limited Range" BT.601 (in most player - I am not sure about MAC OS).
The default gamma curve (affects the brightness) is sRGB gamma curve.
The player converts the pixels from YUV color space to RGB (for displaying the video). The conversion formula is done according to the metadata.
Your input video file input.mp4 has H.264 metadata parameters that are far from the default.
We can assume that scale video filter does not change the color characteristics (the filter applies the YUV elements without converting to RGB).
The characteristics of input.mp4 applies BT.2020, and HLG gamma curve, but converted as if they were default (BT.601 and sRGB gamma), so the colors and brightness are very different from what they should have been.
When FFmpeg encodes a video stream, it does not copy the metadata parameters from the input to the output - we need to set the parameters explicitly.
The solution is using a Bit Stream Video Filter for setting the metadata parameters.
Try using the following command:
ffmpeg -i ./input.mp4 -vf "scale=640:640:force_original_aspect_ratio=decrease,pad=ceil(iw/2)*2:ceil(ih/2)*2" -vcodec libx264 -crf 17 -pix_fmt yuv420p -bsf:v h264_metadata=video_full_range_flag=0:colour_primaries=9:transfer_characteristics=18:matrix_coefficients=9 ./output.mp4
video_full_range_flag=0 applies "limited color range".
colour_primaries=9 applies BT.2020 colour primaries.
transfer_characteristics=18 applies HLG gamma (see
ITU-T Rec. Series H)
matrix_coefficients=9 applies BT.2020 matrix coefficients.
Most of the parameters are documented in ITU-T Rec. H.264 (section E.2.1).
Checking the parameters of output.mp4 using MediaInfo tool:
Color range : Limited
Color primaries : BT.2020
Transfer characteristics : HLG
Matrix coefficients : BT.2020 non-constant
I'm trying to transfer one of the Imagenet-pretrained architectures from keras.applications to CIFAR-10, but I'm getting a CUDA error (causing my jupyter notebook kernel to crash immediately on the last line when I try to fit my model). What could be going wrong?
Output:
2019-01-10 00:39:40.165264: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-01-10 00:39:40.495421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.2405
pciBusID: 0000:01:00.0
totalMemory: 11.93GiB freeMemory: 11.63GiB
2019-01-10 00:39:40.495476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-10 00:39:40.819773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-10 00:39:40.819812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-10 00:39:40.819819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-10 00:39:40.820066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 11256 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
2019-01-10 00:39:40.844280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-10 00:39:40.844307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-10 00:39:40.844313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-10 00:39:40.844317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-10 00:39:40.844520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11256 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
[I 00:40:58.262 NotebookApp] Saving file at /Untitled.ipynb
2019-01-10 00:42:56.543392: F tensorflow/stream_executor/cuda/cuda_dnn.cc:542] Check failed: cudnnSetTensorNdDescriptor(handle_.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (3 vs. 0)batch_descriptor: {count: 32 feature_map_count: 320 spatial: 0 0 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}
Code:
from keras.applications.inception_resnet_v2 import InceptionResNetV2
from keras.preprocessing import image
from keras.layers import Dense, GlobalAveragePooling2D
from keras.models import Model
import keras.utils
import numpy as np
from keras.datasets import cifar10
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
# Define model
base_model = InceptionResNetV2(weights='imagenet', include_top=False)
x = base_model.output
print(x.shape)
x = GlobalAveragePooling2D()(x)
x = Dense(1024,activation='relu')(x)
preds = Dense(10,activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=preds)
# Only fine-tune last layer
for layer in base_model.layers:
layer.trainable = False
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
Check the requirements for an input to InceptionResnetV2 network:
It should have exactly 3 inputs channels, and width and height should
be no smaller than 75
And you are trying to fit CIFAR10 images which are 32x32.
Why there are different data rates in each WLAN protocol.
Ex: 802.11 supports 1 and 2 Mbps,
802.11a can support 6, 9, 12, 18, 24, 36, 48, 54
802.11b can support 1, 2, 5.5, 11
etc..
802.11 and 802.11b:
Each data bits is converted into multiple bits of information for protection against errors due to noise or interference. Each of the new coded bits is called a chip. The different data rates has different chipping methods.
For example:
1 and 2 Mbps using the Barker code.
5.5 and 11 Mbps using Complementary Code Keying (CCK)
Both run at 11 Mchips/s.
Barker code has 11 chip code per symbol, CCK has 8 chip code per symbol =>
Symbol rate for Barker code 11000000/11 = 1 Msps, for CCK 1.375 Msps.
For Barker code:
DBPSK can modulate 1 bit of data => 1 bit * 1 Msps = 1 Mbps
DQPSK can modulate data bits in pairs => 2 bits * 1Msps = 2 Mbps
For CCK:
4 bits * 1.375 Msps = 5.5 Mbps
8 bits * 1.375 Msps = 11 Mbps
802.11g(802.11a):
This standart use OFDM modulation scheme. Look at modulation types:
BPSK (1 bit per symbol) => max rate is 12Mbps
QPSK (2 bits per symbol) => max rate is 24 Mbps
16-QAM (4 bits per symbol) => max rate is 48 Mbps
64-QAM (6 bits per symbol) => max rate is 72 Mbps
This types use code rate for error correction:
BPSK 1/2 => 6 Mbps
BPSK 3/4 => 9 Mbps
QPSK 1/2 => 12 Mbps
and so on.
I am using two graphic cards and the GeForce gtx980 with 4GB, where I compute my neuronal network is always jumping from 0 to 99% and from 99% to 0% (repeating) at the last line of the pasted shell output.
After around 90seconds it did the first calculation. I put my images one after another into the neuronal network (for-loop). And the following calculations only need 20 seconds (3 epochs) and the GPU jumps between 96 and 100%.
Why is it jumping at the beginning?
I use the flag:
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
Can I be sure that is really using not less megabytes than nvidia-smi -lms 50 is showing me?
2017-08-10 16:33:24.836084: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-10 16:33:24.836100: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-10 16:33:25.052501: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-08-10 16:33:25.052861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.2155
pciBusID 0000:03:00.0
Total memory: 3.94GiB
Free memory: 3.87GiB
2017-08-10 16:33:25.187760: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x8532640 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-08-10 16:33:25.188006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-08-10 16:33:25.188291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: GeForce GT 730
major: 3 minor: 5 memoryClockRate (GHz) 0.9015
pciBusID 0000:02:00.0
Total memory: 1.95GiB
Free memory: 1.45GiB
2017-08-10 16:33:25.188312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 1
2017-08-10 16:33:25.188319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 0
2017-08-10 16:33:25.188329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1
2017-08-10 16:33:25.188335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y N
2017-08-10 16:33:25.188339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: N Y
2017-08-10 16:33:25.188348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:03:00.0)
Epoche: 0001 cost= 0.620101001 time= 115.366318226
Epoche: 0004 cost= 0.335480299 time= 19.4528050423