Why multi-gpu faster than single gpu in caffe training? - machine-learning

In the same hardware/software env, With the same net and solver, just differ in command line.
While command line is:
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=6
It tasks about 50 seconds per 100 iters.
While command is :
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=4,5,6,7
It takes about 48 seconds per 100 iters.
As usual, multi-gpu training should cost more time than single-gpu because of cost like replication. So does anyone can tell me why. Thanks very much!
Env:
2 * Intel(R) Xeon(R) CPU E5-2699 v4 # 2.20GHz
8 * Nvidia Tesla V100 PCIE 16GB
Caffe 1.0.0 / use_cudnn on
Cuda 9.0.176
Cudnn 6.0.21

Related

when use pyspark pandas_udf, python worker use too much memory and exceeding memory limit

spark version is 2.4.0, my cluster has four nodes and each node has 16 CPU and 128g RAM.
I am using jupyter-notebook conncet pyspark. The working process is read kudu data by spark then calculate by pandas udf. On the terminal start pyspark
PYSPARK_DRIVER_PYTHON="/opt/anaconda2/envs/py3/bin/jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark2 --jars kudu-spark2_2.11-1.7.0-cdh5.15.0.jar
--conf spark.executor.memory=40g --conf spark.executor.memoryOverhead=5g --num-executors=4 --executor-cores=8 --conf yarn.nodemanager.vmem-check-enabled=false
My dataset only 6g and 32 partitions. when running i can see each node has a executor contains 8 python worker and each python worker uses 6g memory! Container killed by yarn because memory limit.
Container killed by YARN for exceeding memory limits. 45.1 GB of 45 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
I'm confused why does it take up so much memory? data size for each partiton only ~200M. Isn't pandas_udf avoiding serialization and deserialization by pyarrow? Maybe jupyter causes the quesiton?
I am very grateful if anyone helps me.
This is my code.
df = spark.range(0, 800000000)
df= df.select("id",rand(seed=10).alias("uniform"),randn(seed=27).alias("normal"),
randn(seed=27).alias("normal1"),randn(seed=1).alias("normal3"))
df=df.withColumn("flag",
F.array(
F.lit("0"),
F.lit("1"),
F.lit("2"),
F.lit("3"),
F.lit("4"),
F.lit("5"),
F.lit("6"),
F.lit("7"),
F.lit("8"),
F.lit("9"),
F.lit("10"),
F.lit("11"),
F.lit("12"),
F.lit("13"),
F.lit("14"),
F.lit("15"),
F.lit("16"),
F.lit("17"),
F.lit("18"),
F.lit("19"),
F.lit("20"),
F.lit("21"),
F.lit("22"),
).getItem(
(F.rand()*23).cast("int")
)
)
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
schema = StructType([
StructField("flag", IntegerType()),
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def del_data(data):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
return data[["flag"]]
df.groupBy('flag').apply(del_data).write.csv('/tmp/')

Ubuntu 18.04: GNU parallel can't find and use the full number of cores on a local system

I am using GNU parallel (version 20200522) on Ubuntu Linux 18.04.4 and running jobs on all cores of a local server minus 2 cores, that is I am using the -j-2 parameter.
find /folder/ -type f -iname "*.pdf" | parallel -j-2 --nice 2 "script.sh {1} {1/.}; mv -f -v {1} /folder2/; mv -f {1/.}.txt /folder3/" :::: -
However, the program shows
Error: Cannot run any jobs.
I tried using the -j100% parameter and I have seen that it uses just 1 core(job), and I deduce that, for GNU parallel, 100% of the available cores on this system is just one core.
If I use the -j5 parameter (which does not imply autodetection of the total number of cores), everything is alright, parallel launches 5 jobs and uses 5 cores.
The interesting part is that the file /root/.parallel/tmp/sshlogin/MACHINE_NAME/cpuspec contains the following:
1
6
6
which means, I think, that GNU parallel should see 6 available cores.
I have tried deleting the cpuspec file and running parallel again to redetect the total number of cores, but the cpuspec file and the behavior of the program remain the same.
On different systems, deleting the cpuspec file solved all issues, but on this particular system it is not working. The virtual machine is copied from another server with a different configuration, that is why I need deleting the cpuspec file.
What should I do to get GNU parallel to correctly detect the number of cores on the system, so that I can use the -j-2 parameter?
Update 21.07:
After deleting once again the folder with the cpuspec file, running the parallel --number-of-sockets/cores/threads commands and using just once the -S 6/: parameter, the problem seems to have resolved itself. Now GNU parallel correctly detects the number of cores and the -j-2 parameters works.
I am not sure what good things happened, but I am not able to reproduce the bug anymore.
Ole, thank you for your answer. If I meet the bug again or if I am able to reproduce it, I will post it here.
And here is the output to the commands:
parallel --number-of-sockets
1
parallel --number-of-cores
6
parallel --number-of-threads
6
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz
stepping : 2
microcode : 0xffffffff
cpu MHz : 2397.218
cache size : 15360 KB
physical id : 0
siblings : 6
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm pti fsgsbase bmi1 avx2 smep bmi2 erms xsaveopt
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 4794.43
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:
And it repeats itself for 5 more cores.
You may have found a bug. Please post the output of:
cat /proc/cpuinfo
parallel --number-of-sockets
parallel --number-of-cores
parallel --number-of-threads
Also see if you can make an MCVE.
As a workaround you can use -S 6/: to force GNU Parallel to detect 6 cores on your system.
find /folder/ -type f -iname "*.pdf" |
parallel -S 6/: -j-2 --nice 2 "script.sh {1} {1/.}; mv -f -v {1} /folder2/; mv -f {1/.}.txt /folder3/"
(Also :::: - can be left out completely: If there is no ::: :::: then GNU Parallel reads from stdin).

Does Hyper Threading will decrease the performance of Docker?

I have the below configuration server, In this we are running only Docker, All the applications are running as Docker container , Expect Docker , no other application are installed in the OS
System Information :
- Server Model : Dell Power Edge R440
- Number of CPU : 2
- Number of Cores Per CPU : 10 Memory (RAM) - 64 GB
- Hyper Threading : Enabled
- Virtualization : Enabled
- Operating System : CentOS 7
I have gone through Dell BIOS tuning guide.(Dell Power Edge Bios tuning for performance power)
In that they are suggesting to Enable Hyper threading (Logical Processor), and their default is Hyper Threading( Logical Processor) Enabled.
Does Hyper Threading will decrease the performance of Docker?
Containers running in docker are
- Grafana
- PostgreSQL
- EMQX
- Prometheus
- Golang
Output of lscpu
Architecture : x86_64
CPU op-mode(s) : 32-bit, 64-bit
Byte Order : Little Endian
CPU(s) : 40
On-line CPU(s) list : 0-39
Thread(s) per core : 2
Core(s) per socket : 10
Socket(s) : 2
NUMA node(s) : 2
Vendor ID : GenuineIntel
CPU family : 6
Model : 85
Model name : Intel® Xeon® Silver 4210 CPU # 2.20GHz
Stepping : 7
CPU MHz : 999.963
CPU max MHz : 3200.0000
CPU min MHz : 1000.0000
BogoMIPS : 4400.00
Virtualization : VT-x
L1d cache : 32K
L1i cache : 32K
L2 cache : 1024K
L3 cache : 14080K

Tensorflow app freezes in docker container

I have a tensorflow app that runs fine in ubuntu 16.04 but when I attempt to run it in the tensorflow/tensorflow docker image w/ nvidia-docker, it gets to this point and then freezes:
2017-07-12 22:06:10.917255: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations.
2017-07-12 22:06:10.917289: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations.
2017-07-12 22:06:11.023765: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero
2017-07-12 22:06:11.024133: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0
with properties:
name: Quadro M4000
major: 5 minor: 2 memoryClockRate (GHz) 0.7725
pciBusID 0000:00:05.0
Total memory: 7.93GiB
Free memory: 7.87GiB
2017-07-12 22:06:11.024159: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-07-12 22:06:11.024168: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-07-12 22:06:11.024190: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M4000, pci
bus id: 0000:00:05.0)
Since it's not outputting an error message, I don't know where to start; any suggestions for something I might be missing or steps to troubleshoot this further?
I verified that my nvidia-docker installation is functioning correctly.
Turns out that the application was running, it just appeared to have frozen because output from python apps running in docker containers tends to get stuck in the buffer and never show up in the docker logs. To fix the problem I passed -u to python - I can see my application's output in docker logs now and all is well.

How to speed up sonarqube analysis job?

I have one java based application which is having huge line of source code(~1m).Now I am using jenkins with sonar-runner-2.4 to run analysis with code coverage and test cases count.I have upgraded sonarqube server from 5.4 to 6.3.1.Before upgrade this job took 9hrs to complete the whole analysis (still it is very much long time but fine) but after upgrade to sonarqube-6.3.1 same job taking 13hrs to complete the same analysis.
How do I improve analysis time at least my earlier time 9hr ?
EDIT
Here is my JAVA_OPTS for sonarqube-6.3.1 instance
sonar.web.javaOpts=-Xmx6G -Xms2G -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true
Available Hardware :
$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Stepping: 5
CPU MHz: 1596.000
BogoMIPS: 3999.44
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
Available Memory :
$free -m
total used free shared buff/cache available
Mem: 128714 58945 66232 430 3535 68298
Swap: 32767 957 31810
sonar-project.properties for the long running job:
sonar-project.properties
As you haven't really given many details, I can't really give many details in the answer, but the simple answer is that you have to make the scan do less work.
Look at your codebase. Is your scan processing generated classes? Is it scanning test classes? Is it scanning classes that have little real business logic? If you answer "yes" to any of those, consider excluding those classes.
Look at the SonarQube plugins you're using. Are you running every possible plugin you can run? Are there some heuristics you don't need to run, or perhaps you could run less frequently?

Resources