I wonder why increasing of --ram_utilization_factor is not recommended (from the docs):
This option, which takes an integer argument, specifies what percentage of the system's RAM Bazel should try to use for its subprocesses. This option affects how many processes Bazel will try to run in parallel. The default value is 67. If you run several Bazel builds in parallel, using a lower value for this option may avoid thrashing and thus improve overall throughput. Using a value higher than the default is NOT recommended. Note that Bazel's estimates are very coarse, so the actual RAM usage may be much higher or much lower than specified. Note also that this option does not affect the amount of memory that the Bazel server itself will use.
Since Bazel has no way of knowing how much memory an action/worker uses/will use, the only way of setting this up seems ram_utilization_factor.
That comment is very old and I believe was the result of some trial and error when --ram_utilization_factor was first implemented. The comment was added to make sure that developers would have some memory left over for other applications to run on their machines. As far as I can tell, there is no deeper reason for it.
Related
I am using Elki for running LOF algorithm but every time i run LOF algorithm in Elki, Elki is giving different runtime on same data set.
I am confused why is this happening?
As #Anony-Mousse mentioned in the comments, getting reliable runtime values is hard.
This is not specific to ELKI - the LOF algorithm does not involve randomness, so one would assume that you get the same runtime every time. But in reality, you just don't get the same result every time.
Here are some hints on getting more reliable runtimes:
Do not run multiple experiments in the same process. Start a fresh JVM for each (otherwise, it will likely have to garbage collect the objects of the previous run at some time otherwise).
Runtime differences of less than 1 second are not significant. To measure runtimes of say less than a second, you should rather use specialized tools such as JMH.
Run several times, and use the median value, or a truncated mean.
Disable turbo boost, and make sure your CPU is well cooled. On Linux (and I would suggest that you use a dedicated Linux system for benchmarking!), you can disable CPU boost via:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
My CPU has a 3.4 GHz base clock, but can boost to 3.9 GHz. That is 15% faster, but it cannot run 3.9 Ghz when using all cores. Depending on the number of threads running, the actual clock could be 3.4, 3.7, 3.8, 3.9 GHz; and it could even temporarily decide to throttle down to even 1.6 GHz. So you may even need to monitor your CPU clock speeds (e.g. see grep "^cpu MHz" /proc/cpuinfo). But definitely disable to turbo boost.
Do not use -verbose. Logging has a noticeable impact on runtime.
Do not use the MiniGUI. Notice that the MiniGUI displays a command line for you to copy, like KDDCLIApplication -dbc.in file.csv -algorithm ... that allows you to run the experiment from console, without GUI overhead.
Do not use visualization. Visualization needs a lot of memory, and CPU, and that will have a measureable impact. ELKI tries to measure the algorithm time independently, and then visualize afterwards; but there might still be some memory cleanup in the background even if you have closed the visualization window. Preferably, for runtime benchmarking, use -resulthandler DiscardResultHandler to discard the actual result; and get the timinig from the -time log (the log should be only a few lines long; avoid excessive output)
I'm running a job which reads about ~70GB of (compressed data).
In order to speed up processing, I tried to start a job with a large number of instances (500), but after 20 minutes of waiting, it doesn't seem to start processing the data (I have a counter for the number of records read). The reason for having a large number of instances is that as one of the steps, I need to produce an output similar to an inner join, which results in much bigger intermediate dataset for later steps.
What should be an average delay before the job is submitted and when it starts executing? Does it depend on the number of machines?
While I might have a bug that causes that behavior, I still wonder what that number/logic is.
Thanks,
G
The time necessary to start VMs on GCE grows with the number of VMs you start, and in general VM startup/shutdown performance can have high variance. 20 minutes would definitely be much higher than normal, but it is somewhere in the tail of the distribution we have been observing for similar sizes. This is a known pain point :(
To verify whether VM startup is actually at fault this time, you can look at Cloud Logs for your job ID, and see if there's any logging going on: if there is, then some VMs definitely started up. Additionally you can enable finer-grained logging by adding an argument to your main program:
--workerLogLevelOverrides=com.google.cloud.dataflow#DEBUG
This will cause workers to log detailed information, such as receiving and processing work items.
Meanwhile I suggest to enable autoscaling instead of specifying a large number of instances manually - it should gradually scale to the appropriate number of VMs at the appropriate moment in the job's lifetime.
Another possible (and probably more likely) explanation is that you are reading a compressed file that needs to be decompressed before it is processed. It is impossible to seek in the compressed file (since gzip doesn't support it directly), so even though you specify a large number of instances, only one instance is being used to read from the file.
The best way to approach the solution of this problem would be to split a single compressed file into many files that are compressed separately.
The best way to debug this problem would be to try it with a smaller compressed input and take a look at the logs.
When using Z3 on the command line with the "-T" switch, is there a way to set the timeout to less than one second?
I know you can set the timeout to be less than that using the API, but for various stupid reasons I've been passing text files containing SMT-LIBv2 scripts to Z3 in a loop (please don't be mad), thinking it would work just as well. I've only just noticed that this approach seems to create a lower bound of one second on timeouts. This slows things down quite a bit if I'm using Z3 to check thousands of short files.
I understand if this is just the way things are, and I accept that what I'm doing isn't sensible when there's already a perfectly good API for Z3.
There are two options:
You can use "soft timeouts". They are less reliable than the timeout /T because soft timeout expiration is only checked periodically. Nevertheless, the option "smt.soft_timeout=10" would set a timeout of 10ms (instead of 10s). You can set the these options both from the command-line and within the SMT-LIB2 file using (set-option :smt.soft_timeout 10). The tutorial on using tactics/solvers furthermore explains how to use more advanced features (strategies) and you can also control these advanced features using options, such as timeouts, from the textual interface.
You can load SMT-LIB2 files from the programmatic API. The assertions from the files are stored in a conjunction. You can then call a solver (again from the API) and use the "soft timeout" option for the solver object. There isn't really a reason to use option 2 unless you need to speed up your pipe or use something more than the soft timeout feature because it is already reasonably exposed for the SMT-LIB level.
In Jenkins I have 100 java projects. Each has its own build file.
Every time I want clear the build file and compile all source files again.
Using bulkbuilder plugin I tried compling all the jobs.. Having 100 jobs run parallel.
But performance is very bad. Individually if the job takes 1 min. in the batch it takes 20mins. More the batch size more the time it takes.. I am running this on powerful server so no problem of memory and CPU.
Please Suggest me how to over come this.. what configurations need to be done in jenkins.
I am launching jenkins using war file.
Thanks..
Even though you say you have enough memory and CPU resources, you seem to imply there is some kind of bottleneck when you increase the number of parallel running jobs. I think this is understandable. Even though I am not a java developer, I think most of the java build tools are able to parallelize build internally. I.e. building a single job may well consume more than one CPU core and quite a lot of memory.
Because of this I suggest you need to monitor your build server and experiment with different batch sizes to find an optimal number. You should execute e.g. "vmstat 5" while builds are running and see if you have idle cpu left. Also keep an eye on the disk I/O. If you increase the batch size but disk I/O does not increase, you are consuming all of the I/O capacity and it probably will not help much if you increase the batch size.
When you have found the optimal batch size (i.e. how many executors to configure for the build server), you can maybe tweak other things to make things faster:
Try to spend as little time checking out code as possible. Instead of deleting workspace before build starts, configure the SCM plugin to remove files that are not under version control. If you use git, you can use a local reference repo or do a shallow clone or something like that.
You can also try to speed things up by using SSD disks
You can get more servers, run Jenkins slaves on them and utilize the cpu and I/O capacity of multiple servers instead of only one.
Does anybody knows if there is a sort of 'load-balancer' in the erlang standard library? I mean, if I have some really simple operations on a really large set of data, the overhead of constructing a process for every item will be larger than perform the operation sequentially. But if I can balance the work in the 'right number' of process, it will perform better, so I'm basically asking if there is an easy way to accomplish this task.
By the way, does anybody knows if an OTP application does some kind of balance load? I mean, in an OTP application there is the concept of a "worker process" (like a java-ish thread worker)?
See modules pg2 and pool.
pg2 implements quite simple distributed process pool. pg2:get_closest_pid/1 returns "closest" pid, i.e. random local process if available, otherwise random remote process.
pool implements load balancing between nodes started with module slave.
The plists module probably does what you want. It is basically a parallel implementation of the lists module, design to be used as a drop-in replacement. However, you can also control how it parallelizes its operations, for example by defining how many worker processes should be spawned etc.
You probably would do it by calculating some number of workers depending on the length of the list or the load of the system etc.
From the website:
plists is a drop-in replacement for
the Erlang module lists, making most
list operations parallel. It can
operate on each element in parallel,
for IO-bound operations, on sublists
in parallel, for taking advantage of
multi-core machines with CPU-bound
operations, and across erlang nodes,
for parallizing inside a cluster. It
handles errors and node failures. It
can be configured, tuned, and tweaked
to get optimal performance while
minimizing overhead.
There is no, in my view, usefull generic load-balancing tool in otp. And perhaps it only usefull to have one in specific cases. It is easy enough to implement one yourself. plists may be useful in the same cases. I do not believe in parallel-libraries as a substitute to the real thing. Amdahl will haunt you forever if you walk this path.
The right number of worker processes is equal to the number of schedulers. This may vary depending of what other work is done on the system. Use,
erlang:system_info(schedulers_online) -> NS
to get the number of schedulers.
The notion of overhead when flooding the system with an abundance of worker processes is somewhat faulty. There is overhead with new processes but not as much as with os-threads. The main overhead is message copying between processes, this can be alleviated with the use of binaries since only the reference to the binary is sent. With eterms the structure is first expanded then copied to the other process.
There is no way how to predict cost of work mechanically without measure it e.g do it. Some person must determine how to partition work for some class of tasks. In load balancer word I understand something very different than in your question.