Why "LC_ALL=C" cannot speed up "bzgrep" - grep

In terms of the cases that covers ASCII only, we could use LC_ALL=C to significantly speed up the grep procedure.
Since bzgrep is a lot similar to grep, I tried to adopt this as well. Turns out it didn't help much.
The command lines I used are listed as following:
$ time bzgrep Debug 001.log.bz2 | sed -n '/^09:00/ , /^09:30/p' | grep "Pattern1.*Pattern2" > /dev/null
$ time LC_ALL=C bzgrep Debug 001.log.bz2 | sed -n '/^09:00/ , /^09:30/p' | grep "Pattern1.*Pattern2" > /dev/null
Update:
$ time bzgrep
real 1m51.686s
user 1m52.310s
sys 0m6.682s
$ time LC_ALL=C bzgrep
real 1m51.835s
user 1m52.455s
sys 0m6.738s
$ time grep
real 1m9.553s
user 1m3.189s
sys 0m2.120s
$ time LC_ALL grep
real 0m4.136s
user 0m3.187s
sys 0m0.946s

Assuming bzgrep isn't ignoring LC_ALL entirely, you're probably seeing a performance bottleneck in which the decompression performance limits the speed at which the actual grep code gets the data in the first place, thus making the speed of the actual grep code largely moot.
To use a car analogy, imagine that you have two groups of car washers—one that can wash a car in a minute, and another that takes five minutes. You have ten cars to wash. The first team washes ten cars in ten minutes; the other team washes ten cars in 50 minutes.
Now suppose that you have a fire marshall who doesn't like having that many cars on the property at once and decides to solve the problem by allowing exactly one car to enter the car wash, precisely at the top of every hour. Thus, each team finishes with its car and then waits either 55 or 59 minutes for the next one.
In that scenario, the first team gets a car and washes it in a minute, waits an hour, washes another car, etc. They start washing the tenth car nine hours after they started washing the first, so it takes nine hours and one minute in total. The second team does the same thing, but takes five minutes to wash the last car, so it takes then nine hours and five minutes in total.
In much the same way, if the grep part is dramatically faster than the decompression part, then assuming you have a multicore CPU, the total time depends only upon the time required to decompress the data plus the time to grep the very last chunk of data.

Related

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

How does Locust provide state over time for load testing?

I was trying to move from Gatling to Locust (Python is a nicer language) for load tests. In Gatling I can get data for a chart like 'Requests per seconds over time', 'Response time percentiles over time', etc. ( https://gatling.io/docs/2.3/general/reports/ ) and the really useful 'Responses per second over time'
In Locust I can see the two report (requests, distribution), where (if I understand it correctly), 'Distribution' is the one that does 'over time'? But I can't see where things started failing, or the early history of that test.
Is Locust able to provide 'over time' data in a CSV format (or something else easily graph-able)? If so, how?
Looked through logs, can output the individual commands, but it would be a pain to assemble them (it would push the balance toward 'just use Gatling')
Looked over https://buildmedia.readthedocs.org/media/pdf/locust/latest/locust.pdf but not spotting it
I can (and have) created a loop that triggers the locust call at incremental intervals
increment_user_count = [1, 10, 100, 1000]
# for total_users in range(user_min, user_max, increment_count):
for users in increment_user_count:
[...]
system(assembled_command)
And that works... but it loses the whole advantage of setting a spawn rate, and would be painful for gradually incrementing up to a large number (then having to assemble all the files back together)
Currently executing with something like
locust -f locust_base_testing.py --no-web -c 1000 -r 2 --run-time 8m30s --only-summary --csv=output_stats_20190405-130352_1000
(need to use this in automation so Web UI is not a viable use-case)
I would expect a flag, in the call or in some form of setup, that outputs the summary at regular ticks. Basically I'd expect (with no-web) to get the data that I could use to replicate the graph the web version seems to know about:
Actual: just one final summary of the overall test (and logs per individual call)

Getting Linux prof samples even if my program is in sleep state?

With a program without sleep function, perf collects callgraph samples well.
void main()
{
while(true)
{
printf(...);
}
}
For example, more than 1,000 samples in a second.
I collected perf report with this:
sudo perf report -p <process_id> -g
However, when I do it with a program with sleep function, perf does not collect callgraph samples well: only a few samples in a second.
void main()
{
while(true)
{
sleep(1);
printf(...);
}
}
I want to collect the callgraph samples even if my program is in sleep state aka. device time. In Windows with VSPerf, callgraph with sleep state is also collected well.
Collecting callgraph for sleep state is needed for finding performance bottleneck not only in CPU time but also in device time (e.g. accessing database).
I guess there may be a perf option for collecting samples even if my program is in sleep state, because not only I but also many other programmers may want it.
How can I get the prof samples even if my program is in sleep state?
After posting this question, we found that perf -c 1 captures about 10 samples in a second. Without -c 1, perf captured 0.3 samples per second. 10 samples per second is much better for now, but it is still much less than 1000 samples per second.
Is there any better way?
CPU samples while your process is in the sleep state are mostly useless, but you could emulate this behavior by using an event that records the begin and end of the sleep syscall (capturing the stacks), and then just add the the "sleep stacks" yourself in "post processing" by duplicating the entry stack a number of times consistent with the duration of each sleep.
After all, the stack isn't going to change.
When you specify a profiling target, perf will only account for events that were generated by said target. Quite naturally, a sleep'ing target doesn't generate many performance events.
If you would like to see other processes (like a database?) in your callgraph reports, try system-wide sampling:
-a, --all-cpus
System-wide collection from all CPUs (default if no target is specified).
(from perf man page)
In addition, if you plan to spend a lot of time actually looking at the reports, there is a tool I cannot recommend you enough: FlameGraphs. This visualization may save you a great deal of effort.

Airflow list dag times out exactly after 30 seconds

I have a dynamic airflow dag(backfill_dag) that basically reads admin variable(Json) and builds it self. Backfill_dag is used for backfilling/history loading, so for example if I wants to history load dag x,y, n z in some order(x n y run in parallel, z depends on x) then I will mention this in a particular json format and put it in admin variable of backfill_dag.
Backfill_dag now:
parses the Json,
renders the tasks of the dags x,y, and z, and
builds itself dynamically with x and y in parallel and z depends on x
Issue:
It works good as long as Backfill_dag can list_dags in 30 seconds.
Since Backfill_dag is bit complex here, it takes more than 30 seconds to list(airflow list_dags -sd Backfill_dag.py), hence it times out and the dag breaks.
Tried:
I tried to set a parameter, dagbag_import_timeout = 100, in airflow.cfg file of the scheduler, but that did not help.
I fixed my code.
Fix:
I had some aws s3 cp command in my dag that were running durring compilation hence my list_dags command was taking more than 30 seconds, i removed them(or had then in a BashOperator task), now my code compiles(list_dags) in couple of seconds.
Besides fixing your code you can also increase the core.dagbag_import_timeout which has per default 30 seconds. For me it helped increasing it to 150.
core.dagbag_import_timeout
default 30 seconds
The number of seconds before importing a Python file times out.
You can use this option to free up resources by increasing the time it takes before the Scheduler times out while importing a Python file to extract the DAG objects. This option is processed as part of the Scheduler "loop," and must contain a value lower than the value specified in core.dag_file_processor_timeout.
core.dag_file_processor_timeout
default: 50 seconds
The number of seconds before the DagFileProcessor times out processing a DAG file.
You can use this option to free up resources by increasing the time it takes before the DagFileProcessor times out. We recommend increasing this value if you're seeing timeouts in your DAG processing logs that result in no viable DAGs being loaded.
You can try change other airflow configs like:
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT
AIRFLOW__CORE__DEFAULT_TASK_EXECUTION_TIMEOUT
also as mentioned above:
AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT

In SPMD using GNU parallel, is processing the smallest files first the most efficient way?

This is pretty straight forward:
Say I have many files in the folder data/ to process via some executable ./proc. What is the simplest way to maximize efficiency? I have been doing this to gain some efficiency:
ls --sort=size data/* | tac | parallel ./proc
which lists the data according to size, then tac (reverse of cat) flips the order of that output so the smallest files are processed first. Is this the most efficient solution? If not, how can the efficiency be improved (simple solutions preferred)?
I remember that sorting like this leads to better efficiency since larger jobs don't block up the pipeline, but aside from examples I can't find or remember any theory behind this, so any references would be greatly appreciated!
If you need to run all jobs and want to optimize for time to complete them all, you want them to finish the same time. In that case you should run the small jobs last. Otherwise you may have the situation where all cpus are done except one that just started on the last big job. Here you will waste CPU time for all CPUs except the one.
Here are 8 jobs: 7 take 1 second, one takes 5:
1 2 3 4 55555 6 7 8
On a dual core small jobs first:
1368
24755555
On a dual core big jobs first:
555557
123468

Resources