GNU parallel nesting - gnu-parallel

I have a directory of 130000+ .tif files. I want to use find with GNU parallel. All my files are named in the pattern and sequence of k-001 to k-163. One of the challenges is matching 001 with seq 1.
I tried this:
seq 111 163 | parallel -j10 find . -name 'k-{}\*' -print0 | parallel -0 'tesseract {/} /mnt/ramdisk/output/{/.} > /dev/null 2>&1'
I am not getting parallelism from the seq part. Where am I going wrong?

Not sure what the actual issue is, but you can generate zero-padding like this if that is the problem:
printf "%03d\n" {0..10} | parallel -k echo
000
001
002
003
004
005
006
007
008
009
010

Related

Alternative of grep in lua script

I have the following text file output.txt that I created (it has 15 colums including symbol | ):
[66] | alert:n | 3.0 | 10/22/2020-14:45:50.066928 | local_ip | 123.123.123.123 | United States of America | SURICATA STREAM ESTABLISHED SYNACK resend with different ACK
[67] | alert:n | 3.0 | 10/22/2020-14:45:51.096955 | local_ip | 12.12.12.11 | United States of America | SURICATA STREAM ESTABLISHED SYNACK resend with different ACK
[68] | alert:n | 3.0 | 10/22/2020-14:45:53.144942 | 123.123.123.123 | local_ip | United States of America | SURICATA STREAM ESTABLISHED SYNACK resend with different ACK
[69] | alert:n | 3.0 | 10/22/2020-14:45:57.176956 | local_ip | 68.73.203.109 | United States of America | SURICATA STREAM ESTABLISHED SYNACK resend with different ACK
[70] | alert:n | 3.0 | 10/22/2020-14:46:05.240953 | 123.123.123.123 | local_ip | United States of America | SURICATA STREAM ESTABLISHED SYNACK resend with different ACK
[71] | alert:n | 3.0 | 10/22/2020-14:46:21.624979 | local_ip | 68.73.203.109 | United States of America | SURICATA STREAM ESTABLISHED SYNACK resend with different ACK
I'm familiar with the bash script, let say if I want to count total specific ip of 123.123.123.123 that can be found in the 9th column, I can implement like this:
#!/bin/bash
ip = "123.123.123.123"
report = output.txt
src_ip_count=$(grep "${ip}" "${report}" | awk '{ print $9 }' | grep -v "local_ip" | uniq -c | awk '{ print $1 }')
and the output is:
[root#me lua-output]# ./test.sh
2
How do I implement the same code above in lua ? I know there is popen function can be used.. but is there a native way to do this in lua ? Also if I use popen, I also need to pass variable $ip and $report inside that command which I'm not sure if it's possible.
There's a bunch of ways to go about this, really. Assuming you read your data from stdin (though the same works for any file you manually open), you can do something like this:
local c = 0
for line in io.lines() do -- or or file:lines() if you have a different file
if line:find("123.123.123.123") -- Only lines containing the IP we care about
if (true) -- Whatever other conditions you want to apply
c = c + 1
end
end
end
print(c)
Lua doesn't have a concept of what a "column" is, so you have to build that yourself as well. Either use a pattern to count spaces, or split the string into a table and index it.
You mentioned that if it is possible to use variable inside popen in lua. It is possible, and you can use grep command in lua.
So in lua you can do this:
-- lua script using grep example
ip = "123.123.123.123"
report = output.txt
local cmd = "grep -F " .. ip .. " " .. report .. " | awk '{ print $9 }' | grep -v 'local_ip' | uniq -c | awk '{ print $1 }'"
local handle = io.popen(cmd)
local src_ip_count = handle:read("*a")
print(src_ip_count)
handle:close()
output:
2

how to exclude some of the matches from grep?

I am using grep to printout the matching lines from a very large file
from which i got hundreds of matches, some of them are not interesting i want to exclude those matching which are not interesting
grep "WARNING" | grep -v "WARNING_HANDLING_THREAD" path # i tried this
When I grep the file for warning I get
0-00:00:33.392 (2127:127:250:02 = 21.278532 Fri Feb 1 10:17:22 2019) <3:0x000a>:[89]:[enter]: cest_handleFreeReq.c:116: [WARNING]: cest_handleFreeReq: sent from DECA ->UCS
0-00:00:38.263 (2189:022:166:06 = 21.891510 Fri Feb 1 10:17:28 2019) <3:0x000a>:[89]:[enter]: cest_handleConfigReq.c:176: [WARNING]: cest_handleConfigReq.c: GroupConfig NOT present.
0-00:00:38.263 (2189:022:167:03 = 21.891510 Fri Feb 1 10:17:28 2019) <3:0x000a>:[89]:[enter]: cest_handleConfigReq.c:194: [WARNING]: cest_handleConfigReq: physicalConfig NOT present.
60 0x6d77 0 0x504ea | 2 18 | 0 0 | 4 12 | 647 | 14685 0 0.0 0 500 500 | 0 | 0 | 38 | ETH_DRV_WARNING_HANDLING_thread
60 0 | 0 0 | 0 0 0 | 0 0 0 0 0 0 ! N/A N/A N/A N/A N/A N/A |ETH_DRV_WARNING_HANDLING_thread
WARNING: List of threads violating the heap & stack limit
I want to exclude the last lines which are not interesting
0-00:00:33.392 (2127:127:250:02 = 21.278532 Fri Feb 1 10:17:22 2019) <3:0x000a>:[89]:[enter]: cest_handleFreeReq.c:116: [WARNING]: cest_handleFreeReq: sent from DECA ->UCS
0-00:00:38.263 (2189:022:166:06 = 21.891510 Fri Feb 1 10:17:28 2019) <3:0x000a>:[89]:[enter]: cest_handleConfigReq.c:176: [WARNING]: cest_handleConfigReq.c: GroupConfig NOT present.
0-00:00:38.263 (2189:022:167:03 = 21.891510 Fri Feb 1 10:17:28 2019) <3:0x000a>:[89]:[enter]: cest_handleConfigReq.c:194: [WARNING]: cest_handleConfigReq: physicalConfig NOT present.
Is there a way to do this using grep find or any other tool?
Thank you
Note that the substring thread is in lower case in the data, but in upper case in your expression.
Instead, use
grep -F 'WARNING' logfile | grep -F -v 'WARNING_HANDLING_thread'
The -F make grep use string comparisons rather than regular expression matching (this is not really related to your current issue, but just a way of showing that we know what type of pattern we're matching with).
Another option would be to make the second grep do case insensitive matching with -i:
grep -F 'WARNING' logfile | grep -Fi -v 'WARNING_HANDLING_THREAD'
In this case though, I would probably match the [WARNING] tag instead:
grep -F '[WARNING]:' logfile
Note that here we need the -F so that grep interprets the pattern as a string and not as a regular expression matching any single character out of the W, A, R, N, I, G set, followed by a :.

grep invert match on two files

I have two text files containing one column each, for example -
File_A File_B
1 1
2 2
3 8
If I do grep -f File_A File_B > File_C, I get File_C containing 1 and 2. I want to know how to use grep -v on two files so that I can get the non-matching values, 3 and 8 in the above example.
Thanks.
You can also use comm if it allows empty output delimiter
$ # -3 means suppress lines common to both input files
$ # by default, tab character appears before lines from second file
$ comm -3 f1 f2
3
8
$ # change it to empty string
$ comm -3 --output-delimiter='' f1 f2
3
8
Note: comm requires sorted input, so use comm -3 --output-delimiter='' <(sort f1) <(sort f2) if they are not already sorted
You can also pass common lines got from grep as input to grep -v. Tested with GNU grep, some version might not support all these options
$ grep -Fxf f1 f2 | grep -hxvFf- f1 f2
3
8
-F option to match strings literally, not as regex
-x option to match whole lines only
-h to suppress file name prefix
f- to accept stdin instead of file input
awk 'NR==FNR{a[$0]=$0;next} !($0 in a) {print a[(FNR)], $0}' f1 f2
3 8
To Understand the meaning of NR and FNR check below output of their print.
awk '{print NR,FNR}' f1 f2
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
Condition NR==FNR is used to extract the data from first file as both NR and FNR would be same for first file only.
With GNU diff command (to compare files line by line):
diff --suppress-common-lines -y f1 f2 | column -t
The output (left column contain lines from f1, right column - from f2):
3 | 8
-y, --side-by-side - output in two columns

Cypher load CSV eager and long action duration

im loading a file with 85K lines - 19M,
server has 2 cores, 14GB RAM, running centos 7.1 and oracle JDK 8
and it can take 5-10 minutes with the following server config:
dbms.pagecache.memory=8g
cypher_parser_version=2.0
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096
disk mounted in /etc/fstab:
UUID=fc21456b-afab-4ff0-9ead-fdb31c14151a /mnt/neodata
ext4 defaults,noatime,barrier=0 1 2
added this to /etc/security/limits.conf:
* soft memlock unlimited
* hard memlock unlimited
* soft nofile 40000
* hard nofile 40000
added this to /etc/pam.d/su
session required pam_limits.so
added this to /etc/sysctl.conf:
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
disabled journal by running:
sudo e2fsck /dev/sdc1
sudo tune2fs /dev/sdc1
sudo tune2fs -o journal_data_writeback /dev/sdc1
sudo tune2fs -O ^has_journal /dev/sdc1
sudo e2fsck -f /dev/sdc1
sudo dumpe2fs /dev/sdc1
besides that,
when running a profiler, i get lots of "Eagers", and i really cant understand why:
PROFILE LOAD CSV WITH HEADERS FROM 'file:///home/csv10.csv' AS line
FIELDTERMINATOR '|'
WITH line limit 0
MERGE (session :Session { wz_session:line.wz_session })
MERGE (page :Page { page_key:line.domain+line.page })
ON CREATE SET page.name=line.page, page.domain=line.domain,
page.protocol=line.protocol,page.file=line.file
Compiler CYPHER 2.3
Planner RULE
Runtime INTERPRETED
+---------------+------+---------+---------------------+--------------------------------------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+---------------+------+---------+---------------------+--------------------------------------------------------+
| +EmptyResult | 0 | 0 | | |
| | +------+---------+---------------------+--------------------------------------------------------+
| +UpdateGraph | 9 | 9 | line, page, session | MergeNode; Add(line.domain,line.page); :Page(page_key) |
| | +------+---------+---------------------+--------------------------------------------------------+
| +Eager | 9 | 0 | line, session | |
| | +------+---------+---------------------+--------------------------------------------------------+
| +UpdateGraph | 9 | 9 | line, session | MergeNode; line.wz_session; :Session(wz_session) |
| | +------+---------+---------------------+--------------------------------------------------------+
| +ColumnFilter | 9 | 0 | line | keep columns line |
| | +------+---------+---------------------+--------------------------------------------------------+
| +Filter | 9 | 0 | anon[181], line | anon[181] |
| | +------+---------+---------------------+--------------------------------------------------------+
| +Extract | 9 | 0 | anon[181], line | anon[181] |
| | +------+---------+---------------------+--------------------------------------------------------+
| +LoadCSV | 9 | 0 | line | |
+---------------+------+---------+---------------------+--------------------------------------------------------+
all the labels and properties have indices / constrains
thanks for the help
Lior
He Lior,
we tried to explain the Eager Loading here:
And Marks original blog post is here: http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Rik tried to explain it in easier terms:
http://blog.bruggen.com/2015/07/loading-belgian-corporate-registry-into_20.html
Trying to understand the "Eager Operation"
I had read about this before, but did not really understand it until Andres explained it to me again: in all normal operations, Cypher loads data lazily. See for example this page in the manual - it basically just loads as little as possible into memory when doing an operation. This laziness is usually a really good thing. But it can get you into a lot of trouble as well - as Michael explained it to me:
"Cypher tries to honor the contract that the different operations
within a statement are not affecting each other. Otherwise you might
up with non-deterministic behavior or endless loops. Imagine a
statement like this:
MATCH (n:Foo) WHERE n.value > 100 CREATE (m:Foo {m.value = n.value + 100});
If the two statements would not be
isolated, then each node the CREATE generates would cause the MATCH to
match again etc. an endless loop. That's why in such cases, Cypher
eagerly runs all MATCH statements to exhaustion so that all the
intermediate results are accumulated and kept (in memory).
Usually
with most operations that's not an issue as we mostly match only a few
hundred thousand elements max.
With data imports using LOAD CSV,
however, this operation will pull in ALL the rows of the CSV (which
might be millions), execute all operations eagerly (which might be
millions of creates/merges/matches) and also keeps the intermediate
results in memory to feed the next operations in line.
This also
disables PERIODIC COMMIT effectively because when we get to the end of
the statement execution all create operations will already have
happened and the gigantic tx-state has accumulated."
So that's what's going on my load csv queries. MATCH/MERGE/CREATE caused an eager pipe to be added to the execution plan, and it effectively disables the batching of my operations "using periodic commit". Apparently quite a few users run into this issue even with seemingly simple LOAD CSV statements. Very often you can avoid it, but sometimes you can't."

grep specific pattern from a log file

I am passing all my svn commit log messages to a file and want to grep only the JIRA issue numbers from that.
Some lines might have more than 1 issue number, but I want to grab only the first occurrence.
The pattern is XXXX-999 (number of alpha and numeric char is not constant)
Also, I don't want the entire line to be displayed, just the JIRA number, without duplicates. I use the following command but it didn't work.
Could someone help please?
cat /tmp/jira.txt | grep '^[A-Z]+[-]+[0-9]'
Log file sample
------------------------------------------------------------------------
r62086 | userx | 2015-05-12 11:12:52 -0600 (Tue, 12 May 2015) | 1 line
Changed paths:
M /projects/trunk/gradle.properties
ABC-1000 This is a sample commit message
------------------------------------------------------------------------
r62084 | usery | 2015-05-12 11:12:12 -0600 (Tue, 12 May 2015) | 1 line
Changed paths:
M /projects/training/package.jar
EFG-1001 Test commit
Output expected:
ABC-1000
EFG-1001
First of all, it seems like you have the second + in the wrong place, it should be at the end of [0-9] expression.
Second, I think all you need to do this is use the -o option to grep (to display only the matching portion of the line), then pipe the grep output through sort -u, like this:
cat /tmp/jira.txt | grep -oE '^[A-Z]+-[0-9]+' | sort -u
Although if it were me, I'd skip the cat step and just give the filename to grep, as so:
grep -oE '^[A-Z]+-[0-9]+' /tmp/jira.txt | sort -u
Six of one, half a dozen of the other, really.

Resources