An easy way to diff log files, ignoring the time stamps?

An easy way to diff log files, ignoring the time stamps? - parsing

I need to diff two log files but ignore the time stamp part of each line (the first 12 characters to be exact). Is there a good tool, or a clever awk command, that could help me out?

Depending on the shell you are using, you can turn the approach #Blair suggested into a 1-liner
diff <(cut -b13- file1) <(cut -b13- file2)
(+1 to #Blair for the original suggestion :-)

#EbGreen said
I would just take the log files and strip the timestamps off the start of each line then save the file out to different files. Then diff those files.
That's probably the best bet, unless your diffing tool has special powers.
For example, you could
cut -b13- file1 > trimmed_file1
cut -b13- file2 > trimmed_file2
diff trimmed_file1 trimmed_file2
See #toolkit's response for an optimization that makes this a one-liner and obviates the need for extra files. If your shell supports it. Bash 3.2.39 at least seems to...

Answers using cut are fine but sometimes keeping timestamps within the diff output is appreciable. As the OP's question is about ignoring the time stamps (not removing them), I share here my tricky command line:
diff -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
sed isolates the timestamps (# before and \n after) within a process substitution
diff -I '^#' ignores lines having these timestamps (lines beginning by #)
example
Two log files having same content but different timestamps:
$> for ((i=1;i<11;i++)) do echo "09:0${i::1}:00.000 data $i"; done > 1.log
$> for ((i=1;i<11;i++)) do echo "11:00:0${i::1}.000 data $i"; done > 2.log
Basic diff command line says all lines are different:
$> diff 1.log 2.log
1,10c1,10
< 09:01:00.000 data 1
< 09:02:00.000 data 2
< 09:03:00.000 data 3
< 09:04:00.000 data 4
< 09:05:00.000 data 5
< 09:06:00.000 data 6
< 09:07:00.000 data 7
< 09:08:00.000 data 8
< 09:09:00.000 data 9
< 09:01:00.000 data 10
---
> 11:00:01.000 data 1
> 11:00:02.000 data 2
> 11:00:03.000 data 3
> 11:00:04.000 data 4
> 11:00:05.000 data 5
> 11:00:06.000 data 6
> 11:00:07.000 data 7
> 11:00:08.000 data 8
> 11:00:09.000 data 9
> 11:00:01.000 data 10
Our tricky diff -I '^#' does not display any difference (timestamps ignored):
$> diff -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
$>
Change 2.log (replace data by foo on the 6th line) and check again:
$> sed '6s/data/foo/' -i 2.log
$> diff -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
11,13c11,13
11,13c11,13
< #09:06:00.000
< data 6
< #09:07:00.000
---
> #11:00:06.000
> foo 6
> #11:00:07.000
=> timestamps are kept in the diffoutput!
You can also use the side by side feature using -y or --side-by-side option:
$> diff -y -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
#09:01:00.000 #11:00:01.000
data 1 data 1
#09:02:00.000 #11:00:02.000
data 2 data 2
#09:03:00.000 #11:00:03.000
data 3 data 3
#09:04:00.000 #11:00:04.000
data 4 data 4
#09:05:00.000 #11:00:05.000
data 5 data 5
#09:06:00.000 | #11:00:06.000
data 6 | foo 6
#09:07:00.000 | #11:00:07.000
data 7 data 7
#09:08:00.000 #11:00:08.000
data 8 data 8
#09:09:00.000 #11:00:09.000
data 9 data 9
#09:01:00.000 #11:00:01.000
data 10 data 10
old sed
If your sed implementation does not support the -r option, you may have to count the twelve dots <(sed 's/^\(............\)/#\1\n/' 1.log) or use another pattern of your choice ;)

For a graphical option, Meld can do this using its text filters feature.
It allows for ignoring lines based on one or more python regex. The differences still appear, but lines that don't have any other differences won't be highlighted.

Use Kdiff3 and at Configure>Diff edit "Line-Matching Preprocessor command" to something like:
sed "s/[ 012][0-9]:[0-5][0-9]:[0-5][0-9]//"
This will filter out time-stamps from comparison alignment algorithm.
Kdiff3 also lets you manually align specific lines.

I want to propose a solution for Visual Studio Code:
Install this extension - https://marketplace.visualstudio.com/items?itemName=ryu1kn.partial-diff
Configure it like this - https://github.com/ryu1kn/vscode-partial-diff/issues/49#issuecomment-608299085
Run extension command "Toggle Pre-Comparison Text Normalization Rules" and enable rule added on step #2
Use the extension (here is an explanation of it's UI quirk - https://github.com/ryu1kn/vscode-partial-diff/issues/11)

Related

Vertica's vsql.exe returns errorlevel 0 when facing ERROR 3326: Execution time exceeded run time cap

I am using vsql.exe on an external Vertica database for which I don't have any administrative access. I use some views with simple SELECT+FROM+WHERE queries.
These queries 90% of the time work just fine, but some times, randomly, I get this error:
ERROR 3326: Execution time exceeded run time cap of 00:00:45
The strange thing is that this error can happen way after those 45 seconds, even after 3 minutes. I've been told this is related to having different resource pools, but anyway I don't want to dig into that.
The problem is that when this occurs, vsql.exe returns errorlevel 0 and there is (apparently almost) no way to know this failed.
The output of the query is stored in a csv file. When it succeeds, it ends with (#### rows). But when it fails with this error, it just stops at any point of the csv, and its resulting size is around half of what's expected. This is of course not what you would expect when an error occurs, like no output or an empty one.
If there is a connection error or if the query has syntax errors, the errorlevel is not 0, so in those cases it behaves as expected.
I've tried many things, like increasing the timeout or adding -v ON_ERROR_STOP=ON to the vsql.exe parameters, but none of that helped.
I've googled a lot and found many people having this error, but the solutions are mostly related to increasing the timeouts, not related to the errorlevel returned.
Any help will be greatly appreciated.
TL;DR: how can I detect an error 3326 in a batch file like this?
#echo off
vsql.exe -h <hostname> -U <user> -w <pwd> -o output.cs -Ac "SELECT ....;"
echo %errorlevel% is always 0
if errorlevel 1 echo Error!! But this is never displayed.

Now that's really unexpected to me. I don't have Windows available just now, but trying on my Mac - at first just triggering a deliberate error:
$ vsql -h zbook -d sbx -U dbadmin -w $VSQL_PASSWORD -v ON_ERROR_STOP=ON -Ac "select * from foobarfoo"
ERROR 4566: Relation "foobarfoo" does not exist
$ echo $?
1
With ON_ERROR_STOP set to ON, this should be the behaviour everywhere.
Could you try what I did above through Windows, just with echo %ERRORLEVEL% instead of echo $?, just from the Windows command prompt and not in a batch file?
Next test: I run on resource pool general in my little test database, so I temporarily modify it to a runtime cap of 30 sec, run a silly query that will take over 30 seconds with ON_ERROR_STOP set to ON, collect the value returned by vsql and set the runtime cap of general back to NONE. I also have the %VSQL_* % env variables set so I don't have to repeat them all the time:
rem Windows way to set environment variables for vsql:
set VSQL_HOST=zbook
set VSQL_DATABASE=sbx
set VSQL_USER=dbadmin
set VSQL_PASSWORD=***masked***
Now for the test (backslashes, in Linux/MacOs escape a new line, which enables you to "word wrap" a shell command. Use the caret (^) in Windows for that):
marco ~/1/Vertica/supp $ # set a runtime cap
marco ~/1/Vertica/supp $ vsql -i -c \
"alter resource pool general runtimecap '00:00:30'"
ALTER RESOURCE POOL
Time: First fetch (0 rows): 116.326 ms. All rows formatted: 116.730 ms
marco ~/1/Vertica/supp $ vsql -v ON_ERROR_STOP=ON -iAc \
"select count(*) from one_million_rows a cross join one_million_rows b"
ERROR 3326: Execution time exceeded run time cap of 00:00:30
marco ~/1/Vertica/supp $ # test the return code
marco ~/1/Vertica/supp $ echo $?
1
marco ~/1/Vertica/supp $ # clear the runtime cap
marco ~/1/Vertica/supp $ vsql -i -c \
"alter resource pool general runtimecap NONE "
ALTER RESOURCE POOL
Time: First fetch (0 rows): 11.148 ms. All rows formatted: 11.383 ms
So it works in my case. Your line:
if errorlevel 1 echo Error!! But this is never displayed.
... never echoes anything because the previous line, with echo will return 0 to the shell, overriding the previous errorlevel.
Try it command by command on your Windows command prompt, and see what happens. Just echo %errorlevel%, without evaluating it.
And I notice that you are trying to export to CSV format. Then, try this:
Format the output unaligned (-A)
set the field separator to comma (-F ',')
remove the footer '(n rows)' (-P footer)
limit the output to 5 rows in the query for test
(I show the output before redirecting to file):
marco ~/1/Vertica/supp $ vsql -A -F ',' -P footer -c "select * from one_million_rows limit 5"
id,id_desc,dob,category,busid,revenue
0,0,1950-01-01,1,====== boss ========,0.000
1,-1,1950-01-02,2,kbv-000001kbv-000001,0.010
2,-2,1950-01-03,3,kbv-000002kbv-000002,0.020
3,-3,1950-01-04,4,kbv-000003kbv-000003,0.030
4,-4,1950-01-05,5,kbv-000004kbv-000004,0.040
Not aligning is much faster than aligning.
Then, as you spend most time in the fetching of the rows (that's because you get a timeout in the middle of an output file write process), try fetching more rows at a time than the default 1000. You will need to play with the value, depending on the network settings at your site until you get your best value:
-v ROWS_AT_A_TIME=10000
Once you're happy with the tested output, try this command (change the SELECT for your needs, of course ....):
marco ~/1/Vertica/supp $ vsql -A -F ',' -P footer \
-v ON_ERROR_STOP=ON -v ROWS_AT_A_TIME=10000 -o one_million_rows.csv \
-c "select * from one_million_rows"
marco ~/1/Vertica/supp $ wc -l one_million_rows.csv
1000001 one_million_rows.csv
The table actually contains one million rows. Note the line count in the file: 1,000,001. That's the title line included, but the footer (1000000 rows) removed.

grep every fourth line in .fastq

I am working on a linux machine using bash.
My question is, how can I skip lines in the query file using grep?
I am working with a large ~16Gb .fastq file named example.fastq which has the following format.
example.fastq
#SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
#SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I need to extract lines containing a strings of interest #SRR6750041.2 #SRR6750041.5 stored in a bash array called IDarray as well as the 3 lines following each match. The following grep command allows me to do this
for ID in "${IDarray[#]}";
do
grep -F -A 3 "$ID " example.fastq
done
This correctly output the following.
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I am looking for ways to speed this process up... one way would be to reduce the number of lines searched by grep by restricting the search to lines beginning with # or skipping lines that can not possibly contain the match #SRR6750041.1 such as lines 2,3,4 and 6,7,8 etc. Is there a way to do this using grep? Alternative methods are also welcome!

Here are some thoughts with examples. For test purposes I created test case as mini version of Yours example_mini.fastq is 145 MB big and IDarray has 999 elements (interests).
Your version has this performance (more than 2 mins in user space):
$ time for i in "${arr[#]}"; do grep -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 3m16.310s
user 2m9.645s
sys 0m53.092s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
First upgrade of grep to end grep after first match -m 1, I am assuming that interest ID is unique. This narrow down by 50% of complexity and takes approx 1 min in user space:
$ time for i in "${arr[#]}"; do grep -m 1 -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 1m19.325s
user 0m55.844s
sys 0m21.260s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
These solutions are linearly dependent on number of elements. Call n times grep on huge file.
Now let's implement in AWK only for one run, I am exporting IDarray into input file so I can process in one run. I am loading big file into associative array per ID and then looping 1x through You array of IDs to search. This is generic scenario where You can define regexp and number of lines after to print. This has complexity with only one run through file + N comparisons. This is 2000% speed up:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (linesafter-- > 0) { arr[interest]=arr[interest] RS $0; next; }
(FNR==NR) && /^#/ { interest=$1; arr[interest]=$0; linesafter=3; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m7.044s
user 0m6.628s
sys 0m0.307s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
As in Your title If You really can confirm that every fourth line is id of interest and three lines after are about to be printed. You can simplify into this and speed up by another 20%:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (FNR%4==1) { interest=$1; arr[interest]=$0; next; }
(FNR==NR) { arr[interest]=arr[interest] RS $0; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m5.944s
user 0m5.593s
sys 0m0.242s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
On 1.5 GB file with 999 elements to search time is:
real 1m4.333s
user 0m59.491s
sys 0m3.460s
So per my predictions on my machine Your 15 GB example with 10k elements would take approx 16 minutes in user space to process.

lcov remove option is not removing coverage data as expected

I'm using lcov to generate coverage reports. I have a tracefile (broker.info) with this content (relevant fragment shown):
$ lcov -r broker.info
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/orionTypes/]
EntityTypeResponse_test.cpp | 100% 11| 100% 6| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/parse/]
CompoundValueNode_test.cpp | 100% 82| 100% 18| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/rest/]
OrionError_test.cpp |92.1% 38| 100% 6| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/serviceRoutines/]
badVerbAllFour_test.cpp | 100% 24| 100% 7| - 0
...
I want to remove all the info corresponding to test/unittest files.
I have attemped to use the -r option which, according to man page is:
-r tracefile pattern
--remove tracefile pattern
Remove data from tracefile.
Use this switch if you want to remove coverage data for a particular set of files from a tracefile. Additional command line parameters will be interpreted as
shell wildcard patterns (note that they may need to be escaped accordingly to prevent the shell from expanding them first). Every file entry in tracefile
which matches at least one of those patterns will be removed.
The result of the remove operation will be written to stdout or the tracefile specified with -o.
Only one of -z, -c, -a, -e, -r, -l, --diff or --summary may be specified at a time.
Thus, I'm using
$ lcov -r broker.info 'test/unittests/*' -o broker.info2
As far as I understand test/unittest/* matches the files under test/unittest. However, it's not working (note Deleted 0 files below):
Reading tracefile broker.info
Deleted 0 files
Writing data to broker.info2
Summary coverage rate:
lines......: 92.6% (58313 of 62978 lines)
functions..: 96.0% (6451 of 6718 functions)
branches...: no data found
I have tried also this variants (same result):
$ lcov -r broker.info "test/unittests/*" -o broker.info2
$ lcov -r broker.info "test/unittests/\*" -o broker.info2
$ lcov -r broker.info "test/unittests" -o broker.info2
So, maybe I'm doing something wrong?
I'm using lcov version 1.13 (just in case the data is relevant)
Thanks!

I have been testing another options and the following one seems to work, using the wildcard in the prefix also:
$ lcov -r broker.info "*/test/unittests/*" -o broker.info2
Maybe it is something new in version 1.13 because in version 1.11 it seems it works without wildcard in the prefix...

The below mentioned lcov command is working fine, even with wild characters (lcov 1.14):
lcov --remove meson-logs/coverage.info '/home/builduser/external/*' '/home/builduser/unittest/*' -o meson-logs/sourcecoverage.info

tar pre-run to evaluate expected size or amount of files

The problem:
I have a back-end process that at some point he collect and build a big tar file.
This tar receive few directories and an exclude files.
the process can take up to few minutes and i want to report in my front-end process (GUI) about the progress of the taring process (This is a big issue for a user that press download button and it seems like nothing is happening...).
i know i can use -v -R in the tar command and count files and size progress but i am looking for some kind of tar pre-run mode / dry run to help me evaluate either the expected number of files or the expected tar size.
the command I am using: tar -jcf 'FILE.tgz' 'exclude_files' 'include_dirs_and_files'
10x for everyone who is willing to assist.

You can pipe the output to the wc tool instead of actually making a file.
With file listing (verbose):
[git#server]$ tar czvf - ./test-dir | wc -c
./test-dir/
./test-dir/test.pdf
./test-dir/test2.pdf
2734080
Without:
[git#server]$ tar czf - ./test-dir | wc -c
2734080

Why don't you run a
DIRS=("./test-dir" "./other-dir-to-test")
find ${DIRS[#]} -type f | wc -l
beforehand. This gets all the files (-type f) one per line and counts the number of files. DIRS is an array in bash, so you can store the folders in a variable
If you want to know the size of all the stored files, you can use du
DIRS=("./test-dir" "./other-dir-to-test")
du -c -d 0 ${DIRS[#]} | tail -1 | awk -F ' ' '{print $1}'
This prints the disk usage with du, calculates a grand total (-c flag), gets the last line (example 4378921 total), and uses just the first column with awk

Extracting n rows of text from a large csv file

I have a CSV file (foo.csv) with 200,000 rows. I need to break it into four files (foo1.csv, foo2.csv... etc.) with 50,000 rows each.
I already tried simple ctrl-v/-c using gui text editors, but the my computer slows to a halt.
What unix command(s) could I use to accomplish this task?

I don't have a terminal handy to try it out, but it should be just split -d -l 50000 foo.csv.
Hopefully the naming isn't terribly important because with the -d option, the output files will be named foo.csv00 .. foo.csv03. You can add the -a 1 option so that the suffixes are 0-3, but there's no simple way to get the suffix to be injected into the middle of the filename.

you should use head and tail.
head -n 50000 myfile > part1.csv
head -n 100000 myfile | tail -n 50000 > part2.csv
head -n 150000 myfile | tail -n 50000 > part3.csv
etc ...
Else, but with no control on file names, you can use unix command split.

sed -n 2000,4000p somefile.txt
will print from lines 2000 to 4000 to stdout.

split -l50000 foo.csv

You can use sed

I wrote this little shell script for this topic very similar at yours.
This shell script + awk works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart