grepping either of the specific formats from a file - grep

I am trying to extract a text (ex: Dec 16 16:07:36 2020 GMT) from a text file.
The text is prepended either by UTCTIME: or GENERALIZEDTIME:
Now, I can just use grep -E 'UTCTIME.*GMT' or use grep -E 'GENERALIZEDTIME.*GMT' to get the line from file.
But, I cannot be sure whether the file has UTCTIME: or GENERALIZEDTIME:.
So question here, Is there a single command that checks whether the line contains either of those formats and print the line.
I tried something like this grep -E 'UTCTIME.*GMT\|GENERALIZEDTIME.*GMT', but it does not print anything.
Sample format of text.
some content
object: signingTime (1.2.840.113549.1.9.5)
set:
UTCTIME:Dec 16 16:07:36 2020 GMT
some content
OR
some content
object: signingTime (1.2.840.113549.1.9.5)
set:
GENERALIZEDTIME:Dec 16 16:07:36 2020 GMT
some content
Last resort: Even if the command checks signingTime and extract the second line from there, but its better to find directly from same line.

Related

Why does Tika 2.1 app ignore text in a .txt file?

I am processing a file using Tika 2.1 from the command line under Ubuntu 20.04 using the following command:
java -jar tika-app-2.1.0.jar -t test.txt
The file is a pure text ANSI file (all the chars are 0x0 thru 0x7f). As hard as this is to believe, Tika 2.1 app is ignoring all characters when a specific string is present is the text file.
Here is the text file:
From:
Sent:
text
this is a test
testing
next
last
And here is the output:
this is a test
testing
next
last
To show that this is a pure ANSI text file with no formatting, no Unicode 2-bytes sequences, etc., here is the output of the 'od' command:
0000000 7246 6d6f 0d3a 530a 6e65 3a74 0a0d 6574
0000020 7478 0a0d 0a0d 6874 7369 6920 2073 2061
0000040 6574 7473 0a0d 6574 7473 6e69 2067 0a0d
0000060 0a0d 656e 7478 0a0d 616c 7473
However, if I simply change the "Sent:" to "sent:" the output is:
From:
sent:
text
this is a test
testing
next
last
I've been troubleshooting this issue and do not see the connection. If I append "Sent:" to the first line:
From: Sent:
Sent:
text
this is a test
testing
next
last
The results are:
this is a test
testing
next
last
But if I alter "Sent:" to be "\Sent" on the second line, I get this output:
From: Sent:
\Sent:
text
this is a test
testing
next
last
And this file:
From: Sent:
Sent:
Sent:
text
this is a test
testing
next
last
Results in this output:
this is a test
testing
next
last
But if I place "Sent:" in the first line or a simple (0d 0a) as the first two bytes, the output is fine. Why is it the start of the second line seems to matter, as well as the uppercase or lowercase working but not "Sent:"? Why does preceding "Sent" with a "\" make it work? I've also tried this on a different machines - one running Ubuntu 18.04 and running the jar on a Windows 10 system - both with same results.
What is going on with the basic Tika response to a very simply text file? I have not altered the jar in any way. This is the jar file as downloaded from the Apache Tika site. What am I missing?
Any information is very much appreciated.
Tika is interpreting the text as an email. This specific example was a text extraction of an email and includes certain keywords (e.g. "From:" and "Sent:" in specific positions). That is why when other characters are added at the start of the file, it defaults to interpreting it as a pure text file.
I had thought that the order of interpretation was first based on the ".txt" extension and then analysis of the content (which in this case does not have any metadata with this text file). But that does not appear to be the case here. It seems that analysis is the first order, before it considers the ".txt" extension.
The example was being run through Tika running as a server. Going forward I will use the Tika API and follow the suggestions provided by the commenter (#Gagravarr) by skipping a call to AutoDetectParser,setting the content type property on the metadata and call DefaultParser all via the API.
Tx to #Gagravarr for finding a solution.

Git diff: How to ignore starting space of an empty line?

I'm an iOS developer, when I press enter, xcode automatically indent the new line with 4 spaces, that's convenient for developing.
But when it comes to using git diff, every empty line will be marked with red color.
This can be annoying in team development.
So how to deal with it?
Thanks in advance!
Use this when using diff
git diff -w // (--ignore-all-space)
You can create an alias for this so you will not have to type it every time.
git config --global alias.NAME 'diff --ignore-space-change'
git diff
-b / --ignore-space-change
Ignore changes in amount of whitespace.
This ignores whitespace at line end, and considers all other sequences of one or more whitespace characters to be equivalent.
-w / --ignore-all-space
Ignore whitespace when comparing lines.
This ignores differences even if one line has whitespace where the other line has none.
--ignore-blank-lines
Ignore changes whose lines are all blank.
With Git 2.25 (Q1 2020), three+ years later, the "diff" machinery learned not to lose added/removed blank lines in the context when --ignore-blank-lines and --function-context are used at the same time.
So your intermediate 4 empty lines won't show up.
But in the context of functions, they might.
See commit 0bb313a (05 Dec 2019) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit f0070a7, 16 Dec 2019)
xdiff: unignore changes in function context
Signed-off-by: René Scharfe
Changes involving only blank lines are hidden with --ignore-blank-lines, unless they appear in the context lines of other changes.
This is handled by xdl_get_hunk() for context added by --inter-hunk-context, -u and -U.
Function context for -W and --function-context added by xdl_emit_diff() doesn't pay attention to such ignored changes; it relies fully on xdl_get_hunk() and shows just the post-image of ignored changes appearing in function context.
That's inconsistent and confusing.
Improve the result of using --ignore-blank-lines and --function-context together by fully showing ignored changes if they happen to fall within function context.

AWK Avoid Reformatting of date-like values

This is the issue:
I gave as an input to AWK a comma-delimited table (and speficying FS=","), take the average of the 2-3rd column, the same for 4-5th column and print the first column value \t average1 \t average2 \n
BUT
the first column have genenames, and some of them looks like dates
AND
when I print, these names change, for example "Sept15" changed to "15-Sep", and I want to avoid this
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{print $1,($2+$3)/2,($4+$5)/2}}' DESeqResults.csv | grep Sep
Even when using printf(%s)
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{printf("%s\t%d\t%d\n",$1,($2+$3)/2,($4+$5)/2) }}' DESeqResults.csv | grep Sep
I thought that using printf instead of just print could work, but it didn't. And I'm pretty sure something is going on when reading the value and preprocessing it, because I printed it out only that column and anything else (using both print and printf) and the value is already changed.
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{print $1}}' DESeqResults.csv | grep Sep
AWK Version:
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
Here is a sample of the table:
genes,Tet2/3_Ctrtl_A,Tet2/3_Ctrtl_B,Tet2/3_DKO_A,Tet2/3_DKO_B,baseMean,baseMean_Tet2/3_Ctrtl,baseMean_Tet2/3_DKO,foldChange(Tet2/3_DKO/Tet2/3_Ctrtl),log2FoldChange,pval,padj
Sep15,187.0874494,213.5411848,289.6434172,338.0423229,1376.196203,926.4220733,1825.970332,1.970991824,0.978921792,5.88E-05,0.003018514
Psmb2,399.4650982,355.9642309,557.3871013,632.1236546,1462.399465,983.7201408,1941.078789,1.973202244,0.980538833,6.00E-05,0.003071175
Sept1,144.2402924,114.9623101,52.39183843,18.11079498,386.2494712,579.8722584,192.6266841,0.332188135,-1.58992755,0.000418681,0.014756367
Psmd8,101.3085151,68.51270408,140.650979,154.2588735,627.727588,396.4360624,859.0191136,2.166854116,1.115602027,0.000421417,0.014825295
Sepw1,388.2193716,337.7605508,209.8232326,155.9087497,639.6596557,787.1262578,492.1930536,0.625303817,-0.677370771,0.004039946,0.080871288
Cks1b,265.8259249,287.954538,337.1108392,408.0547432,865.5821999,642.8510296,1088.31337,1.692948008,0.759537668,0.004049464,0.0809765
Sept2,358.4252141,302.9219723,393.3509343,394.2208442,4218.71214,3392.272118,5045.152161,1.48724866,0.572645878,0.004380269,0.085547008
Tuba1a,19.47153869,11.1692256,40.09945086,28.7539846,142.1610148,75.37000403,208.9520256,2.772349933,1.47110937,0.004381599,0.085547008
Sepx1,14.5941944,15.37680483,53.70015607,105.5523799,157.8475412,40.73526884,274.9598136,6.749920191,2.754870444,0.010199249,0.153896056
Apc,10.90608004,13.56070852,6.445046152,4.536589807,363.4471652,466.2312058,260.6631245,0.559085538,-0.838859068,0.010251083,0.154555416
Sephs2,38.20092337,29.90249614,41.38713976,60.29027195,328.8398211,228.5362706,429.1433717,1.877791086,0.909036565,0.088470061,0.590328676
2310008H04Rik,12.72162335,13.98659226,17.77340283,16.88409867,175.2157133,133.5326829,216.8987437,1.624312033,0.699828804,0.088572283,0.590803249
Sepn1,16.26472482,11.00430796,7.219301889,7.109776037,119.8773488,144.9435253,94.81117235,0.654124923,-0.612361911,0.129473781,0.719395557
Fancc,6.590254663,5.520421849,8.969058939,8.394987722,111.479883,79.97866541,142.9811007,1.787740518,0.838137351,0.129516654,0.719423355
Sept7,170.6589676,187.3808346,185.8091089,158.0134115,1444.411676,1313.631233,1575.192119,1.199112871,0.261967464,0.189661613,0.852792911
Obsl1,1.400612677,0.51329399,0.299847728,0.105245908,10.77805777,17.15978377,4.396331776,0.256199719,-1.964659203,0.189677412,0.852792911
Sepp1,136.2725767,142.7392758,137.5079558,135.5576156,1055.39992,948.5532274,1162.246613,1.225283494,0.293115585,0.193768055,0.862790863
Tom1l2,6.079259794,5.972711213,4.188234003,1.879086398,93.62018078,115.620636,71.61972551,0.619437221,-0.690970019,0.193795263,0.862790863
Sept10,5.07506603,4.240574236,7.415271602,7.245735277,56.38191446,38.04292126,74.72090766,1.964121187,0.973883947,0.202050794,0.874641256
Jag2,0.531592511,1.753353521,0.106692242,0.099863326,7.812876603,14.01922398,1.606529221,0.114594732,-3.125387366,0.202074037,0.874641256
Sept9,25.71885843,9.170659969,29.98187141,23.5519093,333.6707351,231.1780024,436.1634678,1.8866997,0.915864812,0.227916377,0.920255208
Mad2l2,22.00853798,17.42180189,30.74357865,21.99530555,98.71951578,74.31522721,123.1238044,1.656777608,0.72837996,0.227920237,0.920255208
Sept8,3.128945597,4.413675869,1.658838722,1.197769008,38.73123291,52.59586062,24.8666052,0.472786354,-1.080739698,0.237101573,0.929055595
BC018465,1.974718423,2.171073663,0.264221349,0.123654833,5.802858162,10.40514412,1.200572199,0.115382563,-3.115502877,0.237135522,0.929055595
Sept11,51.69299305,57.36531814,51.69117677,51.61623861,915.6234052,837.2625097,993.9843007,1.187183576,0.247543039,0.259718041,0.949870478
Ccnc,11.42168015,13.32308428,14.76060133,12.19352385,173.0536821,146.6301746,199.4771895,1.36041023,0.444041759,0.259794956,0.949870478
Sept12,0,5.10639021,0,0.158638685,5.07217061,9.738384198,0.405957022,0.041686281,-4.584283515,0.388933297,1
Gclc,24.79641294,20.9904856,13.36470176,15.92090715,146.8502169,163.0012707,130.6991632,0.801829106,-0.318633307,0.3890016,1
Sept14,0.15949349,1.753526538,0,0,2.425489894,4.850979788,0,0,#NAME?,0.396160673,1
Slc17a1,0.131471208,1.445439884,0,0,2.425489894,4.850979788,0,0,#NAME?,0.396160673,1
Sept6,34.11050622,30.16102302,28.2562382,14.56889172,602.5658704,661.8163161,543.3154247,0.820945951,-0.284640854,0.416246976,1
Unc119,6.098478253,9.710512531,4.558282355,1.738214353,23.04654843,30.90026472,15.19283214,0.491673203,-1.024228366,0.416259755,1
Sept4,2.305246374,2.534467513,1.18972284,0.618652085,8.87244411,12.13933481,5.605553408,0.461767757,-1.114760654,0.560252893,1
Ddb2,11.25366078,17.32172888,10.50269513,6.025122118,71.81085298,83.53254996,60.089156,0.719350194,-0.475233821,0.560482212,1
Sephs1,20.92060935,15.48240612,15.94132159,11.57137656,288.7538099,298.3521103,279.1555094,0.935657902,-0.095946952,0.568672243,1
BC021785,0.135120133,0.891334456,0.108476095,0.101533002,5.825443635,9.241093439,2.409793832,0.260769339,-1.939153843,0.568713405,1
Sepsecs,7.276880132,6.154194955,5.055549522,3.680417498,35.9322246,39.77711194,32.08733726,0.806678406,-0.309934458,0.673968316,1
Osbpl7,10.51628336,5.69720028,7.157857243,5.382675661,86.65916873,88.67338952,84.64494794,0.954569893,-0.06707726,0.674000752,1
Sept3,0.113880577,0.250408482,0.228561799,0.042786507,2.505996654,2.619498342,2.392494966,0.913340897,-0.13077466,1,1
Sept5,0.126649979,0,0.203352303,0,0.609528347,0.424441516,0.794615178,1.872142914,0.904690571,1,1
Serpina11,0,0,0.14524189,0,0.198653794,0,0.397307589,Inf,Inf,1,1
Transferring extensive comments into an answer
No; awk does not convert strings such as Sep15 to a date 15-Sep by default, even on a Mac. At least, not with the standard awk on Mac OS X 10.10.2 Yosemite, which I tested with, nor would I expect it to do so with any other variant of awk I've ever seen on a Mac.
[…time passed…] Somewhat to my surprise, I have gawk installed and it is GNU Awk 3.1.7 Copyright (C) 1989, 1991-2009 Free Software Foundation. (I knew I had gawk installed, but I was expecting it to be a 4.x version.) Given your data on my Mac, the output of your first awk (gawk) command on the data you gave does no mapping whatsoever on the first column. If you subsequently import the data into a spreadsheet, the spreadsheet could do all sorts of transformations, but that isn't awk's fault.
You mentioned Mac in one of your early comments; are you using Mac OS X? I am not expecting to find any problem outside the spreadsheet. If the data is imported to a spreadsheet, then I won't be surprised to find the 'date-like' values in column 1 are reformatted.
I tried importing the CSV from the data in the question into LibreOffice (4.4.1.2, or 4.4.1002, depending on where you look for the version number), and no transformation occurred on the data in column 1. Similarly, Numbers 3.5.2 and OpenOffice 4.1.1 both leave the keys starting 'Sep' alone.
Unfortunately, MS Excel (for Mac 2011, version 14.4.8 — 150116) translates such column values to a date (so Sep15 becomes 15-Sep, for example). Even embedding the column in double quotes does not help. I don't have a good solution other than "do not use MS Excel".
There probably is a way to suppress the behaviour, but you need to ask a question tagged excel and csv rather than awk and printing and printf.
Incidentally, a Google search on 'excel csv import force text' turns up Stop Excel from automatically converting certain text values to dates? Some of the techniques outlined there (notably the "rename the file from .csv to .txt" technique) work.

extract a line from a file using csh

I am writing a csh script that will extract a line from a file xyz.
the xyz file contains a no. of lines of code and the line in which I am interested appears after 2-3 lines of the file.
I tried the following code
set product1 = `grep -e '<product_version_info.*/>' xyz`
I want it to be in a way so that as the script find out that line it should save that line in some variable as a string & terminate reading the file immediately ie. it should not read furthermore aftr extracting the line.
Please help !!
grep has an -m or --max-count flag that tells it to stop after a specified number of matches. Hopefully your version of grep supports it.
set product1 = `grep -m 1 -e '<product_version_info.*/>' xyz`
From the man page linked above:
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines. If the input is
standard input from a regular file, and NUM matching lines are
output, grep ensures that the standard input is positioned to
just after the last matching line before exiting, regardless of
the presence of trailing context lines. This enables a calling
process to resume a search. When grep stops after NUM matching
lines, it outputs any trailing context lines. When the -c or
--count option is also used, grep does not output a count
greater than NUM. When the -v or --invert-match option is also
used, grep stops after outputting NUM non-matching lines.
As an alternative, you can always the command below to just check the first few lines (since it always occurs in the first 2-3 lines):
set product1 = `head -3 xyz | grep -e '<product_version_info.*/>'`
I think you're asking to return the first matching line in the file. If so, one solution is to pipe the grep result to head
set product1 = `grep -e '<product_version_info.*/>' xyz | head -1`

How to print Java code on A3 page avoiding line-wrapping

I have to print Java code that some times reaches 300 columns (characters per line) in A3 paper and whatever editor I use (e.g. textmate) wraps the lines in order to fit in A4 paper.
Any suggestions?
cheers,
Asterios
Your editor undoubtably has either a Page Setup dialog or a Preferences dialogue as part of the Print Dialogue which will allow you to set the Paper Size to use for printing.
Even Notepad supports this
I finally made it to print using enscript. Here is the command I used to print Java code into PDF (and the used the pdf to print).
enscript -r -Ejava -M A3 -C -f "Courier8" input.java -o - | ps2pdf - output.pdf
where:
-r prints in landscape mode
-C prints the line numbers
-f changes the font and size
-M sets the output media to A3 (default is A4)
-Ejava adds syntax highlighting (you can also use --color if you need
colors in syntax highlighting but
they are not nicely printed in
greyscale)
It seems unlikely that every editor tries to format for A4. Which other editors have you tried? Does textmate not have a page size option? (Hmm... seems not)
Try a different editor that does let you set page size. Word, even.

Resources