Printing long integers in awk

Printing long integers in awk - printing

I have a pipe delimited feed file which has several fields. Since I only need a few, I thought of using awk to capture them for my testing purposes. However, I noticed that printf changes the value if I use "%d". It works fine if I use "%s".
Feed File Sample:
[jaypal:~/Temp] cat temp
302610004125074|19769904399993903|30|15|2012-01-13 17:20:02.346000|2012-01-13 17:20:03.307000|E072AE4B|587244|316|13|GSM|1|SUCC|0|1|255|2|2|0|213|2|0|6|0|0|0|0|0|10|16473840051|30|302610|235|250|0|7|0|0|0|0|0|10|54320058002|906|722310|2|0||0|BELL MOBILITY CELLULAR, INC|BELL MOBILITY CELLULAR, INC|Bell Mobility|AMX ARGENTINA SA.|Claro aka CTI Movil|CAN|ARG|
I am interested in capturing the second column which is 19769904399993903.
Here are my tests:
[jaypal:~/Temp] awk -F"|" '{printf ("%d\n",$2)}' temp
19769904399993904 # Value is changed
However, the following two tests works fine -
[jaypal:~/Temp] awk -F"|" '{printf ("%s\n",$2)}' temp
19769904399993903 # Value remains same
[jaypal:~/Temp] awk -F"|" '{print $2}' temp
19769904399993903 # Value remains same
So is this a limit of "%d" of not able to handle long integers. If thats the case why would it add one to the number instead of may be truncating it?
I have tried this with BSD and GNU versions of awk.
Version Info:
[jaypal:~/Temp] gawk --version
GNU Awk 4.0.0
Copyright (C) 1989, 1991-2011 Free Software Foundation.
[jaypal:~/Temp] awk --version
awk version 20070501

Starting with GNU awk 4.1 you can use --bignum or -M
$ awk 'BEGIN {print 19769904399993903}'
19769904399993904
$ awk --bignum 'BEGIN {print 19769904399993903}'
19769904399993903
§ Command-Line Options

I believe the underlying numeric format in this case is an IEEE double. So the changed value is a result of floating point precision errors. If it is actually necessary to treat the large values as numerics and to maintain accurate precision, it might be better to use something like Perl, Ruby, or Python which have the capabilities (maybe via extensions) to handle arbitrary-precision arithmetic.

UPDATE: Recent versions of GNU awk support arbitrary precision arithmetic. See the GNU awk manual for more info.
ORIGINAL POST CONTENT:
XMLgawk supports arbitrary precision arithmetic on floating-point numbers.
So, if installing xgawk is an option:
zsh-4.3.11[drado]% awk --version |head -1; xgawk --version | head -1
GNU Awk 4.0.0
Extensible GNU Awk 3.1.6 (build 20080101) with dynamic loading, and with statically-linked extensions
zsh-4.3.11[drado]% awk 'BEGIN {
x=665857
y=470832
print x^4 - 4 * y^4 - 4 * y^2
}'
11885568
zsh-4.3.11[drado]% xgawk -lmpfr 'BEGIN {
MPFR_PRECISION = 80
x=665857
y=470832
print mpfr_sub(mpfr_sub(mpfr_pow(x, 4), mpfr_mul(4, mpfr_pow(y, 4))), 4 * y^2)
}'
1.0000000000000000000000000

This answer was partially answered by #Mark Wilkins and #Dennis Williamson already but I found out the largest 64-bit integer that can be handled without losing precision is 2^53.
Eg awk's reference page
http://www.gnu.org/software/gawk/manual/gawk.html#Integer-Programming
(sorry if my answer is too old. Figured I'd still share for the next person before they spend too much time on this like I did)

You're running into Awk's Floating Point Representation Issues. I don't think you can find a work-around within awk framework to perform arithmetic on huge numbers accurately.
Only possible (and crude) way I can think of is to break the huge number into smaller chunk, perform your math and join them again or better yet use Perl/PHP/TCL/bsh etc scripting languages that are more powerful than awk.

Using nawk on Solaris 11, I convert the number to a string by adding (concatenate) a null to the end, and then use %15s as the format string:
printf("%15s\n", bignum "")

another caveat about the precision :
the errors pile up with extra operations ::
echo 19769904399993903 | mawk2 '{ CONVFMT = "%.2000g";
OFMT = "%.20g";
} {
print;
print +$0;
print $0/1.0
print $0^1.0;
print exp(-log($0))^-1;
print exp(1*log($0))
print sqrt(exp(exp(log(20)-log(10))*log($0)))
print (exp(exp(log(6)-log(3))*log($0)))^2^-1
}'
19769904399993903
19769904399993904
19769904399993904
19769904399993904
19769904399993912
19769904399993908
19769904399993628 <<<—— -275
19769904399993768 <<<—- -135
The first few only off by less than 10.
last 2 equations have triple digit deltas.
For any of the versions that require calling helper math functions, simply getting the -M bignum flag is insufficient. One must also set the PREC variable.
For this exmaple, setting PREC=64 and OFMT="%.17g" should suffice.
Beware of setting OFMT too high, relative to PREC, otherwise you'll see oddities like this :
gawk -M -v PREC=256 -e '{ CONVFMT="%.2000g"; OFMT="%.80g";... } '
19769904399993903
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
since 80 significant digits require precision of at least 265.75, so basically 266-bits, but gawk is fast enough that you can probably safely pre-set it at PREC=4096/8192 instead of having to worry about it everytime

Related

correct locale setting for devnagari unicode text

The following output is wrong. There should be only 1 word returned insted of 2
$ echo 'उद्योजकता' | grep -o -E '\w+'
उद
योजकता
I have been told this is due to locale setting. I have checked it on 2 different servers with 2 different O/S and the results are the same.
Ubuntu
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
AWS EC2
# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I am not sure which locale setting should be selected to get the Devnagari unicode text to break only at space.

Edited to add:
You can use something like this instead if the grep on your machine supports Perl-compatible regular expressions (PCRE). This should be the case e.g., on Amazon Linux.
echo 'उद्योजकता' |grep -o -P '[\w\pL\pM]+'
which will match "word" characters OR "Letter" code points OR "Mark" code points,
OR
echo 'उद्योजकता' |grep -o -P '(*UCP)[\w\pM]+'
which will enable unicode character property (UCP) matching so that \w matches all letters and numbers no matter the script, but you must still include \pM in the pattern because \w simply does not match "Mark" points in grep -- they are not alphanumeric code points.
Be careful with the above! I don't know Devanagari script so I don't know if it's appropriate or not to consider all such "Mark" characters as being part of a word for your purposes. It might be that a narrower set such as \Mn (for non-spacing marks) is more appropriate for your needs, or perhaps there are only a few specific points to include in which case you'd need to select them individually in your pattern.
In the old days, \w just meant [A-Za-z0-9_]. It was oriented around ASCII, and C code. Today, in typical interpretations, it still means "alphanumeric and underscore", which can vary depending on locale.
You say that the output is "wrong", but I'm afraid "wrong" is dependent on which regular expression engine you are using. So even though you are using "grep", the question is which grep, on which OS, etc., etc.
Your input contains 0x094d, as far as I can tell, which is not a "Letter", according to the unicode character definition (at least, not per the link above). It is a "Mark".
There is a Unicode "document" (recommendation) which includes how an engine might define "\w" to be Unicode-smart, and indeed it suggests to include Mark codepoints in the match. So your expectation is natural in that sense. However, you can see from the same link that there is no way to do this and also to be strictly POSIX-compliant at the same time, which lots of regex engines want to do.
Wikipedia indicates that there are some engines which support the Unicode property definitions, but in general, grep isn't going to do it. I'm not familiar enough with those engines (ruby, etc) to say exactly how you should attempt the same thing on command line as you are trying to do with grep.

The macOS (11.2.3) man page for grep has this note at the bottom:
BUGS
The grep utility does not normalize Unicode input, so a pattern containing composed characters will not match decomposed input, and vice versa.

If you are okay with a solution for Devnagari text alone, these would help. As per wikipedia, the Unicode range is U+0900 to U+097f. So, if your shell supports $'...' form, you can use:
$ echo 'उद्योजकता' | grep -oE $'[\u0900-\u097f]+'
उद्योजकता
If PCRE is available:
$ echo 'उद्योजकता' | grep -oP '[\x{900}-\x{97f}]+'
उद्योजकता
Use ripgrep for better Unicode support.
$ echo 'उद्योजकता' | rg -o '\w+'
उद्योजकता

Only output values within a certain range

I run a command that produce lots of lines in my terminal - the lines are floats.
I only want certain numbers to be output as a line in my terminal.
I know that I can pipe the results to egrep:
| egrep "(369|433|375|368)"
if I want only certain values to appear. But is it possible to only have lines that have a value within ± 50 of 350 (for example) to appear?

grep matches against string tokens, so you have to either:
figure out the right string match for the number range you want (e.g., for 300-400, you might do something like grep -E [34].., with appropriate additional context added to the expression and a number of additional .s equal to your floating-point precision)
convert the number strings to actual numbers in whatever programming language you prefer to use and filter them that way
I'd strongly encourage you to take the second option.

I would go with awk here:
./yourProgram | awk '$1>250 && $1<350'
e.g.
echo -e "12.3\n342.678\n287.99999" | awk '$1>250 && $1<350'
342.678
287.99999

AWK Avoid Reformatting of date-like values

This is the issue:
I gave as an input to AWK a comma-delimited table (and speficying FS=","), take the average of the 2-3rd column, the same for 4-5th column and print the first column value \t average1 \t average2 \n
BUT
the first column have genenames, and some of them looks like dates
AND
when I print, these names change, for example "Sept15" changed to "15-Sep", and I want to avoid this
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{print $1,($2+$3)/2,($4+$5)/2}}' DESeqResults.csv | grep Sep
Even when using printf(%s)
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{printf("%s\t%d\t%d\n",$1,($2+$3)/2,($4+$5)/2) }}' DESeqResults.csv | grep Sep
I thought that using printf instead of just print could work, but it didn't. And I'm pretty sure something is going on when reading the value and preprocessing it, because I printed it out only that column and anything else (using both print and printf) and the value is already changed.
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{print $1}}' DESeqResults.csv | grep Sep
AWK Version:
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
Here is a sample of the table:
genes,Tet2/3_Ctrtl_A,Tet2/3_Ctrtl_B,Tet2/3_DKO_A,Tet2/3_DKO_B,baseMean,baseMean_Tet2/3_Ctrtl,baseMean_Tet2/3_DKO,foldChange(Tet2/3_DKO/Tet2/3_Ctrtl),log2FoldChange,pval,padj
Sep15,187.0874494,213.5411848,289.6434172,338.0423229,1376.196203,926.4220733,1825.970332,1.970991824,0.978921792,5.88E-05,0.003018514
Psmb2,399.4650982,355.9642309,557.3871013,632.1236546,1462.399465,983.7201408,1941.078789,1.973202244,0.980538833,6.00E-05,0.003071175
Sept1,144.2402924,114.9623101,52.39183843,18.11079498,386.2494712,579.8722584,192.6266841,0.332188135,-1.58992755,0.000418681,0.014756367
Psmd8,101.3085151,68.51270408,140.650979,154.2588735,627.727588,396.4360624,859.0191136,2.166854116,1.115602027,0.000421417,0.014825295
Sepw1,388.2193716,337.7605508,209.8232326,155.9087497,639.6596557,787.1262578,492.1930536,0.625303817,-0.677370771,0.004039946,0.080871288
Cks1b,265.8259249,287.954538,337.1108392,408.0547432,865.5821999,642.8510296,1088.31337,1.692948008,0.759537668,0.004049464,0.0809765
Sept2,358.4252141,302.9219723,393.3509343,394.2208442,4218.71214,3392.272118,5045.152161,1.48724866,0.572645878,0.004380269,0.085547008
Tuba1a,19.47153869,11.1692256,40.09945086,28.7539846,142.1610148,75.37000403,208.9520256,2.772349933,1.47110937,0.004381599,0.085547008
Sepx1,14.5941944,15.37680483,53.70015607,105.5523799,157.8475412,40.73526884,274.9598136,6.749920191,2.754870444,0.010199249,0.153896056
Apc,10.90608004,13.56070852,6.445046152,4.536589807,363.4471652,466.2312058,260.6631245,0.559085538,-0.838859068,0.010251083,0.154555416
Sephs2,38.20092337,29.90249614,41.38713976,60.29027195,328.8398211,228.5362706,429.1433717,1.877791086,0.909036565,0.088470061,0.590328676
2310008H04Rik,12.72162335,13.98659226,17.77340283,16.88409867,175.2157133,133.5326829,216.8987437,1.624312033,0.699828804,0.088572283,0.590803249
Sepn1,16.26472482,11.00430796,7.219301889,7.109776037,119.8773488,144.9435253,94.81117235,0.654124923,-0.612361911,0.129473781,0.719395557
Fancc,6.590254663,5.520421849,8.969058939,8.394987722,111.479883,79.97866541,142.9811007,1.787740518,0.838137351,0.129516654,0.719423355
Sept7,170.6589676,187.3808346,185.8091089,158.0134115,1444.411676,1313.631233,1575.192119,1.199112871,0.261967464,0.189661613,0.852792911
Obsl1,1.400612677,0.51329399,0.299847728,0.105245908,10.77805777,17.15978377,4.396331776,0.256199719,-1.964659203,0.189677412,0.852792911
Sepp1,136.2725767,142.7392758,137.5079558,135.5576156,1055.39992,948.5532274,1162.246613,1.225283494,0.293115585,0.193768055,0.862790863
Tom1l2,6.079259794,5.972711213,4.188234003,1.879086398,93.62018078,115.620636,71.61972551,0.619437221,-0.690970019,0.193795263,0.862790863
Sept10,5.07506603,4.240574236,7.415271602,7.245735277,56.38191446,38.04292126,74.72090766,1.964121187,0.973883947,0.202050794,0.874641256
Jag2,0.531592511,1.753353521,0.106692242,0.099863326,7.812876603,14.01922398,1.606529221,0.114594732,-3.125387366,0.202074037,0.874641256
Sept9,25.71885843,9.170659969,29.98187141,23.5519093,333.6707351,231.1780024,436.1634678,1.8866997,0.915864812,0.227916377,0.920255208
Mad2l2,22.00853798,17.42180189,30.74357865,21.99530555,98.71951578,74.31522721,123.1238044,1.656777608,0.72837996,0.227920237,0.920255208
Sept8,3.128945597,4.413675869,1.658838722,1.197769008,38.73123291,52.59586062,24.8666052,0.472786354,-1.080739698,0.237101573,0.929055595
BC018465,1.974718423,2.171073663,0.264221349,0.123654833,5.802858162,10.40514412,1.200572199,0.115382563,-3.115502877,0.237135522,0.929055595
Sept11,51.69299305,57.36531814,51.69117677,51.61623861,915.6234052,837.2625097,993.9843007,1.187183576,0.247543039,0.259718041,0.949870478
Ccnc,11.42168015,13.32308428,14.76060133,12.19352385,173.0536821,146.6301746,199.4771895,1.36041023,0.444041759,0.259794956,0.949870478
Sept12,0,5.10639021,0,0.158638685,5.07217061,9.738384198,0.405957022,0.041686281,-4.584283515,0.388933297,1
Gclc,24.79641294,20.9904856,13.36470176,15.92090715,146.8502169,163.0012707,130.6991632,0.801829106,-0.318633307,0.3890016,1
Sept14,0.15949349,1.753526538,0,0,2.425489894,4.850979788,0,0,#NAME?,0.396160673,1
Slc17a1,0.131471208,1.445439884,0,0,2.425489894,4.850979788,0,0,#NAME?,0.396160673,1
Sept6,34.11050622,30.16102302,28.2562382,14.56889172,602.5658704,661.8163161,543.3154247,0.820945951,-0.284640854,0.416246976,1
Unc119,6.098478253,9.710512531,4.558282355,1.738214353,23.04654843,30.90026472,15.19283214,0.491673203,-1.024228366,0.416259755,1
Sept4,2.305246374,2.534467513,1.18972284,0.618652085,8.87244411,12.13933481,5.605553408,0.461767757,-1.114760654,0.560252893,1
Ddb2,11.25366078,17.32172888,10.50269513,6.025122118,71.81085298,83.53254996,60.089156,0.719350194,-0.475233821,0.560482212,1
Sephs1,20.92060935,15.48240612,15.94132159,11.57137656,288.7538099,298.3521103,279.1555094,0.935657902,-0.095946952,0.568672243,1
BC021785,0.135120133,0.891334456,0.108476095,0.101533002,5.825443635,9.241093439,2.409793832,0.260769339,-1.939153843,0.568713405,1
Sepsecs,7.276880132,6.154194955,5.055549522,3.680417498,35.9322246,39.77711194,32.08733726,0.806678406,-0.309934458,0.673968316,1
Osbpl7,10.51628336,5.69720028,7.157857243,5.382675661,86.65916873,88.67338952,84.64494794,0.954569893,-0.06707726,0.674000752,1
Sept3,0.113880577,0.250408482,0.228561799,0.042786507,2.505996654,2.619498342,2.392494966,0.913340897,-0.13077466,1,1
Sept5,0.126649979,0,0.203352303,0,0.609528347,0.424441516,0.794615178,1.872142914,0.904690571,1,1
Serpina11,0,0,0.14524189,0,0.198653794,0,0.397307589,Inf,Inf,1,1

Transferring extensive comments into an answer
No; awk does not convert strings such as Sep15 to a date 15-Sep by default, even on a Mac. At least, not with the standard awk on Mac OS X 10.10.2 Yosemite, which I tested with, nor would I expect it to do so with any other variant of awk I've ever seen on a Mac.
[…time passed…] Somewhat to my surprise, I have gawk installed and it is GNU Awk 3.1.7 Copyright (C) 1989, 1991-2009 Free Software Foundation. (I knew I had gawk installed, but I was expecting it to be a 4.x version.) Given your data on my Mac, the output of your first awk (gawk) command on the data you gave does no mapping whatsoever on the first column. If you subsequently import the data into a spreadsheet, the spreadsheet could do all sorts of transformations, but that isn't awk's fault.
You mentioned Mac in one of your early comments; are you using Mac OS X? I am not expecting to find any problem outside the spreadsheet. If the data is imported to a spreadsheet, then I won't be surprised to find the 'date-like' values in column 1 are reformatted.
I tried importing the CSV from the data in the question into LibreOffice (4.4.1.2, or 4.4.1002, depending on where you look for the version number), and no transformation occurred on the data in column 1. Similarly, Numbers 3.5.2 and OpenOffice 4.1.1 both leave the keys starting 'Sep' alone.
Unfortunately, MS Excel (for Mac 2011, version 14.4.8 — 150116) translates such column values to a date (so Sep15 becomes 15-Sep, for example). Even embedding the column in double quotes does not help. I don't have a good solution other than "do not use MS Excel".
There probably is a way to suppress the behaviour, but you need to ask a question tagged excel and csv rather than awk and printing and printf.
Incidentally, a Google search on 'excel csv import force text' turns up Stop Excel from automatically converting certain text values to dates? Some of the techniques outlined there (notably the "rename the file from .csv to .txt" technique) work.

How to make grep [A-Z] independent of locale?

I was doing some everyday grepping and suddenly discovered that something seemingly trivial does not work:
$ echo T | grep [A-Z]
No match.
How come T is not within A-Z range?
I changed the regex a tiny bit:
$ echo T | grep [A-Y]
A match!
Whoa! How is T within A-Y but not within A-Z?
Apparently this is because my environment is set to Estonian locale where Y is at the end of the alphabet but Z is somewhere in the middle: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY
$ echo $LANG
et_EE.UTF-8
This all came as a bit of a shock to me. 99% of the time I grep computer code, not Estonian literature. Have I been using grep the wrong way all the time? What all kind of mistakes have I made because of this in the past?
After trying several things I arrived at the following solution:
$ echo T | LANG=C grep [A-Z]
Is this the recommended way to make grep locale-independent?
Further more... would it be safe to define an alias like that:
$ alias grep="LANG=C grep"
PS. I'm also wondering of why are the character ranges like [A-Z] locale dependent in the first place while \w seems to be unaffected by locale (although the manual says \w is equivalent of [[:alnum:]] - but I found out the latter depends on locale while \w does not).

POSIX regular expressions, which Linux and FreeBSD grep support naturally, and some others support on request, have a series of [:xxx:] patterns that honor locales. See the man page for details.
grep '[[:upper:]]'
As the []s are part of the pattern name you need the outer [] as well, regardless of how strange it looks.
With the advent of these : codes the classic \w, etc., remain strictly in the C locale. Thus your choice of patterns determines if grep uses the current locale or not.
[A-Z] should follow locale, but you may need to set LC_ALL rather than LANG, especially if the system sets LC_ALL to a different value for your.

How to generate random numbers under OpenWRT?

With a "normal" (i mean "full") linux distro, it works just fine:
sleep $(echo "$[ ($RANDOM % 9 ) ]")
ok, it waits for about 0-9 sec
but under OpenWRT [not using bash, rather "ash"]:
$ sleep $(echo "$[ ($RANDOM % 9 ) ]") sleep: invalid number '$[' $
and why:
$ echo "$[ ($RANDOM % 9 ) ]" $[ ( % 9 ) ] $
So does anyone has a way to generate random numbers under OpenWRT, so i can put it in the "sleep"?
Thank you

You might try something like this:
sleep `head /dev/urandom | tr -dc "0123456789" | head -c1`
Which works on my WhiteRussian OpenWRT router.
I actually don't know if this will always return a number, but when it does, it will always return 0-9, and only 1 digit (you could make it go up to 99 if you made the second head -c2).
Good luck!

you could also use awk
sleep $(awk 'BEGIN{srand();print int(rand()*9)}')

For some scenarios, this might not yield a sufficient diversity of answers. Another approach is to use /dev/urandom directly (eg https://www.2uo.de/myths-about-urandom/):
echo $(hexdump -n 4 -e '"%u"' </dev/urandom)
When using awk, note that awk uses the time of day as the seed (https://linux.die.net/man/1/awk). This might be relevant for scenarios where the time of day is reset (eg no battery backed time of day clock), or synchronised across a fleet (eg group restart).
srand([expr])
Uses expr as a new seed for the random number generator. If no expr is provided, the time of day is used. The return value is the previous seed for the random number generator.
This is confirmed by looking at the source in busybox (https://github.com/mirror/busybox/blob/master/editors/awk.c):
seed = op1 ? (unsigned)L_d : (unsigned)time(NULL);
At least for some versions of Openwrt, it seems an explicit call to srand() is required to avoid obtaining the same answers repeatedly:
# awk 'BEGIN{print rand(), rand()}'
0 0.345001
# awk 'BEGIN{print rand(), rand()}'
0 0.345001

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart