AWK Avoid Reformatting of date-like values - printing

This is the issue:
I gave as an input to AWK a comma-delimited table (and speficying FS=","), take the average of the 2-3rd column, the same for 4-5th column and print the first column value \t average1 \t average2 \n
BUT
the first column have genenames, and some of them looks like dates
AND
when I print, these names change, for example "Sept15" changed to "15-Sep", and I want to avoid this
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{print $1,($2+$3)/2,($4+$5)/2}}' DESeqResults.csv | grep Sep
Even when using printf(%s)
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{printf("%s\t%d\t%d\n",$1,($2+$3)/2,($4+$5)/2) }}' DESeqResults.csv | grep Sep
I thought that using printf instead of just print could work, but it didn't. And I'm pretty sure something is going on when reading the value and preprocessing it, because I printed it out only that column and anything else (using both print and printf) and the value is already changed.
awk 'BEGIN{FS=",";OFS="\t"}{if(NR==1){next}{print $1}}' DESeqResults.csv | grep Sep
AWK Version:
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
Here is a sample of the table:
genes,Tet2/3_Ctrtl_A,Tet2/3_Ctrtl_B,Tet2/3_DKO_A,Tet2/3_DKO_B,baseMean,baseMean_Tet2/3_Ctrtl,baseMean_Tet2/3_DKO,foldChange(Tet2/3_DKO/Tet2/3_Ctrtl),log2FoldChange,pval,padj
Sep15,187.0874494,213.5411848,289.6434172,338.0423229,1376.196203,926.4220733,1825.970332,1.970991824,0.978921792,5.88E-05,0.003018514
Psmb2,399.4650982,355.9642309,557.3871013,632.1236546,1462.399465,983.7201408,1941.078789,1.973202244,0.980538833,6.00E-05,0.003071175
Sept1,144.2402924,114.9623101,52.39183843,18.11079498,386.2494712,579.8722584,192.6266841,0.332188135,-1.58992755,0.000418681,0.014756367
Psmd8,101.3085151,68.51270408,140.650979,154.2588735,627.727588,396.4360624,859.0191136,2.166854116,1.115602027,0.000421417,0.014825295
Sepw1,388.2193716,337.7605508,209.8232326,155.9087497,639.6596557,787.1262578,492.1930536,0.625303817,-0.677370771,0.004039946,0.080871288
Cks1b,265.8259249,287.954538,337.1108392,408.0547432,865.5821999,642.8510296,1088.31337,1.692948008,0.759537668,0.004049464,0.0809765
Sept2,358.4252141,302.9219723,393.3509343,394.2208442,4218.71214,3392.272118,5045.152161,1.48724866,0.572645878,0.004380269,0.085547008
Tuba1a,19.47153869,11.1692256,40.09945086,28.7539846,142.1610148,75.37000403,208.9520256,2.772349933,1.47110937,0.004381599,0.085547008
Sepx1,14.5941944,15.37680483,53.70015607,105.5523799,157.8475412,40.73526884,274.9598136,6.749920191,2.754870444,0.010199249,0.153896056
Apc,10.90608004,13.56070852,6.445046152,4.536589807,363.4471652,466.2312058,260.6631245,0.559085538,-0.838859068,0.010251083,0.154555416
Sephs2,38.20092337,29.90249614,41.38713976,60.29027195,328.8398211,228.5362706,429.1433717,1.877791086,0.909036565,0.088470061,0.590328676
2310008H04Rik,12.72162335,13.98659226,17.77340283,16.88409867,175.2157133,133.5326829,216.8987437,1.624312033,0.699828804,0.088572283,0.590803249
Sepn1,16.26472482,11.00430796,7.219301889,7.109776037,119.8773488,144.9435253,94.81117235,0.654124923,-0.612361911,0.129473781,0.719395557
Fancc,6.590254663,5.520421849,8.969058939,8.394987722,111.479883,79.97866541,142.9811007,1.787740518,0.838137351,0.129516654,0.719423355
Sept7,170.6589676,187.3808346,185.8091089,158.0134115,1444.411676,1313.631233,1575.192119,1.199112871,0.261967464,0.189661613,0.852792911
Obsl1,1.400612677,0.51329399,0.299847728,0.105245908,10.77805777,17.15978377,4.396331776,0.256199719,-1.964659203,0.189677412,0.852792911
Sepp1,136.2725767,142.7392758,137.5079558,135.5576156,1055.39992,948.5532274,1162.246613,1.225283494,0.293115585,0.193768055,0.862790863
Tom1l2,6.079259794,5.972711213,4.188234003,1.879086398,93.62018078,115.620636,71.61972551,0.619437221,-0.690970019,0.193795263,0.862790863
Sept10,5.07506603,4.240574236,7.415271602,7.245735277,56.38191446,38.04292126,74.72090766,1.964121187,0.973883947,0.202050794,0.874641256
Jag2,0.531592511,1.753353521,0.106692242,0.099863326,7.812876603,14.01922398,1.606529221,0.114594732,-3.125387366,0.202074037,0.874641256
Sept9,25.71885843,9.170659969,29.98187141,23.5519093,333.6707351,231.1780024,436.1634678,1.8866997,0.915864812,0.227916377,0.920255208
Mad2l2,22.00853798,17.42180189,30.74357865,21.99530555,98.71951578,74.31522721,123.1238044,1.656777608,0.72837996,0.227920237,0.920255208
Sept8,3.128945597,4.413675869,1.658838722,1.197769008,38.73123291,52.59586062,24.8666052,0.472786354,-1.080739698,0.237101573,0.929055595
BC018465,1.974718423,2.171073663,0.264221349,0.123654833,5.802858162,10.40514412,1.200572199,0.115382563,-3.115502877,0.237135522,0.929055595
Sept11,51.69299305,57.36531814,51.69117677,51.61623861,915.6234052,837.2625097,993.9843007,1.187183576,0.247543039,0.259718041,0.949870478
Ccnc,11.42168015,13.32308428,14.76060133,12.19352385,173.0536821,146.6301746,199.4771895,1.36041023,0.444041759,0.259794956,0.949870478
Sept12,0,5.10639021,0,0.158638685,5.07217061,9.738384198,0.405957022,0.041686281,-4.584283515,0.388933297,1
Gclc,24.79641294,20.9904856,13.36470176,15.92090715,146.8502169,163.0012707,130.6991632,0.801829106,-0.318633307,0.3890016,1
Sept14,0.15949349,1.753526538,0,0,2.425489894,4.850979788,0,0,#NAME?,0.396160673,1
Slc17a1,0.131471208,1.445439884,0,0,2.425489894,4.850979788,0,0,#NAME?,0.396160673,1
Sept6,34.11050622,30.16102302,28.2562382,14.56889172,602.5658704,661.8163161,543.3154247,0.820945951,-0.284640854,0.416246976,1
Unc119,6.098478253,9.710512531,4.558282355,1.738214353,23.04654843,30.90026472,15.19283214,0.491673203,-1.024228366,0.416259755,1
Sept4,2.305246374,2.534467513,1.18972284,0.618652085,8.87244411,12.13933481,5.605553408,0.461767757,-1.114760654,0.560252893,1
Ddb2,11.25366078,17.32172888,10.50269513,6.025122118,71.81085298,83.53254996,60.089156,0.719350194,-0.475233821,0.560482212,1
Sephs1,20.92060935,15.48240612,15.94132159,11.57137656,288.7538099,298.3521103,279.1555094,0.935657902,-0.095946952,0.568672243,1
BC021785,0.135120133,0.891334456,0.108476095,0.101533002,5.825443635,9.241093439,2.409793832,0.260769339,-1.939153843,0.568713405,1
Sepsecs,7.276880132,6.154194955,5.055549522,3.680417498,35.9322246,39.77711194,32.08733726,0.806678406,-0.309934458,0.673968316,1
Osbpl7,10.51628336,5.69720028,7.157857243,5.382675661,86.65916873,88.67338952,84.64494794,0.954569893,-0.06707726,0.674000752,1
Sept3,0.113880577,0.250408482,0.228561799,0.042786507,2.505996654,2.619498342,2.392494966,0.913340897,-0.13077466,1,1
Sept5,0.126649979,0,0.203352303,0,0.609528347,0.424441516,0.794615178,1.872142914,0.904690571,1,1
Serpina11,0,0,0.14524189,0,0.198653794,0,0.397307589,Inf,Inf,1,1

Transferring extensive comments into an answer
No; awk does not convert strings such as Sep15 to a date 15-Sep by default, even on a Mac. At least, not with the standard awk on Mac OS X 10.10.2 Yosemite, which I tested with, nor would I expect it to do so with any other variant of awk I've ever seen on a Mac.
[…time passed…] Somewhat to my surprise, I have gawk installed and it is GNU Awk 3.1.7 Copyright (C) 1989, 1991-2009 Free Software Foundation. (I knew I had gawk installed, but I was expecting it to be a 4.x version.) Given your data on my Mac, the output of your first awk (gawk) command on the data you gave does no mapping whatsoever on the first column. If you subsequently import the data into a spreadsheet, the spreadsheet could do all sorts of transformations, but that isn't awk's fault.
You mentioned Mac in one of your early comments; are you using Mac OS X? I am not expecting to find any problem outside the spreadsheet. If the data is imported to a spreadsheet, then I won't be surprised to find the 'date-like' values in column 1 are reformatted.
I tried importing the CSV from the data in the question into LibreOffice (4.4.1.2, or 4.4.1002, depending on where you look for the version number), and no transformation occurred on the data in column 1. Similarly, Numbers 3.5.2 and OpenOffice 4.1.1 both leave the keys starting 'Sep' alone.
Unfortunately, MS Excel (for Mac 2011, version 14.4.8 — 150116) translates such column values to a date (so Sep15 becomes 15-Sep, for example). Even embedding the column in double quotes does not help. I don't have a good solution other than "do not use MS Excel".
There probably is a way to suppress the behaviour, but you need to ask a question tagged excel and csv rather than awk and printing and printf.
Incidentally, a Google search on 'excel csv import force text' turns up Stop Excel from automatically converting certain text values to dates? Some of the techniques outlined there (notably the "rename the file from .csv to .txt" technique) work.

Related

How to grep a still unknown specific word in a matched row

I have a file made of some rows and the one I am interested in is like this one:
free energy TOTEN = -96.86706464 eV
So with grep I can find the row I need and assign to a variable the value of the row with:
E=$(grep "free energy" OUTCAR_$i)
Now, how do I do if want to assign to E a specific word present in the matched line obtained by using grep, the numeric value in this case? Please note that value I want to grep is the unknown I am looking, but it is always present at the same position in the row!
Thank you
With GNU grep, you may use a PCRE regex solution:
E=$(grep -oP 'free energy.* \K-?[0-9][0-9.]*' "OUTCAR_$i")
See the online demo
With GNU sed, you may extract the negative value from a line:
E=$(sed -n '/free energy/{s/.* \(-\{0,1\}[0-9][0-9.]*\).*/\1/p}' "OUTCAR_$i")
See the online demo.
If the number of non-whtespace chunks is a fixed value extract the fifth field if the line contains free energy:
E=$(awk '$0 ~ /free energy/{print $5}' "OUTCAR_$i")
See this online demo

Need to selectively remove newline characters from a file using unix (solaris)

I am trying to find a way to selectively remove newline characters from a file. I have no issues removing all of them..but I need some to remain.
Here is the example of the bad input file. Note that rows with Permit ID COO789 & COO012 have newlines embedded in the description field that I need to remove.
"Permit Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians
Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race
weekend",,"05/11/2013","05/11/2013"
Here is an example of how I need the file to look like:
"Permit Number/Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race weekend",,"05/11/2013","05/11/2013"
NOTE: I did simplify the file by removing a few extra columns. The logic should be able to accommodation any number of columns though. The actual full header line is with all columns is. Technically, I expect the "extra" newlines to be found in Description and Location columns.
"Permit Number/Id","Permit Name","Description","Start Date","End Date","Custom Status","Owner Name","Total Expected Attendance","Location"
I have tried sed, cut, tr, nawk, etc. Open to any solution that can do this..that can be called from within a unix script.
Thanks!!!
If you must remove newline characters from only within the 'Description' and 'Location' fields, you will need a proper csv parser (think Text::CSV). You could also do this fairly easily using GNU awk, but you won't have access to gawk on Solaris unfortunately. Therefore, the next best solution would be to join lines that don't start with a double-quote to the previous line. You can do this using sed. I've written this with compatibility in mind:
sed -e :a -e '$!N; s/ *\n\([^"]\)/ \1/; ta' -e 'P;D' file
Results:
"Permit Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race weekend",,"05/11/2013","05/11/2013"
sed ':a;N;$!ba;s/ \n/ /g'
Reads the whole file into the pattern space, then removes all newlines which occur directly after a space - assuming that all the errant newlines fit this pattern. If not, when else should newlines be removed?

Using sed or awk to parse current field/column into additional fields/columns on the same line

Here are 4 sample rows of the text file of interest
EnumerateKey,explorer.exe,HKCR\\Directory\\shellex\\ContextMenuHandlers,NOMORE
CreateSec,explorer.exe,\\WINDOWS\\system32\\verclsid.exe,SUCCESS
QueryKey,AcroRd32.exe,HKCU\\Control Panel\\International,BUFOVRFLOW
QueryValue,AcroRd32.exe,HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Policies\\Explorer\\NoRecentDocsHistory,NOTFOUND
I would like to augment the rows by appending K fields/columns (for example K=3 below) which contain the elements of the path found in $3 but parsed by \\
Here are the desired output for the 4 lines.
EnumerateKey,explorer.exe,HKCR\\Directory\\shellex\\ContextMenuHandlers,NOMORE, Directory, shellex, ContextMenuHandlers
CreateSec,explorer.exe,\\WINDOWS\\system32\\verclsid.exe,SUCCESS, WINDOWS, system32, verclsid.exe
QueryKey,AcroRd32.exe,HKCU\\Control Panel\\International,BUFOVRFLOW, Control Panel, International,
QueryValue,AcroRd32.exe,HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Policies\\Explorer\\NoRecentDocsHistory,NOTFOUND, Software, Microsoft, Windows
After some more study, here are 2 nuances:
Some of the paths begin with HK**, others don't. However, in both cases I only care about the path that starts after the initial \\. This difference is captured between line 1 and 2. Therefore I believe the parsing must be anchored at \\ rather than simply $3 if possible. (Am I using that terminology correctly?)
Second, the depth of the path varies. In order to keep consistency in column/fields I'm willing to lose some information (line 4) as well as have empty fields for the short paths (line 3) in order to maintain this.
Here's one way using GNU awk:
awk 'BEGIN { FS=OFS="," } { split($3,a,"\\\\\\\\"); print $0, a[2], a[3], a[4] }' file
Results:
EnumerateKey,explorer.exe,HKCR\\Directory\\shellex\\ContextMenuHandlers,NOMORE,Directory,shellex,ContextMenuHandlers
CreateSec,explorer.exe,\\WINDOWS\\system32\\verclsid.exe,SUCCESS,WINDOWS,system32,verclsid.exe
QueryKey,AcroRd32.exe,HKCU\\Control Panel\\International,BUFOVRFLOW,Control Panel,International,
QueryValue,AcroRd32.exe,HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Policies\\Explorer\\NoRecentDocsHistory,NOTFOUND,Software,Microsoft,Windows
With some ugly Perl:
perl -lane '{$l=$_;s/.*?\\\\([^,]*),.*/$1/;#v=split(/\\\\/,$_); print "$l,".join(",",#v[0,1,2]);}' input

Printing long integers in awk

I have a pipe delimited feed file which has several fields. Since I only need a few, I thought of using awk to capture them for my testing purposes. However, I noticed that printf changes the value if I use "%d". It works fine if I use "%s".
Feed File Sample:
[jaypal:~/Temp] cat temp
302610004125074|19769904399993903|30|15|2012-01-13 17:20:02.346000|2012-01-13 17:20:03.307000|E072AE4B|587244|316|13|GSM|1|SUCC|0|1|255|2|2|0|213|2|0|6|0|0|0|0|0|10|16473840051|30|302610|235|250|0|7|0|0|0|0|0|10|54320058002|906|722310|2|0||0|BELL MOBILITY CELLULAR, INC|BELL MOBILITY CELLULAR, INC|Bell Mobility|AMX ARGENTINA SA.|Claro aka CTI Movil|CAN|ARG|
I am interested in capturing the second column which is 19769904399993903.
Here are my tests:
[jaypal:~/Temp] awk -F"|" '{printf ("%d\n",$2)}' temp
19769904399993904 # Value is changed
However, the following two tests works fine -
[jaypal:~/Temp] awk -F"|" '{printf ("%s\n",$2)}' temp
19769904399993903 # Value remains same
[jaypal:~/Temp] awk -F"|" '{print $2}' temp
19769904399993903 # Value remains same
So is this a limit of "%d" of not able to handle long integers. If thats the case why would it add one to the number instead of may be truncating it?
I have tried this with BSD and GNU versions of awk.
Version Info:
[jaypal:~/Temp] gawk --version
GNU Awk 4.0.0
Copyright (C) 1989, 1991-2011 Free Software Foundation.
[jaypal:~/Temp] awk --version
awk version 20070501
Starting with GNU awk 4.1 you can use --bignum or -M
$ awk 'BEGIN {print 19769904399993903}'
19769904399993904
$ awk --bignum 'BEGIN {print 19769904399993903}'
19769904399993903
§ Command-Line Options
I believe the underlying numeric format in this case is an IEEE double. So the changed value is a result of floating point precision errors. If it is actually necessary to treat the large values as numerics and to maintain accurate precision, it might be better to use something like Perl, Ruby, or Python which have the capabilities (maybe via extensions) to handle arbitrary-precision arithmetic.
UPDATE: Recent versions of GNU awk support arbitrary precision arithmetic. See the GNU awk manual for more info.
ORIGINAL POST CONTENT:
XMLgawk supports arbitrary precision arithmetic on floating-point numbers.
So, if installing xgawk is an option:
zsh-4.3.11[drado]% awk --version |head -1; xgawk --version | head -1
GNU Awk 4.0.0
Extensible GNU Awk 3.1.6 (build 20080101) with dynamic loading, and with statically-linked extensions
zsh-4.3.11[drado]% awk 'BEGIN {
x=665857
y=470832
print x^4 - 4 * y^4 - 4 * y^2
}'
11885568
zsh-4.3.11[drado]% xgawk -lmpfr 'BEGIN {
MPFR_PRECISION = 80
x=665857
y=470832
print mpfr_sub(mpfr_sub(mpfr_pow(x, 4), mpfr_mul(4, mpfr_pow(y, 4))), 4 * y^2)
}'
1.0000000000000000000000000
This answer was partially answered by #Mark Wilkins and #Dennis Williamson already but I found out the largest 64-bit integer that can be handled without losing precision is 2^53.
Eg awk's reference page
http://www.gnu.org/software/gawk/manual/gawk.html#Integer-Programming
(sorry if my answer is too old. Figured I'd still share for the next person before they spend too much time on this like I did)
You're running into Awk's Floating Point Representation Issues. I don't think you can find a work-around within awk framework to perform arithmetic on huge numbers accurately.
Only possible (and crude) way I can think of is to break the huge number into smaller chunk, perform your math and join them again or better yet use Perl/PHP/TCL/bsh etc scripting languages that are more powerful than awk.
Using nawk on Solaris 11, I convert the number to a string by adding (concatenate) a null to the end, and then use %15s as the format string:
printf("%15s\n", bignum "")
another caveat about the precision :
the errors pile up with extra operations ::
echo 19769904399993903 | mawk2 '{ CONVFMT = "%.2000g";
OFMT = "%.20g";
} {
print;
print +$0;
print $0/1.0
print $0^1.0;
print exp(-log($0))^-1;
print exp(1*log($0))
print sqrt(exp(exp(log(20)-log(10))*log($0)))
print (exp(exp(log(6)-log(3))*log($0)))^2^-1
}'
19769904399993903
19769904399993904
19769904399993904
19769904399993904
19769904399993912
19769904399993908
19769904399993628 <<<—— -275
19769904399993768 <<<—- -135
The first few only off by less than 10.
last 2 equations have triple digit deltas.
For any of the versions that require calling helper math functions, simply getting the -M bignum flag is insufficient. One must also set the PREC variable.
For this exmaple, setting PREC=64 and OFMT="%.17g" should suffice.
Beware of setting OFMT too high, relative to PREC, otherwise you'll see oddities like this :
gawk -M -v PREC=256 -e '{ CONVFMT="%.2000g"; OFMT="%.80g";... } '
19769904399993903
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
19769904399993903.000000000000000000000000000000000000000000000000000000000003734
since 80 significant digits require precision of at least 265.75, so basically 266-bits, but gawk is fast enough that you can probably safely pre-set it at PREC=4096/8192 instead of having to worry about it everytime

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?
I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.
I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w
latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.
To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)
In Texmaker interface you can get the word count by right clicking in the PDF preview:
Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:
I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.
If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.
For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Resources