Correct word-count of a LaTeX document

Correct word-count of a LaTeX document - latex

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?

I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.

I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w

latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.

To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)

In Texmaker interface you can get the word count by right clicking in the PDF preview:

Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:

I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.

If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.

For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Related

Read a single number from a text file and advance stream position in Julia

I understand that Julia has a complete set of low level tools for interfacing with binary files on one hand and some powerfull utilities such as readdlm to load text files containing rectangular data into Array structures on the other hand.
What I cannot discover in the standard library docs, however, is how to easily get input from less structured text files. In particular, what would be the Julia equivalent of the c++ idiom
some_input_stream >> a_variable_int_perhaps;
Given this is such a common usage scenario I am surprised something like this does not feature prominently in the standard library...

You can use readuntil http://docs.julialang.org/en/latest/stdlib/io-network/#Base.readuntil
shell> cat test.txt
1 2 3 4
julia> i,j = open("test.txt") do f
parse(Int, readuntil(f," ")), parse(Int, readuntil(f," "))
end
(1,2)
EDIT: To address comments
To get the last integer in an irregularly formatted ascii file you could use split if you know the character preceding the integer (I've use a blank space here)
shell> cat test.txt
1.0, two five:$#!() + 4
last line 3
julia> i = open("test.txt") do f
parse(Int, split(readline(f), " ")[end])
end
4
As far as code length is concerned, the above examples are completely self contained and the file is opened and closed in an exception safe manner (i.e. wrapped in a try-finally block). To do the same in C++ would be quite verbose.

How to make the output of Maxima cleaner?

I want to make use of Maxima as the backend to solve some computations used in my LaTeX input file.
I did the following steps.
Step 1
Download and install Maxima.
Step 2
Create a batch file named cas.bat (for example) as follows.
rem cas.bat
echo off
set PATH=%PATH%;"C:\Program Files (x86)\Maxima-5.31.2\bin"
maxima --very-quiet -r %1 > solution.tex
Save the batch in the same directory in which your input file below exists. It is just for the sake of simplicity.
Step 3
Create the input file named main.tex (for example) as follows.
% main.tex
\documentclass[preview,border=12pt,12pt]{standalone}
\usepackage{amsmath}
\def\f(#1){(#1)^2-5*(#1)+6}
\begin{document}
\section{Problem}
Evaluate $\f(x)$ for $x=\frac 1 2$.
\section{Solution}
\immediate\write18{cas "x: 1/2;tex(\f(x));"}
\input{solution}
\end{document}
Step 4
Compile the input file with pdflatex -shell-escape main and you will get a nice output as follows.
!
Step 5
Done.
Questions
Apparently the output of Maxima is as follows. I don't know how to make it cleaner.
solution.tex
1
-
2
$${{15}\over{4}}$$
false
Now, my question are
how to remove such texts?
how to obtain just \frac{15}{4} without $$...$$?

(1) To suppress output, terminate input expressions with dollar sign (i.e. $) instead of semicolon (i.e. ;).
(2) To get just the TeX-ified expression sans the environment delimiters (i.e. $$), call tex1 instead of tex. Note that tex1 returns a string, which you have to print yourself (while tex prints it for you).
Combining these ideas with the stuff you showed, I think your program could look like this:
"x: 1/2$ print(tex1(\f(x)))$"
I think you might find the Maxima mailing list helpful. I'm pretty sure there have been several attempts to create a system such as the one you describe. You can also look at the documentation.

I couldn't find any way to completely clean up Maxima's output within Maxima itself. It always echoes the input line, and always writes some whitespace after the output. The following is an example of a perl script that accomplishes the cleanup.
#!/usr/bin/perl
use strict;
my $var = $ARGV[0];
my $expr = $ARGV[1];
sub do_maxima_to_tex {
my $m = shift;
my $c = "maxima --batch-string='exptdispflag:false; print(tex1($m))\$'";
my $e = `$c`;
my #x = split(/\(%i\d+\)/,$e); # output contains stuff like (%i1)
my $f = pop #x; # remove everything before the echo of the last input
while ($f=~/\A /) {$f=~s/\A .*\n//} # remove echo of input, which may be more than one line
$f =~ s/\\\n//g; # maxima breaks latex tokens in the middle at end of line; fix this
$f =~ s/\n/ /g; # if multiple lines, get it into one line
$f =~ s/\s+\Z//; # get rid of final whitespace
return $f;
}
my $e1 = do_maxima_to_tex("diff($expr,$var,1)");
my $e2 = do_maxima_to_tex("diff($expr,$var,2)");
print <<TEX;
The first derivative is \$$e1\$. Differentiating a second time,
we get \$$e2\$.
TEX
If you name this script a.pl, then doing
a.pl z 3*z^4
outputs this:
The first derivative is $12\,z^3$. Differentiating a second time,
we get $36\,z^2$.
For the OP's application, a script like this one could be what is invoked by the write18 in the latex file.

If you really want to use LaTeX then the maxiplot package is the answer. It provides a maxima environment inside of which you enter Maxima commands. When you process your LaTeX file a Maxima batch file is generated. Process this file with Maxima and process your LaTeX file again to typeset the equations generated by Maxima.
If you would rather have 2D math input with live typesetting then use TeXmacs. It is a cross-platform document authoring environment (a word processor on steroids if you like) that includes plugins for Maxima, Mathematica and many more scientific computing tools. If you need to or are not satisfied with the typesetting, you can export your document to LaTeX.

I know this is a very old post. Excellent answers for the question asked by OP. I was using --very-quiet -r options on the command line for a long time like OP, but in maxima version 5.43.2 they behave differently. See maxima command line v5.43 is behaving differently than v5.41. I am answering this question with a cross reference because when incorporating these answers in your solutions, make sure the changes in behavior of those command line flags are also incorporated.

How to diagnose errors in LaTeX generated by Doxygen 1.8.x: LT#LL#FM#cr

I've been using Doxygen successfully to generate PDF documentation for a sizable Fortran 90 project since v1.6. After a recent upgrade to Doxygen 1.8, pdflatex is choking with an error I can't understand. From refman.log:
.
.
.
<use classfate__source_a022bf629bdc1d3059ebd5fb86d13b4f4_icgraph.pdf>
Package pdftex.def Info: classfate__source_a022bf629bdc1d3059ebd5fb86d13b4f4_ic
graph.pdf used on input line 607.
(pdftex.def) Requested size: 350.0pt x 65.42921pt.
)
(./classm__aerosol.tex
! Undefined control sequence.
<recently read> \LT#LL#FM#cr
l.25 ...1833ffa6f2fae54ededb}{ia\-\_\-nsize}), \\*
? ?
Type <return> to proceed, S to scroll future error messages,
R to run without stopping, Q to run quietly,
I to insert something, E to edit your file,
1 or ... or 9 to ignore the next 1 to 9 tokens of input,
H for help, X to quit.
Looking at the first 25 lines of classm__aerosol.tex, nothing obviously matches the error message:
\hypertarget{classm__aerosol}{\section{m\-\_\-aerosol Module Reference}
\label{classm__aerosol}\index{m\-\_\-aerosol#{m\-\_\-aerosol}}
}
Contains general aerosol-\/related constants and routines.
\subsection*{Public Member Functions}
\begin{DoxyCompactItemize}
\item
subroutine \hyperlink{classm__aerosol_aa06c1f39c6bd34f22be92d21535f0320}{aerdis} (I\-A\-E\-R\-O, M\-A\-E\-R\-O, V\-O\-L, A\-R\-E\-A, M\-U, T\-G\-A\-S, R\-H\-O, A\-G\-A\-M\-M\-A, X\-L\-A\-E\-R, D\-M\-E\-A\-N, N\-A\-E\-R, X\-N\-D\-A\-E\-R, L\-S\-D\-A\-E\-R)
\begin{DoxyCompactList}\small\item\em Return aerosol mass given a volume, based on aerosol size distribution function. \end{DoxyCompactList}\item
real(kind=wp) function \hyperlink{classm__aerosol_a2dff4ff413057e8788fba7270a30c093}{lamsed} (V\-O\-L, H, M\-U\-G, R\-H\-O\-A\-E\-R, A\-G\-A\-M\-M\-A, A\-C\-H\-I, A\-F\-E\-O, K\-O, M\-A\-E\-R, F\-M\-A\-E\-R, F\-A\-E\-R\-S\-S, F\-S\-E\-D\-D\-K)
\begin{DoxyCompactList}\small\item\em Calculate aerosol removal constant and interpolation factor between steady-\/state and decaying aerosol correlations. \end{DoxyCompactList}\item
pure real(kind=wp) function \hyperlink{classm__aerosol_a6d0a04004f49c404c67e0aa69dd39ee1}{fdbend} (V\-E\-L, H\-S\-E\-D, T\-G, R\-H\-O\-G, M\-U\-G, R\-H\-O\-P\-A\-R, C\-A\-E\-R\-O, X\-D\-B\-E\-N\-D, N90\-J)
\begin{DoxyCompactList}\small\item\em Find total impaction efficiency for aerosol deposition considering 90-\/degree bends in a flow path. \end{DoxyCompactList}\end{DoxyCompactItemize}
\subsection*{Public Attributes}
\begin{DoxyCompactItemize}
\item
integer, parameter \hyperlink{classm__aerosol_a8f604b7ffe3c1833ffa6f2fae54ededb}{ia\-\_\-nsize} = 30
\item
integer, parameter \hyperlink{classm__aerosol_ae71813ecf0c7768af9d6292efb14774f}{ia\-\_\-nmass} = 10
\item
real(kind=wp), dimension(\hyperlink{classm__aerosol_a8f604b7ffe3c1833ffa6f2fae54ededb}{ia\-\_\-nsize}), \\*
Nothing obviously matches the recently read chunk "\LT#LL#FM#cr" and I don't know enough low-level TeX to translate that into something that might actually be in the source text.
Suspecting this might have been fixed in a later version of Doxygen than the one shipping with Linux Mint (v1.8.1.2), I built & installed v1.8.3.1 from source, updated my doxyfile, blew away the old documentation and regenerated it. I get the same baffling error.
There's nothing obvious in refman.log that would indicate missing or broken LaTeX packages and I'm completely at a loss as to what's causing this.

As this still gets a hit on Google when you search:
doxygen missing $ inserted
I would like to add something.
Do not use a PROJECT_NAME containing underscores (_)!
After a brief look into the doxygen's current documentation (I am using 1.8.4) it does not make that explicit.

this will be difficult to solve unless you provide a bit more information - possibly using \errorcontextlines=9999 as suggested in the comments on the question.
as a first short though, the name of the control sequence that can't be found (i.e. \LT#LL#FM#cr) is one defined by the longtable package (documentation, p. 15) - thus adding:
\usepackage{longtable}
to the preamble of the document might help.
If so, according to the doxygen documentation here, adding the following to your configuration file should do the trick:
EXTRA_PACKAGES=longtable

Need to selectively remove newline characters from a file using unix (solaris)

I am trying to find a way to selectively remove newline characters from a file. I have no issues removing all of them..but I need some to remain.
Here is the example of the bad input file. Note that rows with Permit ID COO789 & COO012 have newlines embedded in the description field that I need to remove.
"Permit Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians
Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race
weekend",,"05/11/2013","05/11/2013"
Here is an example of how I need the file to look like:
"Permit Number/Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race weekend",,"05/11/2013","05/11/2013"
NOTE: I did simplify the file by removing a few extra columns. The logic should be able to accommodation any number of columns though. The actual full header line is with all columns is. Technically, I expect the "extra" newlines to be found in Description and Location columns.
"Permit Number/Id","Permit Name","Description","Start Date","End Date","Custom Status","Owner Name","Total Expected Attendance","Location"
I have tried sed, cut, tr, nawk, etc. Open to any solution that can do this..that can be called from within a unix script.
Thanks!!!

If you must remove newline characters from only within the 'Description' and 'Location' fields, you will need a proper csv parser (think Text::CSV). You could also do this fairly easily using GNU awk, but you won't have access to gawk on Solaris unfortunately. Therefore, the next best solution would be to join lines that don't start with a double-quote to the previous line. You can do this using sed. I've written this with compatibility in mind:
sed -e :a -e '$!N; s/ *\n\([^"]\)/ \1/; ta' -e 'P;D' file
Results:
"Permit Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race weekend",,"05/11/2013","05/11/2013"

sed ':a;N;$!ba;s/ \n/ /g'
Reads the whole file into the pattern space, then removes all newlines which occur directly after a space - assuming that all the errant newlines fit this pattern. If not, when else should newlines be removed?

Easiest way to remove Latex tag (but not its content)?

I am using TeXnicCenter to edit a LaTeX document.
I now want to remove a certain tag (say, emph{blabla}} which occurs multiple times in my document , but not tag's content (so in this example, I want to remove all emphasization).
What is the easiest way to do so?
May also be using another program easily available on Windows 7.
Edit: In response to regex suggestions, it is important that it can deal with nested tags.
Edit 2: I really want to remove the tag from the text file, not just disable it.

Using a regular expression do something like s/\\emph\{([^\}]*)\}/\1/g. If you are not familiar with regular expressions this says:
s -- replace
/ -- begin match section
\\emph\{ -- match \emph{
( -- begin capture
[^\}]* -- match any characters except (meaning up until) a close brace because:
[] a group of characters
^ means not or "everything except"
\} -- the close brace
and * means 0 or more times
) -- end capture, because this is the first (in this case only) capture, it is number 1
\} -- match end brace
/ -- begin replace section
\1 -- replace with captured section number 1
/ -- end regular expression, begin extra flags
g -- global flag, meaning do this every time the match is found not just the first time
This is with Perl syntax, as that is what I am familiar with. The following perl "one-liners" will accomplish two tasks
perl -pe 's/\\emph\{([^\}]*)\}/\1/g' filename will "test" printing the file to the command line
perl -pi -e 's/\\emph\{([^\}]*)\}/\1/g' filename will change the file in place.
Similar commands may be available in your editor, but if not this will (should) work.

Crowley should have added this as an answer, but I will do that for him, if you replace all \emph{ with { you should be able to do this without disturbing the other content. It will still be in braces, but unless you have done some odd stuff it shouldn't matter.
The regex would be a simple s/\\emph\{/\{/g but the search and replace in your editor will do that one too.
Edit: Sorry, used the wrong brace in the regex, fixed now.

\renewcommand{\emph}[1]{#1}

any reasonably advanced editor should let you do a search/replace using regular expressions, replacing emph{bla} by bla etc.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Correct word-count of a LaTeX document - latex

I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc: pdftotext file.pdf - | wc - w

latex file.tex dvips -o - file.dvi | ps2ascii | wc -w should give you a fairly accurate word count.

In Texmaker interface you can get the word count by right clicking in the PDF preview:

Overleaf has a word count feature: Overleaf v2: Overleaf v1:

Related

Read a single number from a text file and advance stream position in Julia

How to make the output of Maxima cleaner?

How to diagnose errors in LaTeX generated by Doxygen 1.8.x: LT#LL#FM#cr

Need to selectively remove newline characters from a file using unix (solaris)

Easiest way to remove Latex tag (but not its content)?

Categories

Resources