How to grep to include an optional word?

How to grep to include an optional word? - grep

I use the following grep query to find the occurrences of functions in a VB source file.
grep -nri "^\s*\(public\|private\|protected\)\s*\(sub\|function\)" formName.frm
This matches -
Private Sub Form_Unload(Cancel As Integer)
Private Sub lbSelect_Click()
...
However, it misses out on functions like -
Private Static Sub SaveCustomer()
because of the additional word "Static" in there. How to account for this "optional" word in the grep query?

You can use a \? to make something optional:
grep -nri "^\s*\(public\|private\|protected\)\s*\(static\)\?\s*\(sub\|function\)" formName.frm
In this case, the preceding group, which contains the string "static", is optional (i.e. may occur 0 or 1 times).

When using grep, cardinality wise :
* : 0 or many
+ : 1 or many
? : 0 or 1 <--- this is what you need.
Given the following example (where the very word stands for your static) :
I am well
I was well
You are well
You were well
I am very well
He is well
He was well
She is well
She was well
She was very well
If we only want
I am well
I was well
You are well
You were well
I am very well
we'll use the '?' (also notice the careful placement of the space after 'very ' to mention that we'll want the 'very' word zero or one time :
egrep "(I|You) (am|was|are|were) (very )?well" file.txt
As you guessed it, I am inviting you to use egrep instead of grep (you can try grep -E, for Extended Regular Expressions).

Related

How to type AND in regex word matching

I'm trying to do a word search with regex and wonder how to type AND for multiple criteria.
For example, how to type the following:
(Start with a) AND (Contains p) AND (Ends with e), such as the word apple?
Input
apple
pineapple
avocado
Code
grep -E "regex expression here" input.txt
Desired output
apple
What should the regex expression be?

In general you can't implement and in a regexp (but you can implement then with .*) but you can in a multi-regexp condition using a tool that supports it.
To address the case of ands, you should have made your example starts with a and includes p and includes l and ends with e with input including alpine so it wasn't trivial to express in a regexp by just putting .*s in between characters but is trivial in a multi-regexp condition:
$ cat file
apple
pineapple
avocado
alpine
Using &&s will find both words regardless of the order of p and l as desired:
$ awk '/^a/ && /p/ && /l/ && /e$/' file
apple
alpine
but, as you can see, you can't just use .*s to implement and:
$ grep '^a.*p.*l.*e$' file
apple
If you had to use a single regexp then you'd have to do something like:
$ grep -E '^a.*(p.*l|l.*p).*e$' file
apple
alpine

two ways you can do it
all that "&&" is same as negating the totality of a bunch of OR's "||", so you can write the reverse of what you want.
at a single bit-level, AND is same as multiplication of the bits, which means, instead of doing all the && if u think it's overly verbose, you can directly "multiply" the patterns together :
awk '/^a/ * /p/ * /e$/'
so by multiplying them, you're doing the same as performing multiple logical ANDs all at once
(but only use the short hand if inputs aren't too gigantic, or when savings from early exit are known to be negligible.
don't think of them as merely regex patterns - it's easier for one to think of anything not inside an action block, what's typically referred to as pattern, as
any combination and collection of items that could be evaluated for a boolean outcome of TRUE or FALSE in the end
e.g. POSIX-compliant expressions that work in the space include
sprintf()
field assignments, etc
(even decrementing NR - if there's such a need)
but not
statements like next, print, printf(),
delete array etc, or any of the loop structures
surprisingly though, getline is directly doable
in the pattern space area (with some wrapper workaround)

How to extract certain part of line that's between quotes

For example if I have file.txt with the following
object = {
'name' : 'namestring',
'type' : 'type',
'real' : 'yes',
'version' : '2.0',
}
and I want to extract just the version so the output is 2.0 how would I go about doing this?

I would suggest that grep is probably the wrong tool for this. Nevertheless, it is possible, using grep twice.
grep 'version' input.txt | grep -Eo '[0-9.]+'
The first grep isolates the line you're interested in, and the second one prints only the characters of the line that match the regex, in this case numbers and periods. For your input data, this should work.
However, this solution is weak in a few areas. It doesn't handle cases where multiple version lines exist, it's hugely dependent on the structure of the file (i.e. I suspect your file would be syntactically valid if all the lines were joined into a single long line). It also uses a pipe, and in general, if there's a way to achieve something with a pipe, and a way without a pipe, you choose the latter.
One compromise might be to use awk, assuming you're always going to have things split by line:
awk '/version/ { gsub(/[^0-9.]/,"",$NF); print $NF; }' input.txt
This is pretty much identical in functionality to the dual grep solution above.
If you wanted to process multiple variables within that section of file, you might do something like the following with awk:
BEGIN {
FS=":";
}
/{/ {
inside=1;
next;
}
/}/ {
inside=0;
print a["version"];
# do things with other variables too
#for(i in a) { printf("i=%s / a=%s\n", i, a[i]); } # for example
delete a;
}
inside {
sub(/^ *'/,"",$1); sub(/' *$/,"",$1); # strip whitespace and quotes
sub(/^ *'/,"",$2); sub(/',$/,"",$2); # strip whitespace and quotes
a[$1]=$2;
}
A better solution would be to use a tool that actually understands the file format you're using.

A simple and clean solution using grep and cut
grep version file.txt | cut -d \' -f4

Grep Filenames from ls for specific part of them

I want to extract a specific part out of the filenames to work with them.
Example:
ls -1
REZ-Name1,Surname1-02-04-2012.png
REZ-Name2,Surname2-07-08-2013.png
....
So I want to get only the part with the name.
How can this be achieved ?

There are several ways to do this. Here's a loop:
for file in REZ-*-??-??-????.png
do
name=${file#*-}
name=${name%-??-??-????.png}
echo "($name)"
done
Given a variety of filenames with all sorts of edge cases from spacing, additional hyphens and line feeds:
REZ-Anna-Maria,de-la-Cruz-12-32-2015.png
REZ-Bjørn,Dæhlie-01-01-2015.png
REZ-First,Last-12-32-2015.png
REZ-John Quincy,Adams-11-12-2014.png
REZ-Ridiculous example # this is one filename
is ridiculous,but fun-22-11-2000.png # spanning two lines
it outputs:
(Anna-Maria,de-la-Cruz)
(Bjørn,Dæhlie)
(First,Last)
(John Quincy,Adams)
(Ridiculous example
is ridiculous,but fun)
If you're less concerned with correctness, you can simplify it further:
$ ls | grep -o '[^-]*,[^-]*'
Maria,de
Bjørn,Dæhlie
First,Last
John Quincy,Adams
is ridiculous,but fun

In this case, cut makes more sense than grep:
ls -l | cut -f2 -d-
cut the second field from the input, using '-' as the field delimiter. That other guy's answer will correctly handle some cases mine will not, but for one off uses, I generally find the semantics of cut to be much easier to remember.

GREP How do I search for words that contain specific letters (one or more times)?

I'm using the operating systems dictionary file to scan. I'm creating a java program to allow a user to enter any concoction of letters to find words that contain those letters. How would I do this using grep commands?

To find words that contain only the given letters:
grep -v '[^aeiou]' wordlist
The above filters out the lines in wordlist that don't contain any characters except for those listed. It's sort of using a double negative to get what you want. Another way to do this would be:
grep '^[aeiou]+$' wordlist
which searches the whole line for a sequence of one or more of the selected letters.
To find words that contain all of the given letters is a bit more lengthy, because there may be other letters in between the ones we want:
cat wordlist | grep a | grep e | grep i | grep o | grep u
(Yes, there is a useless use of cat above, but the symmetry is better this way.)

You can use a single grep to solve the last problem in Greg's answer, provided your grep supports PCRE. (Based on this excellent answer, boiled down a bit)
grep -P "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)" wordlist
The positive lookahead means it will match anything with an "a" anywhere, and an "e" anywhere, and.... etc etc.

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?

I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.

I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w

latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.

To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)

In Texmaker interface you can get the word count by right clicking in the PDF preview:

Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:

I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.

If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.

For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to grep to include an optional word? - grep

You can use a \? to make something optional: grep -nri "^\s\(public\|private\|protected\)\s\(static\)\?\s*\(sub\|function\)" formName.frm In this case, the preceding group, which contains the string "static", is optional (i.e. may occur 0 or 1 times).

Related

How to type AND in regex word matching

How to extract certain part of line that's between quotes

Grep Filenames from ls for specific part of them

GREP How do I search for words that contain specific letters (one or more times)?

Correct word-count of a LaTeX document

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to grep to include an optional word? - grep

You can use a \? to make something optional: grep -nri "^\s*\(public\|private\|protected\)\s*\(static\)\?\s*\(sub\|function\)" formName.frm In this case, the preceding group, which contains the string "static", is optional (i.e. may occur 0 or 1 times).

Related

How to type AND in regex word matching

How to extract certain part of line that's between quotes

Grep Filenames from ls for specific part of them

GREP How do I search for words that contain specific letters (one or more times)?

Correct word-count of a LaTeX document

Categories

Resources

You can use a \? to make something optional: grep -nri "^\s\(public\|private\|protected\)\s\(static\)\?\s*\(sub\|function\)" formName.frm In this case, the preceding group, which contains the string "static", is optional (i.e. may occur 0 or 1 times).