GREP - finding all occurrences of a string

GREP - finding all occurrences of a string - grep

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.
I am using grep to filter results like this:
grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...
The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).
All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:
public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";
I would like to find that occurrence as well as:
public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";
Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.

To address your concern about missing some occurrences, why not filter progressively:
Create a text file with all possible
matches as a starting point.
Use filter X (grep for '^import',
for example) to dump probable false
positives into a tmp file.
Use filter X again to remove those
matches from your working file (a
copy of [1]).
Do a quick visual pass of the tmp
file and add any real matches back
in.
Repeat [2]-[4] with other filters.
This might take some time, of course, but it doesn't sound like this is something you want to get wrong...

I would use sed, not grep!
Sed is used to perform basic text transformations on an input stream.
Try s/regexp/replacement/ option with sed command.
You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.
The best solution will be however a simple script in Perl or in Python.

Related

Does [:space:] in a grep command not include newlines and carriage returns? [duplicate]

This question already has answers here:
How to grep for the whole word
(7 answers)
Closed 11 months ago.
I'm curently writing a simple Bash script. The idea is to use grep to find the lines where a certain pattern is found, within some files. The pattern contains 3 capital letters at the start, followed by 6 digits; so the regex is [A-Z]{3}[0-9}{6}.
However, I need to only include the lines where this pattern is not concatenated with other strings, or in other words, if such a pattern is found, it has to be separated from other strings with spaces.
So if the string which matches the pattern is ABC123456 for example, the line something ABC123456 something should be fine, but somethingABC123456something should fail.
I've extended my regex using the [:space:] character class, like so:
[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]
And this seems to work, except for when the string which matches the pattern is the first or last one in the line.
So, the line something ABC123456 something will match correctly;
The line ABC123456 something won't;
And the line something ABC123456 won't as well.
I believe this has something to do with [:space:] not counting new lines and carriage returns as whitespace characters, even though it should from my understanding. Could anyone spot if I'm doing something wrong here?

A common solution to your problem is to normalize the input so that there is a space before and after each word.
sed 's/^ //;s/$/ /' file |
grep -oE '[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]'
Your question assumes that the newlines are part of what grep sees, but that is not true (or at least not how grep is commonly implemented). Instead, it reads just the contents of each new line into a memory buffer, and then applies the regular expression to that buffer.
A similar but different solution is to specify beginning of line or space, and correspondingly space or end of line:
grep -oE '(^|[[:space:]])[A-Z]{3}[0-9}{6}([[:space:]]|$)' file
but this might not be entirely portable.
You might want to postprocess the results to trim any spaces from the extracted strings, too; but I have already had to guess several things about what you are actually trying to accomplish, so I'll stop here.
(Of course, sed can do everything grep can do, and then some, so perhaps switch to sed or Awk entirely rather than build an elaborate normalization pipeline around grep.)

Bash - grep command inconsistent with man page

I am trying to understand and read the man page. Yet everyday I find more inconsistent syntax and I would like some clarification to whether I am misunderstanding something.
Within the man page, it specifies the syntax for grep is grep [OPTIONS] [-e PATTERN]... [-f FILE]... [FILE...]
I got a working example that recursively searches all files within a directory for a keyword.
grep -rnw . -e 'memes
Now this example works, but I find it very inconsistent with the man page. The directory (Which the man page has written as [FILE...] but specifies the use case for if file == directory in the man page) is located last. Yet in this example it is located after [OPTIONS] and before [-e PATTERN].... Why is this allowed, it does not follow the specified regex fule of using this command?

Why is this allowed, it does not follow the specified regex fule of using this command?
The lines in the SYNOPSIS section of a manpage are not to be understood as strict regular expressions, but as a brief description of the syntax of a utility's arguments.
Depending on the particular application, the parser might be more or less flexible on how it accepts its options. After all, each program can implement whatever grammar they like for their arguments. Therefore, some might allow options at the beginning, at the end, or even in-between files (typically with ways to handle ambiguity that may arisa, e.g. reading from the standard input with -, filenames starting with -...).
Now, of course, there are some ways to do it that are common. For instance, POSIX.1-2017 12.1 Utility Argument Syntax says:
This section describes the argument syntax of the standard utilities and introduces terminology used throughout POSIX.1-2017 for describing the arguments processed by the utilities.
In your particular case, your implementation of grep (probably GNU's grep) allows to pass options in-between the file list, as you have discovered.
For more information, see:
https://unix.stackexchange.com/questions/17833/understand-synopsis-in-manpage
Are there standards for Linux command line switches and arguments?
https://www.gnu.org/software/libc/manual/html_node/Getopt-Long-Options.html

You can also leverage .
grep ‘string’ * -lR

Grep in reverse order without reading whole file

I have a log file that may be very large (10+ GB). I'd like to find the last occurrence of an expression. Is it possible to do this with standard posix commands?
Here are some potential answers, from similar questions, that aren't quite suitable.
Use tail -n <x> <file> | grep -m 1 <expression>: I don't know how far back the expression is, so I don't know what <x> would be. It could be several GB previous, so then you'd be tailing the entire file. I suppose you could loop and increment <x> until it's found, but then you'd be repeatedly reading the last part of the file.
Use tac <file> | grep -m 1 <expression>: tac reads the entire source file. It might be possible to chain something on to sigkill tac as soon as some output is found? Would that be efficient?
Use awk/sed: I'm fairly sure these both always start from the top of the file (although I may be wrong, my sed-fu is not strong).
"There'd be no speed up so why bother": I think that's incorrect, since file systems can seek to the end of a file without reading the whole thing. There'd be a little trial and error/buffering to find each new line, but that shouldn't slow things down much, compared to reading (e.g.) 10 GB that are never used.
Write a python/perl script to do it: this is my fall-back if no one can suggest anything better. I'd rather stick to something that can be done straight through the command line, since I'm executing it straight through ssh, and I'd rather not have to upload a script file as well. Using mmap's rfind() in python, I think we can do it in a few lines, provided the expression to find is static (which mine, unfortunately, is not). A regex requires a bit more work, something like this.
If it helps, the expression is anchored at the start of a line, eg: "^foo \d+$".

Whatever script you write will almost certainly be slower than:
tac file | grep -m 1 '^foo [0-9][0-9]*$'

This awk script will search through the whole file and print the last line matching the given /pattern/:
$ awk '/pattern/ { line=$0 } END { print $line }' gigantic.log
Using tac will be a better option (this uses GNU sed to output the first (i.e. last) found match with '/pattern/', after which it terminates, killing the pipeline):
$ tac gigantic.log | gsed -n '/pattern/{p;q}'
Using Perl or C or some other language, you could seek to the end of the file, step back 4kb (or something), and then
read forwards 4kb,
step back 8kb
repeat until pattern is found, making sure that handle reading partial lines correctly.
(This, apart from looking for a pattern, may actually be what tac does: one implementation of tac)

How to determine which pattern in a file matched with grep?

I use procmail to do extensive sorting on my inbox. My next to last recipe matches the incoming From: to a (very) long white/gold list of historically good email addresses, and patterns of email addresses. The recipe is:
# Anything on the goldlist goes straight to inbox
:0
* ? formail -zxFrom: -zxReply-To | fgrep -i -f $HOME/Mail/goldlist
{
LOG="RULE Gold: "
:0:
$DEFAULT
}
The final recipe puts everything left in a suspect folder to be examined as probable spam. Goldlist is currenltty 7384 lines long (yikes...). Every once in a while, I get a piece of spam that has slipped through and I want to fix the failing pattern. I thought I read a while ago about a special flag in grep that helped show the matching patterns, but I can't find that again. Is there a way to use grep that shows the pattern from a file that matched the scanned text? Or another similar tool that would answer the question short of writing a script to scan pattern by pattern?

grep -o will output only the matched text (as opposed to the whole line). That may help. Otherwise, I think you'll need to write a wrapper script to try one pattern at a time.

I'm not sure if this will help you or not. There is a "-o" parameter to output only the matching expression.
From the man page:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.

Tools for command line file parsing in cygwin

I have to deal with text files in a motley selection of formats. Here's an example (Columns A and B are tab delimited):
A B
a Name1=Val1, Name2=Val2, Name3=Val3
b Name1=Val4, Name3=Val5
c Name1=Val6, Name2=Val7, Name3=Val8
The files could have headers or not, have mixed delimiting schemes, have columns with name/value pairs as above etc.
I often have the ad-hoc need to extract data from such files in various ways. For example from the above data I might want the value associated with Name2 where it is present. i.e.
A B
a Val2
c Val7
What tools/techniques are there for performing such manipulations as one line commands, using the above as an example but extensible to other cases?

I don't like sed too much, but it works for such things:
var="Name2";sed -n "1p;s/\([^ ]*\) .*$var=\([^ ,]*\).*/\1 \2/p" < filename
Gives you:
A B
a Val2
c Val7

You have all the basic bash shell commands, for example grep, cut, sed and awk at your disposal. You can also use Perl or Ruby for more complex things.

From what I've seen I'd start with Awk for this sort of thing and then if you need something more complex, I'd progress to Python.

I would use sed:
# print section of file between two regular expressions (inclusive)
sed -n '/Iowa/,/Montana/p' # case sensitive

Since you have cygwin, I'd go with Perl. It's the easiest to learn (check out the O'Reily book: Learning Perl) and widely applicable.

I would use Perl. Write a small module (or more than one) for dealing with the different formats. You could then run perl oneliners using that library. Example for what it would
look like as follows:
perl -e 'use Parser;' -e 'parser("in.input").get("Name2");'
Don't quote me on the syntax, but that's the general idea. Abstract the task at hand to allow you to think in terms of what you need to do, not how you need to do it. Ruby would be another option, it tends to have a cleaner syntax, but either language would work.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

GREP - finding all occurrences of a string - grep

Related

Does [:space:] in a grep command not include newlines and carriage returns? [duplicate]

Bash - grep command inconsistent with man page

Grep in reverse order without reading whole file

How to determine which pattern in a file matched with grep?

Tools for command line file parsing in cygwin

Categories

Resources