Grep in reverse order without reading whole file - grep

I have a log file that may be very large (10+ GB). I'd like to find the last occurrence of an expression. Is it possible to do this with standard posix commands?
Here are some potential answers, from similar questions, that aren't quite suitable.
Use tail -n <x> <file> | grep -m 1 <expression>: I don't know how far back the expression is, so I don't know what <x> would be. It could be several GB previous, so then you'd be tailing the entire file. I suppose you could loop and increment <x> until it's found, but then you'd be repeatedly reading the last part of the file.
Use tac <file> | grep -m 1 <expression>: tac reads the entire source file. It might be possible to chain something on to sigkill tac as soon as some output is found? Would that be efficient?
Use awk/sed: I'm fairly sure these both always start from the top of the file (although I may be wrong, my sed-fu is not strong).
"There'd be no speed up so why bother": I think that's incorrect, since file systems can seek to the end of a file without reading the whole thing. There'd be a little trial and error/buffering to find each new line, but that shouldn't slow things down much, compared to reading (e.g.) 10 GB that are never used.
Write a python/perl script to do it: this is my fall-back if no one can suggest anything better. I'd rather stick to something that can be done straight through the command line, since I'm executing it straight through ssh, and I'd rather not have to upload a script file as well. Using mmap's rfind() in python, I think we can do it in a few lines, provided the expression to find is static (which mine, unfortunately, is not). A regex requires a bit more work, something like this.
If it helps, the expression is anchored at the start of a line, eg: "^foo \d+$".

Whatever script you write will almost certainly be slower than:
tac file | grep -m 1 '^foo [0-9][0-9]*$'

This awk script will search through the whole file and print the last line matching the given /pattern/:
$ awk '/pattern/ { line=$0 } END { print $line }' gigantic.log
Using tac will be a better option (this uses GNU sed to output the first (i.e. last) found match with '/pattern/', after which it terminates, killing the pipeline):
$ tac gigantic.log | gsed -n '/pattern/{p;q}'
Using Perl or C or some other language, you could seek to the end of the file, step back 4kb (or something), and then
read forwards 4kb,
step back 8kb
repeat until pattern is found, making sure that handle reading partial lines correctly.
(This, apart from looking for a pattern, may actually be what tac does: one implementation of tac)

Related

Count Lines, grep, head, and tail inside Feather Files

Setup: I am contemplating switching from writing large (~20GB) data files with csv to feather format, since I have plenty of storage space and the extra speed is more important. One thing I like about csv files is that at the command line, I can do a quick
wc -l filename
to get a row count, even for large data files. Also, I can quickly search for a simple string with
grep search_string filename
The head and tail commands are also very useful at times. These are straight-forward and work well with csv files, but not with feather. If I try any of them on a feather file, I do not get results that make sense or are helpful.
While I certainly can read a feather file into, say, Python or R, and analyze it then, the hassle of writing out the path and importing the necessary libraries is something I'd rather dispense with.
My Question: Does there exist either a cross-platform (at least Mac and Linux) feather file reader I can use to quickly read in and view feather data (this would be in tabular format) with features corresponding to row count, grep, head, and tail? Or are there simple CLI utilities I could install that would enable me to do the equivalent of line count, grep, head, and tail?
I've seen this question, but it is very incomplete relative to my question.
Using feather files you must use Python or R programs.
To use csv you can use any of the common text manipulation utilities available to Linxu/Unix users.
Linux text manipulation tools
reader less
search grep
converters awk sed
extractor split
editor vim
Each of the above tools requires some learning and practice.
Suggestion
If you have programming skill, create a program to manipulate your feather file.

Does [:space:] in a grep command not include newlines and carriage returns? [duplicate]

This question already has answers here:
How to grep for the whole word
(7 answers)
Closed 11 months ago.
I'm curently writing a simple Bash script. The idea is to use grep to find the lines where a certain pattern is found, within some files. The pattern contains 3 capital letters at the start, followed by 6 digits; so the regex is [A-Z]{3}[0-9}{6}.
However, I need to only include the lines where this pattern is not concatenated with other strings, or in other words, if such a pattern is found, it has to be separated from other strings with spaces.
So if the string which matches the pattern is ABC123456 for example, the line something ABC123456 something should be fine, but somethingABC123456something should fail.
I've extended my regex using the [:space:] character class, like so:
[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]
And this seems to work, except for when the string which matches the pattern is the first or last one in the line.
So, the line something ABC123456 something will match correctly;
The line ABC123456 something won't;
And the line something ABC123456 won't as well.
I believe this has something to do with [:space:] not counting new lines and carriage returns as whitespace characters, even though it should from my understanding. Could anyone spot if I'm doing something wrong here?
A common solution to your problem is to normalize the input so that there is a space before and after each word.
sed 's/^ //;s/$/ /' file |
grep -oE '[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]'
Your question assumes that the newlines are part of what grep sees, but that is not true (or at least not how grep is commonly implemented). Instead, it reads just the contents of each new line into a memory buffer, and then applies the regular expression to that buffer.
A similar but different solution is to specify beginning of line or space, and correspondingly space or end of line:
grep -oE '(^|[[:space:]])[A-Z]{3}[0-9}{6}([[:space:]]|$)' file
but this might not be entirely portable.
You might want to postprocess the results to trim any spaces from the extracted strings, too; but I have already had to guess several things about what you are actually trying to accomplish, so I'll stop here.
(Of course, sed can do everything grep can do, and then some, so perhaps switch to sed or Awk entirely rather than build an elaborate normalization pipeline around grep.)

dash equivalent to bash's curly bracket syntax?

In bash, php/{composer,sismo} expands to php/composer php/sismo. Is there any way to do this with /bin/sh (which I believe is dash), the system shell ? I'm writing git hooks and would like to stay away from bash as long as I can.
You can use printf.
% printf 'str1%s\t' 'str2' 'str3' 'str4'
str1str2 str1str3 str1str4
There doesn't seem to be a way. You will have to use loops to generate these names, perhaps in a function. Or use variables to substitute common parts, maybe with "set -u" to prevent typos.
I see that you prefer dash for performance reasons, however you don't seem to provide any numbers to substantiate your decision. I'd suggest you measure actual performance difference and reevaluate. You might be falling for premature optimization, as well. Consider how much implementation and debugging time you'll save by using Bash vs. possible performance drop.
I really like the printf solution provided by #mikeserv, but I thought I'd provide an example using a loop.
The below would probably be most useful if you wish to execute one command for each expanded string, rather than provide both strings as args to the same command.
for X in composer sismo; do
echo "php/$X" # replace 'echo' with your command
done
You could, however, rewrite it as
ARGS="$(for X in composer sismo; do echo "php/$X"; done)"
echo $ARGS # replace 'echo' with your command
Note that $ARGS is unquoted in the above command, and be aware that this means that its content is wordsplitted (i.e. if any your original strings contain spaces, it will break).

GREP - finding all occurrences of a string

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.
I am using grep to filter results like this:
grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...
The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).
All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:
public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";
I would like to find that occurrence as well as:
public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";
Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.
To address your concern about missing some occurrences, why not filter progressively:
Create a text file with all possible
matches as a starting point.
Use filter X (grep for '^import',
for example) to dump probable false
positives into a tmp file.
Use filter X again to remove those
matches from your working file (a
copy of [1]).
Do a quick visual pass of the tmp
file and add any real matches back
in.
Repeat [2]-[4] with other filters.
This might take some time, of course, but it doesn't sound like this is something you want to get wrong...
I would use sed, not grep!
Sed is used to perform basic text transformations on an input stream.
Try s/regexp/replacement/ option with sed command.
You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.
The best solution will be however a simple script in Perl or in Python.

How to determine which pattern in a file matched with grep?

I use procmail to do extensive sorting on my inbox. My next to last recipe matches the incoming From: to a (very) long white/gold list of historically good email addresses, and patterns of email addresses. The recipe is:
# Anything on the goldlist goes straight to inbox
:0
* ? formail -zxFrom: -zxReply-To | fgrep -i -f $HOME/Mail/goldlist
{
LOG="RULE Gold: "
:0:
$DEFAULT
}
The final recipe puts everything left in a suspect folder to be examined as probable spam. Goldlist is currenltty 7384 lines long (yikes...). Every once in a while, I get a piece of spam that has slipped through and I want to fix the failing pattern. I thought I read a while ago about a special flag in grep that helped show the matching patterns, but I can't find that again. Is there a way to use grep that shows the pattern from a file that matched the scanned text? Or another similar tool that would answer the question short of writing a script to scan pattern by pattern?
grep -o will output only the matched text (as opposed to the whole line). That may help. Otherwise, I think you'll need to write a wrapper script to try one pattern at a time.
I'm not sure if this will help you or not. There is a "-o" parameter to output only the matching expression.
From the man page:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.

Resources