Does [:space:] in a grep command not include newlines and carriage returns? [duplicate] - grep

This question already has answers here:
How to grep for the whole word
(7 answers)
Closed 11 months ago.
I'm curently writing a simple Bash script. The idea is to use grep to find the lines where a certain pattern is found, within some files. The pattern contains 3 capital letters at the start, followed by 6 digits; so the regex is [A-Z]{3}[0-9}{6}.
However, I need to only include the lines where this pattern is not concatenated with other strings, or in other words, if such a pattern is found, it has to be separated from other strings with spaces.
So if the string which matches the pattern is ABC123456 for example, the line something ABC123456 something should be fine, but somethingABC123456something should fail.
I've extended my regex using the [:space:] character class, like so:
[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]
And this seems to work, except for when the string which matches the pattern is the first or last one in the line.
So, the line something ABC123456 something will match correctly;
The line ABC123456 something won't;
And the line something ABC123456 won't as well.
I believe this has something to do with [:space:] not counting new lines and carriage returns as whitespace characters, even though it should from my understanding. Could anyone spot if I'm doing something wrong here?

A common solution to your problem is to normalize the input so that there is a space before and after each word.
sed 's/^ //;s/$/ /' file |
grep -oE '[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]'
Your question assumes that the newlines are part of what grep sees, but that is not true (or at least not how grep is commonly implemented). Instead, it reads just the contents of each new line into a memory buffer, and then applies the regular expression to that buffer.
A similar but different solution is to specify beginning of line or space, and correspondingly space or end of line:
grep -oE '(^|[[:space:]])[A-Z]{3}[0-9}{6}([[:space:]]|$)' file
but this might not be entirely portable.
You might want to postprocess the results to trim any spaces from the extracted strings, too; but I have already had to guess several things about what you are actually trying to accomplish, so I'll stop here.
(Of course, sed can do everything grep can do, and then some, so perhaps switch to sed or Awk entirely rather than build an elaborate normalization pipeline around grep.)

Related

GREP: Find words containing multiple specific characters

To clarify a bit, while I am aware how to grab words with a single specific character, I'm unsure how to approach looking for multiple of them. For example, what grep command would be used to retrieve only the words containing both "b" and "p" (in any order), not just one or the other?
Using the above example, if you're given words like "bear," "pear," "biography," and "printable," it would only return the last two words. These are some of my previous attempts.
grep -E "\b[bp]\b" input
grep -E "\b(b|p)\b" input
grep -E "\bb.*p\b" input
you can do it with a regular expression. For instance, here the code snippet for your problem.
grep '\w*[b]\w*[p]\w*\|\w*[p]\w*[b]\w*' test.txt
Helpful links to read further:
https://www.cyberciti.biz/faq/grep-regular-expressions/
https://regexr.com/

Snippets for Gedit: how to change the text in a placeholder to make the letters uppercase?

I’m trying to improve a snippet for Gedit that helps me write shell scripts.
Currently, the snippet encloses the name of a variable into double quotes surrounding curly brackets preceded with a dollar sign. But to make the letters uppercase, I have to switch to the caps-lock mode or hold down a shift key when entering the words. Here is the code of the snippet:
"\${$1}"
I would like that the snippet makes the letters uppercase for me. To do that, I need to know how to make text uppercase and change the content of a placeholder.
I have carefully read the following articles:
https://wiki.gnome.org/Apps/Gedit/Plugins/Snippets
https://blogs.gnome.org/jessevdk/2009/12/06/about-snippets/
https://www.marxists.org/admin/volunteers/gedit-sed.htm
How do you create a date snippet in gedit?
But I still have no idea how to achieve what I want — to make the letters uppercase. I tried to use the output of shell programs, a Python script, the regular expressions — the initial text in the placeholder is not changed. The last attempt was the following (for clarity, I removed the surrounding double-quotes and the curly brackets with the dollar — working just on the letter case):
${1}$<[1]: return $1.upper()>
But instead of MY_VARIABLE I get my_variableMY_VARIABLE.
Perhaps, the solution is obvious, but I cannot get it.
I did it! The solution found!
Before all, I have to say that I don’t count the solution as correct or corresponding to the ideas of the Gedit editor. It’s a dirty hack, of course. But, strangely, there is no way to change the initial content of placeholders in the snippets — haven’t I just found a standard way to do that?
So. If they don’t allow us to change the text in placeholders, let’s ask the system to do that.
The first thought that stroke me was to print backspace characters. There are two ways to do that: a shell script and a python script. The first approach might look like: $1$(printf '\b') The second one should do the same: $1$<[1]: return '\b'> But both of them don’t work — Gedit prints surrogate squares instead of real backspace characters.
Thus, xdotool is our friend! So is metaprogramming! You will not believe — metaprogramming in a shell script inside a snippet — sed will be writing the scenario for xdotool. Also, I’ve added a feature that changes spaces to underscores for easier typing. Here is the final code of the snippet:
$1$(
eval \
xdotool key \
--delay 5 \
`echo "${1}" | sed "s/./ BackSpace/g;"`
echo "\"\${${1}}\"" \
| tr '[a-z ]' '[A-Z_]'
)$0
Here are some explanations.
Usually, I never use backticks in my scripts because of some troubles and incompatibilities. But now is not the case! It seems Gedit cannot interpret the $(...) constructions correctly when they are nested, so I use the backticks here.
A couple of words about using the xdotool command. The most critical part is the --delay option. By default, it’s 12 milliseconds. If I leave it as is, there will be an error when the length of the text in the placeholder is quite long. Not to mention the snippet processing becomes slow. But if I set the time interval too small, some of the emulated keystrokes sometimes will be swallowed somewhere. So, five milliseconds is the delay that turns out optimal for my system.
At last, as I use backspaces to erase the typed text, I cannot use template parts outside the placeholder. Thus, such transformations must be inside the script. The complex heap after the echo command is what the template parts are.
What the last tr command does is the motivator of all this activity.
It turns out, Gedit snippets may be a power tool. Good luck!

How to find match of words with reoccuring character in a file

It might seems like a question that would already have been answered before so pardon me if it's the case, but I can't seems to find a clear answer or an explanation on how to find words in a file with a specified number of repeated character, (ex: words containing 3 times the character '-', such as 'long-and-complex-word').
I'm aware that it is possible to use the command
grep-oE '.{n}'
To find words with consecutive repetition of character, but I'm looking for a way to find repetition of character in no particular order.
Here are the commands that I've tried that aren't working
grep -E '*[-]*[-]*[-]*' file
grep -Ex '* \-* \-* \ -*' file
Thanks.

GREP - finding all occurrences of a string

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.
I am using grep to filter results like this:
grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...
The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).
All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:
public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";
I would like to find that occurrence as well as:
public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";
Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.
To address your concern about missing some occurrences, why not filter progressively:
Create a text file with all possible
matches as a starting point.
Use filter X (grep for '^import',
for example) to dump probable false
positives into a tmp file.
Use filter X again to remove those
matches from your working file (a
copy of [1]).
Do a quick visual pass of the tmp
file and add any real matches back
in.
Repeat [2]-[4] with other filters.
This might take some time, of course, but it doesn't sound like this is something you want to get wrong...
I would use sed, not grep!
Sed is used to perform basic text transformations on an input stream.
Try s/regexp/replacement/ option with sed command.
You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.
The best solution will be however a simple script in Perl or in Python.

How to determine which pattern in a file matched with grep?

I use procmail to do extensive sorting on my inbox. My next to last recipe matches the incoming From: to a (very) long white/gold list of historically good email addresses, and patterns of email addresses. The recipe is:
# Anything on the goldlist goes straight to inbox
:0
* ? formail -zxFrom: -zxReply-To | fgrep -i -f $HOME/Mail/goldlist
{
LOG="RULE Gold: "
:0:
$DEFAULT
}
The final recipe puts everything left in a suspect folder to be examined as probable spam. Goldlist is currenltty 7384 lines long (yikes...). Every once in a while, I get a piece of spam that has slipped through and I want to fix the failing pattern. I thought I read a while ago about a special flag in grep that helped show the matching patterns, but I can't find that again. Is there a way to use grep that shows the pattern from a file that matched the scanned text? Or another similar tool that would answer the question short of writing a script to scan pattern by pattern?
grep -o will output only the matched text (as opposed to the whole line). That may help. Otherwise, I think you'll need to write a wrapper script to try one pattern at a time.
I'm not sure if this will help you or not. There is a "-o" parameter to output only the matching expression.
From the man page:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.

Resources