search multiple keywords from text files, and format output - grep

I am having trouble with the following problem:
say I have a list of words that I would like to search from multiple text files;
e.g. in keywords.txt, I have:
word1
word2
word3
I'd like to search each individual keyword in a number of other text files. Hopefully, the output of the search can be formatted like this:
Keyword Sentence
word1 This sentence contains word1.
word1 This paragraph also contains word1.
word2 word2 does not exist in this file.
In other words, I am hoping to sort the output based on keyword.
I made a little progress using grep, but I am not sure about: 1) how to sort the grep output based on keywords; 2) how to output the entire sentence, as opposed to only the line that contains the keyword (which is the default behavior of grep).
Any suggestion is greatly appreciated.

Related

Remove everything before and after a word that begins with "#" in Google Sheets?

I have an Ifttt setup that writes to a Google Sheet. The content of the cell, directly from the source, is a sentence. Right now I'm manually cleaning up the cells as they come in but since this is a recurring sheet that generates content it's been time consuming.
The content will always have "#" with the word after it. Examples:
Here is an #example lol words here
#AnotherExample or this
Is there a formula to take all the content before and after the # so the result should be:
example
AnotherExample
I kept trying the =REGEXREPLACE formula but I can't seem to make it work for my use case. Any help is appreciated!
Something like:
=REGEXEXTRACT(A1,"#(\S*)")
# - Match a literal "#".
(\S*) - 0+ non-whitespace characters captured in a group.
REGEXEXTRACT() Will then extract this capture group. You could also use #(\w*) to capture 0+ word characters. if your input can be something like "test1 #test2, test3".
Thrown in an array variant:
=INDEX(IF(A1:A="","",REGEXEXTRACT(A1:A,"#(\w*)")),)

Grep getting numbers in a range

I'm trying to display rows which include numbers in range 20-200 in column 12 from a csv file. However, I don't get the right input.
I tried this:
grep -E "^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,([2-9][0-9]|1[0-9][0-9]|200)" file.csv > sc1_d.csv
What do I do wrong?
Any thoughts?
Keep it simple, just use awk:
awk -F, '(20 <= $12) && ($12 <= 200)' file
If that doesn't do exactly what you want then edit your question to explain in what way this and your own attempt "don't get the right input" and to show concise, testable sample input and expected output.
There are two issues here.
(1) your way of finding column 12 won't work in any rows that have cells containing commas. And that's difficult to resolve unless you know who wrote the CSV file. Since there's no single CSV spec, there are multiple ways of escaping commas and quotes in CSV files. For example, in one spec, a comma within a cell is escaped with a backslash (for example, my doctor's name might be written as Dr. Bob\, MD. In another, any cell value that contains a comma needs to be put in double quotes, and double quotes themselves need to be written as two double quotes (so, "Dr. Bob, MD").
But if for some reason you happen to know that embedded commas in cell values is not an issue in your CSV file, you can ignore that.
(2) That expression would also allow some other values, such as 201 or 20000B, that you don't want. So if you know that there this is not the last column, you can just add commas after the choices:
([2-9][0-9]|1[0-9][0-9]|200),
And if you can't make that assumption, then you can just look for a comma OR end of line:
([2-9][0-9]|1[0-9][0-9]|200)(,|$)
And finally you can employ a "repeat" to specify exactly 11 instances of the [^,]*, pattern. So now your grep command looks like this:
grep -E "^([^,]*,){11}([2-9][0-9]|1[0-9][0-9]|200)(,|$)" file.csv > sc1_d.csv

TextPad Replace Character and Line Feed with Nothing

How do I replace a line in TextPad ' with nothing (ie: delete lines with just that one character)?
I have an Excel Spreadsheet containing three columns:
Column A - single quote
Column B - some number
Column C - single quote plus a comma
There are over 90,000 rows on this spreadsheet with data in column B. There are over one million rows with just a single quote in column A because I did a "Ctrl+D" on that column to copy the value in that column (a single quote) down to all rows.
When I copy and paste these three columns into TextPad, I end up with over one million lines. I replaced the tabs with nothing using the F8/Replace dialog.
(Replace: tab with: empty string)
The majority of what is left are lines that contain only a single quote. I want to delete these 900,000 extra lines.
How do I specify a Replace (delete) of single quote + line feed. I do not want to delete any of the single quotes from the lines that include a number that came from column B.
I just figured it out. The backslash n is the line feed.
If I check Regular Expression and enter this Find what:
'\n
(Keeping empty string for Replace with) and Replace All, I have deleted those extra lines.
I also experienced the same...it did not work for me until I did this:
uncheck the regular expression first before entering \n in the find box and replacing with whatever you chose to (in my case, it was ',').
Your result might be an entire list becoming transposed (that's what happened to my data).

replace everything after a character in Google spreadsheet

I did some searching and in openoffice and excel it looks like you can simply add an * at the beginning or end of a character to delete everything before and after it, but in Google spreadsheet this isn't working. Does it support this feature? So if I have:
keyword USD 0078945jg .12 N N 5748 8
And I want to remove USD and everything after it what do I use? I have tried:
USD* and (USD*) with regular expressions checked
But it doesn't work. Any ideas?
The * quantifier just needs to be applied to a dot (.) which will match any character.
To clarify: the * wildcard used in certain spreadsheet functions (eg COUNTIF) has a different usage to the * quantifier used in regular expressions.
In addition to options that would be available in Excel (LEFT + FIND) pointed out by pnuts, you can use a variety of regex tools available in Google Sheets for text searching / manipulation
For example, RegexReplace:
=REGEXREPLACE(A1,"(.*)USD.*","$1")
(.*) <- capture group () with zero or more * of any character .
USD.* <- exact match on USD followed by zero or more * of any character .
$1 <- replace with match in first capture group
Please try:
and also have a look at.
For spaces within keyword I suggest a helper column with a formula such as:
=left(A1,find("USD",A1)-1)
copied down to suit. The formula could be converted to values and the raw data (assumed to be in ColumnA) then deleted, if desired.
To add to the answers here, you can get into trouble when there are special characters in the text (I have been struggling with this for years).
You can put a frontslash \ in front of special characters such as ?, + or . to escape them. But I still got stuck when there were further special characters in the text. I finally figured it out after reading find and replace in google sheets with regex.
Example: I want to remove the number, period and space from the beginning of a question like this: 1. What is your name?
Go to Edit → Find and replace
In the Find field, enter the following: .+\. (note: this includes a space at the end).
Note: In the Find and replace dialogue box, be sure to check "Search using regular expressions" and "match case". Leave the Replace field blank.
The result will be this text only: What is your name?

extract number from cell in openoffice calc

I have a column in open office like this:
abc-23
abc-32
abc-1
Now, I need to get only the sum of the numbers 23, 32 and 1 using a formula and regular expressions in calc.
How do I do that?
I tried
=SUMIF(F7:F16,"([:digit:].)$")
But somehow this does not work.
Starting with LibreOffice 6.4, you can use the newly added REGEX function to generically extract all numbers from a cell / text using a regular expression:
=REGEX(A1;"[^[:digit:]]";"";"g")
Replace A1 with the cell-reference you want to extract numbers from.
Explanation of REGEX function arguments:
Arguments are separated by a semicolon ;
A1: Value to extract numbers from. Can be a cell-reference (like A1) or a quoted text value (like "123abc"). The following regular expression will be applied to this cell / text.
"[^[:digit:]]": Match every character which is not a decimal digit. See also list of regular expressions in LibreOffice
The outer square brackets [] encapsulate the list of characters to search for
^ adds a NOT, meaning that every character not included in the search list is matched
[:digit:] represents any decimal digit
"": replace matching characters (every non-digit) with nothing = remove them
"g": replace all matches (don't stop after the first non-digit character)
Unfortunately Libre-Office only supports regex in find/replace and in search.
If this is a once-only deal, I would copy column A to column to B, then use [data] [text to columns] in B and use the - as a separator, leaving you with all the text in column B and the numbers in column C.
Alternatively, you could use =Right(A1,find("-",A1,1)+1) in column B, then sum Column C.
I think that this is not exactly what do you want, but maybe it can help you or others.
It is all about substring (in Calc called [MID][1] function):
First: Choose your cell (for example with "abc-23" content).
Secondly: Enter the start length ("british" --> start length 4 = tish).
After that: To print all remaining text, you can use the [LEN][2] function (known as length) with your cell ("abc-23") in parameter.
Code now looks like this:
D15="abc-23"
=MID(D15; 5; LEN(D15))
And the output is: 23
When you edit numbers (in this example 23), no problem. However, if you change anything before (text "abc-"), the algorithm collapses because the start length is defined to "5".
Paste the string in a cell, open search and replace dialog (ctrl + f) extended search option mark regular expression search for ([\s,0-9])([^0-9\s])+ and replace it with $1
adjust regex to your needs
I didn't figure out how to do this in OpenOffice/LibreOffice directly. After frustrations in searching online and trying various formulas, I realised my sheet was a simple CSV format, so I opened it up in vim and used vim's built-in sed-like feature to find/replace the text in vim command mode:
:%s/abc-//g
This only worked for me because there were no other columns with this matching text. If there are other columns with the same text, then the solution would be a bit more complex.
If your sheet is not a CSV, you could copy the column out to a text file and use vim to find/replace, and then paste the data back into the spreadsheet. For me, this was a lot less frustrating than trying to figure this out in LibreOffice...
I won't bother with a solution without knowing if there really is interest, but, you could write a macro to do this. Extract all the numbers and then implement the sum by checking for contained numbers in the text.

Resources