grep for argument of a latex command - grep

I'm trying to sort out some broken references in a latex file. They are commands such as \cref{ps.1.1}. I would like to grep my file and get only the argument of the command as output, in this case ps.1.1. grep -Po \\\\cref{.*?} my.tex gives me only the command, not the rest of the line, but I'd like to also get rid of the \cref{ and } in the output, so that I could iterate over them.

Here is a Perl one-liner, printing out only the matches, including multiple ones on the same line. It puts out a line per match, even for those on the same line, prepended with their line numbers.
perl -nle 'print "$.: $1" while(/\\cref\{(.*?)\}/g)' file.tex
This may need to and can be modified, depending on the exact output you want.
For example, to print just once for multiple matches on the same line, drop the /g modifier (remove g after the regex). To match multiple patterns, add them to the regex (separated by | and grouped by ()) and add $2, $3 (...) to print. To see the whole line, change $1 to $_. Etc.
A simple script would offer far more flexiblity and processing opportunities.

Related

Remove two lines using sed

I'm writing a script which can parse an HTML document. I would like to remove two lines, how does sed work with newlines? I tried
sed 's/<!DOCTYPE.*\n<h1.*/<newstring>/g'
which didn't work. I tried this statement but it removes the whole document because it seems to remove all newlines:
sed ':a;N;$!ba;s/<!DOCTYPE.*\n<h1.*\n<b.*/<newstring>/g'
Any ideas? Maybe I should work with awk?
For the simple task of removing two lines if each matches some pattern, all you need to do is:
sed '/<!DOCTYPE.*/{N;/\n<h1.*/d}'
This uses an address matching the first line you want to delete. When the address matches, it executes:
Next - append the next line to the current pattern-space (including \n)
Then, it matches on an address for the contents of the second line (following \n). If that works it executes:
delete - discard current input and start reading next unread line
If d isn't executed, then both lines will print by default and execution will continue as normal.
To adjust this for three lines, you need only use N again. If you want to pull in multiple lines until some delimiter is reached, you can use a line-pump, which looks something like this:
/<!DOCTYPE.*/{
:pump
N
/some-regex-to-stop-pump/!b pump
/regex-which-indicates-we-should-delete/d
}
However, writing a full XML parser in sed or awk is a Herculean task and you're likely better off using an existing solution.
If an xml parsing tool is definitely not an option, awk maybe an option:
awk '/<!DOCTYPE/ { lne=NR+1;next } NR==lne && /<h1/ { next }1' file
When we encounter a line with "<!DOCTYPE" set the variable lne to the line number + 1 (NR+1) and then skip to the next line. Then when the line is equal to lne (NR==lne) and the line contains "<h1", skip to the next line. Print all other lines by using 1.
My solution for a document like this:
<b>...
<first...
<second...
<third...
<a ...
this awk command works well:
awk -v RS='<first[^\n]*\n<second[^\n]*\n<third[^\n]*\n' '{printf "%s", $0}'
that's all.
This might work for you (GNU sed):
sed 'N;/<!DOCTYPE.*\n<h1.*/d;P;D' file
Append the following line and if the pattern matches both lines in the pattern space delete them.
Otherwise, print then delete the first of the two lines and repeat.
To replace the two lines with another string, use:
sed 'N;s/<!DOCTYPE.*\n<h1.*/another string/;P;D'

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.
$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

Replacing part of LaTeX command using BBedit grep

How can I use the BBedit grep option to replace LaTeX commands like
\textcolor{blue}{Some text}
by the contents of the second set of braces, so
Some text
?
The BBEdit Grep Tutorial gives a lot of information and good examples on using the grep option in BBEdit. What you are trying to achieve is actually a variation of one of the examples. The solution is to enter the following:
Find: \\textcolor\{blue\}\{([^\}]*)\}
Replace: \1
The relevant part is the "Find" section. The first part: \\textcolor\{blue\}\{ basically searches for the content \textcolor{blue}{. You need the \s to escape special characters.
Next, we have the cryptic sequence ([^\}]*): The (...) saves everything inside the parentheses into the variable \1, which you can use in the "Replace" section to insert the content. The [^\}]* consists of ^\} which means match all characters which are not ^ a closing brace \}. With [...]* we say, match any number of "not brace" characters. Overall, this expression makes the grep match all characters which are not closing braces, and saves them into \1.
Finally, the expression ends with a \}, i.e. a closing brace, which is the end of what we want to find.
The "Replace" only contains \1, which is everything inside the parentheses (...) in the "Find" field.

How can I find files that match a two-line pattern using grep?

I created a test file with the following:
<cert>
</cert>
I'm now trying to find this with grep and the following command, but it take forever to run.
How can I search quickly for files that contain adjacent lines like these?
tr -d '\n' | grep '<cert></cert>' test.test
So, from the comments, you're trying to get the filenames that contain an empty <cert>..</cert> element. You're using several tools wrong. As #iiSeymour pointed out, tr only reads from standard input-- so if you want to use it to select from lots of filenames, you'll need to use a loop. grep prints out matching lines, not filenames; though you could use grep -l to see the filenames instead.
But you're only joining lines because grep works one line at a time; so let's use a better tool. Here's how to search with awk:
awk '/<cert>/ { started=1; }
/<\/cert>/ { if (started) { print FILENAME; nextfile;} }
!/<cert>/ { started = 0; }' file1 file2 *.txt
It checks each line and keeps track of whether the previous line matched <cert>. (!/pattern/ sets the flag back to zero on lines not matching /pattern/.) Call it with all your files (or with a wildcard like *.txt).
And a friendly suggestion: Next time, try each command separately (you've been stuck on this for hours and you still don't know what grep does?). And have a quick look at the manual for the tools you want to use. Unix tools are usually too complex for simple trial and error.

is there a sophisticated way to grep this file

I have one file. Written in BNF it could be
<line>:== ((<ISBN10>|<ISBN13>)([a-Z/0-9]*)) {1,4})
For example
123456789X/abscd/1234567890123/djfkldsfjj
How can I grep the ISBN10 or ISBN13 ONLY one per line even when in the line are more ISBNs. If there are more ISBNs in the line it should take only the first in line.
When I grep that way
grep -Po "[0-9]{9,13}X{0,1}" file
then I get more lines than the file originally has. (As there could be max 4 ISBNs in line)
I would also need the linecount of file should be the linecount of the grepresult.
Any advices?
Well, assuming the other answer offered isn't correct in assuming that the 'first' ISBN isn't at the start of line, you could always try in perl.
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my ( $first_isbn, #rest ) = m/(\d{9,13}X{0,1})/g;
print $., ":", $first_isbn, "\n" if $first_isbn;
}
$. is the line number in perl, and so we print that and the match if there's a match. <> says read and iterate either filenames or STDIN much like grep does. So you could invoke this in a similar way to grep:
perl myscript.pl <filename>
Or:
cat <filename> | ./myscript.pl
This would one-liner-ify as:
perl -lne 'my ( $first_isbn ) = m/(\d{9,13}X{0,1})/g; print $., ":", $first_isbn, "\n" if $first_isbn;'
One trivial solution is to include the beginning of the line in your regex:
grep -Po "^[0-9]{9,13}X{0,1}" file
This ensures that matches after the first do not satisfy the regex. It does seem from your BNF that the ISBNs, if present, are guaranteed to be the first characters of the line.
Another way is to use sed:
sed -n "s/\([0-9]\{9,13\}X\).*/\1/p" file
This matches your pattern along with the rest of the line, but only prints your pattern. You could then use another utility to add line numbers. E.g. pipe your output to nl -nrz -w9.

Resources