sed refuses to replace pattern -> file (encoding) error? - character-encoding

I'm having a strange problem for which I don't see any explanation. I have a file with the following content:
<me#mycomp:~> head -n 2 myfile
f800671 1 V80068 0.000 2.262 DUMMY heeft één van hen niet gezegd
f800671 1 V80068 2.262 4.090 DUMMY la*v Belgique*v sera*v latine*v
I want to remove all the *v (foreign word) markers as follows:
<me#mycomp:~> head -n 2 myfile | sed 's/*v//'
f800671 1 V80068 0.000 2.262 DUMMY heeft één van hen niet gezegd
f800671 1 V80068 2.262 4.090 DUMMY la*v Belgique*v sera*v latine*v
No luck there. The sed command seems to be correct. It's a basic command, so not much risk there, but I checked it anyway:
echo "Belgique*v" | sed 's/*v//'
Belgique
I assume that there is something wrong with the file, but I honestly can't imagine what. I checked the encoding and it's plain ISO-8859 text.
Any ideas?

Escape the * and use the g flag for global replace:
sed 's/\*v//g'
Edit:
* means 'zero or more of the preceeding'. Do disable this special meaning, you must escape it as \*.
The g flag replaces all occurences. If you don't set this flag, only the first occurence will be replaced.

Related

Grepping twice using result of first Grep in Large file

Am given a list if ID which I need to trace back a name in a file
file: ID contains
1
2
3
4
5
6
The ID are contained in a Large 2 GB file called result.txt
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
So I cat the ID file into a variable
I then use this variable in a loop to grep out the values to link back to the name using grep and cut -d from results.txt and output to a variable
so variable contains ABS CDE FG1
In the same loop I pass the output of the grep to perform another grep on results.txt, to get the name
ie regrets file for ABC CDE FG1
I do get the answer but takes a long time is their a more efficient way?
Thanks
Making some assumptions about your requirement... ID's that are not found in the big file will not be shown in the output; the desired output is in the format shown below.
Here are mock input files - f1 for the id's and f2 for the large file:
[mathguy#localhost test]$ cat f1
1
2
3
4
5
6
[mathguy#localhost test]$ cat f2
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
Proposed solution and output:
[mathguy#localhost test]$ sed 's/.*/\*\*id=&\*\*/' f1 | grep -Ff - f2 | \
> sed -E 's/^.*\*\*id=([[:digit:]]*)\*\*.*,([^,]*)$/\1 \2/'
1 ABC
2 CDE
3 FG1
The hard work here is done by grep -F which might be just fast enough for your needs. There is some prep work and some clean-up work done by sed, but those are both on small datasets.
First we take the id's from the input file and we output strings in the format **id=<number>**. The output is presented as the fixed-character patterns to grep -F via the option -f (take the patterns from file, in this case from stdin, invoked as -; that is, from the output of sed).
After we find the needed lines from the big file, the final sed just extracts the id and the name from each line.
Note: this assumes that each id is only found once in the big file. (Actually the command will work regardless; but if there are duplicate lines for an id, your business users will have to tell you how to handle. What if you get contradictory names for the same id? Etc.)

grep to find words with unique letters

how to use grep to find occurrences of words from a dictionary file which have a given set of letters with the restriction that each letter occurs once and only once.
EG if the letters are abc then the expected output is:
cab
EDIT:
Given a dictionary file (that is a file containing one word per line such as /usr/share/dict/words on mac os x operating system) and a set of (unique) characters, I want to print out all of the dictionary file's words that contain each character of the input set once and only once. For example if the set of characters is {a,b,c} then print out all (3-letter) words that contain each character of the set.
I am looking, preferably, for a solution that uses just grep expressions.
Given a series of letters, for example abc, you can convert each one to a lookahead, like this:
^(?=[^a]*a[^a]*)(?=[^b]*b[^b]*)(?=[^c]*c[^c]*)$
You may need to use the "extended regex" flag -E to use this regex with grep.
To create this regex from a string, you could use sed (an exercise for the reader)
grep -E ^[abc]{3}.$ <Dictionary file> | grep -v -e a.*a -e b.*b -e c.*c
i.e. Find all three letter strings matching the input and pipe these through inverse grep to remove strings with double letters.
I'm using the '.' after {3} because my dictionary file is windows based so has an extra carriage return or line feed. So, that's probably not necessary.
Below is a Perl solution. Note, you'll need to add more words to the dictionary, and read input in to the $input variable. An array of valid words will end up in #results.
#!/usr/bin/env perl
use Data::Dumper;
my $input = "abc";
my #dictionary = qw(aaa aac aad aal aam aap aar aas aat aaw aba abc abd abf abg
abh abm abn abo abr abs abv abw aca acc ace aci ack acl acp acs act acv ada adb
adc add adf adh adl adn ado adp adq adr ads adt adw aea aeb aec aed aef aes aev
afb afc afe aff afg afi afk afl afn afp aft afu afv agb agc agl agm agn ago agp
...
PUT A REAL DICTIONARY HERE!
...
zie zif zig zii zij zik zil zim zin zio zip zir zis zit ziu ziv zlm zlo zlx zma
zme zmi zmu zna zoa zob zoe zog zoi zol zom zon zoo zor zos zot zou zov zoy zrn
zsr zub zud zug zui zuk zul zum zun zuo zur zus zut zuz zva zwo zye zzz);
# Generate a lookahead expression for each character in the input word
my $regexp = join("", map { "(?=.*$_)" } split(//, $input));
my #results;
foreach my $word (#dictionary) {
# If the size of the input doesn't match the dictionary word, skip to the
# next word.
if (length($input) != length($word)) {
next;
}
if ($word =~ /$regexp/) {
push(#results, $word);
}
}
print Dumper #results;
The solution I found involves using grep first to extract all n-letter words that contain only letters from the input set - although some letters might appear more than once, some may not appear; (again I am assuming that the input letters are unique). Then it does a series of 1-letter greps to make sure each letter occurs at least once. Because the words are of length n this ensures the word contains each letter once and only once. For example, if the input character set is (a,b,c} then the solution would be:
grep -E '^[abc]{3}$' /usr/share/dict/words | grep a | grep b | grep c
a simple bash script can be written which creates this grep string and executes it against the word file, using $1 as the input letter set. It might not be the most efficient method of generating the string, but as I am not familiar with sed or awk it does seem to solve my problem. The script I created is:
#!/bin/sh
slen=${#1}
g2="'^[$1]{$slen}\$'"
g3=""
ix1=0
while [ $ix1 -lt $slen ]
do
g3="$g3 | grep ${1:$ix1:1}"
ix1=$((ix1+1))
done
eval grep -E $g2 /usr/share/dict/words $g3

sed add additional column

I want to add an additional column of ones to a tab separated file.
The file looks like this:
#> cat /tmp/myfile
Aal Fisch_und_Fleisch
Aalsuppe Fisch_und_Fleisch
The way I wanted to do it is by sed, matching the whole line, printing it out together with the new column. However the additional column is written in the middle of the lines instead of the end:
#> cat /tmp/myfile | sed 's#^\(.*\)$#\1\t1#g'
Aal 1isch_und_Fleisch
Aalsuppe1 Fisch_und_Fleisch
When I do a sanity check with some manually created lines it works, though:
#> echo -e "aaaaaaaaaa\taaaaaaaaaaaa\nbbbbbbb\tbbbbbbbb" | sed 's#^\(.*\)$#\1\t1#g'
aaaaaaaaaa aaaaaaaaaaaa 1
bbbbbbb bbbbbbbb 1
I guessed it might be an encoding/line break issue, here is what file is saying:
#> file /tmp/myfile
/tmp/myfile: ASCII text, with CRLF line terminators
If it is an encoding/line break issue, how do I go about it?
I'm not able to reproduce your exact issue, but have seen similar things before. Essentially, CRLF line endings can cause strangeness in the visual display, because the CR part, the carriage return, can cause the cursor to move to the begin of the same line, rather than to the beginning of a new line. Easiest is probably just to switch to Unix-style endings.
To switch to Unix-style endings, use one of
dos2unix
tr -d '\r'
As a whole, something like
cat /tmp/myfile | dos2unix | sed 's#^\(.*\)$#\1\t1#g'
If you need to switch back, you could use unix2dos.
This might work for you (GNU sed):
sed 's/$/\t1/' file

Recursively grep results and pipe back

I need to find some matching conditions from a file and recursively find the next conditions in previously matched files , i have something like this
input.txt
123
22
33
The files where you need to find above terms in following files, the challenge is if 123 is found in say 10 files , the 22 should be searched in these 10 files only and so on...
Example of files are like f1,f2,f3,f4.....f1200
so it is like i need to grep -w "123" f* | grep -w "123" | .....
its not possible to list them manually so any easier way?
You can solve this using awk script, i ve encountered a similar problem and this will work fine
awk '{ if(!NR){printf("grep -w %d f*|",$1)} else {printf("grep -w %d f*",$1)} }' input.txt | sh
What it Does?
it reads input.txt line by line
until it is at last record , it prints grep -w %d | (note there is a
pipe here)
which is then sent to shell for execution and results are piped back
to back
and when you reach the end the pipe is avoided
Perhaps taking a meta-programming viewpoint would help. Have grep output a series of grep commands. Or write a little PERL program. Maybe Ruby, if the mood suits.
You can use grep -lw to write the list of file names that matched (note that it will stop after finding the first match).
You capture the list of file names and use that for the next iteration in a loop.

Print text between delimiters using sed

Suppose I have op(abc)asdfasdf and I need sed to print abc between the brackets. What would work for me? (Note: I only want the text between first pair of delimiters on a line, and nothing if a particular line of input does not have a pair of brackets.)
$ echo 'op(abc)asdfasdf' | sed 's|[^(]*(\([^)]*\)).*|\1|'
abc
sed -n -e '/^[^(]*(\([^)]*\)).*/s//\1/p'
The pattern looks for lines that start with a list of zero or more characters that are not open parentheses, then an open parenthesis; then start remembering a list of zero or more characters that are not close parentheses, then a close parenthesis, followed by anything. Replace the input with the list you remembered and print it. The -n means 'do not print by default' - any lines of input without the parentheses will not be printed.

Resources