I am studying the command bgrep found here.
I run bgrep "fafafa" test_27.6.2015.bin | less -M on the the binary data called test_27.6.2015.bin but I get
test_27.6.2015.bin: 00005ee4
test_27.6.2015.bin: 0000bd3c
I would suspect to get matches containing the term fafafafa.
Two matches is the correct amount of matches.
These hex numbers are probably of some segment containing fafafafa.
How does bgrep form its search result?
bgrep's search result are formatted this way:
printf("%s: %08llx\n", filename, (unsigned long long)(offset + o - len));
Hence, it displays the filename, and then the hex offset where the search string started, as illustrated here:
$ xxd test_27.6.2015.bin | grep 5ee0
0005ee0: 0c89 0c88 fafa fafa 585e 0000 fe5a 1eda ........X^...Z..
Related
Context;
After running the following command on my server:
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022 > analisis.txt
I get a text file with thousands of lines like this example:
loggers1/PCRF1_17868/PCRF12_01_03_2022_00_15_39.log:[C]|01-03-2022:00:18:20:183401|140404464875264|TRACKING: CCR processing Compleated for SubId-5281181XXXXX, REQNO-1, REQTYPE-3,
SId-mscp01.herpgwXX.epc.mncXXX.mccXXX.XXXXX.org;25b8510c;621dbaab;3341100102036XX-27cf0XXX,
RATTYPE-1004, ResCode-5005 |processCCR|ProcessingUnit.cpp|423
(X represents incrementing numbers)
Problem:
The output is filled with unnecessary data. The only string portions I need are the MSISDN,IMSI comma separated for each line, like this:
5281181XXXXX,3341100102036XX
Steps I tried
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022| grep -o -P
'(?<=SubId-).*?(?=, REQ)' > analisis1.txt
This gave me the first part of the solution
5281181XXXXX
However, when I tried to get the second string located between '334110' and "-"
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022| grep -o -P
'(?<=SubId-).?(?=, REQ)' | grep -o -P '(?<=334110).?(?=-)' >
analisis1.txt
it doesn't work.
Any input will be appreciated.
To get 5281181XXXXX or the second string located between '334110' and "-" you can use a pattern like:
\b(?:SubId-|334110)\K[^,\s-]+
The pattern matches:
\b A word boundary to prevent a partial word match
(?: Non capture group to match as a whole
SubId- Match literally
| Or
334110 Match literally
) Close the non capture group
\K Forget what is matched so far
[^,\s-]+ Match 1+ occurrences of any char except a whitespace char , or -
See the matches in this regex demo.
That will match:
5281181XXXXX
0102036XX
The command could look like
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022 | grep -oP '\b(?:SubId-|334110)\K[^,\s-]+' > analisis1.txt
I'm trying to sum a column in a medium sized data file (15M rows), but I get the following error:
$> q -Ht 'select sum(value) from datafile.txt'
Error('field larger than field limit (131072)'
My search led to links suggesting a change of default field size in python parsing of csv.fieldsize(), however after checking with awk I verified that my file has no large fields.
never forget: cleanse your data before processing
I found that my data file is full of product names with single and double quotes (single quotes for possessive names, and doubles to represent 'inches'. This causes the python parser to read the delimiter as literal characters within the field.
Do this:
sed s:\"::g data.txt > tmp ; sed s:\'::g tmp > data.txt
Terrible, terrible single/double quotes in data.
I run a command that produce lots of lines in my terminal - the lines are floats.
I only want certain numbers to be output as a line in my terminal.
I know that I can pipe the results to egrep:
| egrep "(369|433|375|368)"
if I want only certain values to appear. But is it possible to only have lines that have a value within ± 50 of 350 (for example) to appear?
grep matches against string tokens, so you have to either:
figure out the right string match for the number range you want (e.g., for 300-400, you might do something like grep -E [34].., with appropriate additional context added to the expression and a number of additional .s equal to your floating-point precision)
convert the number strings to actual numbers in whatever programming language you prefer to use and filter them that way
I'd strongly encourage you to take the second option.
I would go with awk here:
./yourProgram | awk '$1>250 && $1<350'
e.g.
echo -e "12.3\n342.678\n287.99999" | awk '$1>250 && $1<350'
342.678
287.99999
I have a non-delimited text file and want to parse it to add tabs at specific spots to delimit columns. The columns are sometimes empty or vary in length, which is why I need to add tabs to those specific spots. I had found the answer to this once a couple of years ago on the net using batch, but now can't find it or the code. I already have the following code to replace more than 2 spaces in the file, but this doesn't account for when the columns are empty.
gc $FileToOpen | % { $_ -replace ' +',"`t" } | set-content $FileToSave
So, I need to read each line, but be able to only read a portion (certain number of characters) of it and add the tabs after each portion to itself.
Here is a sample of the data file, the top row is the header and the data rows have no blank lines in between them:
MRUN Number Name X Exception Reason Data CDM# Quantity D.O.S
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 08/13/2015 0000000 0 08/13/2015
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 0000000 0 08/13/2015
The second data row is missing Data.
Using Ansgar's answer, my code that does find empty fields:
gc $FileToOpen |
? { $_ -match '^(.{8})(.{12})(.{20})(.{3})(.{34})(.{62})(.{10})(.{22})(.{10})$' } |
% { "{0}`t{1}`t{2}`t{3}`t{4}`t{5}`t{6}`t{7}`t{8}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim(), $matches[4].Trim(), $matches[5].Trim(), $matches[6].Trim(), $matches[7].Trim(), $matches[8].Trim(), $matches[9].Trim() } |
Set-Content $FileToSave
Thanks for your patience Ansgar, I know I tried it! I really do appreciate the help!
Since you seem to have an input file with fixed-width columns, you should probably use a regular expression for transforming the input into a tab-delimited format.
Assume the following input file:
A B C
foo 13 22
bar 4 17
baz 142 23
The file has 3 columns. The first column is 6 characters wide, the other two columns 4 characters each.
The transformation could be done with a regular expression like this:
Get-Content 'C:\path\to\input.txt' |
? { $_ -match '^(.{6})(.{4})(.{4})$' } |
% { "{0}`t{1}`t{2}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim() } |
Set-Content 'C:\path\to\output.txt'
The regular expression defines the columns by character count and captures them in groups (parentheses). The groups can then be accessed as the indexes 1 and above of the resulting $matches collection. Trimming removes the leading/trailing whitespace. The format operator (-f) then inserts the trimmed values into the tab-separated format string.
If the last column has a variable width (because its values are aligned to the left and don't have trailing spaces) you may need to change the regular expression to ^(.{6})(.{4})(.{,4})$ to take care of that. The quantifier {,4} (or {0,4}) means up to four times the preceding expression.
how to use grep to find occurrences of words from a dictionary file which have a given set of letters with the restriction that each letter occurs once and only once.
EG if the letters are abc then the expected output is:
cab
EDIT:
Given a dictionary file (that is a file containing one word per line such as /usr/share/dict/words on mac os x operating system) and a set of (unique) characters, I want to print out all of the dictionary file's words that contain each character of the input set once and only once. For example if the set of characters is {a,b,c} then print out all (3-letter) words that contain each character of the set.
I am looking, preferably, for a solution that uses just grep expressions.
Given a series of letters, for example abc, you can convert each one to a lookahead, like this:
^(?=[^a]*a[^a]*)(?=[^b]*b[^b]*)(?=[^c]*c[^c]*)$
You may need to use the "extended regex" flag -E to use this regex with grep.
To create this regex from a string, you could use sed (an exercise for the reader)
grep -E ^[abc]{3}.$ <Dictionary file> | grep -v -e a.*a -e b.*b -e c.*c
i.e. Find all three letter strings matching the input and pipe these through inverse grep to remove strings with double letters.
I'm using the '.' after {3} because my dictionary file is windows based so has an extra carriage return or line feed. So, that's probably not necessary.
Below is a Perl solution. Note, you'll need to add more words to the dictionary, and read input in to the $input variable. An array of valid words will end up in #results.
#!/usr/bin/env perl
use Data::Dumper;
my $input = "abc";
my #dictionary = qw(aaa aac aad aal aam aap aar aas aat aaw aba abc abd abf abg
abh abm abn abo abr abs abv abw aca acc ace aci ack acl acp acs act acv ada adb
adc add adf adh adl adn ado adp adq adr ads adt adw aea aeb aec aed aef aes aev
afb afc afe aff afg afi afk afl afn afp aft afu afv agb agc agl agm agn ago agp
...
PUT A REAL DICTIONARY HERE!
...
zie zif zig zii zij zik zil zim zin zio zip zir zis zit ziu ziv zlm zlo zlx zma
zme zmi zmu zna zoa zob zoe zog zoi zol zom zon zoo zor zos zot zou zov zoy zrn
zsr zub zud zug zui zuk zul zum zun zuo zur zus zut zuz zva zwo zye zzz);
# Generate a lookahead expression for each character in the input word
my $regexp = join("", map { "(?=.*$_)" } split(//, $input));
my #results;
foreach my $word (#dictionary) {
# If the size of the input doesn't match the dictionary word, skip to the
# next word.
if (length($input) != length($word)) {
next;
}
if ($word =~ /$regexp/) {
push(#results, $word);
}
}
print Dumper #results;
The solution I found involves using grep first to extract all n-letter words that contain only letters from the input set - although some letters might appear more than once, some may not appear; (again I am assuming that the input letters are unique). Then it does a series of 1-letter greps to make sure each letter occurs at least once. Because the words are of length n this ensures the word contains each letter once and only once. For example, if the input character set is (a,b,c} then the solution would be:
grep -E '^[abc]{3}$' /usr/share/dict/words | grep a | grep b | grep c
a simple bash script can be written which creates this grep string and executes it against the word file, using $1 as the input letter set. It might not be the most efficient method of generating the string, but as I am not familiar with sed or awk it does seem to solve my problem. The script I created is:
#!/bin/sh
slen=${#1}
g2="'^[$1]{$slen}\$'"
g3=""
ix1=0
while [ $ix1 -lt $slen ]
do
g3="$g3 | grep ${1:$ix1:1}"
ix1=$((ix1+1))
done
eval grep -E $g2 /usr/share/dict/words $g3