I have a large set of files, which I want to search in.
I want to find all words - and only the words - that are surrounded by X. I do this with lookbehind and lookahead:
grep -rhoP "(?<=X)[a-zA-Z0-9_]*(?=X)" .
Now I want to know if there are words - with full context - from the previous result that are surrounded with Y. I can do that for one specific word abc
grep -rP "(?<=Y)abc(?=Y)" .
But how can I pipe the two commands together?
Update
I have a large set of C-files.
All our api functions are first declared in one of our many .h files, and used in one of our many .c files and inlined .h files.
Some functions may not be accessed directly, but through a function pointer, but that proves to be a great hindrance for development. For this purpose, we made some macro's (FP and CALL) to be able to easily turn on and off this requirement.
FP(DoThis)
void DoThis(void);
With the requirement ON, FP defines a function pointer - in this case for DoThis - which can be used later on by using CALL(DoThis)();.
With the requirement OFF, FP is expanded to nothing, and CALL(DoThis) is expanded to just DoThis.
A list of all functions for which a function pointer is created this way can be fetched by:
grep -rhoP "(?<=FP\()[a-zA-Z0-9_]*(?=\))" .
A list of all locations where the function pointer for function DoThis is used can be fetched by:
grep -rP "(?<=CALL\()DoThis(?=\))" .
Now I want to have a list of all locations where the function pointer for any function created using FP and CALL can be fetched.
So somehow, I want to chain the two greps together, so that each result from the first grep is fed to the second, and the final results are grouped together.
Perhaps this good sir
grep -rho 'XY[a-zA-Z0-9_]*YX' . | sed 's/[XY]//g'
Related
I am working with nom version 6.1.2 and I am trying to parse Strings like
A 2 1 2.
At the moment I would be happy to at least differentiate between input that fits the requirements and inputs which don't do that. (After that I would like to change the output to a tuple that has the "A" as first value and as second value a vector of the u16 numbers.)
The String always has to start with a capital A and after that there should be at least one space and after that one a number. Furthermore, there can be as much additional spaces and numbers as you want. It is just important to end with a number and not with a space. All numbers will be within the range of u16. I already wrote the following function:
extern crate nom;
use nom::sequence::{preceded, pair};
use nom::character::streaming::{char, space1};
use nom::combinator::recognize;
use nom::multi::many1;
use nom::character::complete::digit1;
pub fn parse_and(line: &str) -> IResult<&str, &str>{
preceded(
char('A'),
recognize(
many1(
pair(
space1,
digit1
)
)
)
)(line)
}
Also I want to mention that there are answers for such a problem which use CompleteStr but that isn't an option anymore because it got removed some time ago.
People explained that the reason for my behavior is that nom doesn't know when the slice of a string ends and therefore I get parse_and: Err(Incomplete(Size(1))) as answer for the provided example as input.
It seems like that one part of the use declarations created that problem. In the documentation (somewhere in some paragraph way to low that I looked at it) it says:
"
Streaming / Complete
Some of nom's modules have streaming or complete submodules. They hold different variants of the same combinators.
A streaming parser assumes that we might not have all of the input data. This can happen with some network protocol or large file parsers, where the input buffer can be full and need to be resized or refilled.
A complete parser assumes that we already have all of the input data. This will be the common case with small files that can be read entirely to memory.
"
Therefore, the solution to my problem is to swap use nom::character::complete::{char, space1}; instead of nom::character::streaming::{char, space1}; (3rd loc without counting empty lines). That worked for me :)
While trying to find out how Forth manages the dictionary (and memory in general), I came across this page. Being familiar with C, I have no problem with the concept of pointers, and I assume I understood everything correctly. However, at the end of the page are several exercises, and here I noticed something strange.
Exercise 9.4, assuming DATE has been defined as a VARIABLE, asks what the difference is between
DATE .
and
' DATE .
and exercise 9.5 does the same using the user variable BASE.
According to the supplied answers, both phrases will give the same result (also with BASE). Trying this with Win32Forth however, gives results with a difference of 4 bytes (1 cell). Here is what I did:
here . 4494668 ok
variable x ok
x . 4494672 ok
' x . 4494668 ok
Creating another variable gives a similar result:
variable y ok
y . 4494680 ok
' y . 4494676 ok
Thus, it looks like each variable gets not just one cell (for the value), but two cells. The variable itself points to where the actual value is stored, and retrieving the contents at the execution token (using ' x ?) gives 0040101F for both variables.
For exercise 9.5, my results are:
base . 195F90 ok
' base . 40B418 ok
These are not even close to each other. The answer for this exercise does however mention that the results can depend on how BASE is defined.
Returning to normal variables, my main question thus is: why are two cells reserved per variable?
Additionally:
Since only one cell contains the actual value, what do the contents of the other cell mean?
Is this specific to Win32Forth? What happens in other implementations?
Is this different for run-time and compile-time variables?
How do answers to the above questions apply to user variables (such as BASE)?
EDIT1: Okay, so Forth also stores a header for each variable, and using the ' gives you the address of this header. From my tests I would then conclude the header uses just one cell, which does not correspond to all the information the header should contain. Secondly, according to the exercise retrieving the address of a variable should for both cases give the same result, which appears to contradict the existence of a header altogether.
My gut feeling is that this is all very implementation-specific. If so, what happens in Win32Forth, and what should happen according to the exercise?
This is roughly how a definition looks like in the dictionary using a traditional memory layout. Note that implementations may well diverge from this, sometimes a lot. In particular, the order of the fields may be different.
Link to previous word (one cell)
Flags (a few bits)
Name length (one byte, less a few bits)
Name string (variable)
Code field (one cell)
Parameter field (variable)
Everything except the code and parameter fields is considered the header. The code field usually comes right before the parameter field.
Ticking a word with ' gives you an XT, or execution token. This can be anything the implementation fancies, but in many cases it's the address of the code field.
Executing a word created with CREATE or VARIABLE gives you the address of the parameter field.
This is probably why in Win32Forth, the two addresses differ by 4 bytes, or one cell. I don't know why the answers to the exercises state there should be no difference.
Assuming BASE is a user variable, it probably works like this: Every task has its own user area in which user variables are allocated. All user variables know their specific offset inside this area. Ticking BASE gives you its XT, which is the same for all tasks. Executing BASE computes an address by adding its offset to the base of the user area.
I am working with a file that contains thousands of proteins in an organism. I have code that will allow me to go through each individual protein one by one and determine the frequency of amino acids in each. Would there be a way to alter my current code to allow me to determine all of the frequencies of amino acids at once?
IIUC, you're reinventing the wheel a bit: BioPython contains utilities for handling files in various formats (FASTA in your case), and simple analysis. For your example, I'd use something like this:
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
for seq_record in SeqIO.parse("protein_x.txt", "fasta"):
print(seq_record.id), ProteinAnalysis(repr(seq_record.seq)).get_amino_acids_percent().items()
The answer is yes, but without showing us your code we can't give much feedback. Essentially you want to keep your counts of the amino acids persist between reading FASTA records. If you wanted probabilities you then total them up outside the loop and divide through only at the end. This is trivially accomplished without something like a "counting dictionary" in Python or incrementing a value in a hash/dict. There are also highly likely plenty of command line tools that do this for you since all you want is character level counts for any line not starting with a '>' in the file.
For example for a file that small:
grep -v '^>' yourdata.fa | perl -pe 's/(.)/$1\n/g' | sort | uniq -c
I would like to use grep to filter a file B according to a list of elements in A (one element per line). My goal is to keep those lines of B that appear in list A. Both files are ordered.
I am using something like this:
grep -f A B
The trouble is that file B is several million lines long and file A contains more over a million elements.
Is this the fastest way to go or are there more efficient options out there?
Thanks
fgrep or grep -F would be faster (f/F for fast) if you're searching for strings instead of regexps. If both files have whole lines and since they are "ordered", you might even do better running comm or diff on them.
I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.
I am using grep to filter results like this:
grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...
The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).
All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:
public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";
I would like to find that occurrence as well as:
public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";
Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.
To address your concern about missing some occurrences, why not filter progressively:
Create a text file with all possible
matches as a starting point.
Use filter X (grep for '^import',
for example) to dump probable false
positives into a tmp file.
Use filter X again to remove those
matches from your working file (a
copy of [1]).
Do a quick visual pass of the tmp
file and add any real matches back
in.
Repeat [2]-[4] with other filters.
This might take some time, of course, but it doesn't sound like this is something you want to get wrong...
I would use sed, not grep!
Sed is used to perform basic text transformations on an input stream.
Try s/regexp/replacement/ option with sed command.
You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.
The best solution will be however a simple script in Perl or in Python.