Tools for command line file parsing in cygwin - parsing

I have to deal with text files in a motley selection of formats. Here's an example (Columns A and B are tab delimited):
A B
a Name1=Val1, Name2=Val2, Name3=Val3
b Name1=Val4, Name3=Val5
c Name1=Val6, Name2=Val7, Name3=Val8
The files could have headers or not, have mixed delimiting schemes, have columns with name/value pairs as above etc.
I often have the ad-hoc need to extract data from such files in various ways. For example from the above data I might want the value associated with Name2 where it is present. i.e.
A B
a Val2
c Val7
What tools/techniques are there for performing such manipulations as one line commands, using the above as an example but extensible to other cases?

I don't like sed too much, but it works for such things:
var="Name2";sed -n "1p;s/\([^ ]*\) .*$var=\([^ ,]*\).*/\1 \2/p" < filename
Gives you:
A B
a Val2
c Val7

You have all the basic bash shell commands, for example grep, cut, sed and awk at your disposal. You can also use Perl or Ruby for more complex things.

From what I've seen I'd start with Awk for this sort of thing and then if you need something more complex, I'd progress to Python.

I would use sed:
# print section of file between two regular expressions (inclusive)
sed -n '/Iowa/,/Montana/p' # case sensitive

Since you have cygwin, I'd go with Perl. It's the easiest to learn (check out the O'Reily book: Learning Perl) and widely applicable.

I would use Perl. Write a small module (or more than one) for dealing with the different formats. You could then run perl oneliners using that library. Example for what it would
look like as follows:
perl -e 'use Parser;' -e 'parser("in.input").get("Name2");'
Don't quote me on the syntax, but that's the general idea. Abstract the task at hand to allow you to think in terms of what you need to do, not how you need to do it. Ruby would be another option, it tends to have a cleaner syntax, but either language would work.

Related

Silversearcher/ack vs find,grep

Currently when I have to search for complex patterns in code, I typically use a combination of find and grep in the form:
find / \( -type f -regextype posix-extended -regex '.*python3.*py' \) -exec grep -EliI '\b__[[:alnum:]]*_\b' {} \; -exec cat {} \; > ~/python.py
While this looks a long term to type, its actually quite short if you use zsh. I just type f (the first character), and go directly to this command from my command history. Further the regex in find/grep is standardized and tested, so there are no surprises, or missing searches.
ripgrep/ag etc etc are new software, which mightnot be supported a few years down the line when the original maintaner loses interest.
is there any plan to include .gitignore rules or optimizations in ag/ack/rg in grep/other version of grep? Is there any reason why these optimizations were/are not going to be included in grep?
For those of you who switched over: Did you guys find it worthwhile to switch over to rg/ag/ack especially because there is going to be a learning curve for these tools as well?
Use ag.
The key part of your example: ag -G '.*python3.*py' '\b__[[:alnum:]]*_\b'
Ag is here to stay and uses Perl regex (PCRE) which is far more flexible than POSIX basic or extended Regular Expressions. Grep -P uses the Perl regex engine, so this just akin to using ag, without some of the later's more modern features. Likewise, ack is like ag but is slower (though admittedly has a few more bells and whistles). Ag's file regexes filtering (the -G flag as exemplified above) and built-in file types filters are very handy (e.g. --python). The recently renamed .ignore file also provides finer tuning.
Since most modern scripting languages have PCRE or handle regexes with similar features in PCRE (perl, python, ruby), as do many full languages (java, C++) have near equivalent feature sets (e.g java.util.regex, Boost.Regex), I consider this the main reason to switch. Moreover, it is satisfying to unify your programming with you commandline skillset.
From my point of view, ripgrep is ag's main contender because it is faster and has an easy way to add file types. That said, it doesn't have as flexible a regex engine: no backreferences nor look-arounds. With this is mind, I recommend Ag.

dash equivalent to bash's curly bracket syntax?

In bash, php/{composer,sismo} expands to php/composer php/sismo. Is there any way to do this with /bin/sh (which I believe is dash), the system shell ? I'm writing git hooks and would like to stay away from bash as long as I can.
You can use printf.
% printf 'str1%s\t' 'str2' 'str3' 'str4'
str1str2 str1str3 str1str4
There doesn't seem to be a way. You will have to use loops to generate these names, perhaps in a function. Or use variables to substitute common parts, maybe with "set -u" to prevent typos.
I see that you prefer dash for performance reasons, however you don't seem to provide any numbers to substantiate your decision. I'd suggest you measure actual performance difference and reevaluate. You might be falling for premature optimization, as well. Consider how much implementation and debugging time you'll save by using Bash vs. possible performance drop.
I really like the printf solution provided by #mikeserv, but I thought I'd provide an example using a loop.
The below would probably be most useful if you wish to execute one command for each expanded string, rather than provide both strings as args to the same command.
for X in composer sismo; do
echo "php/$X" # replace 'echo' with your command
done
You could, however, rewrite it as
ARGS="$(for X in composer sismo; do echo "php/$X"; done)"
echo $ARGS # replace 'echo' with your command
Note that $ARGS is unquoted in the above command, and be aware that this means that its content is wordsplitted (i.e. if any your original strings contain spaces, it will break).

POSIX sh EBNF grammar

Is there an existing POSIX sh grammar available or do I have to figure it out from the specification directly?
Note I'm not so much interested in a pure sh; an extended but conformant sh is also more than fine for my purposes.
The POSIX standard defines the grammar for the POSIX shell. The definition includes an annotated Yacc grammar. As such, it can be converted to EBNF more or less mechanically.
If you want a 'real' grammar, then you have to look harder. Choose your 'real shell' and find the source and work out what the grammar is from that.
Note that EBNF is not used widely. It is of limited practical value, not least because there are essentially no tools that support it. Therefore, you are unlikely to find an EBNF grammar (of almost anything) off-the-shelf.
I have done some more digging and found these resources:
An sh tutorial located here
A Bash book containing Bash 2.0's BNF grammar (gone from here) with the relevant appendix still here
I have looked through the sources of bash, pdksh, and posh but haven't found anything remotely at the level of abstraction I need.
I've had multiple attempts at writing my own full blown Bash interpreters over the past year, and I've also reached at some point the same book appendix reference stated in the marked answer (#2), but it's not completely correct/updated (for example it doesn't define production rules using the 'coproc' reserved keyword and has a duplicate production rule definition for a redirection using '<&', might be more problems but those are the ones I've noticed).
The best way i've found was to go to http://ftp.gnu.org/gnu/bash/
Download the current bash version's sources
Open the parse.y file (which in this case is the YACC file that basically contains all the parsing logic that bash uses) and just copy paste the lines between '%%' in your favorite text editor, those define the grammar's production rules
Then, using a little bit of regex (which I'm terrible at btw) we can delete the extra code logic that are in between '{...}' to make the grammar look more BNF-like.
The regex i used was :
(\{(\s+.*?)+\})\s+([;|])
It matches any line non greedily .*? including spaces and new lines \s+ that are between curly braces, and specifically the last closing brace before a ; or | character. Then i just replaced the matched strings to \3 (e.g. the result of the third capturing group, being either ; or |).
Here's the grammar definition that I managed to extract at the time of posting https://pastebin.com/qpsK4TF6
I'd expect that sh, csh, ash, bash, would contain parsers. GNU versions of these are open source; you might just go check there.

Parsing a file with following a pattern

I need to parse a file with a mediawiki syntax (tables).
I know sed or awk could do it but I'm no expert of these.
I need to find the following pattern :
beginning_of_line| [[text]] || random_stuff_until_newline
There might be (or not) space between the pipes and the brakets. And I need a output the text
Any solutions for me ?
Thx
Parsing text like that is like parsing XML or HTML. Regexes aren't really well suited for that type of document. You should try to find a Python or Perl module that is suited for the job.
However, here is a sed command that will work in the simple case you provided as an example.
sed 's/^[^|]*|[[:space:]]*\[\[\([^]]\+\)\]\].*/\1/' inputfile
I would look for a Mediawiki parser. It must exist somewhere.
Failing that, if you have a grammar for mediawiki you can generate a parser using ANTLR or similiar depending on what kind of grammar it is.
If you don't have a grammar or don't want to do that because of the learning curve then you need some reliable way to distinguish between what yo're calling "text and what you're calling random stuff. Are the pipes guaranteed to be there? If so in Java you can just do String.split() using the pipes as the argument to split.
Is this what you mean?
This might work for you (GNU sed):
sed 's/^[^|]*|\s*\[\[\([^]]*\(][^]]*\)*\)]]\s*||.*/\1/;t;d' file

GREP - finding all occurrences of a string

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.
I am using grep to filter results like this:
grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...
The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).
All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:
public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";
I would like to find that occurrence as well as:
public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";
Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.
To address your concern about missing some occurrences, why not filter progressively:
Create a text file with all possible
matches as a starting point.
Use filter X (grep for '^import',
for example) to dump probable false
positives into a tmp file.
Use filter X again to remove those
matches from your working file (a
copy of [1]).
Do a quick visual pass of the tmp
file and add any real matches back
in.
Repeat [2]-[4] with other filters.
This might take some time, of course, but it doesn't sound like this is something you want to get wrong...
I would use sed, not grep!
Sed is used to perform basic text transformations on an input stream.
Try s/regexp/replacement/ option with sed command.
You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.
The best solution will be however a simple script in Perl or in Python.

Resources