Currently when I have to search for complex patterns in code, I typically use a combination of find and grep in the form:
find / \( -type f -regextype posix-extended -regex '.*python3.*py' \) -exec grep -EliI '\b__[[:alnum:]]*_\b' {} \; -exec cat {} \; > ~/python.py
While this looks a long term to type, its actually quite short if you use zsh. I just type f (the first character), and go directly to this command from my command history. Further the regex in find/grep is standardized and tested, so there are no surprises, or missing searches.
ripgrep/ag etc etc are new software, which mightnot be supported a few years down the line when the original maintaner loses interest.
is there any plan to include .gitignore rules or optimizations in ag/ack/rg in grep/other version of grep? Is there any reason why these optimizations were/are not going to be included in grep?
For those of you who switched over: Did you guys find it worthwhile to switch over to rg/ag/ack especially because there is going to be a learning curve for these tools as well?
Use ag.
The key part of your example: ag -G '.*python3.*py' '\b__[[:alnum:]]*_\b'
Ag is here to stay and uses Perl regex (PCRE) which is far more flexible than POSIX basic or extended Regular Expressions. Grep -P uses the Perl regex engine, so this just akin to using ag, without some of the later's more modern features. Likewise, ack is like ag but is slower (though admittedly has a few more bells and whistles). Ag's file regexes filtering (the -G flag as exemplified above) and built-in file types filters are very handy (e.g. --python). The recently renamed .ignore file also provides finer tuning.
Since most modern scripting languages have PCRE or handle regexes with similar features in PCRE (perl, python, ruby), as do many full languages (java, C++) have near equivalent feature sets (e.g java.util.regex, Boost.Regex), I consider this the main reason to switch. Moreover, it is satisfying to unify your programming with you commandline skillset.
From my point of view, ripgrep is ag's main contender because it is faster and has an easy way to add file types. That said, it doesn't have as flexible a regex engine: no backreferences nor look-arounds. With this is mind, I recommend Ag.
Related
Setup: I am contemplating switching from writing large (~20GB) data files with csv to feather format, since I have plenty of storage space and the extra speed is more important. One thing I like about csv files is that at the command line, I can do a quick
wc -l filename
to get a row count, even for large data files. Also, I can quickly search for a simple string with
grep search_string filename
The head and tail commands are also very useful at times. These are straight-forward and work well with csv files, but not with feather. If I try any of them on a feather file, I do not get results that make sense or are helpful.
While I certainly can read a feather file into, say, Python or R, and analyze it then, the hassle of writing out the path and importing the necessary libraries is something I'd rather dispense with.
My Question: Does there exist either a cross-platform (at least Mac and Linux) feather file reader I can use to quickly read in and view feather data (this would be in tabular format) with features corresponding to row count, grep, head, and tail? Or are there simple CLI utilities I could install that would enable me to do the equivalent of line count, grep, head, and tail?
I've seen this question, but it is very incomplete relative to my question.
Using feather files you must use Python or R programs.
To use csv you can use any of the common text manipulation utilities available to Linxu/Unix users.
Linux text manipulation tools
reader less
search grep
converters awk sed
extractor split
editor vim
Each of the above tools requires some learning and practice.
Suggestion
If you have programming skill, create a program to manipulate your feather file.
Some command options are with one dash e.g. ruby -c (check syntax) and ruby --copyright (print copyright). Is there any pattern to this?
These are known as short and long options. Which name/format a developer uses for options of his program is totally up to him.
However, there are some widespread conventions. Like -v/--version for printing version number, -h/--help for printing usage instructions, etc.
Sadly, most commandline tools on OSX seem not to conform to -v/-h.
Good CLI (command-line interface) design dictates that options of a program that are most useful should have two formats, short and long. You use short format in your everyday life (because it's faster to type).
ps aux | grep ruby
Long ones are for scripts that you write and rarely touch (they're easier to read and understand).
mongod --logpath /path/to/logs --dbpath /path/to/db --fork --smallfiles
Many less used options may have only the long version (because, you know, there are only 26 letters in latin alphabet).
On many rails commands there is a pattern. One dash is an abbreviation for a two dash option, e.g. rspec -o FILE is a synonym for rspec --out FILE.
Is there an existing POSIX sh grammar available or do I have to figure it out from the specification directly?
Note I'm not so much interested in a pure sh; an extended but conformant sh is also more than fine for my purposes.
The POSIX standard defines the grammar for the POSIX shell. The definition includes an annotated Yacc grammar. As such, it can be converted to EBNF more or less mechanically.
If you want a 'real' grammar, then you have to look harder. Choose your 'real shell' and find the source and work out what the grammar is from that.
Note that EBNF is not used widely. It is of limited practical value, not least because there are essentially no tools that support it. Therefore, you are unlikely to find an EBNF grammar (of almost anything) off-the-shelf.
I have done some more digging and found these resources:
An sh tutorial located here
A Bash book containing Bash 2.0's BNF grammar (gone from here) with the relevant appendix still here
I have looked through the sources of bash, pdksh, and posh but haven't found anything remotely at the level of abstraction I need.
I've had multiple attempts at writing my own full blown Bash interpreters over the past year, and I've also reached at some point the same book appendix reference stated in the marked answer (#2), but it's not completely correct/updated (for example it doesn't define production rules using the 'coproc' reserved keyword and has a duplicate production rule definition for a redirection using '<&', might be more problems but those are the ones I've noticed).
The best way i've found was to go to http://ftp.gnu.org/gnu/bash/
Download the current bash version's sources
Open the parse.y file (which in this case is the YACC file that basically contains all the parsing logic that bash uses) and just copy paste the lines between '%%' in your favorite text editor, those define the grammar's production rules
Then, using a little bit of regex (which I'm terrible at btw) we can delete the extra code logic that are in between '{...}' to make the grammar look more BNF-like.
The regex i used was :
(\{(\s+.*?)+\})\s+([;|])
It matches any line non greedily .*? including spaces and new lines \s+ that are between curly braces, and specifically the last closing brace before a ; or | character. Then i just replaced the matched strings to \3 (e.g. the result of the third capturing group, being either ; or |).
Here's the grammar definition that I managed to extract at the time of posting https://pastebin.com/qpsK4TF6
I'd expect that sh, csh, ash, bash, would contain parsers. GNU versions of these are open source; you might just go check there.
I'm guessing it's not a Perl Compatible Regular Expression, since there's a special kind of grep which is specifically PCRE. What's grep most similar to?
Are there any special quirks of grep that I need to know about? (I'm used to Perl and the preg functions in PHP)
Default GNU grep behavior is to use a slightly flavorful variant on POSIX basic regular expressions, with a similarly tweaked species of POSIX extended regular expressions for egrep (usually an alias for grep -E). POSIX ERE is what PHP ereg() uses.
GNU grep also claims to support grep -P for PCRE, by the way. So no terribly special kind of grep required.
POSIX BRE (Basic Regular Expressions)
You can compare the various flavors here.
There's a good write-up here. To quote the page, "grep is supposed to use BREs, except that grep -E uses EREs. (GNU grep fits some extensions in where POSIX leaves the behaviour unspecified)."
In other words, it's a long story. ;)
Grep is an implementation of POSIX regular expressions. There are two types of posix regular expressions -- basic regular expressions and extended regular expressions. In grep, generally you use the -E option to allow extended regular expressions.
The grep man pages do a pretty thorough job of explaining the flavor of regexp available in grep. man grep is pretty useful.
There is no regular grep function in PHP. If you are referring to the ereg family of PHP functions then those are POSIX regular expressions. If you are referring to the Linux grep commandline utility, those are POSIX regular expressions as well. It supports both basic as well as extended POSIX regular expressions.
I have to deal with text files in a motley selection of formats. Here's an example (Columns A and B are tab delimited):
A B
a Name1=Val1, Name2=Val2, Name3=Val3
b Name1=Val4, Name3=Val5
c Name1=Val6, Name2=Val7, Name3=Val8
The files could have headers or not, have mixed delimiting schemes, have columns with name/value pairs as above etc.
I often have the ad-hoc need to extract data from such files in various ways. For example from the above data I might want the value associated with Name2 where it is present. i.e.
A B
a Val2
c Val7
What tools/techniques are there for performing such manipulations as one line commands, using the above as an example but extensible to other cases?
I don't like sed too much, but it works for such things:
var="Name2";sed -n "1p;s/\([^ ]*\) .*$var=\([^ ,]*\).*/\1 \2/p" < filename
Gives you:
A B
a Val2
c Val7
You have all the basic bash shell commands, for example grep, cut, sed and awk at your disposal. You can also use Perl or Ruby for more complex things.
From what I've seen I'd start with Awk for this sort of thing and then if you need something more complex, I'd progress to Python.
I would use sed:
# print section of file between two regular expressions (inclusive)
sed -n '/Iowa/,/Montana/p' # case sensitive
Since you have cygwin, I'd go with Perl. It's the easiest to learn (check out the O'Reily book: Learning Perl) and widely applicable.
I would use Perl. Write a small module (or more than one) for dealing with the different formats. You could then run perl oneliners using that library. Example for what it would
look like as follows:
perl -e 'use Parser;' -e 'parser("in.input").get("Name2");'
Don't quote me on the syntax, but that's the general idea. Abstract the task at hand to allow you to think in terms of what you need to do, not how you need to do it. Ruby would be another option, it tends to have a cleaner syntax, but either language would work.