ack misses results (vs. grep) - grep

I'm sure I'm misunderstanding something about ack's file/directory ignore defaults, but perhaps somebody could shed some light on this for me:
mbuck$ grep logout -R app/views/
Binary file app/views/shared/._header.html.erb.bak.swp matches
Binary file app/views/shared/._header.html.erb.swp matches
app/views/shared/_header.html.erb.bak: <%= link_to logout_text, logout_path, { :title => logout_text, :class => 'login-menuitem' } %>
mbuck$ ack logout app/views/
mbuck$
Whereas...
mbuck$ ack -u logout app/views/
Binary file app/views/shared/._header.html.erb.bak.swp matches
Binary file app/views/shared/._header.html.erb.swp matches
app/views/shared/_header.html.erb.bak
98:<%= link_to logout_text, logout_path, { :title => logout_text, :class => 'login-menuitem' } %>
Simply calling ack without options can't find the result within a .bak file, but calling with the --unrestricted option can find the result. As far as I can tell, though, ack does not ignore .bak files by default.
UPDATE
Thanks to the helpful comments below, here are the new contents of my ~/.ackrc:
--type-add=ruby=.haml,.rake
--type-add=css=.less

ack is peculiar in that it doesn't have a blacklist of file types to ignore, but rather a whitelist of file types that it will search in.
To quote from the man page:
With no file selections, ack-grep only searches files of types that it recognizes. If you have a file called foo.wango, and ack-grep doesn't know what a .wango file is, ack-grep won't search it.
(Note that I'm using Ubuntu where the binary is called ack-grep due to a naming conflict)
ack --help-types will show a list of types your ack installation supports.

If you are ever confused about what files ack will be searching, simply add the -f option. It will list all the files that it finds to be searchable.

ack --man states:
If you want ack to search every file,
even ones that it always ignores like
coredumps and backup files, use the
"−u" switch.
and
Why does ack ignore unknown files by
default? ack is designed by a
programmer, for programmers, for
searching large trees of code. Most
codebases have a lot files in them
which aren’t source files (like
compiled object files, source control
metadata, etc), and grep wastes a lot
of time searching through all of those
as well and returning matches from
those files.
That’s why ack’s behavior of not
searching things it doesn’t recognize
is one of its greatest strengths: the
speed you get from only searching the
things that you want to be looking at.
EDIT: Also if you look at the source code, bak files are ignored.

Instead of wrestling with ack, you could just use plain old grep, from 1973. Because it uses explicitly blacklisted files, instead of whitelisted filetypes, it never omits correct results, ever. Given a couple of lines of config (which I created in my home directory 'dotfiles' repo back in the 1990s), grep actually matches or surpasses many of ack's claimed advantages - in particular, speed: When searching the same set of files, grep is faster than ack.
The grep config that makes me happy looks like this, in my .bashrc:
# Custom 'grep' behaviour
# Search recursively
# Ignore binary files
# Output in pretty colors
# Exclude a bunch of files and directories by name
# (this both prevents false positives, and speeds it up)
function grp {
grep -rI --color --exclude-dir=node_modules --exclude-dir=\.bzr --exclude-dir=\.git --exclude-dir=\.hg --exclude-dir=\.svn --exclude-dir=build --exclude-dir=dist --exclude-dir=.tox --exclude=tags "$#"
}
function grpy {
grp --include=*.py "$#"
}
The exact list of files and directories to ignore will probably differ for you: I'm mostly a Python dev and these settings work for me.
It's also easy to add sub-customisations, as I show for my 'grpy', that I use to grep Python source.
Defining bash functions like this is preferable to setting GREP_OPTIONS, which will cause ALL executions of grep from your login shell to behave differently, including those invoked by programs you have run. Those programs will probably barf on the unexpectedly different behaviour of grep.
My new functions, 'grp' and 'grpy', deliberately don't shadow 'grep', so that I can still use the original behaviour any time I need that.

Related

Search for files that contain pattern

I have this search - I would like to print out the paths of files that contain the matching text:
grep -r "jasmine" .
and it yields results that look like this:
./app-root/runtime/repo/node_modules/jasmine-core/.github/CONTRIBUTING.md:- [Jasmine Google Group](http://groups.google.com/group/jasmine-js)
./app-root/runtime/repo/node_modules/jasmine-core/.github/CONTRIBUTING.md:- [Jasmine-dev Google Group](http://groups.google.com/group/jasmine-js-dev)
./app-root/runtime/repo/node_modules/jasmine-core/.github/CONTRIBUTING.md:git clone git#github.com:yourUserName/jasmine.git # Clone your fork
./app-root/runtime/repo/node_modules/jasmine-core/.github/CONTRIBUTING.md:cd jasmine # Change directory
./app-root/runtime/repo/node_modules/jasmine-core/.github/CONTRIBUTING.md:git remote add upstream https://github.com/jasmine/jasmine.git # Assign original repository to a remote named 'upstream'
./app-root/runtime/repo/node_modules/jasmine-core/.github/CONTRIBUTING.md:Note that Jasmine tests itself. The files in `lib` are loaded first, defining the reference `jasmine`. Then the files in `src` are loaded, defining the reference `j$`. So there are two copies of the code loaded under test.
./app-root/runtime/repo/node_modules/jasmine-core/.github/CONTRIBUTING.md:The tests should always use `j$` to refer to the objects and functions that are being tested. But the tests can use functions on `jasmine` as needed. _Be careful how you structure any new test code_. Copy the patterns you see in the existing code - this ensures that the code you're testing is not leaking into the `jasmine` reference and vice-versa.
But I just want the file names, I don't want to print out the matching contents, just the file names, how can I do that?
The problem is that the matching text will wrap around in the terminal and make the results basically unreadable.
Did you use the -l flag from grep?
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match.
A simple search on my home directory,
grep -rl 'bash' .
./.bashrc
./.bash_history
./.bash_logout
./.bash_profile
./.profile
./.viminfo
As a matter of fact, -l is a POSIX defined option for grep should be available in almost all distros.

Slash at the end of url

I think (correct me if I am wrong) that it is better to put a / at the end of most of url. Like this: http://www.myweb/file/
And not put / at the end of filenames: http://www.myweb/name.html
I have to correct that in a website with a lot of links. Is there a way I can do that in a fast way. For instance in some programs like Dreamweaver I can use find and replace.
The second case is quite easy with Dreamweaver:
- Find: .html/"
- Replace: .html"
But how can I say something like:
- Find: all the links that end with a directory. Like http://www.myweb/file
- Replace: the same link but with a / at the end. Like http://www.myweb/file/
Your approach may work but it is based on the assumption that all files have a file extension.
There is a distinct difference between the urls http://www.myweb/file and http://www.myweb/file/ because the latter could resolve to http://www.myweb/file/index.php, or any other in the default set configured in your web server. That URL could also reference a perfectly valid file which doesn't contain a file extension, such as if it were a REST endpoint.
So you are correct insofar as you should explicitly add a "/" if you are referring to a directory, for example if you are expecting the web server to look up the correct index page to respond, or doing a directory listing.
To replace the incorrect URLS, regular expressions are your friend.
To find all files which have an erroneous "/" you could use /\.(html|php|jpg|png)\//, adding as many different file extensions into that pipe-separated list as you like. You can then replace that with .$1 or .\1 depending on your tool.
An example of doing this with Perl would be:
perl -pi -e 's/\.(html|php|jpg|png)\//.\1/g' theFileYouWantToCheck.html
Of (if you're using a Linux-based system) you can automate that nicely with find:
find path/to/html/root -type f -name "*.html* | xargs perl -pi -e 's/\.(html|php|jpg|png)\//.\1/g'
which will find all html files in the directory and do an inline find and replace. Assuming you're using version control, it's then easy to see the changes it's applied :)
Update
Solving the problem for adding a slash to directories isn't trivial. The approach I'd take:
Write a script to recurse through your website structure locally, making a list of all files
Parse the HTML files to extract all href=".*" and replace them with href=".*/" only if the end of the URL isn't present in the list extracted by the first script.
Any text-based find and replace is not going to be aware of whether the link is actually to a file or not.

How can I get the output of ls

How can I get the output of ls? I want to add a indirection operator =>, and it's function is same as >, it means in the command line $ls => Files, the list of files in the directory is stored in the file Files
Output redirection (and all other redirection for that matter) is a facility provided by the shell, not by the ls program. The ls just writes its output to standard output and, if the shell has redirected that to a file, that's where it goes.
So, if you want to add a => token, it's the shell that you're going to have to modify, recompile and install. That's not necessarily an easy task, I've made changes to bash in the past and, while it's relatively easy to tinker around the edges (I added an internal command for outputting the PS1 result string), I suspect redirection may be a little more difficult.
Still, it may be a matter of simply creating a new token => and copying the code that's currently executed for >. It may also be that ash, the Minix3 shell, is a lot cleaner than bash. My advice would be to investigate ash, specifically the version found in Minix3, and just have a play.

Options for MeCab Japanese tokenizer on iOS?

I'm using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having some trouble getting it to tokenize all possible words. Specifically, I cannot tokenize "吉本興業" into two pieces "吉本" and "興業". Are there any options that I could use to fix this? The iPhone library does not expose anything, but it uses C++ underneath the objective-c wrapper. I assume there must be some sort of setting I could change to give more fine-grained control, but I have no idea where to start.
By the way, if anyone wants to tag this 'mecab' that would probably be appropriate. I'm not allowed to create new tags yet.
UPDATE: The iOS library is calling mecab_sparse_tonode2() defined in libmecab.cpp. If anyone could point me to some English documentation on that file it might be enough.
There is nothing iOS-specific in this. The dictionary you are using with mecab (probably ipadic) contains an entry for the company name 吉本興業. Although both parts of the name are listed as separate nouns as well, mecab has a strong preference to tag the compound name as one word.
Mecab lacks a feature that allows the user to choose whether or not compounds should be split into parts. Note that such a feature is generally hard to implement because not everyone agrees on which compounds can be split and which ones can't. E.g. is 容疑者 a compound made up of 容疑 and 者? From a purely morphological point of view perhaps yes, but for most practical applications probably no.
If you have a list of compounds you'd like to get segmented, a quick fix is to create a user dictionary for the parts they consist of, and make mecab use this in addition to the main dictionary.
There is Japanese documentation on how to do this here. For your particular example, it would involve the steps below.
Make a user dictionary with two entries, one for 吉本 and one for 興業:
吉本,,,100,名詞,固有名詞,人名,名,*,*,よしもと,ヨシモト,ヨシモト
興業,,,100,名詞,一般,*,*,*,*,こうぎょう,コウギョウ,コウギョウ
I suspect that both entries exist in the default dictionary already, but by adding them to a user dictionary and specifying a relatively low specificness indicator (I've used 100 for both -- the lower, the more likely to be split), you can get mecab to tend to prefer the parts over the whole.
Compile the user dictionary:
$> $MECAB/libexec/mecab/mecab-dict-index -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
You may have to adjust the command. The above assumes:
Mecab was installed from source in $MECAB. If you use mecab installed by a package manager, you might have difficulties finding the mecab-dict-index tool. Best install from source.
The default dictionary is in /usr/lib64/mecab/dict/ipadic. This is not part of the mecab package; it comes as a separate package (e.g. this) and you may have difficulties finding this, too.
mydic is the name of the user dictionary created in step 1. mydic.dic is the name of the compiled dictionary you'll get as output (needs not exist).
Both the system dictionary (-t option) and the user dictionary (-f option) are encoded in UTF-8. This may be wrong, in which case you'll get an error message later when you use mecab.
Modify the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc or similar. In your case it may be located somewhere else. Add the following line to the end of the configuration file:
userdic = home/myhome/mydic.dic
Make sure the absolute path to the dictionary compiled above is correct.
If you then run mecab against your input, it will split the compound into its parts (I tested it, using mecab 0.994 on a Linux system).
A more thorough fix would be to get the source of the default dictionary and manually remove all compoun nouns you want to get split, then recompile the dictionary. As a general remark, using a CJK tokenizer for a serious application in production mode over a longer period of time usually involves a certain amount of dictionary maintenance (adding/removing entries) regularly.

Getting a full list of the URLS in a rails application

How do I get a a complete list of all the urls that my rails application could generate?
I don't want the routes that I get get form rake routes, instead I want to get the actul URLs corrosponding to all the dynmically generated pages in my application...
Is this even possible?
(Background: I'm doing this because I want a complete list of URLs for some load testing I want to do, which has to cover the entire breadth of the application)
I was able to produce useful output with the following command:
$ wget --spider -r -nv -nd -np http://localhost:3209/ 2>&1 | ack -o '(?<=URL:)\S+'
http://localhost:3209/
http://localhost:3209/robots.txt
http://localhost:3209/agenda/2008/08
http://localhost:3209/agenda/2008/10
http://localhost:3209/agenda/2008/09/01
http://localhost:3209/agenda/2008/09/02
http://localhost:3209/agenda/2008/09/03
^C
A quick reference of the wget arguments:
# --spider don't download anything.
# -r, --recursive specify recursive download.
# -nv, --no-verbose turn off verboseness, without being quiet.
# -nd, --no-directories don't create directories.
# -np, --no-parent don't ascend to the parent directory.
About ack
ack is like grep but use perl regexps, which are more complete/powerful.
-o tells ack to only output the matched substring, and the pattern I used looks for anything non-space preceded by 'URL:'
You could pretty quickly hack together a program that grabs the output of rake routes and then parses the output to put together a list of the URLs.
What I have, typically, done for load testing is to use a tool like WebLOAD and script several different types of user sessions (or different routes users can take). Then I create a mix of user sessions and run them through the website to get something close to an accurate picture of how the site might run.
Typically I will also do this on a total of 4 different machines running about 80 concurrent user sessions to realistically simulate what will be happening through the application. This also makes sure I don't spend overly much time optimizing infrequently visited pages and can, instead, concentrate on overall application performance along the critical paths.
Check out the Spider Integration Tests written By Courtnay Gasking
http://pronetos.googlecode.com/svn/trunk/vendor/plugins/spider_test/doc/classes/Caboose/SpiderIntegrator.html

Resources