How does `grep` determine which files are "text"? - grep

This question pertains to the usage and inner workings of grep. If I issue grep 'needle' *, grep will search for needle within all text files of the current directory (e.g., http://www.linfo.org/grep.html). But what constitutes these so-called "text files", and how does grep identify these files?
For example, ASCII, UTF-8, UTF-16, could all be considered to be text files, but grep does not search UTF-16 by default.
As for identification, does grep use solely the file signature at the beginning of the file?

Related

How to grep for files using 'and' operator, words might not be on the same line

I have a directory /dir
which has several text files in it, These files may or may not contain the words 'rock' and 'stone', so basically some files might just contain the word 'rock', some may just contain the word 'stone', some may contain both, and some may contain neither.
How can I list all files in this directory that contain both 'rock' and 'stone'? These words might not be on the same line so I don't think piping through grep twice would work.
Appreciate any help, I was not able to find a stackoverflow post with this problem so I figured I'd ask.
To search files that match the given two (or more) words at any line anywhere in the file, you may want to try ugrep:
ugrep -F --files -e 'rock' --and -e 'stone' dir
This only matches files that have both rock and stone in them. Lines are output that have rock or stone, or you can use option -l to just list files. The -F option searches strings (like grep -F and fgrep), --files applies the --and file-wide, which you want instead of applying the --and per line. Note that we have more than one pattern in this case, so option -e should be used (like grep also requires this).
A shorter form with --bool:
ugrep -F --files --bool 'rock stone' dir
where --bool formulates a Boolean query with space as AND (or use AND).
If you want to search directory dir recursively in subdirectories, use option -r.

Find the name of the file or files with a certain text. This command can search through the directory and all sub directories to find all matches

Trying to find the name of a file by searching for a word that's in the file.
I have tried to look at many examples online but unfortunately I couldn't find a code that outputs the names of the files with that certain word in it.
grep -r 'Facebook' *
I expected the output to be many names of the files which has the word facebook in it but instead I got output of lines with the word facebook in it which is not what I wanted.
There are many possible solutions in a Unix/Linux environment:
find /path/to/base -name Facebook
ls -R /path/to/base | grep -i "[Ff]acebook"
etc.
Or there are programmatic approaches written in a language of your choice. Add more details about what you are trying to do for a better answer.
it should be something like this
grep -l "Facebook" *
this is going to search for all the files and not sub directories in the current folder and produce the name of the file where a match was found
from man page of grep:
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match. (-l is
specified by POSIX.)

SSIS pipe delimiter issue for CRLF csv file

I am Facing an below pipe delimiter issue in SSIS.
CRLF Pipe delimited text file:
-----------------------------
Col1|Col2 |Col3
1 |A/C No|2015
2 |A|C No|2016
Because of embedded pipe within pipes SSIS failing to read the data.
Bad news: once you have a file with this problem, there is NO standard way for ANY software program to correctly parse the file.
Good news: if you can control (or affect) the way the file is generated to begin with, you would usually address this problem by including what is called a "Text Delimiter" (for example, having field values surrounded by double quotes) in addition to the Field Delimiter (pipe). The Text Delimiter will help because a program (like SSIS) can tell the field values apart from the delimiters, even if the values contain the Field Delimiter (e.g. pipes).
If you can't control how the file is generated, the best you can usually do is GUESS, which is problematic for obvious reasons.

Quickly parse large utf-8 text file in haskell

I have a 300MB file (link) with utf-8 characters in it. I want to write a haskell program equivalent to:
cat bigfile.txt | grep "^en " | wc -l
This runs in 2.6s on my system.
Right now, I'm reading the file as a normal String (readFile), and have this:
main = do
contents <- readFile "bigfile.txt"
putStrLn $ show $ length $ lines contents
After a couple seconds I get this error:
Dictionary.hs: bigfile.txt: hGetContents: invalid argument (Illegal byte sequence)
I assume I need to use something more utf-8 friendly? How can I make it both fast, and utf-8 compatible? I read about Data.ByteString.Lazy for speed, but Real World Haskell says it doesn't support utf-8.
Package utf8-string provides support for reading and writing UTF8 Strings. It reuses the ByteString infrastructure so the interface is likely to be very similar.
Another Unicode strings project which is likely to be related to the above and is also inspired by ByteStrings is discussed in this Masters thesis.

GExperts grep expression for source lines with string literals (for translation)

How can I find all lines in Delphi source code using GExperts grep search which contain a string literal instead of a resource string, except those lines which are marked as 'do not translate'?
Example:
this line should match
ShowMessage('Fatal error! Save all data and restart the application');
this line should not match
FieldByName('End Date').Clear; // do not translate
(Asking specifically about GExpert as it has a limited grep implementation afaik)
Regular Expressions cannot be negated in general.
Since you want to negate a portion of the search, this comes as close as I could get it within the RegEx boundaries that GExpers Grep Search understands:
\'.*\'.*[^n][^o][^t][^ ][^t][^r][^a][^n][^s][^l][^a][^t][^e]$
Edit: Forgot the end-of-line $ marker, as GExperts Grep Search cannot do without.
blokhead explains why you cannot negate in general.
This Visual Studio Quick Search uses the tilde for negation, but the GExperts Grep Search cannot.
The grep command-line search has the -v (reverse) option to negate a complete search (but not a partial search).
A perfect manual negation gets complicated very rapidly.
--jeroen

Resources