Quickly parse large utf-8 text file in haskell - parsing

I have a 300MB file (link) with utf-8 characters in it. I want to write a haskell program equivalent to:
cat bigfile.txt | grep "^en " | wc -l
This runs in 2.6s on my system.
Right now, I'm reading the file as a normal String (readFile), and have this:
main = do
contents <- readFile "bigfile.txt"
putStrLn $ show $ length $ lines contents
After a couple seconds I get this error:
Dictionary.hs: bigfile.txt: hGetContents: invalid argument (Illegal byte sequence)
I assume I need to use something more utf-8 friendly? How can I make it both fast, and utf-8 compatible? I read about Data.ByteString.Lazy for speed, but Real World Haskell says it doesn't support utf-8.

Package utf8-string provides support for reading and writing UTF8 Strings. It reuses the ByteString infrastructure so the interface is likely to be very similar.
Another Unicode strings project which is likely to be related to the above and is also inspired by ByteStrings is discussed in this Masters thesis.

Related

Grep in reverse order without reading whole file

I have a log file that may be very large (10+ GB). I'd like to find the last occurrence of an expression. Is it possible to do this with standard posix commands?
Here are some potential answers, from similar questions, that aren't quite suitable.
Use tail -n <x> <file> | grep -m 1 <expression>: I don't know how far back the expression is, so I don't know what <x> would be. It could be several GB previous, so then you'd be tailing the entire file. I suppose you could loop and increment <x> until it's found, but then you'd be repeatedly reading the last part of the file.
Use tac <file> | grep -m 1 <expression>: tac reads the entire source file. It might be possible to chain something on to sigkill tac as soon as some output is found? Would that be efficient?
Use awk/sed: I'm fairly sure these both always start from the top of the file (although I may be wrong, my sed-fu is not strong).
"There'd be no speed up so why bother": I think that's incorrect, since file systems can seek to the end of a file without reading the whole thing. There'd be a little trial and error/buffering to find each new line, but that shouldn't slow things down much, compared to reading (e.g.) 10 GB that are never used.
Write a python/perl script to do it: this is my fall-back if no one can suggest anything better. I'd rather stick to something that can be done straight through the command line, since I'm executing it straight through ssh, and I'd rather not have to upload a script file as well. Using mmap's rfind() in python, I think we can do it in a few lines, provided the expression to find is static (which mine, unfortunately, is not). A regex requires a bit more work, something like this.
If it helps, the expression is anchored at the start of a line, eg: "^foo \d+$".
Whatever script you write will almost certainly be slower than:
tac file | grep -m 1 '^foo [0-9][0-9]*$'
This awk script will search through the whole file and print the last line matching the given /pattern/:
$ awk '/pattern/ { line=$0 } END { print $line }' gigantic.log
Using tac will be a better option (this uses GNU sed to output the first (i.e. last) found match with '/pattern/', after which it terminates, killing the pipeline):
$ tac gigantic.log | gsed -n '/pattern/{p;q}'
Using Perl or C or some other language, you could seek to the end of the file, step back 4kb (or something), and then
read forwards 4kb,
step back 8kb
repeat until pattern is found, making sure that handle reading partial lines correctly.
(This, apart from looking for a pattern, may actually be what tac does: one implementation of tac)

Cobol REPLACING ALL pattern matching

I'm working on converting some legacy COBOL code and came across a statement like this:
INSPECT WS-LOCAL-VAR REPLACING ALL X'0D25' BY ' '
I understand that the INSPECT...REPLACING ALL statement will look through WS-LOCAL-VAR, match the pattern X'0D25' and replace it with a space.
What I don't understand is the purpose of the X outside of '0D25'. All examples of REPLACING ALL that I've found online don't use anything other than a char literal for pattern matching.
How does the X affect which patterns are replaced?
COBOL is running on an EBCDIC machine and the input file is coming from a Windows machine.
The X indicates that the characters in the string are in hexadecimal. In this case, X"0D" indicates the return carriage character and X"25" the % sign (assuming an ASCII system).
A similar notation is used to indicate national strings (N"
こんにちは") and boolean/bit strings (B"0101010") and their respective hexadecimal equivalents (NX"01F5A4" and BX"2A").
Is the Cobol running on a EBCDIC machine (Mainframe / AS400) and is the file coming from a Windows Machine ???.
Ebcdic has only one end-of-line character x'25' as apposed to the 2 (\r, \n) in ascii. X'0D25' is the Ebcdic representation of Windows End-of-Line Marker \r\n. In Ebcdic 0D is not a valid character.
Possibly sources of the problem:
Poor conversion of a Windows Text file when transfered to the mainframe / AS400.
Java (and possibly other modern languages) on Windows. Java on windows supports writing Ebcdic Text files using its standard writers. But on Windows, Java insists on writing \r\n even though \r is not a valid EBCDIC character and you get corrupt files containing x'0D25'.
If you move a program that hard codes \r\n to the mainframe and run it, you will also get x'0d25' in files.

POSIX sh EBNF grammar

Is there an existing POSIX sh grammar available or do I have to figure it out from the specification directly?
Note I'm not so much interested in a pure sh; an extended but conformant sh is also more than fine for my purposes.
The POSIX standard defines the grammar for the POSIX shell. The definition includes an annotated Yacc grammar. As such, it can be converted to EBNF more or less mechanically.
If you want a 'real' grammar, then you have to look harder. Choose your 'real shell' and find the source and work out what the grammar is from that.
Note that EBNF is not used widely. It is of limited practical value, not least because there are essentially no tools that support it. Therefore, you are unlikely to find an EBNF grammar (of almost anything) off-the-shelf.
I have done some more digging and found these resources:
An sh tutorial located here
A Bash book containing Bash 2.0's BNF grammar (gone from here) with the relevant appendix still here
I have looked through the sources of bash, pdksh, and posh but haven't found anything remotely at the level of abstraction I need.
I've had multiple attempts at writing my own full blown Bash interpreters over the past year, and I've also reached at some point the same book appendix reference stated in the marked answer (#2), but it's not completely correct/updated (for example it doesn't define production rules using the 'coproc' reserved keyword and has a duplicate production rule definition for a redirection using '<&', might be more problems but those are the ones I've noticed).
The best way i've found was to go to http://ftp.gnu.org/gnu/bash/
Download the current bash version's sources
Open the parse.y file (which in this case is the YACC file that basically contains all the parsing logic that bash uses) and just copy paste the lines between '%%' in your favorite text editor, those define the grammar's production rules
Then, using a little bit of regex (which I'm terrible at btw) we can delete the extra code logic that are in between '{...}' to make the grammar look more BNF-like.
The regex i used was :
(\{(\s+.*?)+\})\s+([;|])
It matches any line non greedily .*? including spaces and new lines \s+ that are between curly braces, and specifically the last closing brace before a ; or | character. Then i just replaced the matched strings to \3 (e.g. the result of the third capturing group, being either ; or |).
Here's the grammar definition that I managed to extract at the time of posting https://pastebin.com/qpsK4TF6
I'd expect that sh, csh, ash, bash, would contain parsers. GNU versions of these are open source; you might just go check there.

How to Decode Scrambled Character Encoding: Special Character Encoding

I have data in CSV format that has been seriously scrambled character encoding wise, likely going back and forth between different software applications (LibreOffice Calc, Microsoft, Excel, Google Refine, custom PHP/MySQL software; on Windows XP, Windows 7 and GNU/Linux machines from various regions of the world...). It seems like somewhere in the process, non-ASCII characters have become seriously scrambled, and I'm not sure how to descramble them or detect a pattern. To do so manually would involve a few thousand records...
Here's an example. For "Trois-Rivières", when I open this portion of the CSV file in Python, it says:
Trois-Rivi\xc3\x83\xc2\x85\xc3\x82\xc2\xa0res
Question: through what process can I reverse
\xc3\x83\xc2\x85\xc3\x82\xc2\xa0
to get back
è
i.e. how can I unscramble this? How might this have become scrambled in the first place? How can I reverse engineer this bug?
You can check the solutions that were offered in: Double-decoding unicode in python
Another simpler brute force solution is to create a mapping table between the small set of scrambled characters using regular expression (((\\\x[a-c0-9]{2}){8})) search on your input file. For a file of a single source, you should have less than 32 for French and less than 10 for German. Then you can run "find and replace" using this small mapping table.
Based on dan04's comment above, we can guess that somehow the letter "è" was misinterpreted as an "Š", which then had three-fold UTF-8 encoding applied to it.
So how did "è" turn into "Š", then? Well, I had a hunch that the most likely explanation would be between two different 8-bit charsets, so I looked up some common character encodings on Wikipedia, and found a match: in CP850 (and in various other related 8-bit DOS code pages, such as CP851, CP853, CP857, etc.) the letter "è" is encoded as the byte 0x8A, which in Windows-1252 represents "Š" instead.
With this knowledge, we can recreate this tortuous chain of mis-encodings with a simple Unix shell command line:
$ echo "Trois-Rivières" \
| iconv -t cp850 \
| iconv -f windows-1252 -t utf-8 \
| iconv -f iso-8859-1 -t utf-8 \
| iconv -f iso-8859-1 -t utf-8 \
| iconv -f ascii --byte-subst='\x%02X'
Trois-Rivi\xC3\x83\xC2\x85\xC3\x82\xC2\xA0res
Here, the first iconv call just converts the string from my local character encoding — which happens to be UTF-8 — to CP850, and the last one just encodes the non-ASCII bytes with Python-style \xNN escape codes. The three iconv calls in the middle recreate the actual re-encoding steps applied to the data: first from (assumed) Windows-1252 to UTF-8, and then twice from ISO-8859-1 to UTF-8.
So how can we fix it? Well, we just need to apply the same steps in reverse:
$ echo -e 'Trois-Rivi\xC3\x83\xC2\x85\xC3\x82\xC2\xA0res' \
| iconv -f utf-8 -t iso-8859-1 \
| iconv -f utf-8 -t iso-8859-1 \
| iconv -f utf-8 -t windows-1252 \
| iconv -f cp850
Trois-Rivières
The good news is that this process should be mostly reversible. The bad news is that any "ü", "ì", "Å", "É" and "Ø" letters in the original text may have been irreversibly mangled, since the bytes used to encode those letters in CP850 are undefined in Windows-1252. (If you're lucky, they may have been interpreted as the same C1 control codes that those bytes represent in ISO-8859-1, in which case back-conversion should in principle be possible. I haven't managed to figure out how to convince iconv to do it, though.)

GREP - finding all occurrences of a string

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.
I am using grep to filter results like this:
grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...
The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).
All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:
public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";
I would like to find that occurrence as well as:
public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";
Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.
To address your concern about missing some occurrences, why not filter progressively:
Create a text file with all possible
matches as a starting point.
Use filter X (grep for '^import',
for example) to dump probable false
positives into a tmp file.
Use filter X again to remove those
matches from your working file (a
copy of [1]).
Do a quick visual pass of the tmp
file and add any real matches back
in.
Repeat [2]-[4] with other filters.
This might take some time, of course, but it doesn't sound like this is something you want to get wrong...
I would use sed, not grep!
Sed is used to perform basic text transformations on an input stream.
Try s/regexp/replacement/ option with sed command.
You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.
The best solution will be however a simple script in Perl or in Python.

Resources