Currently I am grepping data from a file containing any of the following:
342163477\|405760044\|149007683\|322391022\|77409125\|195978682\|358463993\|397650460\|171780277\|336063797\|397650502\|357636118\|168490006...............
This list is longer and contaings ~700 different values.
What is the most efficient way of extracting it? I can chop it in parts of 10/20/50/100... Or are there other unix methods? This grep is piped to python for further analysis which goes fast enough.
Splitting it would only make it worse. It really doesn't matter except in degenerate cases, which this isn't, how long or how complex a regular expression is: the execution time is the same.
Related
For those who are not familiar with what a homophone is, I provide the following examples:
our & are
hi & high
to & too & two
While using the Speech API included with iOS, I am encountering situations where a user may say one of these words, but it will not always return the word I want.
I looked into the [alternativeSubstrings] (link) property wondering if this would help, but in my testing of the above words, it always comes back empty.
I also looked into the Natural Language API, but could not find anything in there that looked useful.
I understand that as a user adds more words, the Speech API can begin to infer context and correct for these, but my use case will not work well with this since it will often only want one or two words at most, limiting the effectiveness of context.
An example of contextual processing:
Using the words above on their own, I get these results:
are
hi
to
However, if I put together the following sentence, you can see they are all wrong:
I am too high for our ladder
Ideally, I would either get a list back containing [are, our], [to, too, two], [hi, high] for each transcription segment, or would have a way to compare a string against a function that supports homophones.
An example of this would be:
if myDetectedWord == "to" then { ... }
Where myDetectedWord can be [to, too, two], and this function would return true for each of these.
This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.
Being said that, if you wish to really get into it, I like this idea of yours:
string against a function
This might be more efficient and performance friendly.
One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.
You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.
Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,
^I
vs
^eye
The probability of I being the first word is much higher than that of eye.
The probability of I in any part of a string is higher than that of eye, also.
You might want to weight words based on that.
I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.
Good project though, I hope you like/enjoy the challenge.
Available in my Lua 5.1 environment are obviously the default Lua pattern matching, but also a reasonably recent version of PCRE and LPEG. I don't honestly care which of these is used; as long as my problem is tackled in an efficient manner I'm happy. (My personal knowledge of LPEG especially is next to non-existent, but I hear it has some very good qualities.)
I have a table with certain string patterns as keys, the accompanying values are to be used once the keys matches... which means they aren't really important for this matter.
Suppose you have:
tbl = { ["aaa"] = 12, ["aab"] = 452, ["aba"] = -2 }
Now my goal is to find out which one of these matches first in a particular string like "accaccaacaadacaabacdaaba".
In reality, the keys are more numerous and the match string is considerably lengthier. This means simply matching against all keys one by one and compare the column the match begins at is a very inefficient solution that is not viable for me.
Parts of the match strings can have considerable overlaps, too. From the theory, I know one state machine per key pattern would be ideal in this regard; just go through the motions on every pattern and the moment you have a complete match on one of them you are done.
But I would be crazy to go code something like that myself when there's so many pattern matching libraries in my environment. The only one I know is technically capable is PCRE; just append the keys like "aaa|aab|aba" and you'll get the first feasible match.
But there's also the problem. For one, I am unsure how intelligent it is when compiling such a match. (I think it first tries 'aaa', unwinds completely once it fails, then completely tries aab, but I haven't tested) which wouldn't be too efficient compared to matching it like "a(a[ab]|ba)" where similarities get resolved faster.
Additionally, I'd like to have the capacity to put in some flexibility ("a.ad" where the second character doesn't matter, or matches a number.. basic stuff like that). With a pattern like that in such an additive approach, I do not see a way to regain the original pattern that matched so I can use the value that goes with it.
(Worst case, I could just generate a lot of entries in the table to match every possible wildcard variation and do away with the pattern requirement, but I honestly don't want to.)
Which library is the right tool for the job, and to boot, how to best use said library to achieve above-stated goals without reinventing the wheel?
A comment to your question mentioned Aho–Corasick algorithm.
If your environment has access to os.execute or io.popen, you can call fgrep -o -f patterns filename, where patterns is the name of a file that contains patterns separated with newlines, and filename is the name of your input. -o means that only matches will be output, one per line. You can replace filename with - so that fgrep reads from standard input: echo "String to match" | fgrep -o -f patterns.
fgrep implements Aho–Corasick algorithm.
However, remember that Aho–Corasick algorithm does not recognise metacharacters.
Just as Alexander Mashin's answer said, Aho–Corasick algorithm is an efficient algorithm that will solve your problem. In Lua land, cloudflare /
lua-aho-corasick is an implementation for LuaJIT using FFI. There's also a pure lua implemetation jgrahamc/aho-corasick-lua which might be slower.
I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)
The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:
If you’re dealing with large input files or very slow parsers, it
might also be worth trying to parse multiple sections within a single
file in parallel. For this to be efficient there must be a fast way to
find the start and end points of such sections. For example, if you
are parsing a large serialized data structure, the format might allow
you to easily skip over segments within the file, so that you can chop
up the input into multiple independent parts that can be parsed in
parallel. Another example could be a programming languages whose
grammar makes it easy to skip over a complete class or function
definition, e.g. by finding the closing brace or by interpreting the
indentation. In this case it might be worth not to parse the
definitions directly when they are encountered, but instead to skip
over them, push their text content into a queue and then to process
that queue in parallel.
This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?
FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)
The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.
If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.
Either way, you end up with a lazy seq that the parser can then read.
For example I have a list I want to save as a file that has a lot of other erlang types. Then I want to load it back into a process What would I use? io_lib:format("~P", [Term]) with io:write and then file:consult?
Yes. Note that you need a trailing dot for each term, and that file:consult returns a list of all dot-terminated terms in the file. So if you only have one term, the code would look like:
ok = file:write_file("myfile", io_lib:format("~p.~n", [Term])),
{ok, [Term]} = file:consult("myfile").
As an alternative to legoscia's solution, you can also write the result of erlang:term_to_binary/1 to a file and read it back with erlang:binary_to_term/1. There's a few caveats with this approach, though:
The file will not be human-readable (at least not easily)
You can't store multiple terms easily because erlang:term_to_binary/1 can produce null-characters and newlines, which can create problems with parsing. There are a few ways to get around this, though:
base64 encode the terms and separate by newline
store your terms inside of another term. For instance, if you have three terms you want to store, use erlang:term_to_binary({T1, T2, T3})
There's no handy file:consult equivalent for term_to_binary, so you have to explicitly read (as a binary) and then run binary_to_term
So why would you bother with erlang:term_to_binary/1 at all? Two reasons:
Space efficiency (in most cases)
Parsing-speed (faster to parse term_to_binary than a human-readable term)
Are there any conventions for formatting console output from a command line app for readability and consistency? For instance, do you indent sub-information, when do you print a blank line, if ever, how should you accent important statements.
I've found output can quickly degenerate into a chaotic blur. I'm interested in hearing about what other people do.
Update: Really this is for embedded software which spits debug status out a terminal, but it's pretty much like a console app, and I figured everyone would be more familiar with that. Thanks so far.
I'd differentiate two kinds of programs:
Do you print information that might be used by a script (i.e. it should be parseable)? Then define a pretty strict format and use only that (for example fixed field separators).
Do you print information that need not be parsed by a script (or is there an alternative script-parseable format already)? Then write what comes natural:
My suggestions:
write it so that you would like to read it
indent sub-information 2 or 4 spaces, definitely not more
separate blocks of information by one empty line at most
respect the COLUMN environment variable (and possible ROWS if it applies to your output).
If this is for a *nix environment, then I'd recommend reading Basics of Unix Philosophy. It's not specific to output but there are some good guidelines for command line programs in general.
Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.