Store NFA into data structure - parsing

I am provided with a NFA and I need to use a data structure (I can not use recursive descent parser) for storing it. Once the NFA is stored in a data structure I am given a string to check if the string is valid according to the NFA given or not.
Can someone please suggest a data structure for storing a NFA? Also if there are any opensource c language examples that would help a lot.

An NFA is just a set of triples State x Input -> State. It's usually convenient to represent a State with a small integer in a consecutive range starting at 0 (or some other defined starting point). Input symbols can also be mapped onto small integers, either directly (ascii code if all the transitions are ascii characters) or by keeping an inventory while you read the machine. Making a list of triples is highly inefficient and making a hash table is overkill; a plausible intermediate is a two-dimensional array. Remember that the machine is Nondeterministic, so a given [state, input symbol] pair might map to a set of next states.
You can determinize the NFA into a DFA using the Subset Construction. That simplifies the data structure but it can also blow up exponentially in size.

Related

How is a frequency table stored in Huffman coding?

So I'm looking into Huffman coding, and it's a pretty simple algorithm to understand, except I was curious about one thing. Given that "a Huffman tree that omits unused symbols produces the most optimal code lengths", I was curious whether the frequency table of a Huffman tree counts towards the total length of the encoded message? I suppose this question in itself boils down to how the frequency table is stored. Is it part of the encoded message, or is it saved as a separate file?
Yes, unless the two sides agree on a pre-determined code book, the frequency table (or equivalent information sufficient to construct the decoding tree on the receiving end) must be included in the message.
Google Canonical Huffman code for a clever way to cut down on the size of this information.

Generating parse tree from CYK algorithm

I use CYK algorithm (already implemented it in Java) to see if a string recognized according to a specific grammar. Now I need to generate a parse tree for the string, is the a way to generate the tree from the matrix which I use when using the CYK algorithm?
When implementing CYK as just a recognizer, then the boxes in the chart are generally just a set of bits (or other boolean values) that correspond to the productions that might apply at that point. That doesn't leave you enough information to reconstruct the parse tree.
If you instead store a set of objects, those objects include the non-terminal and keep track of the two productions that were combined. When you're done, you check if your final box contains an object which represents a start symbol production. If it does, you can follow the pointers back to reconstruct the parse tree.

NLP - How would you parse highly noisy sentence (with Earley parser)

I need to parse a sentence. Now I have an implemented Earley parser and a grammar for it. And everything works just fine when a sentence has no misspellings. But the problem is a lot of sentences I have to deal with are highly noisy. I wonder if there's an algorithm which combines parsing with errors correction? Possible errors are:
typos 'cheker' instead of 'checker'
typos like 'spellchecker' instead of 'spell checker'
contractions like 'Ear par' instead 'Earley parser'
If you know an article which can answer my question I would appriciate a link to it.
I assume you are using a tagger (or lexer) stage that is applied before the Earley parser, i.e. an algorithm that splits the input string into tokens and looks each token up in a dictionary to determine its part-of-speech (POS) tag(s):
John --> PN
loves --> V
a --> DT
woman --> NN
named --> JJ,VPP
Mary --> PN
It should be possible to build some kind of approximate string lookup (aka fuzzy string lookup) into that stage, so when it is presented with a misspelled token, such as 'lobes' instead of 'loves', it will not only identify the tags found by exact string matching ('lobes' as a noun plural of 'lobe'), but also tokens that are similar in shape ('loves' as third-person singular of verb 'love').
This will imply that you generally get a larger number of candidate tags for each token, and therefore a larger number of possible parse results during parsing. Whether or not this will produce the desired result depends on how comprehensive the grammar is, and how good the parser is at identifying the correct analysis when presented with many possible parse trees. A probabilistic parser may be better for this, as it assigns every candidate parse tree a probability (or confidence score), which may be used to select the most likely (or best) analysis.
If this is the solution you'd like to try, there are several possible implementation strategies. Firstly, if the tokenization and tagging is performed as a simple dictionary lookup (i.e. in the style of a lexer), you may simply use a data structure for the dictionary that enables approximate string matching. General methods for approximate string comparison are described in Approximate string matching algorithms, while methods for approximate string lookup in larger dictionaries are discussed in Quickly compare a string against a Collection in Java.
If, however, you use an actual tagger, as opposed to a lexer, i.e. something that performs POS disambiguation in addition to mere dictionary lookup, you will have to build the approximate dictionary lookup into that tagger. There must be a dictionary lookup function, which is used to generate candidate tags before disambiguation is applied, somewhere in the tagger. That dictionary lookup will have to be replaced with one that enables approximate string lookup.

If you know the length of a string and apply a SHA1 hash to it, can you unhash it?

Just wondering if knowing the original string length means that you can better unlash a SHA1 encryption.
No, not in the general case: a hash function is not an encryption function and it is not designed to be reversible.
It is usually impossible to recover the original hash for certain. This is because the domain size of a hash function is larger than the range of the function. For SHA-1 the domain is unbounded but the range is 160bits.
That means that, by the Pigeonhole principle, multiple values in the domain map to the same value in the range. When such two values map to the same hash, it is called a hash collision.
However, for a specific limited set of inputs (where the domain of the inputs is much smaller than the range of the hash function), then if a hash collision is found, such as through an brute force search, it may be "acceptable" to assume that the input causing the hash was the original value. The above process is effectively a preimage attack. Note that this approach very quickly becomes infeasible, as demonstrated at the bottom. (There are likely some nice math formulas that can define "acceptable" in terms of chance of collision for a given domain size, but I am not this savvy.)
The only way to know that this was the only input that mapped to the hash, however, would be to perform an exhaustive search over all the values in the range -- such as all strings with the given length -- and ensure that it was the only such input that resulted in the given hash value.
Do note, however, that in no case is the hash process "reversed". Even without the Pigeon hole principle in effect, SHA-1 and other cryptographic hash functions are especially designed to be infeasible to reverse -- that is, they are "one way" hash functions. There are some advanced techniques which can be used to reduce the range of various hashes; these are best left to Ph.D's or people who specialize in cryptography analysis :-)
Happy coding.
For fun, try creating a brute-force preimage attack on a string of 3 characters. Assuming only English letters (A-Z, a-z) and numbers (0-9) are allowed, there are "only" 623 (238,328) combinations in this case. Then try on a string of 4 characters (624 = 14,776,336 combinations) ... 5 characters (625 = 916,132,832 combinations) ... 6 characters (626 = 56,800,235,584 combinations) ...
Note how much larger the domain is for each additional character: this approach quickly becomes impractical (or "infeasible") and the hash function wins :-)
One way password crackers speed up preimage attacks is to use rainbow tables (which may only cover a small set of all values in the domain they are designed to attack), which is why passwords that use hashing (SHA-1 or otherwise) should always have a large random salt as well.
Hash functions are one-way function. For a given size there are many strings that may have produced that hash.
Now, if you know that the input size is fixed an small enough, let's say 10 bytes, and you know that each byte can have only certain values (for example ASCII's A-Za-z0-9), then you can use that information to precompute all the possible hashes and find which plain text produces the hash you have. This technique is the basis for Rainbow tables.
If this was possible , SHA1 would not be that secure now. Is it ? So no you cannot unless you have considerable computing power [2^80 operations]. In which case you don't need to know the length either.
One of the basic property of a good Cryptographic hash function of which SHA1 happens to be one is
it is infeasible to generate a message that has a given hash
Theoretically, let's say the string was also known to be solely of ASCII characters, and it's of size n.
There are 95 characters in ASCII not including controls. We'll assume controls weren't used.
There are 95ⁿ possible such strings.
There are 1.461501×10⁴⁸ possible SHA-1 values (give or take) and a just n=25, there are 2.7739×10⁴⁹ possible ASCII-only strings without controls in them, which would mean guaranteed collisions (some such strings have the same SHA-1).
So, we only need to get to n=25 when this becomes impossible even with infinite resources and time.
And remember, up until now I've been making it deliberately easy with my ASCII-only rule. Real-world modern text doesn't follow that.
Of course, only a subset of such strings would be anything likely to be real (if one says "hello my name is Jon" and the other says "fsdfw09r12esaf" then it was probably the first). Stil though, up until now I was assuming infinite time and computing power. If we want to work it out sometime before the universe ends, we can't assume that.
Of course, the nature of the attack is also important. In some cases I want to find the original text, while in others I'll be happy with gibberish with the same hash (if I can input it into a system expecting a password).
Really though, the answer is no.
I posted this as an answer to another question, but I think it is applicable here:
SHA1 is a hashing algorithm. Hashing is one-way, which means that you can't recover the input from the output.
This picture demonstrates what hashing is, somewhat:
As you can see, both John Smith and Sandra Dee are mapped to 02. This means that you can't recover which name was hashed given only 02.
Hashing is used basically due to this principle:
If hash(A) == hash(B), then there's a really good chance that A == B. Hashing maps large data sets (like a whole database) to a tiny output, like a 10-character string. If you move the database and the hash of both the input and the output are the same, then you can be pretty sure that the database is intact. It's much faster than comparing both databases byte-by-byte.
That can be seen in the image. The long names are mapped to 2-digit numbers.
To adapt to your question, if you use bruteforce search, for a string of a given length (say length l) you will have to hash through (dictionary size)^l different hashes.
If the dictionary consists of only alphanumeric case-sensitive characters, then you have (10 + 26 + 26)^l = 62^l hashes to hash. I'm not sure how many FLOPS are required to produce one hash (as it is dependent on the hash's length). Let's be super-unrealistic and say it takes 10 FLOP to perform one hash.
For a 12-character password, that's 62^12 ~ 10^21. That's 10,000 seconds of computations on the fastest supercomputer to date.
Multiply that by a few thousand and you'll see that it is unfeasible if I increase my dictionary size a little bit or make my password longer.

When to Define "unit" in the TypeSpecifierList for Erlang Bins

I've started learning Erlang and recently wrapped up the section on bit syntax. I feel I have a firm understanding of how they can be constructed and matched but failed to come up with an example of when I would want to change the default values of "unit" inside the TypeSpecifierList.
Can anyone share a situation when this would prove useful?
Thanks for your time.
Sometimes, just for convenience: you've got a parameter from somewhere (e.g., from a file header) specifying a count of units of a given size, such as N words of 24-bit audio data, and instead of doing some multiplication, you just say:
<<Audio:N/binary-unit:24, Rest/binary>> = Data
to extract that data (as a chunk) from the rest of the file contents. After parsing the rest of the file, you could pass that chunk to some other function that splits it up into samples.

Resources