Demonstrating a language is regular - fibonacci

So I'm trying to do this sort of problem that says:
Demonstrate that a language conformed by binary numbers of Fibonacci length, is not a regular language.
I really don't know how to approach it nor am I sure if I understand it.

The language of binary numbers whose lengths are Fibonacci numbers can be shown irregular by either the pumping lemma or the Myhill-Nerode theorem.
For the pumping lemma, take any string 0^p where p is the pumping length. No matter which substring of this you consider, you get a contradiction in fairly short order (for p > 1, it is never the case that p - a, p and p + a are all Fibonacci numbers. Proof of that fact is by reference to the definition of Fibonacci numbers.
For the Myhill-Nerode theorem proof, simply show that for any string x whose length is the nth Fibonacci number, the smallest non-empty strings that can be appended to get more strings in the language have lengths equal to the (n-1)th Fibonacci number. Therefore, there are infinitely many distinguishable strings and, therefore, the language is not regular (since a minimal DFA, which has one state per equivalence class under the indistinguishability relation, must have finitely many states).

Related

Theory: LL(k) parser vs parser for LL(k) grammars

I'm concerned about the very important difference between the therms: "LL(k) parser" and "parser for LL(k) grammars". When a LL(1) backtracking parser is in question, it IS a parser for LL(k) grammars, because it can parse them, but its NOT LL(k) parser, because it does not use k tokens to look ahead from a single position in the grammar, but its exploring with backtracking the possible cases, regardless that it still uses k tokens to explore.
Am I right?
The question may break down to the way the look-ahead is performed. If the look-ahead is actually still processing the grammar with backtracking, that does not make it LL(k) parser. To be LL(k) parser the parser must not use the grammar with backtracking mechanism, because then it would be "LL(1) parser with backtracking that can parser LL(k) grammars".
Am I right again?
I think the difference is related to the expectation that LL(1) parser is using a constant time per token, and LL(k) parser is using at most k * constant (linear to the look-ahead) time per token, not an exponential time as it would be in the case of a backtracking parser.
Update 1: to simplify - per token, is the parsing LL(k) expected to run exponentially in respect to k or in a linear time in respect to k?
Update 2: I have changed it to LL(k) because the question is irrelevant to the range of which k is (integer or infinity).
An LL(k) parser needs to do the following at each point in the inner loop:
Collect the next k input symbols. Since this is done at each point in the input, this can be done in constant time by keeping the lookahead vector in a circular buffer.
If the top of the prediction stack is a terminal, then it is compared with the next input symbol; either both are discarded or an error is signalled. This is clearly constant time.
If the top of the prediction stack is a non-terminal, the action table is consulted, using the non-terminal, the current state and the current lookahead vector as keys. (Not all LL(k) parsers need to maintain a state; this is the most general formulation. But it doesn't make a difference to complexity.) This lookup can also be done in constant time, again by taking advantage of the incremental nature of the lookahead vector.
The prediction action is normally done by pushing the right-hand side of the selected production onto the stack. A naive implementation would take time proportional to the length of the right-hand side, which is not correlated with either the lookahead k nor the length of the input N, but rather is related to the size of the grammar itself. It's possible to avoid the variability of this work by simply pushing a reference to the right-hand side, which can be used as though it were the list of symbols (since the list can't change during the parse).
However, that's not the full story. Executing a prediction action does not consume an input, and it's possible -- even likely -- that multiple predictions will be made for a single input symbol. Again, the maximum number of predictions is only related to the grammar itself, not to k nor to N.
More specifically, since the same non-terminal cannot be predicted twice in the same place without violating the LL property, the total number of predictions cannot exceed the number of non-terminals in the grammar. Therefore, even if you do push the entire right-hand side onto the stack, the total number of symbols pushed between consecutive shift actions cannot exceed the size of the grammar. (Each right-hand side can be pushed at most once. In fact, only one right-hand side for a given non-terminal can be pushed, but it's possible that almost every non-terminal has only one right-hand side, so that doesn't reduce the asymptote.) If instead only a reference is pushed onto the stack, the number of objects pushed between consecutive shift actions -- that is, the number of predict actions between two consecutive shift actions -- cannot exceed the size of the non-terminal alphabet. (But, again, it's possible that |V| is O(|G|).
The linearity of LL(k) parsing was established, I believe, in Lewis and Stearns (1968), but I don't have that paper at hand right now so I'll refer you to the proof in Sippu & Soisalon-Soininen's Parsing Theory (1988), where it is proved in Chapter 5 for Strong LL(K) (as defined by Rosenkrantz & Stearns 1970), and in Chapter 8 for Canonical LL(K).
In short, the time the LL(k) algorithm spends between shifting two successive input symbols is expected to be O(|G|), which is independent of both k and N (and, of course, constant for a given grammar).
This does not really have any relation to LL(*) parsers, since an LL(*) parser does not just try successive LL(k) parses (which would not be possible, anyway). For the LL(*) algorithm presented by Terence Parr (which is the only reference I know of which defines what LL(*) means), there is no bound to the amount of time which could be taken between successive shift actions. The parser might expand the lookahead to the entire remaining input (which would, therefore, make the time complexity dependent on the total size of the input), or it might fail over to a backtracking algorithm, in which case it is more complicated to define what is meant by "processing an input symbol".
I suggest you to read the chapter 5.1 of Aho Ullman Volume 1.
https://dl.acm.org/doi/book/10.5555/578789
A LL(k) parser is a k-predictive algorithm (k is the lookahead integer >= 1).
A LL(k) parser can parse any LL(k) grammar. (chapter 5.1.2)
for all a, b you have a < b => LL(b) grammar is also a LL(a) grammar. But the reverse is not true.
A LL(k) parser is PREDICTIVE. So there is NO backtracking.
All LL(k) parsers are O(n) n is the length of the parsed sentence.
It is important to understand that a LL(3) parser do not parse faster than a LL(1). But the LL(3) parser can parse MORE grammars than the LL(1). (see the point #2 and #3)

Why is figuring out if the cardinality of a language is not finite a "decidable" problem?

Given two finite-state languages L1 and L2, then determining their intersection is not finite is a decidable problem.
How can this be? Thanks.
Let M1 and M2 be minimal deterministic finite automata whose accepted languages are L1 and L2, respectively.
First, construct a deterministic finite automaton M3 whose accepted language is the intersection of L1 and L2 by using the Cartesian Product Machine construction - an algorithm which produces the desired machine.
Next, construct a deterministic finite automaton M4 which accepts the same language as M3, but which is minimal; that is, minimize M3 and call the result M4. There is an algorithm which produces this result.
Next, construct a deterministic finite automaton M5 which accepts only words of length strictly greater than k, where k is the number of states in M4. There is such a machine with k+1 states for any alphabet; its construction is not complicated.
Next, construct a deterministic finite automaton M6 whose accepted language is the intersection of the languages accepted by M4 and M5. Use the Cartesian Product Machine construction again here.
Next, construct a deterministic finite automaton M7 by minimizing M6.
At this point, either M6 is a deterministic finite automaton with one state which accepts no strings at all, or not. In the former case, the intersection of L1 and L2 is finite; in the latter case, that intersection is infinite. Why?
M1 accepts L1
M2 accepts L2
M3 accepts L1 intersect L2
M4 is a DFA accepts L1 intersect L2 and has as few states as possible
M5 accepts only words which would cause M4 to enter one of its states twice
M6 accepts only words in L1 intersect L2 that also cause M4 to enter one of its states twice. Note that if M6 accepts anything, that means there are words in L1 intersect L2 which a minimal DFA for that language must loop to accept; because such a loop could be followed any number of times, this implies there must be infinitely many words in L1 intersect L2. This is closely related to the pumping lemma for regular languages.
M7 accepts what M6 does, but is minimal. Note that minimization is not necessary but it makes it trivial to check whether M6 accepts anything or not. The minimal DFA which accepts no string has one dead state. This is easy to check, and there are standard algorithms for minimization.
Another similar way of showing the same thing would be to say you can construct the DFA for the intersection and then check all strings of length from |Q| to |2Q|. No finite language will have any strings of these lengths accepted by a DFA for that language, but every infinite language will have at least one such string. This is because any DFA accepting an infinite language must loop, and the length of that loop must be no greater than the number of states.

Features in SVM based Sentiment Analysis

I am unable to convert semantic and lexical information into feature vectors.
I know the following information
Part of Speech tag - output of POS tagger ex Adjective,verb
Word Sense- output of Word Sense Disambiguation ex Bank - financial institution,heap
Ontological information - ex mammal,Location
n-gram - ex good-boy
Head word - ex act root word of acting
My question is how to represent them as real values.Should I just just choose the occurrence of each of the feature(POS,sense,etc..) i.e. boolean vector but then the semantic information will be lost in case of n-grams(ex very good boy and good boy have different semantic orientation in case of sentiment analysis).
There is no good method for converting nominal values into real valued vectors. The most common approach is as you suggested - conversion to he boolean vectors. In case of n-grams I do not see your point. What is your object? You say that you have POS, POS is a feature of a word, n-gram on the other hand has no meaning on the single word level, but rather as a representation of the part of the sentence. Do you mean "n-gram in which it appears"? It is exactly the same as "previous word" then (or n-1 previous words) and you do not loose any information then (simply you have k dimensions per each "previous" word, where k is a size of a vocablurary). Keep in mind, that your representation will be huge.

How do I combine two features of different dimension?

Let us consider the problem of text classification. So if the document is represented as Bag of words , then we will have an n dimensional feature , where n- number of words in the document. Now if the I decide that I also want to use the document length as feature , then the dimension of this feature alone( length ) will be one. So how do I combine to use both the features (length and Bag of words). Should consider the feature now as 2 dimensional( n-dimensional vector(BOW) and 1-dimensional feature(length). If this wont work , How do I combine the features. Any pointers on this will also be helpful ?
This statement is a little ambiguous: "So if the document is represented as Bag of words, then we will have an n dimensional feature, where n- number of words in the document."
My interpretation is that you have a column for each word that occurs in your corpus (probably restricted to some dictionary of interest), and for each document you have counted the number of occurrences of that word. Your number of columns is now equal to the number of words in your dictionary that appear in ANY of the documents. You also have a "length" feature, which could be a count of the number of words in the document, and you want to know how to incorporate it into your analysis.
A simple approach would be to divide the number of occurrences of a word by the total number of words in the document.
This has the effect of scaling the word occurrences based on the size of the document, and the new feature is called a 'term frequency'. The next natural step is to weight the term frequencies to compensate for terms that are more common in the corpus (and therefore less important). Since we give HIGHER weights to terms that are LESS common, this is called 'inverse document frequency', and the whole process is called “Term Frequency times Inverse Document Frequency”, or tf-idf. You can Google this for more information.
It's possible that you are doing word counts in a different way -- for example, counting the number of word occurrences in each paragraph (as opposed to each document). In that case, for each document, you have a word count for each paragraph, and the typical approach is to merge these paragraph-counts using a process such as Singular Value Decomposition.

Approximate string matching with a letter confusion matrix?

I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do approximate string matching.
However, I want the matching to be phonetically-motivated, e.g. "m" and "n" are phonetically similar, so the substitution cost of "m" for "n" should be small, compared to say, "m" and "k". So, if I'm searching for [mein] "main", it would match the letter sequence [meim] "maim" with, say, cost 0.1, whereas it would match the letter sequence [meik] "make" with, say, cost 0.7. Similarly, there are differing costs for inserting or deleting each letter. I can supply a confusion matrix that, for each letter pair (x,y), gives the cost of substituting x with y, where x and y are any letter or the empty string.
I know that there are tools available that do approximate matching such as agrep, but as far as I can tell, they do not take a confusion matrix as input. That is, the cost of any insertion/substitution/deletion = 1. My question is, are there any open-source tools already available that can do approximate matching with confusion matrices, and if not, what is a good algorithm that I can implement to accomplish this?
EDIT: just to be clear, I'm trying to isolate approximate instances of a word such as [mein] from a longer string, e.g. [aiammeinlimeiking...]. Ideally, the algorithm/tool should report instances such as [mein] with cost 0.0 (exact match), [meik] with cost 0.7 (near match), etc, for all approximate string matches with a cost below a given threshold.
I'm not aware of any phonetic recognizers that use confusion matrices. I know of Soundex, and match rating.
I think that the K-nearest neighbour algorithm might be useful for the type of approximations you are interested in.
Peter Kleiweg's Rug/L04 (for computational dialectology) includes an implementation of Levenshtein distance which allows you to specify non-uniform insertion, deletion, and substitution costs.

Resources