How to do random embedded bracketing of elements - parsing

I'm writing a learning algorithm for automatic constituent bracketing. Since the algorithm starts from scratch, the bracketing (embedded) should be random at first. It is then improved through iterations. I'm stuck with how to do random bracketing. Can you please suggest a code in R or Python or give some programming idea (pseudo code)? I also need ideas on how to check a random bracketing against a proper one for correctness.
This is what I'm trying to finally arrive at, through the learning process, starting from random bracketing.
This is a sentence.
'He' 'chased' 'the' 'dog.'
Replacing each element with grammatical elements,
N, V, D, N.
Bracketing (first phase) (D, N are constituents):
(N) (V) (D N)
Bracketing (second phase):
(N) ((V) (D N))
Bracketing (third phase):
((N) ((V) (D N)))
Please help. Thank you.

Here's all I can say with the information provided:
A naive way for the bracketing would be to generate some trees (generating all can quickly get very space consuming), having as many leaves as there are words (or components), then selecting a suitable one (at random or according to proper partitioning) and apply it as bracketing pattern. For more efficiency, look for a true random tree generation algorithm (I couldn't find one at the moment).
Additionally, I'd recommend reading about genetic algos/evolutionary programming, especially fitness fucnctions (which are the "check random results for correctness" part). As far as I understood you, you want the program to detect ways of parsing and then keep them in memory as "learned". That quite matches a genetic algorithm with memorization of "fittest" patterns (and only mutation as changing factor).
An awesome, very elaborate (if working), but probably extremely difficult approach would be to use genetic programming. But that's probably too different from what you want.
And last, the easiest way to check correctness of bracketing imo would be to keep a table with the grammar/syntax rules and compare with them. You also could improve this to a better fitness function by keeping them in a tree and measuring the distance from the actual pattern ((V D) N) to the correct pattern (V (D N)). (which is just some random idea, I've never actually done this..)

Related

In terms of Big O notation what category is O(N*P), P signifying feature size, as seen in Naive Bayes or kNN?

If time complexity of some of the machine learning algorithms such as; kNN and Naive Bayes can be defined as O(N*P) where N is number of rows and P is feature size, which Big O complexity does this count as?
Does O(N*P) time complexity fall in same category as O(N) hence is it "Linear Complexity"? If P=N was true couldn't it also be counted as O(N^2) hence Quadratic Complexity? So, exactly what complexity can we call it, is it undetermined? Thank you.
As you said, it depends on the value of P. Hence, the time complexity is O(N*P) in general, and you can explain it in more detail when we know more about the value of P. For another example than you have mentioned, if P = N^2, the time complexity can be Qubic as well. Therefore, you can't say anything about this time complexity without knowing about P.

Genetic algorithm - shortest path in weighted graph

I want to make a genetic algorithm that solves a shortest path problem in weighted, connected graph. Similar to travelling salesman, but instead of fully-connected graph, it's just connected.
My idea is to randomly generate a path consisting of n-1 nodes for each chromosome in binary form, where numbers indicate nodes in a path. Then I will choose the best depending on sum of weights (if cant go from A to B i would give it penalty) and crossover/mutate bits in it. Will it work? It feels a little like smaller version of bruteforce. Is there a better way?
Thanks!
Genetic algorithm is pretty much "smaller version of bruteforce". It is just a metaheuristic, not an optimization method which has decent convergence guarantees. It basically depends on randomness to provide new solutions, thus it is a "slightly better random search".
So "will it work"? Yes, it will do something, as long as you have enough randomness in mutation it will even (eventually) converge to optimum. Will it work better than a random search? Hard to say, this depends on dozens of factors, not only your encoding, but also all the hyperparameters used etc. in general genetic algorithms are about trials and errors. In particular representation of chromosomes which does not loose any information (yours does not) does not matter, meaning that everything depends on clever implementation of crossover and mutation (as long as chromosomes do not loose any information they are all equivalent).
Edited.
You can use permutation coding GA. In permutation coding, you should give the start and end points. GA searches for the best chromosome with your fitness function. Candidate solutions (chromosomes) will be like 2-5-4-3-1 or 2-3-1-4-5 or 1-2-5-4-3 etc. So your solution depends on your fitness function. (Look at GA package for R to apply permutation GA easily.)
Connections are constraints for your problem. My best advice is create a constraint matrix like that:
FirstPoint SecondPoint Connected
A B true
A C true
A E false
... ... ...
In standard TSP, just distances are considered. In your fitness function, you have to consider this matrix and add a penalty to return value for each false.
Example chromosome: A-B-E-D-C
A-B: 1
B-E: 1
E-D: 4
D-C: 3
Fitness value: 9
.
Example chromosome: A-E-B-C-D
A-E: penalty
E-B: 1
B-C: 6
C-D: 3
Fitness value: 10 + penalty value.
Because your constraint is a hard constraint, you can use max integer value as the penalty. GA will find the best solution. :)

how to decide p of ACF and q of PACF in AR, MA, ARMA and ARIMA?

I am confused about how to calculate p of ACF and q of PACF in AR, MA, ARMA and ARIMA. For example, in R, we use acf or pacf to get the best p and q.
However, based on the information I have read, p is the order of AR and q is the order of MA. Let's say p=2, then AR(2) is supposed to be y_t=a*y_t-1+b*y_t-2+c. We can calculate acf function (in R) when lag=1,2,3.... to find which lag brings the biggest acf function value. The same thing happens to MA for deciding q. But, does this mean that p and q have already been set up?
I guess here is the steps. But I am not sure if I am right.
So, let's say in R's functions acf and pacf, is this the real process:
1. For p=1, set lag=1,2,3,...max to see which lag has the biggest autocorrelation value.
2. For p=2,3,4..., do the same thing to find the lags.
3. Compare those values with each other. Let's say the biggest autocorrelation value comes when p=2 and lag=4, when we say the order of AR, ie. p, is 2?
Cloud anyone please give me an example showing exactly how to estimate p and q?
This isn't a good stackoverflow question. You want to be on the Math site for this. To answer your question, though, there isn't one single generally accepted method for finding the optimal p and q.
Generally, what most people tend to do, is eyeball it using pacf visualizations (in which case, as you observe, you can't distinguish whether to put time into p or q) and set p == q.
An alternative way to do it, would be to try estimating your time series with different values of p and q, in a grid search, and pick the combination that maximizes some estimator like log likelihood or out-of-sample error, or whatever makes sense on your dataset.
If I might suggest, however, you probably want to start by looking at the rather extensive body of research on arima models and see how others have done this - that really should be your first step for questions like this.
PACF plot for most optimal in the AR(p) model, ACF plot for most optimal in the MA(q) model

how a segment in a segment tree be deleted in O(log n) time?

I just finished reading about segment trees, the proof for insertion with time complexity O(log n) is quite convincing, but I was not able to figure out how deletion can be carried out with same complexity. Also I tried searching for the paper in which segment trees were proposed but was not able to find it , if some one has it can you post the link.
"J.L. Bentley, Algorithms for Klee's rectangle problems. Technical Report"

Reducing the number of used clauses using proof goal in Z3

I am experimenting with optimizing the use of Z3 for proving facts about a first-order theory. Currently, I specify a first-order theory in Python, ground the quantifiers there and send all the clauses along with the negation of the proof goal to Z3. I have the following idea that I hope could optimize the outcome: I only want to send the formulas in the theory to Z3 that are relevant to the proof goal. I will not discuss this concept in detail, but I think the intuition is simple: my theory is a conjunction of formulas, and I only want to send conjuncts that can possibly affect the truth value of the proof goal.
My question is the following: can this lead to an improvement in efficiency, or does Z3 already use a similar method? I would guess not, because I don't think that Z3 always assumes that the last assertion is the proof goal, so it has no way of optimizing this.
Yes, removing irrelevant facts can make a big difference. Suppose that we have a unsatisfiable formula of the form F_1 and F_2 and (not G). Moreover, let us assume that F_1 and (not G) is unsatisfiable, and F_2 is satisfiable. F_2 is what you call irrelevant. If there is a cheap way to remove F_2 before sending the formulat to Z3, it will probable make a big difference.
Z3 has heuristics for "ignoring" irrelevant facts, but they are just heuristics. For our example, the worst case scenario is a F_2 that is really hard for Z3 to satisfy. Z3 is essentially trying to build an interpretation/solution that satisfies the input formula (the formula F_1 an F_2 and (not G) in our working example). A formula is unsatisfiable when Z3 can show it is impossible to build the interpretation. In practice, the formula F_2 is irrelevant for Z3 only if it can quickly show it to be satisfiable, and the interpretation/solution for F_2 does not conflicts F_1 and (not G). If that is not the case, Z3 can waste a lot of resources with F_2.

Resources