how a segment in a segment tree be deleted in O(log n) time? - segment-tree

I just finished reading about segment trees, the proof for insertion with time complexity O(log n) is quite convincing, but I was not able to figure out how deletion can be carried out with same complexity. Also I tried searching for the paper in which segment trees were proposed but was not able to find it , if some one has it can you post the link.
"J.L. Bentley, Algorithms for Klee's rectangle problems. Technical Report"

Related

How to restrict the sequence prediction in an LSTM model to match a specific pattern?

I have created a word-level text generator using an LSTM model. But in my case, not every word is suitable to be selected. I want them to match additional conditions:
Each word has a map: if a character is a vowel then it will write 1 if not, it will write 0 (for instance, overflow would be 10100010). Then, the sentence generated needs to meet a given structure, for instance, 01001100 (hi 01 and friend 001100).
The last vowel of the last word must be the one provided. Let's say is e. (friend will do the job, then).
Thus, to handle this scenario, I've created a pandas dataframe with the following structure:
word last_vowel word_map
----- --------- ----------
hello o 01001
stack a 00100
jhon o 0010
This is my current workflow:
Given the sentence structure, I choose a random word from the dataframe which matches the pattern. For instance, if the sentence structure is 0100100100100, we can choose the word hello, as its vowel structure is 01001.
I subtract the selected word from the remaining structure: 0100100100100 will become 00100100 as we've removed the initial 01001 (hello).
I retrieve all the words from the dataframe which matches part of the remaining structure, in this case, stack 00100 and jhon 0010.
I pass the current word sentence content (just hello by now) to the LSTM model, and it retrieves the weights of each word.
But I don't just want to select the best option, I want to select the best option contained in the selection of point 3. So I choose the word with the highest estimation within that list, in this case, stack.
Repeat from point 2 until the remaining sentence structure is empty.
That works like a charm, but there is one remaining condition to handle: the last vowel of the sentence.
My way to deal with this issue is the following:
Generating 1000 sentences forcing that the last vowel is the one specified.
Get the rmse of the weights returned by the LSTM model. The better the output, the higher the weights will be.
Selecting the sentence which retrieves the higher rank.
Do you think is there a better approach? Maybe a GAN or reinforcement learning?
EDIT: I think another approach would be adding WFST. I've heard about pynini library, but I don't know how to apply it to my specific context.
If you are happy with your approach, the easiest way might be if you'd be able to train your LSTM on the reversed sequences as to train it to give the weight of the previous word, rather than the next one. In such a case, you can use the method you already employ, except that the first subset of words would be satisfying the last vowel constraint. I don't believe that this is guaranteed to produce the best result.
Now, if that reversal is not possible or if, after reading my answer further, you find that this doesn't find the best solution, then I suggest using a pathfinding algorithm, similar to reinforcement learning, but not statistical as the weights computed by the trained LSTM are deterministic. What you currently use is essentially a depth first greedy search which, depending on the LSTM output, might be even optimal. Say if LSTM is giving you a guaranteed monotonous increase in the sum which doesn't vary much between the acceptable consequent words (as the difference between N-1 and N sequence is much larger than the difference between the different options of the Nth word). In the general case, when there is no clear heuristic to help you, you will have to perform an exhaustive search. If you can come up with an admissible heuristic, you can use A* instead of Dijkstra's algorithm in the first option below, and it will do the faster, the better you heuristic is.
I suppose it is clear, but just in case, your graph connectivity is defined by your constraint sequence. The initial node (0-length sequence with no words) is connected with any word in your data frame that matches the beginning of your constraint sequence. So you do not have the graph as a data structure, just it's the compressed description as this constraint.
EDIT
As per request in the comment here are additional details. Here are a couple of options though:
Apply Dijkstra's algorithm multiple times. Dijkstra's search finds the shortest path between 2 known nodes, while in your case we only have the initial node (0-length sequence with no words) and the final words are unknown.
Find all acceptable last words (those that satisfy both the pattern and vowel constraints).
Apply Dijkstra's search for each one of those, finding the largest word sequence weight sum for each of them.
Dijkstra's algorithm is tailored to the searching of the shortest path, so to apply it directly you will have to negate the weights on each step and pick the smallest one of those that haven't been visited yet.
After finding all solutions (sentences that end with one of those last words that you identified initially), select the smallest solution (this is going to be exactly the largest weight sum among all solutions).
Modify your existing depth-first search to do an exhaustive search.
Perform the search operation as you described in OP and find a solution if the last step gives one (if the last word with a correct vowel is available at all), record the weight
Rollback one step to the previous word and pick the second-best option among previous words. You might be able to discard all the words of the same length on the previous step if there was no solution at all. If there was a solution, it depends on whether your LSTM provides different weights depending on the previous word. Likely it does and in that case, you have to perform that operation for all the words in the previous step.
When you run out of the words on the previous step, move one step up and restart down from there.
You keep the current winner all the time as well as the list of unvisited nodes on every step and perform exhaustive search. Eventually, you will find the best solution.
I would reach for a Beam Search here.
This is much like your current approach of starting 1000 solutions randomly. But instead of expanding each of those paths independently, it expands all candidate solutions together in a step by step manner.
Sticking with the current candidate count of 1000, it would look like this:
Generate 1000 stub solutions, for example using random starting points or selected from some "sentence start" model.
For each candidate, compute the best extensions based on your LSTM language model, which fit the constraints. This works just as it does in your current approach, except you could also try more than one option. For example using the best 5 choices for the next word would product 5000 child candidates.
Compute a score for each of those partial solution candidates, then reduce back to 1000 candidates by keeping only the best scoring options.
Repeat steps 2 and 3 until all candidates cover the full vowel sequence, including the end constraint.
Take the best scoring of these 1000 solutions.
You can play with the candidate scoring to trade off completed or longer solutions vs very good but short fits.

Optimize deep Q network with long episode

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.

What is the computational complexity of constructing and performing a regression tree?

What is the computational complexity of constructing and performing a regression tree? Is there any analysis or conclusion on it?
Thanks!
You can look at the xgboost paper.
The most time consuming part of the tree learning algorithm is getting
the data in sorted order. This makes the time complexity of learning
each tree O(n log n).
The answer largely depends on the procedure for selecting the best attribute to split and the split point. The two parameters that play a key role in the analysis:
number of attributes;
number of training examples.
The expensive part will be computing the best split point for continuous attribute (this is essentially discretization), and selection the best attributes from among the set of candidate attributes to split on.
In my experience the complexity is often quadratic in the number of attributes *(denoted a) and linear in the number of examples (denoted n), that is, O(n * a^2).
But, as I said, it really depends on your specific case. Provide us with more details if you want a more concrete answer.

How to do random embedded bracketing of elements

I'm writing a learning algorithm for automatic constituent bracketing. Since the algorithm starts from scratch, the bracketing (embedded) should be random at first. It is then improved through iterations. I'm stuck with how to do random bracketing. Can you please suggest a code in R or Python or give some programming idea (pseudo code)? I also need ideas on how to check a random bracketing against a proper one for correctness.
This is what I'm trying to finally arrive at, through the learning process, starting from random bracketing.
This is a sentence.
'He' 'chased' 'the' 'dog.'
Replacing each element with grammatical elements,
N, V, D, N.
Bracketing (first phase) (D, N are constituents):
(N) (V) (D N)
Bracketing (second phase):
(N) ((V) (D N))
Bracketing (third phase):
((N) ((V) (D N)))
Please help. Thank you.
Here's all I can say with the information provided:
A naive way for the bracketing would be to generate some trees (generating all can quickly get very space consuming), having as many leaves as there are words (or components), then selecting a suitable one (at random or according to proper partitioning) and apply it as bracketing pattern. For more efficiency, look for a true random tree generation algorithm (I couldn't find one at the moment).
Additionally, I'd recommend reading about genetic algos/evolutionary programming, especially fitness fucnctions (which are the "check random results for correctness" part). As far as I understood you, you want the program to detect ways of parsing and then keep them in memory as "learned". That quite matches a genetic algorithm with memorization of "fittest" patterns (and only mutation as changing factor).
An awesome, very elaborate (if working), but probably extremely difficult approach would be to use genetic programming. But that's probably too different from what you want.
And last, the easiest way to check correctness of bracketing imo would be to keep a table with the grammar/syntax rules and compare with them. You also could improve this to a better fitness function by keeping them in a tree and measuring the distance from the actual pattern ((V D) N) to the correct pattern (V (D N)). (which is just some random idea, I've never actually done this..)

In ID3 implementation, at which point the recursion in Algorithm should stop

In ID3 implementation, at which point the recursion in Algorithm should stop.
A branch stops when there are no examples left to classify or there are no attributes left to classify with. The algorithm description on Wikipedia is pretty easy to follow and there's a bunch of links to examples and discussions on there too.
Well you continue to split (form two new nodes from an extant one) as long as the splitting criterion is satisfied.
The splitting criterion is usually a negative value for the difference between the parent node Information Gain, aka entropy, (or variance if the variable is discrete rather than categorical) and the weighted average IG of the putative child nodes the weighted average Information Gain:
if weighted_mean(IG_child1, IG_child2) < IG_parent :
createNodes(IG_child1, IG_child2)
else :
continue
So this is the trivial answer, but there's likely a more sophisticated intention behind your question, which, if you don't mind, i'll re-word slightly as should you continue to create nodes as long as the splitting criterion is satisfied?
As you might have just found out if you are coding an ID3 algorithm, applying the splitting criterion without constraint will often cause over-fitting (i.e., the tree that you've build from the training data doesn't generalize well because it hasn't distinguished the noise from the genuine patterns).
So this is more likely the answer to your question: the techniques to 'constrain' node splitting (and therefore deal with the over-fitting problem) all fall into one of two categories--top down or bottom up. An example of top down: set some threshold (e.g., if the weighted mean of the child nodes is less than 5% below, then do not split)?
An example of bottom up: pruning. Pruning means to let the algorithm split as long as the splitting criterion is satisfied then after it has stopped, start from the bottom layer of nodes and 'unsplit' any nodes in which the IG difference between the child nodes and the parent is less than some threshold.
These two approaches do not have the same effect, and in fact pruning is the superior technique. The reason: if you enforce a splitting threshold top-down, then of course some splitting will be prevented; however, if it had been allowed to occur, the next split (one or both of the two child nodes into grandchildren) might have been a valid split (i.e., above the threshold) but that split would never occur. Pruning of course accounts for this.

Resources