Edit distance with swaps - edit-distance

Edit distance finds the number of insertion, deletion or substitutions required to one string to another. I want to to also include swaps in this algorithm. For example "apple" and "appel" should give a edit distance of 1.

The edit distance that you are defining is called the Damerau–Levenshtein distance. You can find possible implementations on the Wikipedia page.

See the algorithm here.
http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/
You can give different costs for swap, add, deletions.
m[i,j] = min(m[i-1,j-1]
+ if s1[i]=s2[j] then 0 else cost_swap fi,
m[i-1, j] + cost_insert,
m[i, j-1] + cost_delete ), i=1..|s1|, j=1..|s2|

Related

Difference of TermCriteria type in OpenCV : COUNT and MAX_ITER

The doc page of TermCriteria says that the MAX_ITER is the same as COUNT and the type can be one of : COUNT, EPS or COUNT + EPS. I am wondering whether there is a difference between COUNT + EPS and MAX_ITER + EPS. I found that in different places, there are these two different styles. Would that lead to different effects while running?
There is no difference. COUNT and MAX_ITER mean the same. They have the same value, hence are indistinguishable.
Well, their meaning depends on what function takes a TermCriteria tuple/struct/object. Still, same value means the identifiers are interchangeable.
Those named constants live in an enum. The values are chosen to be bits in a bit field. So they're actually flags and should, ordinarily, be combined with | (bitwise OR operator).
The + is a funny custom and probably because of the following... if you give two termination criteria, an algorithm terminates if any of them becomes true. So one could say both the one and the other are given... and now people get their brain gyri twisted thinking of "and" and "or". Combining those flags with + sidesteps that nicely.
cv.TermCriteria_COUNT == 1
cv.TermCriteria_MAX_ITER == 1
cv.TermCriteria_EPS == 2
so your choices are:
COUNT (means MAX_ITER)
MAX_ITER (means COUNT)
EPS
COUNT + EPS
MAX_ITER + EPS
Beware that you don't say COUNT + MAX_ITER (wrong!) because that is 1 + 1 = 2 and that is now EPS, which isn't what that expression was supposed to express.
The documentation may not contain all the information, and it is generated from OpenCV public header files (via doxygen and its config file).
Just use an IDE/Editor, browsing the source code, search TermCriteria, and will see MAX_ITER and COUNT enumeration element values. Should be same.

Performing an "online" linear interpolation

I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
end
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
kk=kk-1;
end
end
end
end
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
figure;
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
set(leg1,'Location','Best');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
end
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.
The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
end
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
end
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.

Recurrence Relation tree method

I am currently having issues with figuring our some recurrence stuff and since I have midterms about it coming up soon I could really use some help and maybe an explanation on how it works.
So I basically have pseudocode for solving the Tower of Hanoi
TOWER_OF_HANOI ( n, FirstRod, SecondRod, ThirdRod)
if n == 1
move disk from FirstRod to ThirdRod
else
TOWER_OF_HANOI(n-1, FirstRod, ThirdRod, SecondRod)
move disk from FirstRod to ThirdRod
TOWER_OF_HANOI(n-1, SecondRod, FirstRod, ThirdRod)
And provided I understand how to write the relation (which, honestly I'm not sure I do...) it should be T(n) = 2T(n-1)+Ɵ(n), right? I sort of understand how to make a tree with fractional subproblems, but even then I don't fully understand the process that would give you the end solution of Ɵ(n) or Ɵ(n log n) or whatnot.
Thanks for any help, it would be greatly appreciated.
Assume the time complexity is T(n), it is supposed to be: T(n) = T(n-1) + T(n-1) + 1 = 2T(n-1) + 1. Why "+1" but not "+n"? Since "move disk from FirstRod to ThirdRod" costs you only one move.
For T(n) = 2T(n-1) + 1, its recursion tree will exactly look like this:
https://www.quora.com/What-is-the-complexity-of-T-n-2T-n-1-+-C (You might find it helpful, the image is neat.) C is a constant; it means the cost per operation. In the case of Tower of Hanoi, C = 1.
Calculate the sum of the cost each level, you will easily find out in this case, the total cost will be 2^n-1, which is exponential(expensive). Therefore, the answer of this recursion equation is Ɵ(2^n).

Moving Average across Variables in Stata

I have a panel data set for which I would like to calculate moving averages across years.
Each year is a variable for which there is an observation for each state, and I would like to create a new variable for the average of every three year period.
For example:
P1947=rmean(v1943 v1944 v1945), P1947=rmean(v1944 v1945 v1946)
I figured I should use a foreach loop with the egen command, but I'm not sure about how I should refer to the different variables within the loop.
I'd appreciate any guidance!
This data structure is quite unfit for purpose. Assuming an identifier id you need to reshape, e.g.
reshape long v, i(id) j(year)
tsset id year
Then a moving average is easy. Use tssmooth or just generate, e.g.
gen mave = (L.v + v + F.v)/3
or (better)
gen mave = 0.25 * L.v + 0.5 * v + 0.25 * F.v
More on why your data structure is quite unfit: Not only would calculation of a moving average need a loop (not necessarily involving egen), but you would be creating several new extra variables. Using those in any subsequent analysis would be somewhere between awkward and impossible.
EDIT I'll give a sample loop, while not moving from my stance that it is poor technique. I don't see a reason behind your naming convention whereby P1947 is a mean for 1943-1945; I assume that's just a typo. Let's suppose that we have data for 1913-2012. For means of 3 years, we lose one year at each end.
forval j = 1914/2011 {
local i = `j' - 1
local k = `j' + 1
gen P`j' = (v`i' + v`j' + v`k') / 3
}
That could be written more concisely, at the expense of a flurry of macros within macros. Using unequal weights is easy, as above. The only reason to use egen is that it doesn't give up if there are missings, which the above will do.
FURTHER EDIT
As a matter of completeness, note that it is easy to handle missings without resorting to egen.
The numerator
(v`i' + v`j' + v`k')
generalises to
(cond(missing(v`i'), 0, v`i') + cond(missing(v`j'), 0, v`j') + cond(missing(v`k'), 0, v`k')
and the denominator
3
generalises to
!missing(v`i') + !missing(v`j') + !missing(v`k')
If all values are missing, this reduces to 0/0, or missing. Otherwise, if any value is missing, we add 0 to the numerator and 0 to the denominator, which is the same as ignoring it. Naturally the code is tolerable as above for averages of 3 years, but either for that case or for averaging over more years, we would replace the lines above by a loop, which is what egen does.
There is a user written program that can do that very easily for you. It is called mvsumm and can be found through findit mvsumm
xtset id time
mvsumm observations, stat(mean) win(t) gen(new_variable) end

Big-O of an operation over a single linked list

Suppose you've got a single linked list of size N, and you want to perform an operation on every element, beginning at the end.
I've come up with the following pseudocode:
while N > 0
Current = LinkedList
for 0 to N
Current = Current.tail
end
Operation(Current.head)
N := N-1
end
Now I've got to determine which Big-O this algorithm is.
Supposing that Operation() is O(1), I think it's something like this:
N + (N-1) + (N-2) + ... + (N-(N-1)) + 1
But I'm not sure what Big-O that actually is. I think it is definitely smaller than O(N^2), but I don't think you can say its O(N) either ...
Your equation is basically that of the triangular numbers, and sums to N(N+1)/2. I'll leave you to determine the O() from that!
A quicker way to do this is to construct a new list that is the reverse of the original list, and then perform the operations on that.
Your algorithm is O(n^2) as you suggest in your post. You can do it in O(n), though.
It's important to remember that Big-O notation is an upper bound on the algorithm's time complexity.
1+2+3+...+n = n*(n+1)/2 = 0.5*n^2+O(n)
This is O(n^2), and O(n^2) is tight, i.e. there is no lower runtime order that'd contain your runtime.
A faster algorithm that works from front-to-back could have O(n) instead of O(n^2)
Your runtime analysis is correct, the runtime is 1 + 2 + ... + N which is a sum of the arithmetic progression and therefore = (N²-N) / 2.

Resources