Constrained Sequence to Index Mapping - mapping

I'm puzzling over how to map a set of sequences to consecutive integers.
All the sequences follow this rule:
A_0 = 1
A_n >= 1
A_n <= max(A_0 .. A_n-1) + 1
I'm looking for a solution that will be able to, given such a sequence, compute a integer for doing a lookup into a table and given an index into the table, generate the sequence.
Example: for length 3, there are 5 the valid sequences. A fast function for doing the following map (preferably in both direction) would be a good solution
1,1,1 0
1,1,2 1
1,2,1 2
1,2,2 3
1,2,3 4
The point of the exercise is to get a packed table with a 1-1 mapping between valid sequences and cells.
The size of the set in bounded only by the number of unique sequences possible.
I don't know now what the length of the sequence will be but it will be a small, <12, constant known in advance.
I'll get to this sooner or later, but though I'd throw it out for the community to have "fun" with in the meantime.
these are different valid sequences
1,1,2,3,2,1,4
1,1,2,3,1,2,4
1,2,3,4,5,6,7
1,1,1,1,2,3,2
these are not
1,2,2,4
2,
1,1,2,3,5
Related to this

There is a natural sequence indexing, but no so easy to calculate.
Let look for A_n for n>0, since A_0 = 1.
Indexing is done in 2 steps.
Part 1:
Group sequences by places where A_n = max(A_0 .. A_n-1) + 1. Call these places steps.
On steps are consecutive numbers (2,3,4,5,...).
On non-step places we can put numbers from 1 to number of steps with index less than k.
Each group can be represent as binary string where 1 is step and 0 non-step. E.g. 001001010 means group with 112aa3b4c, a<=2, b<=3, c<=4. Because, groups are indexed with binary number there is natural indexing of groups. From 0 to 2^length - 1. Lets call value of group binary representation group order.
Part 2:
Index sequences inside a group. Since groups define step positions, only numbers on non-step positions are variable, and they are variable in defined ranges. With that it is easy to index sequence of given group inside that group, with lexicographical order of variable places.
It is easy to calculate number of sequences in one group. It is number of form 1^i_1 * 2^i_2 * 3^i_3 * ....
Combining:
This gives a 2 part key: <Steps, Group> this then needs to be mapped to the integers. To do that we have to find how many sequences are in groups that have order less than some value. For that, lets first find how many sequences are in groups of given length. That can be computed passing through all groups and summing number of sequences or similar with recurrence. Let T(l, n) be number of sequences of length l (A_0 is omitted ) where maximal value of first element can be n+1. Than holds:
T(l,n) = n*T(l-1,n) + T(l-1,n+1)
T(1,n) = n
Because l + n <= sequence length + 1 there are ~sequence_length^2/2 T(l,n) values, which can be easily calculated.
Next is to calculate number of sequences in groups of order less or equal than given value. That can be done with summing of T(l,n) values. E.g. number of sequences in groups with order <= 1001010 binary, is equal to
T(7,1) + # for 1000000
2^2 * T(4,2) + # for 001000
2^2 * 3 * T(2,3) # for 010
Optimizations:
This will give a mapping but the direct implementation for combining the key parts is >O(1) at best. On the other hand, the Steps portion of the key is small and by computing the range of Groups for each Steps value, a lookup table can reduce this to O(1).
I'm not 100% sure about upper formula, but it should be something like it.
With these remarks and recurrence it is possible to make functions sequence -> index and index -> sequence. But not so trivial :-)

I think hash with out sorting should be the thing.
As A0 always start with 0, may be I think we can think of the sequence as an number with base 12 and use its base 10 as the key for look up. ( Still not sure about this).

This is a python function which can do the job for you assuming you got these values stored in a file and you pass the lines to the function
def valid_lines(lines):
for line in lines:
line = line.split(",")
if line[0] == 1 and line[-1] and line[-1] <= max(line)+1:
yield line
lines = (line for line in open('/tmp/numbers.txt'))
for valid_line in valid_lines(lines):
print valid_line

Given the sequence, I would sort it, then use the hash of the sorted sequence as the index of the table.

Related

hypothesis function space in decision tree

I am reading the book "Artificial Intelligence" by Stuart Russell and Peter Norvig (Chapter 18). The following paragraph is from the decision trees context.
For a wide variety of problems, the decision tree format yields a
nice, concise result. But some functions cannot be represented
concisely. For example, the majority function, which returns true if
and only if more than half of the inputs are true, requires an
exponentially large decision tree.
In other words, decision trees are good for some kinds of functions
and bad for others. Is there any kind of representation that is
efficient for all kinds of functions? Unfortunately, the answer is no.
We can show this in a general way. Consider the set of all Boolean
functions on "n" attributes. How many different functions are in this
set? This is just the number of different truth tables that we can
write down, because the function is defined by its truth table.
A truth table over "n" attributes has 2^n rows, one for each
combination of values of the attributes.
We can consider the “answer” column of the table as a 2^n-bit number
that defines the function. That means there are (2^(2^n)) different
functions (and there will be more than that number of trees, since
more than one tree can compute the same function). This is a scary
number. For example, with just the ten Boolean attributes of our
restaurant problem there are 2^1024 or about 10^308 different
functions to choose from.
What does author mean by "answer" column of the table as a 2^n-bit number that defines the function?
How did author derive (2^(2^n)) different functions?
Please elaborate on above question, preferably with simple example, such as n = 3.
Consider a general truth table for a 3-input function, where the result for each triple is also a Boolean (1 or 0), represented by variables i through 'p':
A B C f(a,b,c)
0 0 0 i
0 0 1 j
0 1 0 k
0 1 1 l
1 0 0 m
1 0 1 n
1 1 0 o
1 1 1 p
We can now represent any function on three variables as an 8-bit number, ijklmnop. For instance, and is 00000001; or is 01111111; one_hot (exactly one input True) is 01101000.
For 3 variables, you have 2^3 bits in the "answer", the complete function definition. Since there are 8 bits in the "answer", there are 2^8 possible functions we can define.
Does that outline the field of comprehension for you?
More detail on an example function
You simply (once you see the pattern) make the eight bits correspond to the entires in the table. For instance, the table for one-hot looks like this:
A B C f(a,b,c)
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 0
Reading down the "answer" column, labeled f(a,b,c), you get the 8-bit sequence 01101000. That 8-bit number is sufficient to completely define the function: the rows listing all the combinations of a, b, c are in a fixed (numerical) sequence.
You can write any such function in a template format:
def and(a, b, c):
and_def = '00000001'
index = 4*a + 2*b + 1*c
return and_def[index]
Now, if we generalize this to any 3-input binary function:
def_bin_func(a, b, c, func_def)
return func_def[4*a + 2*b + 1*c]
If you wish, you can further generalize the template for a list of inputs: concatenate the bits and use that integer as the index into the func_def string.
Does that clear it up?

Calculating ISIN checksum

HI I know there have been may question about this here but I wasn't able to find a detailed enough answer, Wikipedia has two examples of ISIN and how is their checksum calculated.
The part of calculation that I'm struggling with is
Multiply the group containing the rightmost character
The way I understand this statement is:
Iterate through each character from right to left
once you stumble upon a character rather than digit record its position
if the position is an even number double all numeric values in even position
if the position is an odd number double all numeric values in odd position
My understanding has to be wrong because there are at least two problems:
Every ISIN starts with two character country code so position of rightmost character is always the first character
If you omit the first two characters then there is no explanation as to what to do with ISINs that are made up of all numbers (except for first two characters)
Note
isin.org contains even less information on verifying ISINs, they even use the same example as Wikipedia.
I agree with you; the definition on Wikipedia is not the clearest I have seen.
There's a piece of text just before the two examples that explains when one or the other algorithm should be used:
Since the NSIN element can be any alpha numeric sequence (9 characters), an odd number of letters will result in an even number of digits and an even number of letters will result in an odd number of digits. For an odd number of digits, the approach in the first example is used. For an even number of digits, the approach in the second example is used
The NSIN is identical to the ISIN, excluding the first two letters and the last digit; so if the ISIN is US0378331005 the NSIN is 037833100.
So, if you want to verify the checksum digit of US0378331005, you'll have to use the "first algorithm" because there are 9 digits in the NSIN. Conversely, if you want to check AU0000XVGZA3 you're going to use the "second algorithm" because the NSIN contains 4 digits.
As to the "first" and "second" algorithms, they're identical, with the only exception that in the former you'll multiply by 2 the group of odd digits, whereas in the latter you'll multiply by 2 the group of even digits.
Now, the good news is, you can get away without this overcomplicated algorithm.
You can, instead:
Take the ISIN except the last digit (which you'll want to verify)
Convert all letters to numbers, so to obtain a list of digits
Reverse the list of digits
All the digits in an odd position are doubled and their digits summed again if the result is >= 10
All the digits in an even position are taken as they are
Sum all the digits, take the modulo, subtract the result from 0 and take the absolute value
The only tricky step is #4. Let's clarify it with a mini-example.
Suppose the digits in an odd position are 4, 0, 7.
You'll double them and get: 8, 0, 14.
8 is not >= 10, so we take it as it is. Ditto for 0. 14 is >= 10, so we sum its digits again: 1+4=5.
The result of step #4 in this mini-example is, therefore: 8, 0, 5.
A minimal, working implementation in Python could look like this:
import string
isin = 'US4581401001'
def digit_sum(n):
return (n // 10) + (n % 10)
alphabet = {letter: value for (value, letter) in
enumerate(''.join(str(n) for n in range(10)) + string.ascii_uppercase)}
isin_to_digits = ''.join(str(d) for d in (alphabet[v] for v in isin[:-1]))
isin_sum = 0
for (i, c) in enumerate(reversed(isin_to_digits), 1):
if i % 2 == 1:
isin_sum += digit_sum(2*int(c))
else:
isin_sum += int(c)
checksum_digit = abs(- isin_sum % 10)
assert int(isin[-1]) == checksum_digit
Or, more crammed, just for functional fun:
checksum_digit = abs( - sum(digit_sum(2*int(c)) if i % 2 == 1 else int(c)
for (i, c) in enumerate(
reversed(''.join(str(d) for d in (alphabet[v] for v in isin[:-1]))), 1)) % 10)

Bitwise operation alternative in Neo4j cypher query

I need to do a bitwise "and" in a cypher query. It seems that cypher does not support bitwise operations. Any suggestions for alternatives?
This is what I want to detect ...
For example 268 is (2^8 + 2^3 + 2^2) and as you can see 2^3 = 8 is a part of my original number. So if I use bitwise AND it will be (100001100) & (1000) = 1000 so this way I can detect if 8 is a part of 268 or not.
How can I do this without bitwise support? any suggestions? I need to do this in cypher.
Another way to perform this type of test using cypher would be to convert your decimal values to collections of the decimals that represent the bits that are set.
// convert the binary number to a collection of decimal parts
// create an index the size of the number to convert
// create a collection of decimals that correspond to the bit locations
with '100001100' as number
, [1,2,4,8,16,32,64,128,256,512,1024,2048,4096] as decimals
with number
, range(length(number)-1,0,-1) as index
, decimals[0..length(number)] as decimals
// map the bits to decimal equivalents
unwind index as i
with number, i, (split(number,''))[i] as binary_placeholder, decimals[-i-1] as decimal_placeholder
// multiply the decimal value by the bits that are set
with collect(decimal_placeholder * toInt(binary_placeholder)) as decimal_placeholders
// filter out the zero values from the collection
with filter(d in decimal_placeholders where d > 0) as decimal_placeholders
return decimal_placeholders
Here is a sample of what this returns.
Then when you want to test whether the number is in the decimal, you can just test the actual decimal for presence in the collection.
with [4, 8, 256] as decimal_placeholders
, 8 as decimal_to_test
return
case
when decimal_to_test in decimal_placeholders then
toString(decimal_to_test) + ' value bit is set'
else
toString(decimal_to_test) + ' value bit is NOT set'
end as bit_set_test
Alternatively, if one had APOC available they could use apoc.bitwise.op which is a wrapper around the java bitwise operations.
RETURN apoc.bitwise.op(268, "&",8 ) AS `268_AND_8`
Which yields the following result
If you absolutely have to do the operation in cypher probably a better solution would be to implement something like #evan's SO solution Alternative to bitwise operation using cypher.
You could start by converting your data using cypher that looks something like this...
// convert binary to a product of prime numbers
// start with the number to conver an a collection of primes
with '100001100' as number
, [2,3,5,7,13,17,19,23,29,31,37] as primes
// create an index based on the size of the binary number to convert
// take a slice of the prime array that is the size of the number to convert
with number
, range(length(number)-1,0,-1) as index
, primes[0..length(number)] as primes, decimals[0..length(number)] as decimals
// iterate over the index and match the prime number to the bits in the number to convert
unwind index as i
with (split(number,''))[i] as binary_place_holder, primes[-i-1] as prime_place_holder, decimals[-i-1] as decimal_place_holder
// collect the primes that are set by multiplying by the set bits
with collect(toInt(binary_place_holder) * prime_place_holder) as prime_placeholders
// filter out the zero bits
with filter(p in prime_placeholders where p > 0) as prime_placeholders
// return a product of primes of the set bits
return prime_placeholders, reduce(pp = 1, p in prime_placeholders | pp * p) as prime_product
Sample of the output of the above query. The query could be adapted to update attributes with the prime product.
Here is a screen cap of how the conversion breaks down
Then when you want to use it you could use the modulus of the prime number in the location of the bit you want to test.
// test if the fourth bit is set in the decimal 268
// 268 is the equivalent of a prime product of 1015
// a modulus 7 == 0 will indicate the bit is set
with 1015 as prime_product
, [2,3,5,7,13,17,19,23,29,31,37] as primes
, 4 as bit_to_test
with bit_to_test
, prime_product
, primes[bit_to_test-1] as prime
, prime_product % primes[bit_to_test-1] as mod_remains
with
case when mod_remains = 0 then
'bit ' + toString(bit_to_test) + ' set'
else
'bit ' + toString(bit_to_test) + ' NOT set'
end as bit_set
return bit_set
It almost certainly defeats the purpose of choosing a bitwise operation in the first place but if you absolutely needed to AND the two binary numbers in cypher you could do something like this with collections.
with split('100001100', '') as bin_term_1
, split('000001000', '') as bin_term_2
, toString(1) as one
with bin_term_1, bin_term_2, one, range(0,size(bin_term_1)-1,1) as index
unwind index as i
with i, bin_term_1, bin_term_2, one,
case
when (bin_term_1[i] = one) and (bin_term_2[i] = one) then
1
else
0
end as r
return collect(r) as AND
Thanks Dave. I tried your solutions and they all worked. They were a good hint for me to find another approach. This is how I solved it. I used String comparison.
with '100001100' as number , '100000000' as sub_number
with number,sub_number,range(length (number)-1,length (number)-length(sub_number),-1) as tail,length (number)-length(sub_number) as difference
unwind tail as i
with i,sub_number,number, i - length (number) + length (sub_number) as sub_number_position
with sub_number_position, (split(number,''))[i-1] as bit_mask , (split(sub_number,''))[sub_number_position] as sub_bit
with collect(toInt(bit_mask) * toInt(sub_bit)) as result
return result
Obviously the number and sub_number can have different values.

Possible to use less/greater than operators with IF ANY?

Is it possible to use <,> operators with the if any function? Something like this:
select if (any(>10,Q1) AND any(<2,Q2 to Q10))
You definitely need to create an auxiliary variable to do this.
#Jignesh Sutar's solution is one that works fine. However there are often multiple ways in SPSS to accomplish a certain task.
Here is another solution where the COUNT command comes in handy.
It is important to note that the following solution assumes that the values of the variables are integers. If you have float values (1.5 for instance) you'll get a wrong result.
* count occurrences where Q2 to Q10 is less then 2.
COUNT #QLT2 = Q2 TO Q10 (LOWEST THRU 1).
* select if Q1>10 and
* there is at least one occurrence where Q2 to Q10 is less then 2.
SELECT (Q1>10 AND #QLT2>0).
There is also a variant for this sort of solution that deals with float variables correctly. But I think it is less intuitive though.
* count occurrences where Q2 to Q10 is 2 or higher.
COUNT #QGE2 = Q2 TO Q10 (2 THRU HIGHEST).
* select if Q1>10 and
* not every occurences of (the 9 variables) Q2 to Q10 is two or higher.
SELECT IF (Q1>10 AND #QGE2<9).
Note: Variables beginning with # are temporary variables. They are not stored in the data set.
I don't think you can (would be nice if you could - you can do something similar in Excel with COUNTIF & SUMIF IIRC).
You've have to construct a new variable which tests the multiple ANY less than condition, as per below example:
input program.
loop #j = 1 to 1000.
compute ID=#j.
vector Q(10).
loop #i = 1 to 10.
compute Q(#i) = trunc(rv.uniform(-20,20)).
end loop.
end case.
end loop.
end file.
end input program.
execute.
vector Q=Q2 to Q10.
loop #i=1 to 9 if Q(#i)<2.
compute #QLT2=1.
end loop if Q(#i)<2.
select if (Q1>10 and #QLT2=1).
exe.

Pascal's triangle and Fibonacci sequence explanation

Okay I need to redraw the pascal's triangle and explain the Fibonacci sequence embedded in it.. And i need to observe over 12 rows of the triangle (which ends on the number 144 in the fibonacci sequence) -- I understand this part as i am just explaining how each row diagonally forms the sum of the Fibonacci numbers.
But I need to use the fact that the rth number in the nth row of the triangle is
C(n, r) = n!/r! n-r!
This last part is whats confusing me.. How can i use C(n,r) to explain the Fibonacci sequence in the triangle??
Please Help. Thanks
Consider the following problem :
In how many ways can you go up a ladder of n steps if you can take either a single step at a time or 2 steps at a time?
Solution 1 : Let's construct a recurrence relation for this problem. It's pretty clear that the recurrence would be something like this : a(n) = a(n-1) + a(n-2); where a(1)=1 and a(2)=2
Thus, the answer for n would be the (n+1)th fibonacci term.
Solution 2 : Each unique way of climbing up the ladder corresponds to a unique sequence of 1's and 2's which adds up to n. The number of such sequences thus would be our answer. Let's start counting such sequences :
Number of sequences without a 2 = $ {n \choose 0 } $.
Number of sequences with one 2 = $ {n-1 \choose 1 } $.
.
.
.
and so on.
In case of even n, the last term would be $ {n/2 \choose n/2 } $.
And for odd n, it would be $ {(n+1)/2 \choose (n-1)/2 } $.
As you can see, These are the diagonal terms in a pascal's triangle.
As these two solutions compute the same result, hence they must be equal. Thus we get the relation between Fibonacci numbers and the diagonals of a pascals triangle.
Refer the link
http://ms.appliedprobability.org/data/files/Articles%2033/33-1-5.pdf
for anymore doubts.

Resources