I'm working on a short homomorphic query which aims to transform certain letters of a string input, into other fixed letters.
For instance, I would like all the letters 'A' to be transformed in 'E', and all the 'E' to be turned into 'O'.
I can't use sequentials native replace() functions, because the following would happen:
Input : LEA
replace n°1: LEE
replace n°2: LOO
Wished output versus output obtained : LOE / LOO
So I decided to process letter by letter, looping over my string caracters. In the below example I transform all the 'E' by 'O':
MATCH (...stringToReplace..)
UNWIND range(0,size(apoc.text.split(stringToReplace,'',0))-1) AS i
SET stringToReplace = CASE
WHEN apoc.text.split(stringToReplace,'',0)[i] = 'E'
THEN substring(stringToReplace,0,i) + "O" + substring(stringToReplace, i+1, size(apoc.text.split(stringToReplace,'',0))-1)
ELSE stringToReplace
END
RETURN stringToReplace
The problem I encounter is that I'll have as many SET queries as the string has letters. I think that performance-wise, this is pretty lame.
What I would like to have, and I'm not sure it's possible in Cypher, is to modify an aggregating variable inside the loop and then SET my data. I tried to use a WITH statement before my UNWIND loop but didn't manage to store data inside a var.
Edit: I managed to do a different implementation but it's still return and setting too many times. Even though the end result is right.
MATCH (...stringToReplace...)
UNWIND range(0,size(apoc.text.split(stringToReplace,'',0))-1) AS i
WITH CASE
WHEN apoc.text.split(stringToReplace,'',0)[i] = 'a'
THEN substring(stringToReplace,0,i) + "i" + substring(stringToReplace, i+1, size(apoc.text.split(stringToReplace,'',0))-1)
ELSE stringToReplace
END AS outputString, stringToReplace
SET stringToReplace = outputString
RETURN stringToReplace
This should convert every character in stringToReplace
MATCH (...stringToReplace..)
RETURN REDUCE(s = '', c IN split(stringToReplace,'') |
s + CASE c
WHEN 'A' THEN 'E'
WHEN 'E' THEN 'O'
ELSE c
END
) as result
Just add more WHEN/THEN clauses to handle all the character conversions needed.
Related
How would i find the index value of a string within a list - for example
WITH split ("what is porsche",' ')
how would I find the position of 'porsche' as 3?
First of all, the position would be 2 as we generally start from 0 in CS.
This is a one liner :
WITH split ("what is porsche",' ') AS spl
RETURN [x IN range(0,size(spl)-1) WHERE spl[x] = "porsche"][0]
Returns 2
WITH split ("what is porsche",' ') AS spl
RETURN [x IN range(0,size(spl)-1) WHERE spl[x] = "is"][0]
Returns 1
Cypher does not have an IndexOf-like function natively. But you can install APOC Procedure and use the function apoc.coll.indexOf, like this:
WITH split ("what is porsche",' ') AS list
RETURN apoc.coll.indexOf(list, 'porsche')
The result will be:
╒════════════════════════════════════╕
│"apoc.coll.indexOf(list, 'porsche')"│
╞════════════════════════════════════╡
│2 │
└────────────────────────────────────┘
Note: The result is 2 because indexes starts at 0.
Note 2: Remember to install APOC Procedures according the version of Neo4j you are using. Take a look in the version compatibility matrix.
EDIT:
One alternative approach without using APOC Procedures, using size(), reduce() and range() functions with CASE expression:
WITH split ("what is porsche",' ') AS list
WITH list, range(0, size(list) - 1) AS indexes
WITH reduce(acc=-1, index IN indexes |
CASE WHEN list[index] = 'porsch' THEN index ELSE acc + 0 END
) as reduction
RETURN reduction
In case the index is not found then -1 will return.
As Bruno says, APOC is the right call for this but if for some reason you wanted to find the position without APOC you could go through the following rigamarole...
WITH split("what is porsche",' ') AS porsche_strings
UNWIND range(0,size(porsche_strings)-1) AS idx
WITH CASE
WHEN porsche_strings[idx] = 'porsche' THEN idx + 1
END AS position
RETURN collect(position) AS positions
Another approach for implementing this in plain Cypher:
WITH 'porsche' AS needle, 'what is porsche' AS haystack
WITH needle, split(haystack, ' ') AS words
WITH needle, [i IN range(0, length(words)-1) | [i, words[i]]] AS word
WITH filter(w IN word WHERE w[1] = needle) AS res
RETURN coalesce(res[0][0], -1)
I need to do a bitwise "and" in a cypher query. It seems that cypher does not support bitwise operations. Any suggestions for alternatives?
This is what I want to detect ...
For example 268 is (2^8 + 2^3 + 2^2) and as you can see 2^3 = 8 is a part of my original number. So if I use bitwise AND it will be (100001100) & (1000) = 1000 so this way I can detect if 8 is a part of 268 or not.
How can I do this without bitwise support? any suggestions? I need to do this in cypher.
Another way to perform this type of test using cypher would be to convert your decimal values to collections of the decimals that represent the bits that are set.
// convert the binary number to a collection of decimal parts
// create an index the size of the number to convert
// create a collection of decimals that correspond to the bit locations
with '100001100' as number
, [1,2,4,8,16,32,64,128,256,512,1024,2048,4096] as decimals
with number
, range(length(number)-1,0,-1) as index
, decimals[0..length(number)] as decimals
// map the bits to decimal equivalents
unwind index as i
with number, i, (split(number,''))[i] as binary_placeholder, decimals[-i-1] as decimal_placeholder
// multiply the decimal value by the bits that are set
with collect(decimal_placeholder * toInt(binary_placeholder)) as decimal_placeholders
// filter out the zero values from the collection
with filter(d in decimal_placeholders where d > 0) as decimal_placeholders
return decimal_placeholders
Here is a sample of what this returns.
Then when you want to test whether the number is in the decimal, you can just test the actual decimal for presence in the collection.
with [4, 8, 256] as decimal_placeholders
, 8 as decimal_to_test
return
case
when decimal_to_test in decimal_placeholders then
toString(decimal_to_test) + ' value bit is set'
else
toString(decimal_to_test) + ' value bit is NOT set'
end as bit_set_test
Alternatively, if one had APOC available they could use apoc.bitwise.op which is a wrapper around the java bitwise operations.
RETURN apoc.bitwise.op(268, "&",8 ) AS `268_AND_8`
Which yields the following result
If you absolutely have to do the operation in cypher probably a better solution would be to implement something like #evan's SO solution Alternative to bitwise operation using cypher.
You could start by converting your data using cypher that looks something like this...
// convert binary to a product of prime numbers
// start with the number to conver an a collection of primes
with '100001100' as number
, [2,3,5,7,13,17,19,23,29,31,37] as primes
// create an index based on the size of the binary number to convert
// take a slice of the prime array that is the size of the number to convert
with number
, range(length(number)-1,0,-1) as index
, primes[0..length(number)] as primes, decimals[0..length(number)] as decimals
// iterate over the index and match the prime number to the bits in the number to convert
unwind index as i
with (split(number,''))[i] as binary_place_holder, primes[-i-1] as prime_place_holder, decimals[-i-1] as decimal_place_holder
// collect the primes that are set by multiplying by the set bits
with collect(toInt(binary_place_holder) * prime_place_holder) as prime_placeholders
// filter out the zero bits
with filter(p in prime_placeholders where p > 0) as prime_placeholders
// return a product of primes of the set bits
return prime_placeholders, reduce(pp = 1, p in prime_placeholders | pp * p) as prime_product
Sample of the output of the above query. The query could be adapted to update attributes with the prime product.
Here is a screen cap of how the conversion breaks down
Then when you want to use it you could use the modulus of the prime number in the location of the bit you want to test.
// test if the fourth bit is set in the decimal 268
// 268 is the equivalent of a prime product of 1015
// a modulus 7 == 0 will indicate the bit is set
with 1015 as prime_product
, [2,3,5,7,13,17,19,23,29,31,37] as primes
, 4 as bit_to_test
with bit_to_test
, prime_product
, primes[bit_to_test-1] as prime
, prime_product % primes[bit_to_test-1] as mod_remains
with
case when mod_remains = 0 then
'bit ' + toString(bit_to_test) + ' set'
else
'bit ' + toString(bit_to_test) + ' NOT set'
end as bit_set
return bit_set
It almost certainly defeats the purpose of choosing a bitwise operation in the first place but if you absolutely needed to AND the two binary numbers in cypher you could do something like this with collections.
with split('100001100', '') as bin_term_1
, split('000001000', '') as bin_term_2
, toString(1) as one
with bin_term_1, bin_term_2, one, range(0,size(bin_term_1)-1,1) as index
unwind index as i
with i, bin_term_1, bin_term_2, one,
case
when (bin_term_1[i] = one) and (bin_term_2[i] = one) then
1
else
0
end as r
return collect(r) as AND
Thanks Dave. I tried your solutions and they all worked. They were a good hint for me to find another approach. This is how I solved it. I used String comparison.
with '100001100' as number , '100000000' as sub_number
with number,sub_number,range(length (number)-1,length (number)-length(sub_number),-1) as tail,length (number)-length(sub_number) as difference
unwind tail as i
with i,sub_number,number, i - length (number) + length (sub_number) as sub_number_position
with sub_number_position, (split(number,''))[i-1] as bit_mask , (split(sub_number,''))[sub_number_position] as sub_bit
with collect(toInt(bit_mask) * toInt(sub_bit)) as result
return result
Obviously the number and sub_number can have different values.
I have an SPSS variable containing lines like:
|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|
Every line starts with pipe, and ends with one. I need to refactor it into boolean variables as the following:
var var1 var2 var3 var4 var5
|2|4|5| 0 1 0 1 1
I have tried to do it with a loop like:
loop # = 1 to 72.
compute var# = SUBSTR(var,2#,1).
end loop.
exe.
My code won't work with 2 or more digits long numbers and also it won't place the values into their respective variables, so I've tried nest the char.substr(var,char.rindex(var,'|') + 1) into another loop with no luck because it still won't allow me to recognize the variable number.
How can I do it?
This looks like a nice job for the DO REPEAT command. However the type conversion is somewhat tricky:
DO REPEAT var#i=var1 TO var72
/i=1 TO 72.
COMPUTE var#i = CHAR.INDEX(var,CONCAT("|",LTRIM(STRING(i,F2.0)),"|"))>0).
END REPEAT.
Explanation: Let's go from the inside to the outside:
STRING(value,F2.0) converts the numeric values into a string of two digits (with a leading white space where the number consist of just one digit), e.g. 2 -> " 2".
LTRIM() removes the leading whitespaces, e.g. " 2" -> "2".
CONCAT() concatenates strings. In the above code it adds the "|" before and after the number, e.g. "2" -> "|2|"
CHAR.INDEX(stringvar,searchstring) returns the position at which the searchstring was found. It returns 0 if the searchstring wasn't found.
CHAR.INDEX(stringvar,searchstring)>0 returns a boolean value indicating if the searchstring was found or not.
It's easier to do the manipulations in Python than native SPSS syntax.
You can use SPSSINC TRANS extension for this purpose.
/* Example data*/.
data list free / TextStr (a99).
begin data.
"|2|3|4|5|6|7|8|10|11|12|13|14|15|16|18|20|21|22|23|24|25|26|27|28|29|"
end data.
/* defining function to achieve task */.
begin program.
def runTask(x):
numbers=map(int,filter(None,[i.strip() for i in x.lstrip('|').split("|")]))
answer=[1 if i in numbers else 0 for i in xrange(1,max(numbers)+1)]
return answer
end program.
/* Run job*/.
spssinc trans result = V1 to V30 type=0 /formula "runTask(TextStr)".
exe.
I'm trying to count the number of times that " --" occurs in a string.
So for instance, it occurs twice here 'a --b --c'
I tried the following, but it gives me 4 instead of 2, any idea why?
argv='a --b --c'
count = 0
for i in string.gfind(argv, " --") do
count = count + 1
end
print(count)
you can actually do this as a one-liner using string.gsub:
local _, count = string.gsub(argv, " %-%-", "")
print(count)
no looping required!
Not recommended for large inputs, because the function returns the processed input to the _ variable, and will hold onto the memory until the variable is destroyed.
This snippet could be helpful, based on response of Mike Corcoran & optimisation suggestion of WD40
function count(base, pattern)
return select(2, string.gsub(base, pattern, ""))
end
print(count('Hello World', 'l'))
The - character has special meaning in patterns, used for a non-greedy repetition.
You need to escape it, i.e. use the pattern " %-%-".
Let's say I have a dictionary of names (a huge CSV file). I want to guess a name from an email that has no obvious parsable points (., -, _). I want to do something like this:
dict = ["sam", "joe", "john", "parker", "jane", "smith", "doe"]
word = "johnsmith"
x = 0
y = word.length-1
name_array = []
for i in x..y
match_me = word[x..i]
dict.each do |name|
if match_me == name
name_array << name
end
end
end
name_array
# => ["john"]
Not bad, but I want "John Smith" or ["john", "smith"]
In other words, I recursively loop through the word (i.e., unparsed email string, "johndoe#gmail.com") until I find a match within the dictionary. I know: this is incredibly inefficient. If there's a much easier way of doing this, I'm all ears!
If there's not better way of doing it, then show me how to fix the example above, for it suffers from two major flaws: (1) how do I set the length of the loop (see problem of finding "i" below), and (2) how do I increment "x" in the example above so that I can cycle through all possible character combinations given an arbitrary string?
Problem of finding the length of the loop, "i":
for an arbitrary word, how can we derive "i" given the pattern below?
for a (i = 1)
a
for ab (i = 3)
a
ab
b
for abc (i = 6)
a
ab
abc
b
bc
c
for abcd (i = 10)
a
ab
abc
abcd
b
bc
bcd
c
cd
d
for abcde (i = 15)
a
ab
abc
abcd
abcde
b
bc
bcd
bcde
c
cd
cde
d
de
e
r = /^(#{Regexp.union(dict)})(#{Regexp.union(dict)})$/
word.match(r)
=> #<MatchData "johnsmith" 1:"john" 2:"smith">
The regex might take some time to build, but it's blazing fast.
I dare suggest a brute force solution that is not very elegant but still useful in case
you have a large number of items (building a regexp can be a pain)
the string to analyse is not limited to two components
you want to get all splittings of a string
you want only complete analyses of a string, that span from ^ to $.
Because of my poor English, I could not figure out a long personal name that can be split in more than one way, so let's analyse a phrase:
word = "godisnowhere"
The dictionary:
#dict = [ "god", "is", "now", "here", "nowhere", "no", "where" ]
#lengths = #dict.collect {|w| w.length }.uniq.sort
The array #lengths adds a slight optimization to the algorithm, we will use it to prune subwords of lengths that don't exist in the dictionary without actually performing dictionary lookup. The array is sorted, this is another optimization.
The main part of the solution is a recursive function that finds the initial subword in a given word and restarts for the tail subword.
def find_head_substring(word)
# boundary condition:
# remaining subword is shorter than the shortest word in #dict
return [] if word.length < #lengths[0]
splittings = []
#lengths.each do |len|
break if len > word.length
head = word[0,len]
if #dict.include?(head)
tail = word[len..-1]
if tail.length == 0
splittings << head
else
tails = find_head_substring(tail)
unless tails.empty?
tails.collect!{|tail| "#{head} #{tail}" }
splittings.concat tails
end
end
end
end
return splittings
end
Now see how it works
find_head_substring(word)
=>["god is no where", "god is now here", "god is nowhere"]
I have not tested it extensively, so I apologize in advance :)
If you just want the hits of matches in your dictionary:
dict.select{ |r| word[/#{r}/] }
=> ["john", "smith"]
You run a risk of too many confusing subhits, so you might want to sort your dictionary so longer names are first:
dict.sort_by{ |w| -w.size }.select{ |r| word[/#{r}/] }
=> ["smith", "john"]
You will still encounter situations where a longer name has a shorter substring following it and get multiple hits so you'll need to figure out a way to weed those out. You could have an array of first names, and another of last names, and take the first returned result of scanning for each, but given the diversity of first and last names, that doesn't guarantee 100% accuracy, and will still gather some bad results.
This sort of problem has no real good solution without further hints to the code about the person's name. Perhaps scanning the body of the message for salutation or valediction sections will help.
I'm not sure what you're doing with i, but isn't it as simple as:
dict.each do |first|
dict.each do |last|
puts first,last if first+last == word
end
end
This one bags all occurrences, not necessarily exactly two:
pattern = Regexp.union(dict)
matches = []
while match = word.match(pattern)
matches << match.to_s # Or just leave off to_s to keep the match itself
word = match.post_match
end
matches