Character Replacements - google-sheets

I have a UniCode string UniStr.
I also have a MAP of { UniCodeChar : otherMappedStrs }
I need the 'otherMappedStrs' version of UniStr.
Eg: UniStr = 'ABC', MAP = { 'A':'233','B':'#$','C':'9ij' }, Result = '233#$9ij'
I have come up with the formula below which works;
=ArrayFormula(JOIN("",VLOOKUP(REGEXEXTRACT(A1,REPT("(.)",LEN(A1))),MapRange,2,FALSE)))
The MAP being a whole character set (40 chars) is quite large.
I need to use this function in multiple spreadsheets. How can I subsume the MAP into the formula for portability ?
Is there a better way to iterate a string other than the REGEXEXTRACT method in formula ? This method has limitation for long strings.
I also tested the below formula. Problem here is it gives 2 results (or the size of the array within SUBSTITUTE replacement). If 3 substitutions made, then it gives three results. Can this be resolved ?
=ArrayFormula(SUBSTITUTE(A1,{"s","i"},{"#","#"}))
EDIT;
#Tom 's first solution appears best for my case (1) REGEX has an upper limit on search criteria which does not hinder in your solution (2) Feels fast (did not do empirical testing) (3) This is a better way to iterate string characters, I believe (you answered my Q2 - thanks)
I digress here. I wish google would introduce Named-Formulas or Formula-Aliases. In this case, hypothetically below. I have sent feed back along those lines many times. Nothing :(
MyFormula($str) == ArrayFormula(join(,vlookup(mid($str,row(indirect("1:"&len($str))),1), { "A","233";"B","#$";"C","9ij" },2,false)))

Not sure how long you want your strings to be, but the more traditional
=ArrayFormula(join(,vlookup(mid(A1,row(indirect("1:"&len(A1))),1), { "A","233";"B","#$";"C","9ij" },2,false)))
seems a bit more robust for long strings.
For a more radical idea, supposing the maximum length of your otherMappedStrings is 3 characters, then you could try:
=ArrayFormula(join(,trim(mid("233 #$9ij",find(mid(A1,row(indirect("1:"&len(A1))),1), "ABC")*3-2,3))))
where I have put a space in before #$ to pad it out to 3 characters.
Incidentally the original VLOOKUP is not case sensitive. If you want this behaviour, use SEARCH instead of FIND.

You seem to have several different Qs, but considering only portability, perhaps something like the following would help:
=join(,switch(arrayformula(regexextract(A1&"",rept("(.)",len(A1)))),"A",233,"B","#$","C","9ij"))
extended with 37 more pairs.

Related

Finding duplicate cases, string-variable, SPSS

Being a novel on SPSS I am struggling with finding duplicate cases based on a string-variable in a dataset containing approx 33,000 cases.
I have a variable named "nr" that is supposed to be unique id for every case. However, it turns out that some cases might have two different values in "nr" entered,the only difference being the last character. Resulting in a case being shown as two separate rows.
The structure of the var "nr" is a as follows: XX-XXXXXXX-X or X-XXXXXXX-X i.e 2-7-1 characters or 1-7-1 characters.
I would like to sort out all cases that have a "nr" equal to another case except for the last character.
To illustrate, with a succesfull syntax I would hopefully be able to sort cases like these out from the whole dataset:
20-4026988-2
20-4026988-3
5-4026992-5
5-4026992-8
20-4027281-2
20-4027281-3
Anyone have an idea on how to make a syntax for this? Would be so grateful for any input!
I suggest to create a new variable without that last character, and then look for the doubles:
* first creating some sample data to play with.
data list list/ID (a15).
begin data.
20-4026988-2
12-2345678-7
20-4026988-3
5-4026992-5
5-4026992-8
12-1234567-1
20-4027281-2
6-1234567-1
20-4027281-3
end data.
* now creating the new variable and counting the occurrences of each shortened ID.
string ShortID (a15).
compute ShortID=char.substr(ID,1,char.rindex(ID,"-")).
* also possible: compute ShortID=char.substr(ID,1,char.length(rtrim(ID))-1).
aggregate out=* mode=add /break=ShortID/occurrences=n.
* at this point you can filter based on the number or `occurrences` or sort them.
sort cases by occurrences (d) ShortID.
After removing the last character, you can use Data > Identify Duplicate Cases to find the dups. It as a number of useful options for this.

Regex that finds a line with exactly 3 words in it

I have a problem that requires me to write a regex that finds a line that containing exactly 3 groups of characters (it could be words or numbers) and that ends with another specific word. The way I had in mind was to find a pattern that ended in a space, and look for it 3 times. assuming this is the correct way to go about it, I do no know how to find a space, but I thought it would look like .*"find a space"{3} endword$. Is this the way it would be done? Even if it is not the way to do it how do you find a space? Any suggestions?
Assuming by three groups of words you would accept any non-space character, you could write:
/^\s*(?:\S+\s+){3}endword$/
The initial caret is to make sure you have exactly 3 non-space groups on the line.
Of course you need to consider whether things like control characters could appear, and adjust accordingly.
Depending on your flavor, something like the below would do it:
\b+.+?\b+.+?\b+.+?\bendword$
This makes use of the word boundary mark (\b) and non-greedy repetitions (+?), so it may be slightly different in your specific implementation, especially if you're using something old like grep.

How to get a % difference of two NSStrings

I'm thinking this may be impossible to do resonably, but I figured I would take a shot at it. So lets say I have two NSStrings. One is #"Singin' In The Rain" and the other is #"Singing In The Rain". These strings are very similar, but have a small difference. I'm trying to find a way where I could write something like the following:
NSString *stringOne = #"Singin' In The Rain";
NSString *stringTwo = #"Singing In The Rain";
float dif = [stringOne differenceFrom:stringTwo];
//dif = .9634 or something like that
One project that I did find similar to this was taken from the previous similar question on Stack Overflow: Check if two NSStrings are similar. However, this simply returns a BOOL which isn't as accurate as I need it to be. I also tried looking into the compare: documentation for NSString but it all looked too basic. Another similar thing I found is at https://gist.github.com/iloveitaly/1515464. However, this gives varying results, even saying two of the same string are different occasionally. Any advice would be much appreciated.
The question is a little vague, but I would assume that the most satisfactory results will come from using NSLinguisticTagger. If you parse each for tags with the NSLinguisticTagSchemeLexicalClass scheme then your string will be broken down into verbs, nouns, adjectives, etc. In your example, even if you weren't spotting that singin' and singing are the same, you'd spot the other three words are the same and that the thing at the end is a noun, so they're both about doing something in the same thing.
It'd probably be wise to use something like a BK-Tree to compare individual words where you suspect there may be a match (a noun obviously doesn't match an adverb but two nouns may match even if spellings differ).
Another off the wall suggestion:
The source, and hence the algorithm, for diff and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.
When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.
So:
Break each string into "words" (white space separated should be sufficient).
Compare the two lists using the diff algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync)
Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.
For the two example strings this would give 1:4 changes or 75% similar.
If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).
For the two example strings this would give 3 6/7 words out of 4, or 96% similar.
I'd recommend dynamic time warping for such comparisons:
http://en.wikipedia.org/wiki/Dynamic_time_warping
This will however return distance between two strings (so you'll get 0 for identical), but this the best starting point I can think of.

seaching 2D ArrayLib does not work for some cases

I have 2D array in which the second column has domain names of some emails, let us call the array myData[][]. I decided to use ArrayLib in order to search the second column for a specific domain.
ArrayLib.indexOf(myData, 1, domain)
Here is where I found an issue. In myData array, one of the domains look like this "ewmining.com" (pay attention to the w).
While searching for "e.mining.com" (notice the first dot), the indexOf() function actully gave me the row containing "ewmining.com".
This is what is in the array "ewmining.com"
This is what is in the serach string "e.mining.com"
It seams that ArrayLib treats the dot to mean any character. Is this supposed to be the correct behavior? Is there a way to stop this behavior and search for exact match.
I really need help on this issue.
Thanks in advance for your help.
The dot usually represents "any character" in regular expressions. I am not familiar with ArrayLib, but maybe you should look for a way to turn off regular expressions when searching. Otherwise you might have to escape the dot, for example search for e[.]mining[.]com

Ruby compare two strings similarity percentage

Id like to compare two strings in Ruby and find their similarity
I've had a look at the Levenshtein gem but it seems this was last updated in 2008 and I can't find documentation how to use it. With some blogs suggesting its broken
I tried the text gem with Levenshtein but it gives an integer (smaller is better)
Obviously if the two strings are of variable length I run into problems with the Levenshtein Algorithm (Say comparing two names, where one has a middle name and one doesnt).
What would you suggest I do to get a percentage comparison?
Edit: Im looking for something similar to PHP's similar text
I think your question could do with some clarifications, but here's something quick and dirty (calculating as percentage of the longer string as per your clarification above):
def string_difference_percent(a, b)
longer = [a.size, b.size].max
same = a.each_char.zip(b.each_char).count { |a,b| a == b }
(longer - same) / a.size.to_f
end
I'm still not sure how much sense this percent difference you are looking for makes, but this should get you started at least.
It's a bit like Levensthein distance, in that it compares the strings character by character. So if two names differ only by the middle name, they'll actually be very different.
There is now a ruby gem for similar_text. https://rubygems.org/gems/similar_text
It provides a similar method that compares two strings and returns a number representing the percent similarity between the two strings.
I can recommend the fuzzy-string-match gem.
You can use it like this (taken from the docs):
require "fuzzystringmatch"
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
p jarow.getDistance("jones", "johnson")
It will return a score ~0.832 which tells how good those strings match.

Resources