rangeOfComposedCharacterSequencesForRange transforming 0 character range into 1 character range - ios

I do not understand why in the following code, the extended range that will be printed is
"location: 1, length: 1" . Why was the range length extended from 0 to 1?
NSString * text = #"abc";
NSRange range = NSMakeRange(1, 0);
NSRange extendedRange = [text rangeOfComposedCharacterSequencesForRange:range];
NSLog(#"extended range: location %d, length : %d ",extendedRange.location,extendedRange.length);
The doc says that the result of this is:
The range in the receiver that includes the composed character
sequences in range.
with the following discussion
This method provides a convenient way to grow a range to include all composed character sequences it overlaps.
But the text #"abc" does not contain any composed character, which makes me think that the result should be the same range, unmodified, and anyway, I would think that a range of length 0 would not overlap any character.
This looks like a bug to me, but I might have missed something. Is that normal?

It's probably a bug.
The implementation of rangeOfComposedCharacterSequencesForRange: just calls rangeOfComposedCharacterSequenceAtIndex: twice, with the start and end indexes of the range, and returns the combined range.
The documentation does not explicitly state that the characters at the edges of the provided range are never included but I agree that the observed behavior feels wrong.
You should file a bug.

The documentation of rangeOfComposedCharacterSequencesForRange: say
Return Value
The range in the receiver that includes the composed character sequences in range.
Discussion
This method provides a convenient way to grow a range to include all composed character sequences it overlaps.
as the location is valid it is considering that the range overlap the character at this location

Related

How to specify a range in Ruby

I've been looking for a good way to see if a string of items are all numbers, and thought there might be a way of specifying a range from 0 to 9 and seeing if they're included in the string, but all that I've looked up online has really confused me.
def validate_pin(pin)
(pin.length == 4 || pin.length == 6) && pin.count("0-9") == pin.length
end
The code above is someone else's work and I've been trying to identify how it works. It's a pin checker - takes in a set of characters and ensures the string is either 4 or 6 digits and all numbers - but how does the range work?
When I did this problem I tried to use to_a? Integer and a bunch of other things including ranges such as (0..9) and ("0..9) and ("0".."9") to validate a character is an integer. When I saw ("0-9) it confused the heck out of me, and half an hour of googling and youtube has only left me with regex tutorials (which I'm interested in, but currently just trying to get the basics down)
So to sum this up, my goal is to understand a more semantic/concise way to identify if a character is an integer. Whatever is the simplest way. All and any feedback is welcome. I am a new rubyist and trying to get down my fundamentals. Thank You.
Regex really is the right way to do this. It's specifically for testing patterns in strings. This is how you'd test "do all characters in this string fall in the range of characters 0-9?":
pin.match(/\A[0-9]+\z/)
This regex says "Does this string start and end with at least one of the characters 0-9, with nothing else in between?" - the \A and \z are start-of-string and end-of-string matchers, and the [0-9]+ matches any one or more of any character in that range.
You could even do your entire check in one line of regex:
pin.match(/\A([0-9]{4}|[0-9]{6})\z/)
Which says "Does this string consist of the characters 0-9 repeated exactly 4 times, or the characters 0-9, repeated exactly 6 times?"
Ruby's String#count method does something similar to this, though it just counts the number of occurrences of the characters passed, and it uses something similar to regex ranges to allow you to specify character ranges.
The sequence c1-c2 means all characters between c1 and c2.
Thus, it expands the parameter "0-9" into the list of characters "0123456789", and then it tests how many of the characters in the string match that list of characters.
This will work to verify that a certain number of numbers exist in the string, and the length checks let you implicitly test that no other characters exist in the string. However, regexes let you assert that directly, by ensuring that the whole string matches a given pattern, including length constraints.
Count everything non-digit in pin and check if this count is zero:
pin.count("^0-9").zero?
Since you seem to be looking for answers outside regex and since Chris already spelled out how the count method was being implemented in the example above, I'll try to add one more idea for testing whether a string is an Integer or not:
pin.to_i.to_s == pin
What we're doing is converting the string to an integer, converting that result back to a string, and then testing to see if anything changed during the process. If the result is =>true, then you know nothing changed during the conversion to an integer and therefore the string is only an Integer.
EDIT:
The example above only works if the entire string is an Integer and won’t properly deal with leading zeros. If you want to check to make sure each and every character is an Integer then do something like this instead:
pin.prepend(“1”).to_i.to_s(1..-1) == pin
Part of the question seems to be exactly HOW the following portion of code is doing its job:
pin.count("0-9")
This piece of the code is simply returning a count of how many instances of the numbers 0 through 9 exist in the string. That's only one piece of the relevant section of code though. You need to look at the rest of the line to make sense of it:
pin.count("0-9") == pin.length
The first part counts how many instances then the second part compares that to the length of the string. If they are equal (==) then that means every character in the string is an Integer.
Sometimes negation can be used to advantage:
!pin.match?(/\D/) && [4,6].include?(pin.length)
pin.match?(/\D/) returns true if the string contains a character other than a digit (matching /\D/), in which case it it would be negated to false.
One advantage of using negation here is that if the string contains a character other than a digit pin.match?(/\D/) would return true as soon as a non-digit is found, as opposed to methods that examine all the characters in the string.

Incorrect implementation of String.removeSubrange?

I came across a weird behaviour with the String.removeSubrange function.
This is what the documentation says:
Removes the characters in the given range.
Parameters
bounds The range of the elements to remove. The upper and lower bounds
of bounds must be valid indices of the string and not equal to the
string’s end index.
bounds The range of the elements to remove. The upper and lower bounds
of bounds must be valid indices of the string.
The documentation already states that the range can not include the endIndex of the string, but I think that should be changed.
Lets look at an example why.
I have a string "12345" and I want to remove the first three characters which would result in "45".
The code for that is the following:
// Remove the characters 123
var str = "12345"
let endRemoveIndex = str.index(str.startIndex, offsetBy: 2)
str.removeSubrange(str.startIndex...endRemoveIndex)
So far so good I just create a closed range from the startIndex to the startIndex advanced by 2.
Lets say I want to remove the characters "345" I would expect the following code to work:
str = "12345"
let startRemoveIndex = endRemoveIndex
str.removeSubrange(startRemoveIndex...str.endIndex)
However this does not work as the documentation has already mentioned.
This results in fatalError saying
Can't advance past endIndex
The code that works for removing the last three characters is the following:
// Remove the characters 345
str = "12345"
let startRemoveIndex = endRemoveIndex
str.removeSubrange(startRemoveIndex..<str.endIndex)
That in my opinion is syntactically incorrect, because the half range operator implies that the maximum will not be included, but in this case it is.
What do you think about that?
Hamish pointed out that the String.endIndex is a “past the end” position which is the position one greater than the last valid subscript argument.

Only 2 emoji return an incorrect length when compared against a character set containing them

let myString = "☺️"
let emoji = "😀😁😂😃😄😅😆😇😈👿😉😊☺️😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷🙂🙃🙄🤔🙁☹️🤒🤕🤑🤓🤗🤐🤠🤤🤥🤧🤢🤡🤣"
let characterSet = CharacterSet(charactersIn: emoji)
let range = (myString as NSString).rangeOfCharacter(from: characterSet)
(myString as NSString).substring(with: range)
(range as NSRange).location
(range as NSRange).length
(myString as NSString).length
substring == myString
This code can be ran in Playgrounds. Try changing myString to be any emoji face.
I'm using NSString and NSRange here as their values are easier to demonstrate, but this has the exact same behaviour with a Swift String or Range.
When I set myString to most of the face emojis, the range comes back as having a length of 2, and the substring can be used appropriately elsewhere. With only 2 face emojis - the "smiling face" emoji and "frowning face" emoji, the range comes back as a length of 1. In all cases, the length of the string comes back as 2. The substring with the given range of 1 is incomplete, and you can see that comparing it back to myString, as an example of comparing it to itself, gives a result of false. The result for the range of those 2 emojis should be 2.
Interestingly, looking at the unicode spec, those 2 emojis have vastly differently unicode values to their neighbours.
This seems like it may be an iOS bug. I can't think of anything I could be personally doing incorrectly here, as it works with all other emoji.
Hardly an answer but to much to fit into a comment so bear with me :)
I don't know if you've already seen this but I think your problem is addressed in the Platform State of the Union talk from WWDC 2017 (https://developer.apple.com/videos/play/wwdc2017/102/) in the section about what is new in Swift 4.
If you look at the video at about the 23 minutes 12 seconds mark you'll see Ted Kremenek talk about how they've fixed separating unicode characters out as expected in Swift 4 using Unicode 9 Grapheme Braking.
Also, have a look at this question and answer.
Yes...Don't ask me in detail what all this means, but it seems as if they're working on it :)

Under what conditions can [NSEvent characters] be a NSString of length greater than 1?

NSEvent has a characters property which is a NSString valid for key up/down events. Under what conditions can the string length be greater than 1?
The only condition I have been able to find till now is when the NSEvent corresponds to input from an IME (Input Method Editor).
Edit - I knew about the surrogate pair case, but it somehow slipped out of my mind while asking this. I am more interested in the case when the no. of graphemes(characters) is greater than 1 itself.
Under what conditions can the string length be greater than 1?
When you have a keyboard/input method which can input any single character which requires a surrogate pair in UTF-16, e.g. a 𐀀 (Unicode Linear B Syllable B008 A), then the length will be 2. This is because length returns the number of 16-bit code units, not the number of characters.
You can also get this with programmatically-posted events. CGEventKeyboardSetUnicodeString() allows the caller to attach any arbitrary string to the key event.
High unicode codepoints are coded with a character sequence in Mac OS X. Try 𫝑.

Understanding the Use of invertedSet method of NSCharacterSet

So as I work my way through understanding string methods, I came across this useful class
NSCharacterSet
which is defined in this post quite well as being similar to a string excpet it is used for holding the char in an unordered set
What is differnce between NSString and NSCharacterset?
So then I came across the useful method invertedSet, and it bacame a little less clear what was happening exactly. Also I a read page a fter page on it, they all sort of glossed over the basics of what was happening and jumped into advanced explainations. So if you wanted to know what this is and why we use It SIMPLY put, it was not so easy instead you get statements like this from the apple documentation: "A character set containing only characters that don’t exist in the receiver." - and how do I use this exactly???
So here is what i understand to be the use. PLEASE provide in simple terms if I have explained this incorrectly.
Example Use:
Create a list of Characters in a NSCharacterSetyou want to limit a string to contain.
NSString *validNumberChars = #"0123456789"; //Only these are valid.
//Now assign to a NSCharacter object to use for searching and comparing later
validCharSet = [NSCharacterSet characterSetWithCharactersInString:validNumberChars ];
//Now create an inverteds set OF the validCharSet.
NSCharacterSet *invertedValidCharSet = [validCharSet invertedSet];
//Now scrub your input string of bad character, those characters not in the validCharSet
NSString *scrubbedString = [inputString stringByTrimmingCharactersInSet:invertedValidCharSet];
//By passing in the inverted invertedValidCharSet as the characters to trim out, then you are left with only characters that are in the original set. captured here in scrubbedString.
So is this how to use this feature properly, or did I miss anything?
Thanks
Steve
A character set is a just that - a set of characters. When you invert a character set you get a new set that has every character except those from the original set.
In your example you start with a character set containing the 10 standard digits. When you invert the set you get a set that has every character except the 10 digits.
validCharSet = [NSCharacterSet characterSetWithCharactersInString:validNumberChars];
This creates a character set containing the 10 characters 0, 1, ..., 9.
invertedValidCharSet = [validCharSet invertedSet];
This creates the inverted character set, i.e. the set of all Unicode characters without
the 10 characters from above.
scrubbedString = [inputString stringByTrimmingCharactersInSet:invertedValidCharSet];
This removes from the start and end of inputString all characters that are in
the invertedValidCharSet. For example, if
inputString = #"abc123d€f567ghj😄"
then
scrubbedString = #"123d€f567"
Is does not, as you perhaps expect, remove all characters from the given set.
One way to achieve that is (copied from NSString - replacing characters from NSCharacterSet):
scrubbedString = [[inputString componentsSeparatedByCharactersInSet:invertedValidCharSet] componentsJoinedByString:#""]
This is probably not the most effective method, but as your question was about understanding
NSCharacterSet I hope that it helps.

Resources