iphone sdk : Break chinese sentence into words and letters - ios

I have Chinese news feed and I want to break the sentence into smaller chunks to pass to the API.
How can I do it in ios? I have set character length of 50 characters for English language.
Currently I am using rangeOfString: function to find dot, comma and break into sentence.
NSString *str = nil, *rem = nil;
str = [final substringToIndex:MAX_CHAR_Private];
rem = [final substringFromIndex:MAX_CHAR_Private];
NSRange rng = [rem rangeOfString:#"?"];
if (rng.location == NSNotFound) {
rng = [rem rangeOfString:#"!"];
if (rng.location == NSNotFound) {
rng = [rem rangeOfString:#","];
if (rng.location == NSNotFound) {
rng = [rem rangeOfString:#"."];
if (rng.location == NSNotFound) {
rng = [rem rangeOfString:#" "];
}
}
}
}
if (rng.location+1 + MAX_CHAR_Private > MAXIMUM_LIMIT_Private) {
rng = [rem rangeOfString:#" "];
}
if (rng.location == NSNotFound) {
remaining = [[final substringFromIndex:MAX_CHAR_Private] retain];
}
else{
//NSRange rng = [rem rangeOfString:#" "];
str = [str stringByAppendingString:[rem substringToIndex:rng.location]];
remaining = [[final substringFromIndex:MAX_CHAR_Private + rng.location+1] retain];
}
This is not working correctly for chinese and japanese characters.

Check NSLinguisticTagger, It should work with Chinese:
From Apple: "The NSLinguisticTagger class is used to automatically segment natural-language text and tag it with information, such as parts of speech. It can also tag languages, scripts, stem forms of words, etc."
Apple documentation NSLinguisticTagger Class Reference
Also see NSHipster NSLinguisticTagger.
Also see objc.io issue 7

NSString provides that out of the box with NSStringEnumerationBySentences enumeration option:
[string enumerateSubstringsInRange:NSMakeRange(0, [string length])
options:NSStringEnumerationBySentences
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop)
{
NSString *sentence = [substring stringByTrimmingCharactersInSet:whiteSpaceSet];
// process sentence
}
];

Related

How to get the first alphabet character of a string in iOS

I have an example NSString in iOS
NSString* str = #"-- This is an example string";
I want to get the first alphabet letter. The result of above situation is letter "T" from word "This". Some characters before letter "T" is not alphabet letter so it returns the first alphabet letter is "T".
How can I retrieve it? If the string not contain any alphabet letter, it can return nil.
Besides, the result can be a NSRange
NSRange range = [string rangeOfCharacterFromSet:[NSCharacterSet letterCharacterSet]];
First create a NSCharecterSet as a global variable and write this code
-(void)viewDidLoad{
NSCharacterSet *s = [NSCharacterSet characterSetWithCharactersInString:#"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"]
s = [s invertedSet];
NSString *myString = #"--- This is a string";
NSArray *arrayOfStrings = [myString componentsSeparatedByString:#" "];
for(int i=0;i<arrayOfStrings.count){
NSString *current = [arrayOfStrings objectAtIndex:i];
char c = [self returnCharacter:current];
if(c == nil){
//that means first word is not with alphabets;
}
else {
NSLog(#"%c",c);
//your output.
}
}
}
And here is the method
-(char)returnChracter:(NSString*)string{
NSRange r = [string rangeOfCharacterFromSet:s];
if (r.location != NSNotFound) {
NSLog(#"the string contains illegal characters");
return nil;
}
else {
//string contains all alphabets
char firstLetter = [string charAtIndex:0];
return firstLetter;
}
}
You can use the following function. Pass a string and get first character as a string.
-(NSString*)getFirstCharacter:(NSString*)string
{
for(int i=0;i<string.length;i++)
{
unichar firstChar = [string characterAtIndex:i];
NSCharacterSet *letters = [NSCharacterSet letterCharacterSet];
if ([letters characterIsMember:firstChar]) {
return [NSString:stringWithFormat:#"%c",firstChar];
}
}
return nil;
}

NSString to treat "regular english alphabets" and characters like emoji or japanese uniformly

There is a textView in which I can enter Characters. characters can be a,b,c,d etc or a smiley face added using emoji keyboard.
-(void)textFieldDidEndEditing:(UITextField *)textField{
NSLog(#"len:%lu",textField.length);
NSLog(#"char:%c",[textField.text characterAtIndex:0]);
}
Currently , The above function gives following outputs
if textField.text = #"qq"
len:2
char:q
if textField.text = #"😄q"
len:3
char:=
What I need is
if textField.text = #"qq"
len:2
char:q
if textField.text = #"😄q"
len:2
char:😄
Any clue how to do this ?
Since Apple screwed up emoji (actually Unicode planes above 0) this becomes difficult. It seems it is necessary to enumerate through the composed character to get the actual length.
Note: The NSString method length does not return the number of characters but the number of code units (not characters) in unichars. See NSString and Unicode - Strings - objc.io issue #9.
Example code:
NSString *text = #"qqq😄rrr";
int maxCharacters = 4;
__block NSInteger unicharCount = 0;
__block NSInteger charCount = 0;
[text enumerateSubstringsInRange:NSMakeRange(0, text.length)
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
unicharCount += substringRange.length;
if (++charCount >= maxCharacters)
*stop = YES;
}];
NSString *textStart = [text substringToIndex: unicharCount];
NSLog(#"textStart: '%#'", textStart);
textStart: 'qqq😄'
An alternative approach is to use utf32 encoding:
int byteCount = maxCharacters*4; // 4 utf32 characters
char buffer[byteCount];
NSUInteger usedBufferCount;
[text getBytes:buffer maxLength:byteCount usedLength:&usedBufferCount encoding:NSUTF32StringEncoding options:0 range:NSMakeRange(0, text.length) remainingRange:NULL];
NSString * textStart = [[NSString alloc] initWithBytes:buffer length:usedBufferCount encoding:NSUTF32LittleEndianStringEncoding];
There is some rational for this in Session 128 - Advance Text Processing from 2011 WWDC.
This is what i did to cut a string with emoji characters
+(NSUInteger)unicodeLength:(NSString*)string{
return [string lengthOfBytesUsingEncoding:NSUTF32StringEncoding]/4;
}
+(NSString*)unicodeString:(NSString*)string toLenght:(NSUInteger)len{
if (len >= string.length){
return string;
}
NSInteger charposition = 0;
for (int i = 0; i < len; i++){
NSInteger remainingChars = string.length-charposition;
if (remainingChars >= 2){
NSString* s = [string substringWithRange:NSMakeRange(charposition,2)];
if ([self unicodeLength:s] == 1){
charposition++;
}
}
charposition++;
}
return [string substringToIndex:charposition];
}

Is there a way to work case insensitive with UITextChecker?

I need my app to check if a word entered by the user is actually a real word or not. For doing this i found the UITextChecker. I am using it to check for words like this:
UITextChecker *textChecker = [[UITextChecker alloc] init];
NSLocale *locale = [NSLocale currentLocale];
NSString *language = [locale objectForKey:NSLocaleLanguageCode];
NSRange searchRange = NSMakeRange(0, [currentWord length]);
NSRange misspelledRange = [textChecker rangeOfMisspelledWordInString:currentWord range: searchRange startingAt:0 wrap:NO language:language];
if (misspelledRange.location == NSNotFound) NSLog(#"is a word");
else NSLog(#"is not a word");
This works fine until it comes to words that begin with an uppercase (and that are a lot of words in german :) ). The user is entering the text lowercase, so even if a word is actually a word it outputs "is not a word" because of it being lowercase.
I found no solution for this problem in the documentation, also i searched for it here.
So my question is if there is any better way for doing this, than transforming the first letter into an uppercase if the check failed?
Example for what i would do if there is no other way:
NSRange misspelledRange = [textChecker rangeOfMisspelledWordInString:currentWord range: searchRange startingAt:0 wrap:NO language:language];
if (misspelledRange.location == NSNotFound) NSLog(#"is a word");
else {
NSString *firstLetter = [currentWord substringToIndex:1];
NSString *restWord = [currentWord substringFromIndex:1];
currentWord = [NSString stringWithFormat:#"%#%#", [firstLetter capitalizedString], restWord];
NSRange misspelledRange = [textChecker rangeOfMisspelledWordInString:currentWord range: searchRange startingAt:0 wrap:NO language:language];
if (misspelledRange.location == NSNotFound) NSLog(#"is a word");
NSLog(#"is not a word");
}
Thanks for your help!

Finding word in NSString and checking before and after character this word?

How to find word in NSString and check characters before and after this word?
"This pattern has two parts separated by the"
How to find tern and how to check the character before and after
Before word character:"t"
After word character:" "
You can use NSScanner to get indexes of these two characters.
Example:
NSString *string = #"tern";
NSScanner *scanner = [[NSScanner alloc] initWithString:#"This pattern has two parts separated by the"];
[scanner scanUpToString:string intoString:nil];
NSUInteger indexOfChar1 = scanner.scanLocation - 1;
NSUInteger indexOfChar2 = scanner.scanLocation + string.length;
You can also use a rangeOfString method:
Example:
NSRange range = [sourceString rangeOfString:stringToLookFor];
NSUInteger indexOfChar1 = range.location - 1;
NSUInteger indexOfChar2 = range.location +range.length + 1;
Then, when you have indexes, getting the characters is easy:
NSString *firstCharacter = [sourceString substringWithRange:NSMakeRange(indexOfChar1, 1)];
NSString *secondCharacter = [sourceString substringWithRange:NSMakeRange(indexOfChar2, 1)];
Hope this helps.
Here is an implementation using Regular Expressions
NSString *testString= #"This pattern has two parts separated by the";
NSString *regexString = #"(.)(tern)(.)";
NSRegularExpression* exp = [NSRegularExpression
regularExpressionWithPattern:regexString
options:NSRegularExpressionSearch error:&error];
if (error) {
NSLog(#"%#", error);
} else {
NSTextCheckingResult* result = [exp firstMatchInString:testString options:0 range:NSMakeRange(0, [testString length] ) ];
if (result) {
NSRange groupOne = [result rangeAtIndex:1]; // 0 is the WHOLE string.
NSRange groupTwo = [result rangeAtIndex:2];
NSRange groupThree = [result rangeAtIndex:3];
NSLog(#"[%#][%#][%#]",
[testString substringWithRange:groupOne],
[testString substringWithRange:groupTwo],
[testString substringWithRange:groupThree] );
}
}
Results:
[t][tern][ ]
Its better to get pre and post character in NSString to avoid handling of unicode characters.
NSString * testString = #"This pattern has two parts separated by the";
NSString * preString;
NSString * postString;
NSUInteger maxRange;
NSRange range = [testString rangeOfString:#"tern"];
if(range.location == NSNotFound){
NSLog(#"Not found");
return;
}
if (range.location==0) {
preString=nil;
}
else{
preString = [testString substringWithRange:NSMakeRange(range.location-1,1)];
}
maxRange = NSMaxRange(range);
if ( maxRange >=testString.length ) {
postString = nil;
}
else{
postString = [testString substringWithRange:NSMakeRange(range.location+range.length, 1)];
}

Finding first letter in NSString and counting backwards

I'm new to IOS, and was looking for some guidance.
I have a long NSString that I'm parsing out. The beginning may have a few characters of garbage (can be any non-letter character) then 11 digits or spaces, then a single letter (A-Z). I need to get the location of the letter, and get the substring that is 11 characters behind the letter to 1 character behind the letter.
Can anyone give me some guidance on how to do that?
Example: '!!2553072 C'
and I want : '53072 '
You can accomplish this with the regex pattern: (.{11})\b[A-Z]\b
The (.{11}) will grab any 11 characters and the \b[A-Z]\b will look for a single character on a word boundary, meaning it will be surrounded by spaces or at the end of the string. If characters can follow the C in your example then remove the last \b. This can be accomplished in Objective-C like so:
NSError *error;
NSString *example = #"!!2553072 C";
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:#"(.{11})\\b[A-Z]\\b"
options:NSRegularExpressionCaseInsensitive
error:&error];
if(!regex)
{
//handle error
}
NSTextCheckingResult *match = [regex firstMatchInString:example
options:0
range:NSMakeRange(0, [example length])];
if(match)
{
NSLog(#"match: %#", [example substringWithRange:[match rangeAtIndex:1]]);
}
There may be a more elegant way to do this involving regular expressions or some Objective-C wizardry, but here's a straightforward solution (personally tested).
-(NSString *)getStringContent:(NSString *)input
{
NSString *substr = nil;
NSRange singleLetter = [input rangeOfCharacterFromSet:[NSCharacterSet letterCharacterSet]];
if(singleLetter.location != NSNotFound)
{
NSInteger startIndex = singleLetter.location - 11;
NSRange substringRange = NSMakeRange(start, 11);
substr = [tester substringWithRange:substringRange];
}
return substr;
}
You can use NSCharacterSets to split up the string, then take the first remaining component (consisting of your garbage and digits) and get a substring of that. For example (not compiled, not tested):
- (NSString *)parseString:(NSString *)myString {
NSCharacterSet *letters = [NSCharacterSet letterCharacterSet];
NSArray *components = [myString componentsSeparatedByCharactersInSet:letters];
assert(components.count > 0);
NSString *prefix = components[0]; // assuming relatively new Xcode
return [prefix substringFromIndex:(prefix.length - 11)];
}
//to get rid of all non-Digits in a NSString
NSString *customerphone = CustomerPhone.text;
int phonelength = [customerphone length];
NSRange customersearchRange = NSMakeRange(0, phonelength);
for (int i =0; i < phonelength;i++)
{
const unichar c = [customerphone characterAtIndex:i];
NSString* onechar = [NSString stringWithCharacters:&c length:1];
if(!isdigit(c))
{
customerphone = [customerphone stringByReplacingOccurrencesOfString:onechar withString:#"*" options:0 range:customersearchRange];
}
}
NSString *PhoneAllNumbers = [customerphone stringByReplacingOccurrencesOfString:#"*" withString:#"" options:0 range:customersearchRange];

Resources