Is it possible to detect links within an NSString that have spaces in them with NSDataDetector? - ios

First off, I have no control over the text I am getting. Just wanted to put that out there so you know that I can't change the links.
The text I am trying to find links in using NSDataDetector contains the following:
<h1>My main item</h1>
<img src="http://www.blah.com/My First Image Here.jpg">
<h2>Some extra data</h2>
The detection code I am using is this, but it will not find this link:
NSDataDetector *linkDetector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingTypeLink error:nil];
NSArray *matches = [linkDetector matchesInString:myHTML options:0 range:NSMakeRange(0, [myHTML length])];
for (NSTextCheckingResult *match in matches)
{
if ([match resultType] == NSTextCheckingTypeLink)
{
NSURL *url = [match URL];
// does some stuff
}
}
Is this a bug with Apple's link detection here, where it can't detect links with spaces, or am I doing something wrong?
Does anyone have a more reliable way to detect links regardless of whether they have spaces or special characters or whatever in them?

I just got this response from Apple for a bug I filed on this:
We believe this issue has been addressed in the latest iOS 9 beta.
This is a pre-release iOS 9 update.
Please refer to the release notes for complete installation
instructions.
Please test with this release. If you still have issues, please
provide any relevant logs or information that could help us
investigate.
iOS 9 https://developer.apple.com/ios/download/
I will test and let you all know if this is fixed with iOS 9.

You could split the strings into pieces using the spaces so that you have an array of strings with no spaces. Then you could feed each of those strings into your data detector.
// assume str = <img src="http://www.blah.com/My First Image Here.jpg">
NSArray *components = [str componentsSeparatedByString:#" "];
for (NSString *strWithNoSpace in components) {
// feed strings into data detector
}
Another alternative is to look specifically for that HTML tag. This is a less generic solution, though.
// assume that those 3 HTML strings are in a string array called strArray
for (NSString *htmlLine in strArray) {
if ([[htmlLine substringWithRange:NSMakeRange(0, 8)] isEqualToString:#"<img src"]) {
// Get the url from the img src tag
NSString *urlString = [htmlLine substringWithRange:NSMakeRange(10, htmlLine.length - 12)];
}
}

I've found a very hacky way to solve my issue. If someone comes up with a better solution that can be applied to all URLs, please do so.
Because I only care about URLs ending in .jpg that have this problem, I was able to come up with a narrow way to track this down.
Essentially, I break out the string into components based off of them beginning with "http:// into an array. Then I loop through that array doing another break out looking for .jpg">. The count of the inner array will only be > 1 when the .jpg"> string is found. I then keep both the string I find, and the string I fix with %20 replacements, and use them to do a final string replacement on the original string.
It's not perfect and probably inefficient, but it gets the job done for what I need.
- (NSString *)replaceSpacesInJpegURLs:(NSString *)htmlString
{
NSString *newString = htmlString;
NSArray *array = [htmlString componentsSeparatedByString:#"\"http://"];
for (NSString *str in array)
{
NSArray *array2 = [str componentsSeparatedByString:#".jpg\""];
if ([array2 count] > 1)
{
NSString *stringToFix = [array2 objectAtIndex:0];
NSString *fixedString = [stringToFix stringByReplacingOccurrencesOfString:#" " withString:#"%20"];
newString = [newString stringByReplacingOccurrencesOfString:stringToFix withString:fixedString];
}
}
return newString;
}

You can use NSRegularExpression to fix all URLs by using a simple regex to detect the links and then just encode the spaces (if you need more complex encoding you can look into CFURLCreateStringByAddingPercentEscapes and there are plenty of examples out there). The only thing that might take you some time if you haven't worked with NSRegularExpression before is how to iterate the results and do the replacing, the following code should do the trick:
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"src=\".*\"" options:NSRegularExpressionCaseInsensitive error:&error];
if (!error)
{
NSInteger offset = 0;
NSArray *matches = [regex matchesInString:myHTML options:0 range:NSMakeRange(0, [myHTML length])];
for (NSTextCheckingResult *result in matches)
{
NSRange resultRange = [result range];
resultRange.location += offset;
NSString *match = [regex replacementStringForResult:result inString:myHTML offset:offset template:#"$0"];
NSString *replacement = [match stringByReplacingOccurrencesOfString:#" " withString:#"%20"];
myHTML = [myHTML stringByReplacingCharactersInRange:resultRange withString:replacement];
offset += ([replacement length] - resultRange.length);
}
}

Try this regex pattern: #"<img[^>]+src=(\"|')([^\"']+)(\"|')[^>]*>" with ignore case ... Match index=2 for source url.
regex demo in javascript: (Try for any help)
Demo

Give this snippet a try (I got the regexp from your first commentator user3584460) :
NSError *error = NULL;
NSString *myHTML = #"<http><h1>My main item</h1><img src=\"http://www.blah.com/My First Image Here.jpg\"><h2>Some extra data</h2><img src=\"http://www.bloh.com/My Second Image Here.jpg\"><h3>Some extra data</h3><img src=\"http://www.bluh.com/My Third-Image Here.jpg\"></http>";
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"src=[\"'](.+?)[\"'].*?>" options:NSRegularExpressionCaseInsensitive error:&error];
NSArray *arrayOfAllMatches = [regex matchesInString:myHTML options:0 range:NSMakeRange(0, [myHTML length])];
NSTextCheckingResult *match = [regex firstMatchInString:myHTML options:0 range:NSMakeRange(0, myHTML.length)];
for (NSTextCheckingResult *match in arrayOfAllMatches) {
NSRange range = [match rangeAtIndex:1];
NSString* substringForMatch = [myHTML substringWithRange:range];
NSLog(#"Extracted URL : %#",substringForMatch);
}
In my log, I have :
Extracted URL : http://www.blah.com/My First Image Here.jpg
Extracted URL : http://www.bloh.com/My Second Image Here.jpg
Extracted URL : http://www.bluh.com/My Third-Image Here.jpg

You should not use NSDataDetector with HTML. It is intended for parsing normal text (entered by an user), not computer-generated data (in fact, it has many heuristics to actually make sure it does not detect computer-generated things which are probably not relevant to the user).
If your string is HTML, then you should use an HTML parsing library. There are a number of open-source kits to help you do that. Then just grab the href attributes of your anchors, or run NSDataDetector on the text nodes to find things not marked up without polluting the string with tags.

URLs really shouldn't contain spaces. I'd remove all spaces from the string before doing anything URL-related with it, something like the following
// Custom function which cleans up strings ready to be used for URLs
func cleanStringForURL(string: NSString) -> NSString {
var temp = string
var clean = string.stringByReplacingOccurrencesOfString(" ", withString: "")
return clean
}

Related

How would I use NSRegularExpression where if a section is detected and replaced, it won't be done to again?

I have an issue where I want to parse some Markdown, and when I try to parse text with emphasis, where the text wrapped in underscores is to be emphasized (such as this is some _emphasized_ text).
However links also have underscores in them, such as http://example.com/text_with_underscores/, and currently my regular expression would pick up _with_ as an attempt at emphasized text.
Obviously I don't want it to, and as text with emphasis in the middle of it is valid (such as longword*with*emphasis being valid), my go to solution is to parse links first, and almost "mark" those replacements to not be touched again. Is this possible?
One solution you can implement like this:-
NSString *yourStr=#"this is some _emphasized_ text";
NSMutableString *mutStr=[NSMutableString string];
NSUInteger count=0;
for (NSUInteger i=0; i<yourStr.length; i++)
{
unichar c =[yourStr characterAtIndex:i];
if ((c=='_') && (count==0))
{
[mutStr appendString:[NSString stringWithFormat:#"%#",#"<em>"]];
count++;
}
else if ((c=='_') && (count>0))
{
[mutStr appendString:[NSString stringWithFormat:#"%#",#"</em>"]];
count=0;
}
else
{
[mutStr appendString:[NSString stringWithFormat:#"%C",c]];
}
}
NSLog(#"%#",mutStr);
Output:-
this is some <em>emphasized</em> text
__block NSString *yourString = #"media_w940996738_ _help_ 476.mp3";
NSError *error = NULL;
__block NSString *yourNewString;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"([_])\\w+([_])" options:NSRegularExpressionCaseInsensitive error:&error];
yourNewString=[NSString stringWithString:yourString];
[regex enumerateMatchesInString:yourString options:0 range:NSMakeRange(0, [yourString length]) usingBlock:^(NSTextCheckingResult *match, NSMatchingFlags flags, BOOL *stop){
// detect
NSString *subString = [yourString substringWithRange:[match rangeAtIndex:0]];
NSRange range=[match rangeAtIndex:0];
range.location+=1;
range.length-=2;
//print
NSString *string=[NSString stringWithFormat:#"<em>%#</em>",[yourString substringWithRange:range] ];
yourNewString = [yourNewString stringByReplacingOccurrencesOfString:subString withString:string];
}];
First a more usual way to do processing like this would be to tokenise the input; this both makes handling each kind of token easier and is probably more efficient for large inputs. That said, here is how to solve your problem using regular expressions.
Consider:
matchesInString:options:range returns all the non-overlapping matches for a regular expression.
Regular expressions are built from smaller regular expressions and can contain alternatives. So if you have REemphasis which matches strings to emphasise and REurl which matches URLs, then (REemphasis)|(REurl) matches both.
NSTextCheckingResult, instances of which are returned by matchesInString:options:range, reports the range of each group in the match, and if a group does not occur in the result due to alternatives in the pattern then the group's NSRange.location is set to NSNotFound. So for the above pattern, (REemphasis)|(REurl), if group 1 is NSNotFound the match is for the REurl alternative otherwise it is for REemphasis alternative.
The method replacementStringForResult:inString:offset:template will return the replacement string for a match based on the template (aka the replacement pattern).
The above is enough to write an algorithm to do what you want. Here is some sample code:
- (NSString *) convert:(NSString *)input
{
NSString *emphPat = #"(_([^_]+)_)"; // note this pattern does NOT allow for markdown's \_ escapes - that needs to be addressed
NSString *emphRepl = #"<em>$2</em>";
// a pattern for urls - use whatever suits
// this one is taken from http://stackoverflow.com/questions/6137865/iphone-reg-exp-for-url-validity
NSString *urlPat = #"([hH][tT][tT][pP][sS]?:\\/\\/[^ ,'\">\\]\\)]*[^\\. ,'\">\\]\\)])";
// construct a pattern which matches emphPat OR urlPat
// emphPat is first so its two groups are numbered 1 & 2 in the resulting match
NSString *comboPat = [NSString stringWithFormat:#"%#|%#", emphPat, urlPat];
// build the re
NSError *error = nil;
NSRegularExpression *re = [NSRegularExpression regularExpressionWithPattern:comboPat options:0 error:&error];
// check for error - omitted
// get all the matches - includes both urls and text to be emphasised
NSArray *matches = [re matchesInString:input options:0 range:NSMakeRange(0, input.length)];
NSInteger offset = 0; // will track the change in size
NSMutableString *output = input.mutableCopy; // mutuable copy of input to modify to produce output
for (NSTextCheckingResult *aMatch in matches)
{
NSRange first = [aMatch rangeAtIndex:1];
if (first.location != NSNotFound)
{
// the first group has been matched => that is the emphPat (which contains the first two groups)
// determine the replacement string
NSString *replacement = [re replacementStringForResult:aMatch inString:output offset:offset template:emphRepl];
NSRange whole = aMatch.range; // original range of the match
whole.location += offset; // add in the offset to allow for previous replacements
offset += replacement.length - whole.length; // modify the offset to allow for the length change caused by this replacement
// perform the replacement
[output replaceCharactersInRange:whole withString:replacement];
}
}
return output;
}
Note the above does not allow for Markdown's \_ escape sequence and you need to address that. You probably also need to consider the RE used for URLs - one was just plucked from SO and hasn't been tested properly.
The above will convert
http://example.com/text_with_underscores _emph_
to
http://example.com/text_with_underscores <em>emph</em>
HTH

Removing \ from NSString (from escape sequences only)

I have tried (searching for) various possible solutions here on SO, in vain. Most of them simply replace all occurrences of backslashes, and don't respect backslashes that should otherwise be untouched.
For instance, if I have a Hi, it\'s me. How\'re you doing?, it should be Hi, it's me. How're you doing?. However, if someone tries to get creative with ASCII art, like
\\// \\// \\//
//\\ //\\ //\\
(WOW even SO won't let me add text as is, the above text needed extra backslashes to be displayed correctly.)
I cannot use [myString stringByReplacingOccurrencesOfString:#"\\" withString:#""]; since it will replace ALL backslashes. I do not want that.
I would like the string to be displayed as is.
NOTE: The strings in question here are values in NSDictionarys received as JSON from a web service. The use is in a service like a chat client, so it is important that text is handled correctly.
ULTRA IMPORTANT NOTE: I'm open to all ideas like library functions, regular expressions, human sacrifices, as long it gets the job done.
try this ...i cannot understand your question but it may help full for you,i think so
- (void)remove:(NSString*)str
{
NSString* const pattern = #"(\"[^\"]*\"|[^, ]+)";
NSRegularExpression *regex = [[NSRegularExpression alloc] initWithPattern:pattern
options:0
error:nil];
NSRange searchRange = NSMakeRange(0, [str length]);
NSArray *matches = [regex matchesInString:str
options:0
range:searchRange];
for (NSTextCheckingResult *match in matches) {
NSRange matchRange = [match range];
NSLog(#"%#", [str substringWithRange:matchRange]);
}
NSLog(#"%#",str);
}
call this method..
NSString* str = #"Hi, it\'s me. How\'re you doing?";
[self remove:str];
then the output is
Hi, it's me. How're you doing?

Bold word of a sentence of NSString in Xcode

I have an issue in display sentence with a bold selected word.
NSString * string = #"Notes on iOS7 going to take a <-lot-> of getting used to!";
I want to print a sentence like this:
"Notes on iOS7 going to take a lot of getting used to!"
A plus that I have this code to select "lot" word. So how to base on this selected to bold the word. This string in this is example. So the range would be different.
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(?<=<-).*?(?=->)"
options:0 error:&error];
if (regex) {
NSRange rangeOfFirstMatch = [regex rangeOfFirstMatchInString:string options:0 range:NSMakeRange(0, [string length])];
if (!NSEqualRanges(rangeOfFirstMatch, NSMakeRange(NSNotFound, 0))) {
NSString *result = [string substringWithRange:rangeOfFirstMatch];
NSLog(#"%#",result);
} else {
// Match attempt failed
}
} else {
// Syntax error in the regular expression
}
It will be:
Before: Notes on iOS7 going to take a <-lot-> of getting used to!
After: Notes on iOS7 going to take a lot of getting used to!
Thanks in advanced.
You have to use a NSAttributedString for that. Have a look at Any way to bold part of a NSString?

Whats the quickest way to do lots of NSRange calls in a very long NSString on iOS?

I have a VERY long NSString. It contains about 100 strings I need to pull out of it, all randomly scattered throughout. They are all commonly are between imgurl= and &.
I could use NSRange and just loop through pulling out each string, but I'm wondering if there is a quicker was to pick out everything in a simple API call? Maybe something I am missing here?
Looking for the quickest way to do this. Thanks!
Using NSString methods componentsSeparatedByString and componentsSeparatedByCharactersInSet:
NSString *longString = some really long string;
NSArray *longStringComponents = [longString componentsSeparatedByString:#"imgurl="];
for (NSString *string in longStringComponents){
NSString *imgURLString = [[string componentsSeparatedByCharactersInSet:[NSCharacterSet characterSetWithCharactersInString:#"&"]] firstObject];
// do something with imgURLString...
}
If you feel adventurous then you can use regular expression. Since you said that the string you are looking is between imgurl and &, I assumed its a url and made the sample code to do the same.
NSString *str = #"http://www.example.com/image?imgurl=my_image_url1&imgurl=myimageurl2&somerandom=blah&imgurl=myurl3&someother=lol";
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(?:imageurl=)(.*?)(?:&|\\r)"
options:NSRegularExpressionCaseInsensitive
error:&error];
//should do error checking here...
NSArray *matches = [regex matchesInString:str
options:0
range:NSMakeRange(0, [str length])];
for (NSTextCheckingResult *match in matches)
{
//[match rangeAtIndex:0] <- gives u the whole string matched.
//[match rangeAtIndex:1] <- gives u the first group you really care about.
NSLog(#"%#", [str substringWithRange:[match rangeAtIndex:1]]);
}
If I were you, I will still go with #bobnoble method because its easier and simpler compared to regex. You will have to do more error checking using this method.

How to detect email addresses within arbitrary strings

I'm using the following code to detect an email in the string. It works fine except dealing with email having pure number prefix, such as "536264846#gmail.com". Is it possible to overcome this bug of apple? Any help will be appreciated!
NSString *string = #"536264846#gmail.com";
NSError *error = NULL;
NSDataDetector *detector = [NSDataDetector dataDetectorWithTypes:NSTextCheckingTypeLink error:&error];
NSArray *matches = [detector matchesInString:string
options:0
range:NSMakeRange(0, [string length])];
for (NSTextCheckingResult *match in matches) {
if ([match.URL.scheme isEqualToString:#"mailto"]) {
NSString *email = [match.URL.absoluteString substringFromIndex:match.URL.scheme.length + 1];
NSLog(#"email :%#",email);
}else{
NSLog(#"[match URL] :%#",[match URL]);
}
}
Edit:
log result is: [match URL] :http://gmail.com
What I did in the past:
tokenize the input, e.g., separate tokens using spaces (since most other common separators may be valid within an email). However, this may not be necessary if the regular expression is not anchored - but not sure how it would work without the "^" and "$" anchors (which I added to what was shown on the web site).
keep in mind that addresses may take the form '"string"' as well as just address
in each token, look for '#', as it's probably the best indicator you have that its an email address
run the token through the regular expression shown on this Email Detector comparison site (I found in testing that the one marked #1 as of 3/21/2013 worked best)
What I did was put the regular expression in a text file, so I didn't need to escape it:
^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}#)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))\x22))(?:.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))\x22)))#(?:(?:(?!.[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+).){1,126}){1,}(?:(?:[a-z][a-z0-9])|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+))|(?:[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.[a-f0-9][:]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))]))$
Defined an ivar:
NSRegularExpression *reg
Created the regular expression:
NSString *fullPath = [[NSBundle mainBundle] pathForResource:#"EMailRegExp" ofType:#"txt"];
NSString *pattern = [NSString stringWithContentsOfFile:fullPath encoding:NSUTF8StringEncoding error:NULL];
NSError *error = nil;
reg = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error];
assert(reg && !error);
Then wrote a method to do the comparison:
- (BOOL)isValidEmail:(NSString *)string
{
NSTextCheckingResult *match = [reg firstMatchInString:string options:0 range:NSMakeRange(0, [string length])];
return match ? YES : NO;
}
EDIT: I've turned the above into a project on github
EDIT2: for an alterate, less rigorous but faster, see the comment section of this question

Resources