Optimize NSRegularExpression performance

Optimize NSRegularExpression performance - ios

My app has a feed with posts that can contain URL's, #hashtags and #mentions. I display them using a pod called ActiveLabel. This pod does a good job, but my feed slightly lags when scrolling. My feed is a UICollectionView, and the cell generation is slightly lagging. I profiled my app when scrolling, and analysed the lag spikes. The lag is almost unnoticeable, but it annoys me.
As you can see, the main offender is the NSRegularExpression search.
I tried to optimize this slightly by disabling data detection when there is no instances of the data type, using .contains(). This made it marginally faster, but the lag spikes remains.
let enabledTypes:[ActiveType] = {
var types = [ActiveType]()
if ad.caption.current.string.contains("#") { types.append(.hashtag) }
if ad.caption.current.string.contains("#") { types.append(.mention) }
if ad.caption.current.string.contains("://") { types.append(.url) }
if ad.caption.canExpand { types.append(seeMore) }
return types
}()
label.enabledTypes = enabledTypes
I also followed every step in this article, which slighty helped, but not enough. So I need to fix the regex.
The regex statements ActiveLabel uses is
static let hashtagPattern = "(?:^|\\s|$)#[\\p{L}0-9_]*"
static let mentionPattern = "(?:^|\\s|$|[.])#[\\p{L}0-9_]*"
static let urlPattern = "(^|[\\s.:;?\\-\\]<\\(])" + "((https?://|www\\.|pic\\.)[-\\w;/?:#&=+$\\|\\_.!~*\\|'()\\[\\]%#,☺]+[\\w/#](\\(\\))?)" + "(?=$|[\\s',\\|\\(\\).:;?\\-\\[\\]>\\)])"
and it uses them with
static func getElements(from text: String, with pattern: String, range: NSRange) -> [NSTextCheckingResult]{
guard let elementRegex = try? NSRegularExpression(pattern: pattern, options: [.caseInsensitive]) else { return [] }
return elementRegex.matches(in: text, options: [], range: range)
}
Is searched around for other regexes to detect hashtags and mentions, but I didn't find anything that made a difference.
I tried to layout the label on a background thread, but that obviously crashed because UI doesn't like being done on a background thread. I could rewrite ActiveLabel to work mainly on background threads where I can and use callbacks instead of return types, but I'd like to avoid that.
Some samples of strings that I detect data on:
"Arnie says, Aspen. Str. Small. Varm og god jakke. Veldig fin på! Fremstår ubrukt. Kun brukt et par ganger, rett og slett fordi jeg har alt for mange jakker🙈 #urban #arnie #says #aspen #ubrukt"
"Skjorte pent brukt i organisk bomull fra tom tailor originalpris 300kr #organisk #bomullsjorte #bomull #flower #floral"
"Jean Paul genser i 100% ull, pent brukt✨ er i str.m, men veldig liten, passer xs-s! \n #jeanpaul #genser #classy #litebrukt #brun #ull"
As you can see, our users mainly hashtag stuff, so that one is the most important one.
Is there any way I can improve either NSRegularExpression or the regex statements to avoid the performance hit?

As #raidfive suggests, most likely your best course of action here is to create one or more NSRegularExpression instances ahead of time and reuse them whenever needed.
Note that since it's the creation/compiling of regexes that makes the biggest difference in your time profile (at least, in as much of the time profile as you've shared), caching regexes may win you back enough performance that you no longer need your intermediate optimization of enabling only the detection elements you need. In that case, you need only one regex (representing detection of all possible element types), so caching/reuse is easy.
Note furthermore that your intermediate "optimization" may not actually improve performance to begin with — it might even harm performance. Matching a regex, however complicated, requires searching the entire string in its entirety (roughly) once. Trying to decide which element types to detect means searching the string multiple times — once for each contains("#") (etc) test, then once more to evaluate the string against the regex. Repeated string searches might well cost more than the compilation of a single regex.
If you find after implementing the single cached universal regex that you're still (somehow) hamstrung on regex performance, you can cache multiple regexes, one for each search scenario you're processing. The combinatorics presumably work out such that you still have far fewer different regexes than you have strings to process, so if you compile them all before the user even starts scrolling, you're not paying the time cost of compiling them during scrolling. Per the previous paragraph, though, this only makes sense if you have a cheap (i.e. not string search) way of detecting which regex you need for each string.

You could try creating and storing the NSRegularExpression instances in a variable (class or instance) so you're only creating them once.

Related

Apache-camel Xpathbuilder performance

I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}

Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.

Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.

Does less space in-between lines of code in Xcode make the build/app faster, or does it not make a difference?

As I was formatting my code for a project, specifically regarding uniform spacing (OCD.. hehe). I had a curious thought, does having "too much" (or too little) spacing affect code performance in any way?
I tried searching via docs & Google but was only able to find performance "tips" that were unrelated. I assume any relevant knowledge might involve understanding lower level languages (which I have very little experience). I'll post two simple examples below:
////////// - Lots Of Spacing
func exampleFunctionOne(newText: String) {
if newText.isEmpty {
return
}
exampleLabel.text = newText
return
}
////////// - Little Spacing
func exampleFunctionTwo(newText: String) {
if newText.isEmpty { return }
exampleLabel.text = newText; return
}
Keep in mind, even though these examples are small, my project is currently around 20,000 lines (if that matters).

Code formatting in general, whether it is spaces, comments or other decorators do not affect the performance after the code is compiled.
The compilation/build time of your project may be affected negatively as more and more "non-code" characters are added, since the pre-parser will have to strip them out.

No, it doesn't affect app performance. Spacing, comments and any other non-code or pre-compile commands are being striped out before compiling. They will not be converted into binaries so they will not have any effect on your app performance.
Although, as the number of source files increase, the build time will increase naturally. With incremental builds feature in modern compilers, the effect on build time will not be noticeable. Also, some IDEs will pre-process source files when idle - and make it incremental as you do your changes - to save build time when you need a build.

iOS - Voice Over - Accessibility for large amounts

I am not able to make iOS voice over / Accessiblity read large amounts in money format for example £782284.00 , this should read as seven hundered eighty two thousand , two hundered and eight four, but iOS voice over reads this as seven eight two two eight four.

The best way to get your purpose is to format perfectly your numbers to be vocalized as desired by VoiceOver.
Use of NumberFormatter and .spellOut style to read out the accessibilityLabel are important tools to adapt the VoiceOver vocalization for large amounts.
I deeply encourage you to try and vocalize numbers as they should : the application content MUST be adapted to the VoiceOver users, not the opposite.

It is really important to make sure you do all you can to make the app easier to use for VoiceOver users. I have been running an app for sighted and visually impaired players, you can see an example of this method running in the inventory section of the app:
https://apps.apple.com/us/app/swordy-quest-an-rpg-adventure/id1446641513
The number of requests I got from blind & visually impaired players to read out millions as millions, rather than individual digits, was huge. Please do take the extra effort to make it fully VoiceOver compatible. It makes life so much easier for VoiceOver users. Here is a method I created solely for this purpose, that VoiceOver seems to like. You are basically adding thousand comma separators:
// e.g. 1000 -> "1,000"
public static func addCommaSeperatorsToInt(number: Int) -> String {
let numberFormatter = NumberFormatter()
numberFormatter.numberStyle = NumberFormatter.Style.decimal
return numberFormatter.string(from: NSNumber(value: number))!
}

I agree to #aadrian is suggesting, try not to break convention VoiceOver users are used to. Because some large numbers are read in a long time, then the users have slow navigation across numbers.
However, if that is the case you need it hard, here you can have (I couldnot find sth for swift/objc but you will get the idea) a number to word converter and then you can set that to _.accessbilityLabel of the UIView or whatever. Then it will read as you like.
Also see this

erlang list lines from position from end of the file

Is it possible to have a function that will return x lines of file from the end? The function will take parameter defining how far from end we want to read from(in lines measure) and how much lines we want to be returned from that position:
get_lines_file_end(IoDevice, LineNumberPositionFromEnd, LineCount) ->
Example:
We have file with 30 lines 0-29
get_lines_file_end(IoDevice, -10, 10) // will return lines 20-29
get_lines_file_end(IoDevice, -20, 10) // will return lines 10-19
The problem in this is that I can seek only with file:position by certain number of bytes ..
Purpose:
View large log file(hundreds of MB) in page manner starting from last "page".
Erlang is used for rest api which is used by javascript web.
The usage of such function is to view whole log files page by page, where page is represented by x lines of text. No processing of log files, or getting certain information of it is needed.
Thanks

Two points to be made:
To make this efficient you must create metadata about your text file contents to amortize the work involved. This way you can directly skip to the bits you need by seeking using file:position/2 after you have created this metadata.
If this is your use case then you should be partitioning the work differently. The huge text files should either be broken down into smaller text files, or (more likely) you shouldn't be using text files at all. Depending on what your goal is (which you haven't mentioned; I strongly suspect this is to be an X-Y problem) you probably don't want text at all but rather want to know something represented by the text. It may be a good idea to keep the raw text around somewhere just in case, but for actual processing of the data is is almost certainly a better idea to create symbolic data that (much more briefly) represents whatever you find interesting about the data, and store that in a database where seeking, scanning, indexing and doing whatever other things you might want are natural operations.
To build metadata about the files, you will need to do something analogous to:
1> {ok, Data} = file:read_file("TheLongDarkTeaTimeOfTheSoul.txt").
{ok,<<"Douglas Adams. The Long Dark Tea-Time of the Soul\r\n\r\n"...>>}
2> LineEnds = binary:matches(Data, <<"\r\n">>).
[{49,2},
{51,2},
{53,2},
{...}|...]
And then save LineEnds somewhere separately as meta about the file itself. Using this seeking within the file data is elementary (as in, use file:position/2 with the data at linebreak X, or at length(LineEnds) - X or whatever).
But this is still silly.
If you want to hop around within log files, and especially if you want to be able to locate patterns within them, count certain aspects of them, etc. then you would almost certainly do better reading them into a database like Postgres line by line, counting the line numbers as you go. At that point, pagination becomes a trivial issue.
Log files, however, are usually full of the sort of data that is best represented by symbols, not actual text, and it is probably an even better idea to tokenize the log file. Consider the case of access log files. A repeating number of visitors access from a finite number of access points (IPs, or devices, or whatever) an arbitrary number of times. Each aspect of this can be separately indexed and compared rather trivially within a database. The tokenization itself is rather trivial as well. Not only is this solution much faster when it comes to later analysis of the data, but it lends itself naturally to answering otherwise very difficult to answer questions about the contents of the data in a very straightfoward and familiar manner. ...And you don't even have to lose any of the raw data, or intermediate stages of processing (which may all be independently useful in different ways).
Also... note that all of the above work can be made parallel very easily in Erlang. Whatever your computing resource situation is, writing a solution that best leverages your hardware is certainly within grasp (assuming you have enough total data that this is even an issue).
Just like many "How to do X with data Y?" questions, the real answer is always going to revolve around "What is your goal regarding the data and why?"

You can use the file:read_line/1 function to read lines, discarding those that doesn't match your range:
get_lines(File, From) when From > 0 ->
get_lines(File, file:read_line(File), From, 1).
get_lines(_File, eof, _From, _Current) ->
[];
get_lines(File, {ok, _Line}, From, Current) when Current < From ->
get_lines(File, file:read_line(File), From, Current + 1);
get_lines(File, {ok, Line}, From, Current) ->
[Line|get_lines(File, file:read_line(File), From, Current + 1)];
get_lines(_IoDevice, Error, _From, _Current) ->
Error.

Reintroducing spaces into a document

Imagine we have some reference text on hand
Four score and seven years ago our fathers brought forth on this
continent a new nation, conceived in liberty, and dedicated to the
proposition that all men are created equal. Now we are engaged in a
great civil war, testing whether that nation, or any nation, so
conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that
field, as a final resting place for those who here gave their lives
that that nation might live. It is altogether fitting and proper that
we should do this. But, in a larger sense, we can not dedicate, we can
not consecrate, we can not hallow this ground. The brave men, living
and dead, who struggled here, have consecrated it, far above our poor
power to add or detract. The world will little note, nor long remember
what we say here, but it can never forget what they did here. It is
for us the living, rather, to be dedicated here to the unfinished work
which they who fought here have thus far so nobly advanced. It is
rather for us to be here dedicated to the great task remaining before
us—that from these honored dead we take increased devotion to that
cause for which they gave the last full measure of devotion—that we
here highly resolve that these dead shall not have died in vain—that
this nation, under God, shall have a new birth of freedom—and that
government of the people, by the people, for the people, shall not
perish from the earth.
and we receive snippets of that text back to us with no spaces or punctuation, and some characters deleted, inserted, and substituted
ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn
Using the reference text what are some tools (in any programming language) we can use to try properly space the words
ield as a final rTsting place for who fought here gave their liZes that that n
correcting errors is not necessary, just spacing

Weird problem you've got there :)
If you can't rely on capitalization for hints, just lowercase everything to start with.
Then get a dictionary of words. Maybe just a wordlist, or you could try Wordnet.
And a corpus of similar, correctly spaced, material. If suitable, download the Wikipedia dump. You'll need to clean it up and break into ngrams. 3 grams will probably suit the task. Or save yourself the time and use Google's ngram data. Either the web ngrams (paid) or the book ngrams (free-ish).
Set a max word length cap. Let's say 20chars.
Take the first char of your mystery string and look it up in the dictionary. Then take the first 2 chars and look them up. Keep doing this until you get to 20. Store all matches you get, but the longest one is probably the best. Move the starting point 1 char at a time, through your string.
You'll end up with an array of valid word matches.
Loop through this new array and pair each value up with the following value, comparing it to the original string, so that you identify all possible valid word combinations that don't overlap. You might end up with 1 output string, or several.
If you've got several, break each output string into 3-grams. Then lookup in your new ngram database to see which combinations are most frequent.
There might also be some time-saving techniques like starting with stop words, checking them in a dictionary combined with incremental letter either side, and adding spaces there first.
... or I'm over-thinging the whole issue and there's an awk one liner that someone will humble me with :)

You can do this using edit distance and finding the minimum edit distance substring of the reference. Check out my answer (PHP implementation) to a similar question here:
Longest Common Substring with wrong character tolerance
Using the shortest_edit_substring() function from the above link, you can add this to do the search after stripping out everything but letters (or whatever you want to keep in: letters, numbers, etc.) and then correctly map the result back to the original version.
// map a stripped down substring back to the original version
function map_substring($haystack_letters,$start,$length,$haystack, $regexp)
{
$r_haystack = str_split($haystack);
$r_haystack_letters = $r_haystack;
foreach($r_haystack as $k => $l)
{
if (preg_match($regexp,$l))
{
unset($r_haystack_letters[$k]);
}
}
$key_map = array_keys($r_haystack_letters);
$real_start = $key_map[$start];
$real_end = $key_map[$start+$length-1];
$real_length = $real_end - $real_start + 1;
return array($real_start,$real_length);
}
$haystack = 'Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.';
$needle = 'ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn';
// strip out all non-letters
$regexp_to_strip_out = '/[^A-Za-z]/';
$haystack_letters = preg_replace($regexp_to_strip_out,'',$haystack);
list($start,$length) = shortest_edit_substring($needle,$haystack_letters);
list($real_start,$real_length) = map_substring($haystack_letters,$start,$length,$haystack,$regexp_to_strip_out);
printf("Found |%s| in |%s|, matching |%s|\n",substr($haystack,$real_start,$real_length),$haystack,$needle);
This will do the error correction as well; it's actually easier to do it than to not do it. The minimum edit distance search is pretty straightforward to implement in other languages if you want something faster than PHP.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart