Optimizing Lucid/Solr to index large text documents

Optimizing Lucid/Solr to index large text documents - memory

I am trying to index about 3 million text documents in solr. About 1/3 of these files are emails that have about 1-5 paragraphs of text in them. The remaining 2/3 files only have a few words to sentences each.
It takes Lucid/Solr nearly 1 hour to fully index the entire dataset I'm working with. I'm trying to find ways to optimize this. I have setup Lucid/Solr to only commit every 100,000 files, and it indexes the files in batches of 50,000 files at once. Memory isn't an issue anymore, as it consistently stays around 1GB of memory because of the batching.
The entire dataset has to be indexed initially. It's like a legacy system that has to be loaded to a new system, so the data has to be indexed and it needs to be as fast as possible, but I'm not sure what areas to look into to optimize this time.
I'm thinking that maybe there's a lot of little words like "the, a, because, should, if, ..." that are causing a lot of overhead and are just "noise" words. I am curious if I cut them out if it would drastically speed up the indexing time. I have been looking at the Lucid docs for a while, but I can't seem to find a way to specify what words not to index. I came across the term "stop list" but didn't see much more than a reference to it in passing.
Are there other ways to make this indexing go faster or am I just stuck with a 1 hour indexing time?

We met similar problem recently. We can't use solrj as the request and response have to go through some applications, so we take the following steps:
Creating Custom Solr Type to Stream Large Text Field!
Use GZipOutput/InputStream and Bse64Output/InputStream to compress the large text. This can reduce size of text about 85%, this can reduce the time to transfer the request/response.
To reduce memory usage at client side:
2.1 We use stream api(GSon stream or XML Stax) to read doc one by one.
2.2 Define a custom Solr Field Type: FileTextField which accepts FileHolder as value. FileTextField will eventually pass a reader to Lucene. Lucene will use the reader to read content and add to index.
2.3 When the text field is too big, first uncompress it to a temp file, create a FileHolder instance, then set the FileHolder instance as field value.

It seems from your query that Indexing time is really important for your application. Solr is a great search engine however if you need super fast indexing time and if that is a very important criteria for you, than you should go with Sphinx Search Engine. It wont take much of time for you to quickly setup and benchmark your results using Sphinx.
There can be ways (like the one you have mentioned, stopwords etc.) to optimize however whatever you do with respect to indexing time Solr won't be able to beat Sphinx. I have done benchmarking myself.
I too love Solr a lot because of its ease of use, its out of box great features like N-Gram Indexing, Faceting, Multi-core, Spelling Correctors and its integration with other apache products etc.. but when it comes to Optimized Algorithms (be it Index size, Index time etc.) Sphinx rocks!!
Sphinx too is open source. Try that out.

Related

Apache-camel Xpathbuilder performance

I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}

Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.

Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.

Sorting 20GB of data

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.

I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).

If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.

Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.

Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.

If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.

importing and processing data from a CSV File in Delphi

I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?

It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".

The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck

How to count occurrences of a substring within string fast with Ruby

I have a text file sized 300MB, I want to count the occurrences of each 10,000 substrings in the file. I want to know how to do it fast.
Now, I use the following code:
content = IO.read("path/to/mytextfile")
Word.each do |w|
w.occurrence = content.scan(w.name).size
w.save
end
Word is an ActiveRecord class.
It took me almost 1 day to finish the counting. Is there anyway to do it faster? Thanks.
Edit1:
Thank you again. I am running rails 2.3.9. The name filed of words table contains what I am searching for, and it contains only unique values. Instead of using Word.each, I use batch(1000 rows a time) load. It should help.
I rewrited the whole code with the idea from bpaulon. Now it only took a few hours to finish the counting.
I profiled the new version code, now the largest time costing methods are utf8 encode supported string truncating code
def truncate(n)
self.slice(/\A.{0,#{n}}/m)
end
and characters counting code
def utf8_length
self.unpack('U*').size
end
Any other faster methods to replace them?

Your use of scan creates an array, counts the size of it, then throws it away. If you have a lot of occurrences of the substring inside a big file, you will create a big array temporarily, potentially burning up CPU time with memory management, but that should still run pretty quickly, even with 300MB.
Because Word is an ActiveRecord class, it is dependent on the schema and any indexes in your database, plus any issues your database server might be having. If the database is not optimized or is responding slowly or the query used to retrieve the data is not efficient, then the iteration will be slow. You might find it a lot faster to grab groups of Word so they are in RAM, then iterate over them.
And, if the database and your code are running on the same machine, you could be suffering from resource constraints like having only one drive, not enough RAM, etc.
Without knowing more about your environment and hardware it's hard to say.
EDIT:
I can grab the substrings into an array/hash first, then add the count results to the array or hash, and write the results back to database after all the counting is done. You think it be faster, right?
No, I doubt that will help a lot, and, without knowing where the problem lies all you might do is make the problem worse because you'll have to load 10,000 records as objects from the database, then build a 10,000 element hash or array which will also be in memory along with the DB records, then write them out.
Ruby will only use a single core, currently, but you can gain speed by using Ruby 1.9+. I'd recommend installing RVM and letting it manage your Ruby. Be sure to read the instructions on that page, then run rvm notes and follow those directions.
What is your Word model and the underlying schema and indexes look like? Is the database on the same machine?
EDIT: From looking at your table schema, you have no indexes except for id which really won't help much for normal look-ups. I'd recommend presenting your schema on Stack Overflow's sibling site https://dba.stackexchange.com/ and explain what you want to do. At a minimum I'd add a key to the text fields to help avoid full table scans for any searches you do.
What might help more is to read: Retrieving Multiple Objects in Batches from "Active Record Query Interface".
Also, look at the SQL being emitted when your Word.each is running. Is it something like "select * from word"? If so, Rails is pulling in 10,000 records to iterate over them one by one. If it is something like "select * from word where id=1" then for every record you have a database read followed by a write when you update the count. That is the scenario that the "Retrieving Multiple Objects in Batches" link will help fix.
Also, I am guessing that content is the text you are searching for, but I can't tell for sure. Is it possible you have duplicated text values causing you to do scans more than once for the same text? If so, select your records using a unique condition on that field and then update your counts for all matching records at one time.
Have you profiled your code to see if Ruby itself can help you pinpoint the problem? Modify your code a little to process 100 or 1000 records. Start the app with the -r profile flag. When the app exits profiler will output a table showing where time was spent.
What version of Rails are you running?

I think you could approach this problem differently
You do not need to scan the file this many times, you could create a db, like in mongo or mysql, and for each word you find, you fetch the db for it and then adds on some "counter" field.
You could ask me "but then I will have to scan my database a lot and it could take a lot more". Well, sure you wouldn't ask this, but it won't take more time because databases are focused in IO, besides you could always index it.
EDIT: There is no way to delimit at all?? Let's say that where you have the a Word.name string you really holds a (not simple) regex. Could the regex contain the \n? Well, if the regex can contain any value, you should estimate the maximum size of string the regex can fetch, double it, and scan the file by that ammount of chars but moving the cursor by that number.
Lets say your estimate of the maximum your regex could fetch it is like 20 chars nad your file has from 0 to 30000 chars. You pass each regex you have from 0 to 40 chars, then again from 20 to 60, from 40 to 80, etc...
You should also hold the position you found of your smaller regex so it wouldn't repeat it.
Finally, this solution seems to be not worth the effort, your problem may have a greater solution based on what that regexes are, but it will be faster than invoke scan Words.count times your your 300Mb string.

You could load your entire "Word" table into a Trie, then do back-tracking since you said there are no delimiters in the text.
So for each character in the text, go down the Trie of words. If you hit a word, increment its count. "Going down the trie" involves three cases:
There's no node at this character. (If you're mid-search, pop the back-tracking stack)
There's a node at this character. (But it's not a Word)
There's a node at this character. (It's a Word - increment and "dirty")
Back-tracking is just keeping track of places you want to go after you've exhausted this "search" of the Trie, which is when you run out of nodes to visit. This will probably be each character you visit that is a root of the Trie.
After you've done this, you can then visit all the nodes you changed and just update the records they represent.
This will take some time to implement, but will surely be faster than each & scan.

Are helpers really faster than partials? What about string building?

I've got a fancy-schmancy "worksheet" style view in a Rails app that is taking way too long to load. (In dev mode, and yes I know there's no caching there, "Completed in 57893ms (View: 54975, DB: 855)") The worksheet is rendered using helper methods, because I couldn't stand maintaining umpteen teeny little partials for the different sorts of rows in the worksheet. Now I'm wondering whether partials might actually be faster?
I've profiled the page load and identified a few cases where object caching will shave a few seconds off, but the profile output suggests that a large chunk of time is spent simply looping through the Worksheet model's constituent objects and appending the string output from the helper. Here's an example of what I'm talking about:
def header_row(wksht)
content_tag(:thead, :class => "ioe") do
content_tag(:tr) do
html_row = []
for i in (0...wksht.class::NUM_COLS) do
html_row << content_tag(:th, h(wksht.column_headings[i].upcase),
:class => wksht.column_classes[i])
end
html_row.join("\n")
end
end
end
OTOH using partials means opening files, spinning off the Ruby interpreter, and in the long run, aggregating a bunch of strings, right? So I'm wondering whether there is another way to speed things up in the helpers. Should I be using something like a stringstream (does that exist in Ruby?), should I get rid of content_tag calls in favor of my own "" string interpolation... I'm willing to write my own performance tests, and share the results, if you have any suggested alternatives to the approach I've already taken.
As it's a fairly complex view (and has an editable version as well), I'd rather not rewrite-and-profile the whole thing more than once. :)
Some related reading:
http://www.viget.com/extend/helpers-vs-partials-a-performance-question/ (old)
http://www.breakingpointsystems.com/community/blog/ruby-string-processing-overhead/
http://blog.purepistos.net/index.php/2008/07/14/benchmarking-ruby-string-interpolation-concatenation-and-appending/
#tadman:
There are row totals and column totals (and more columnar arithmetic), and since they're not all just totals, but also depend on other "magic numbers" from the database, I implemented them in the Ruby code rather than Javascript. (DRY and unit testable.) Javascript is used only in the edit view, and just to add/delete rows (client side only) and to fetch a sheet with fresh totals when the cell contents change. It fetches the whole table because nearly half of the values get updated when an input cell changes.
The worksheet and its rows are actually virtual models; they don't live in the DB, but rather aggregate a boatload of real AR objects. They get created every time a view renders (but that takes 1.7 secs in dev mode, so I'm not worried about it).
I suppose I could transmit a matrix of numbers, rather than marked-up content, and have JS unpack it into the sheet. But that gets unmaintainable fast.

I ended up reading an excellent article at http://www.infoq.com/articles/Rails-Performance ("A Look At Common Performance Problems In Rails"). Then I followed the author's suggestion to cache computations during request processing:
def estimated_costs
#estimated_costs ||=
begin
# tedious vector math
end
end
Because my worksheet does stuff like the above over and over, and then builds on those results to calculate some more rows, this resulted in a 90% speedup right off the bat. Should have been plain as day, but it started with just a few totals, then I showed the prototype to the customer, and it snowballed from there :)
I also wondered whether my array-based math might be inefficient, so I replaced the Ruby Arrays of numbers with NArray (http://narray.rubyforge.org/). The speedup was negligible but the code's cleaner, so it's staying that way.
Finally, I put some object caching in place. The "magic numbers" in the database only change a few times a year at most, and some of them are encrypted, but they need to be used in most of the calculations. That's low-hanging fruit ripe for caching, and it shaved off another 1.25 seconds.
I'll look at eager loading of associations next, as there's probably some time to save there, and I'll do a quick comparison of sending "just the data" vs sending the HTML, as #tadman suggested. About the only partial I can cache is the navigation sidebar. All of the other content depends on the request parameters.
Thanks for your suggestions!

Internally all partials are translated into a block of executable Ruby code and run through exactly the same runtime as any helper methods. Periodically you can see glimpses of this when a malformed template causes the generated code to fail to compile.
Although it stands to reason that helper methods are faster than partials, and a straightforward string interpolation is faster still, it's hard to say if the performance gain from this would make it worth pursuing. Rendering a very large number of partials can be a bottleneck in terms of logging in the development environment, but in a production environment their impact seems less severe.
The only way to figure this one out is to benchmark your pages using two different rendering methods.
As you point out, caching is where you get the big gains. Using Memcached to save large chunks of pre-rendered HTML content can give you exponentially faster load times. Rendering 10,000 rows into HTML will always be slower than retrieving the same snippet from the Rails.cache subsystem.
It's also the case that the content you don't render is always rendered the quickest, so anything you can do to reduce the amount of content you generate for each helper call will provide big gains. If you're building a large spread-sheet style app that's entirely dependent on JavaScript, you may find that bundling up the data as a JSON array and expanding it client-side is significantly faster than unrolling the HTML on the server and shipping it over that way.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart