Apache-camel Xpathbuilder performance - parsing

I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}

Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.

Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.

Related

Array#product RangeError: too big to product

I have 93 arrays. Each array has about 18 values in average
I need to make a product of these arrays.
So I have my two dimension array that store these 93 arrays.
Here is what I try to do
DATASET.first.product(*DATASET[1..-1])
Ruby returns
RangeError: too big to product
Does anyone know some workaround to figure out of it?
Some ways to chunk them?
What you want is impossible.
The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.
This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.
Does anyone know some workaround to figure out of it? Some ways to chunk them?
Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.
Right, so as it is hopefully clear that this should not be attempted I'll take a risk and attempt solving your actual problem.
You've mentioned in the comments that you need this array for property testing - I'll take a massive leap of faith here and assume you want to test that every possible combination satisfies some conditions - and this is the mistake here, as the amount of possible combination is just... large...
Instead, you can test that some of the combinations works. You can easily generate a short, randomized list of combinations using:
Array.new(num) { DATASET.map(&:sample) }
Where num is a number of combinations you want to test. Note that there is a chance that some of the entries will be duplicated - but given your dataset size the chances would be comparable with colliding uuids and can be safely ignored.
Generating such a subset of possible solutions is much easier, faster and, most importantly, possible. Since the output is randomized, it will test slightly different combination on each run, so remember to have some randomization setup in your test suite if you want to be able to recreate failures.

Dask - Understanding diagnostics - memory:list

I am working on some fairly complex application that is making use of Dask framework, trying to increase the performance. To that end I am looking at the diagnostics dashboard. I have two use-cases. On first I have a 1GB parquet file split in 50 parts, and on second use case I have the first part of the above file, split over 5 parts, which is what used for the following charts:
The red node is called "memory:list" and I do not understand what it is.
When running the bigger input this seems to block the whole operation.
Finally this is what I see when I go inside those nodes:
I am not sure where I should start looking to understand what is generating this memory:list node, especially given how there is no stack button inside the task as it often happens. Any suggestions ?
Red nodes are in memory. So this computation has occurred, and the result is sitting in memory on some machine.
It looks like the type of the piece of data is a Python list object. Also, the name of the task is list-159..., so probably this is the result of calling the list Python function.

importing and processing data from a CSV File in Delphi

I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck

Why do we use data structures? (when no dynamic allocation is needed)

I'm pretty sure this is a silly newbie question but I didn't know it so I had to ask...
Why do we use data structures, like Linked List, Binary Search Tree, etc? (when no dynamic allocation is needed)
I mean: wouldn't it be faster if we kept a single variable for a single object? Wouldn't that speed up access time? Eg: BST possibly has to run through some pointers first before it gets to the actual data.
Except for when dynamic allocation is needed, is there a reason to use them?
Eg: using linked list/ BST / std::vector in a situation where a simple (non-dynamic) array could be used.
Each thing you are storing is being kept in it's own variable (or storage location). Data structures apply organization to your data. Imagine if you had 10,000 things you were trying to track. You could store them in 10,000 separate variables. If you did that, then you'd always be limited to 10,000 different things. If you wanted more, you'd have to modify your program and recompile it each time you wanted to increase the number. You might also have to modify the code to change the way in which the calculations are done if the order of the items changes because the new one is introduced in the middle.
Using data structures, from simple arrays to more complex trees, hash tables, or custom data structures, allows your code to both be more organized and extensible. Using an array, which can either be created to hold the required number of elements or extended to hold more after it's first created keeps you from having to rewrite your code each time the number of data items changes. Using an appropriate data structure allows you to design algorithms based on the relationships between the data elements rather than some fixed ordering, giving you more flexibility.
A simple analogy might help to understand. You could, for example, organize all of your important papers by putting each of them into separate filing cabinet. If you did that you'd have to memorize (i.e., hard-code) the cabinet in which each item can be found in order to use them effectively. Alternatively, you could store each in the same filing cabinet (like a generic array). This is better in that they're all in one place, but still not optimum, since you have to search through them all each time you want to find one. Better yet would be to organize them by subject, putting like subjects in the same file folder (separate arrays, different structures). That way you can look for the file folder for the correct subject, then find the item you're looking for in it. Depending on your needs you can use different filing methods (data structures/algorithms) to better organize your information for it's intended use.
I'll also note that there are times when it does make sense to use individual variables for each data item you are using. Frequently there is a mixture of individual variables and more complex structures, using the appropriate method depending on the use of the particular item. For example, you might store the sum of a collection of integers in a variable while the integers themselves are stored in an array. A program would need to be pretty simple though before the introduction of data structures wouldn't be appropriate.
Sorry, but you didn't just find a great new way of doing things ;) There are several huge problems with this approach.
How could this be done without requring programmers to massively (and nontrivially) rewrite tons of code as soon as the number of allowed items changes? Even when you have to fix your data structure sizes at compile time (e.g. arrays in C), you can use a constant. Then, changing a single constant and recompiling is sufficent for changes to that size (if the code was written with this in mind). With your approach, we'd have to type hundreds or even thousands of lines every time some size changes. Not to mention that all this code would be incredibly hard to read, write, maintain and verify. The old truism "more lines of code = more space for bugs" is taken up to eleven in such a setting.
Then there's the fact that the number is almost never set in stone. Even when it is a compile time constant, changes are still likely. Writing hundreds of lines of code for a minor (if it exists at all) performance gain is hardly ever worth it. This goes thrice if you'd have to do the same amount of work again every time you want to change something. Not to mention that it isn't possible at all once there is any remotely dynamic component in the size of the data structures. That is to say, it's very rarely possible.
Also consider the concept of implicit and succinct data structures. If you use a set of hard-coded variables instead of abstracting over the size, you still got a data structure. You merely made it implicit, unrolled the algorithms operating on it, and set its size in stone. Philosophically, you changed nothing.
But surely it has a performance benefit? Well, possible, although it will be tiny. But it isn't guaranteed to be there. You'd save some space on data, but code size would explode. And as everyone informed about inlining should know, small code sizes are very useful for performance to allow the code to be in the cache. Also, argument passing would result in excessive copying unless you'd figure out a trick to derive the location of most variables from a few pointers. Needless to say, this would be nonportable, very tricky to get right even on a single platform, and liable to being broken by any change to the code or the compiler invocation.
Finally, note that a weaker form is sometimes done. The Wikipedia page on implicit and succinct data structures has some examples. On a smaller scale, some data structures store much data in one place, such that it can be accessed with less pointer chasing and is more likely to be in the cache (e.g. cache-aware and cache-oblivious data structures). It's just not viable for 99% of all code and taking it to the extreme adds only a tiny, if any, benefit.
The main benefit to datastructures, in my opinion, is that you are relationally grouping them. For instance, instead of having 10 separate variables of class MyClass, you can have a datastructure that groups them all. This grouping allows for certain operations to be performed because they are structured together.
Not to mention, having datastructures can potentially enforce type security, which is powerful and necessary in many cases.
And last but not least, what would you rather do?
string string1 = "string1";
string string2 = "string2";
string string3 = "string3";
string string4 = "string4";
string string5 = "string5";
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.WriteLine(string4);
Console.WriteLine(string5);
Or...
List<string> myStringList = new List<string>() { "string1", "string2", "string3", "string4", "string5" };
foreach (string s in myStringList)
Console.WriteLine(s);

Optimizing Lucid/Solr to index large text documents

I am trying to index about 3 million text documents in solr. About 1/3 of these files are emails that have about 1-5 paragraphs of text in them. The remaining 2/3 files only have a few words to sentences each.
It takes Lucid/Solr nearly 1 hour to fully index the entire dataset I'm working with. I'm trying to find ways to optimize this. I have setup Lucid/Solr to only commit every 100,000 files, and it indexes the files in batches of 50,000 files at once. Memory isn't an issue anymore, as it consistently stays around 1GB of memory because of the batching.
The entire dataset has to be indexed initially. It's like a legacy system that has to be loaded to a new system, so the data has to be indexed and it needs to be as fast as possible, but I'm not sure what areas to look into to optimize this time.
I'm thinking that maybe there's a lot of little words like "the, a, because, should, if, ..." that are causing a lot of overhead and are just "noise" words. I am curious if I cut them out if it would drastically speed up the indexing time. I have been looking at the Lucid docs for a while, but I can't seem to find a way to specify what words not to index. I came across the term "stop list" but didn't see much more than a reference to it in passing.
Are there other ways to make this indexing go faster or am I just stuck with a 1 hour indexing time?
We met similar problem recently. We can't use solrj as the request and response have to go through some applications, so we take the following steps:
Creating Custom Solr Type to Stream Large Text Field!
Use GZipOutput/InputStream and Bse64Output/InputStream to compress the large text. This can reduce size of text about 85%, this can reduce the time to transfer the request/response.
To reduce memory usage at client side:
2.1 We use stream api(GSon stream or XML Stax) to read doc one by one.
2.2 Define a custom Solr Field Type: FileTextField which accepts FileHolder as value. FileTextField will eventually pass a reader to Lucene. Lucene will use the reader to read content and add to index.
2.3 When the text field is too big, first uncompress it to a temp file, create a FileHolder instance, then set the FileHolder instance as field value.
It seems from your query that Indexing time is really important for your application. Solr is a great search engine however if you need super fast indexing time and if that is a very important criteria for you, than you should go with Sphinx Search Engine. It wont take much of time for you to quickly setup and benchmark your results using Sphinx.
There can be ways (like the one you have mentioned, stopwords etc.) to optimize however whatever you do with respect to indexing time Solr won't be able to beat Sphinx. I have done benchmarking myself.
I too love Solr a lot because of its ease of use, its out of box great features like N-Gram Indexing, Faceting, Multi-core, Spelling Correctors and its integration with other apache products etc.. but when it comes to Optimized Algorithms (be it Index size, Index time etc.) Sphinx rocks!!
Sphinx too is open source. Try that out.

Resources