Understanding ETL processes - data-warehouse

ETL seems to be a pretty common task. I am basically reading some ETL mistakes which designers make with very large data on http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-with-very-large-data-19264
I need some practical insights for the following points
a) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.. How is that a problem?
b) Sourcing multiple systems at the same time, depending on heterogeneous systems of data.
c) Not producing the correct indexes on the sources/ lookups that need to be accessed.
d) Believing that ‘ I need to process all the data in one pass because it’s the fastest way to do it ‘
Any help?

a) Data integrity issue
b) data quality will increase and less failure for smaller chunks.
c) will take more time to complete<
d) wrong indexes can cause more time. Better have indexes based on the query you are executing.
i.e what comes in the where clause of statement
e) splitting the data into smaller data sets and processing the same would be an efficient solution
Your a BITS-PILANI(WILP) student rite.

A) It's a problem if you find the task takes too long to complete (due to increased data volumes), and then it becomes too difficult to technically split them out afterwards. But splitting the tasks out can increase the possibility of inconsistent data loads (i.e. your DELETE works but your insert fails, meaning you are missing a load of data)
B) I don't understand 'at the same time' here - Do you mean simultaneously? You could max out bandwidth (network, disk etc.) if you simultaneously try to load data from many systems. Sometimes you don't have a choice if you need to load that data at offline times.
C) Yes incorrect indexes will slow down access. But often vendors don't like you creating indexes in the source database.
D) Performance tuning (the fastest way to do it) is a complex topic. In some cases it might be faster to do it in one pass. In other cases it may not.

Related

Apache-camel Xpathbuilder performance

I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}
Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.
Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.

Control the tie-breaking choice for loading chunks in the scheduler?

I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"
Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.

Computing in-place with dask

Short version
I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the original backup numpy arrays, making the whole thing an in-place operation?
If you're thinking "you're using dask wrong" then see the long version below for why I feel the need to do this.
Long version
I'm using dask for an application where the original data is sourced from in-memory numpy arrays that contain data collected from a scientific instrument. The goal is to fill most of the RAM (say 75%+) with the original data, which means that there isn't enough to make an in-memory copy. That makes it semantically a bit like an out-of-core problem, in that any derived value can only be realised in memory in chunks rather than all at once.
Dask is well-suited to this, except for one wrinkle. I'm simplifying a lot, but on most of the data (call it X), we need to apply an element-wise operation f, compute some summary statistics s(f(X)), and use that to compute another result over the data, say t(s(f(X)), f(X)). While all the functions are dask-friendly (can be done on a per-chunk basis), trying to simply run this dask graph would cause f(X) to all be held in memory at once because the chunks are all needed for the second pass. An alternative is to explicitly compute s before asking for t (as suggested by https://github.com/dask/dask/issues/874), and thus pay to compute f(X) twice, but it's a somewhat expensive operation so I'd like to avoid that.
However, once f has been applied, the original data are no longer needed. So I'd like to run da.store(f(X)) and have it store the results in the original backing numpy arrays. Technically I think I know how to set that up, and as long as I can be sure that each piece of data is fully consumed before it is overwritten then there are no race conditions, but I'm worried that I may be breaking an API contract by changing back data underneath dask and that it might go wrong in some way. Is there any way to guarantee that it is safe?
One way I can immediately see this going wrong is if several of the input arrays have the same contents and hence get given the same name in dask, causing them to the unified in the graph. I'm using name=False in da.from_array though, so that shouldn't be an issue.

find the last item but N in a stream w/o storing n items

Suppose there is a stream of data arriving, D(0), D(1), D(2), .... When D(i) comes, I want to know D(i - N). The most straight forward way is to store the most recent N items and keep updating them upon arrival of new data. But the problem is N can be large so that there is no enough memory to store them. Is there anyway to achieve this by storing much less items than N? A constant of M << N of spaces are preferred? Thanks in advance.
Not as far as I can see, unless there is some regularity in the data that you can exploit. If the data are completely random (such that no element can be inferred from the others), then a choice of not saving element k will make it impossible to reproduce that element in iteration k + N.
Instead, consider:
Can you reduce N?
Can you store information on disk or (if you are in an embedded environment) on a slower, cheaper form of memory?
Is there some pattern in the data? If there is e.g. a repeating pattern, you can utilize that, or if there is some mathematical relationship between the numbers, perhaps some formula can aid in reconstructing one number from others. Even if there is no perceptible pattern, perhaps you could use some compression algorithm to reduce the data size?
Is there some limitation to the data, e.g. every number is between 0 and 255? If so, you could perhaps reduce the storage requirements.
(What is the application of this, by the way?)

Multiple calculations on the same set of data: ruby or database?

I have a model Transaction for which I need to display the results of many calculations on many fields for a subset of transactions.
I've seen 2 ways to do it, but am not sure which is the best. I'm after the one that will have the least impact in terms of performance when data set grows and number of concurrent users increases.
data[:total_before] = Transaction.where(xxx).sum(:amount_before)
data[:total_after] = Transaction.where(xxx).sum(:amount_after)
...
or
transactions = Transaction.where(xxx)
data[:total_before]= transactions.inject(0) {|s, e| s + e.amount_before }
data[:total_after]= transactions.inject(0) {|s, e| s + e.amount_after }
...
edit: the where clause is always the same.
Which one should I choose? (or is there a 3rd, better way?)
Thanks, P.
Not to nag, but what about
transactions = Transaction.where(xxx)
data[:total_before] = transactions.sum(:amount_before)
data[:total_after] = transactions.sum(:amount_before)
? This looks like the union of strengths of methods 1 and 2 :) You reuse search results and employ more clean rails-specific sum aggregator.
PS If you were asking whether it's possible to rely on Rails in caching results of Transaction.where(xxx) query, that I don't know. And when I don't know, I prefer to play safe.
Really you're talking about scalability.
If you're talking about millions of rows and needing to do calculations on them, then which do you think would be faster?
Asking the DBM to summarize millions of rows and return you two numbers.
Returning millions of query results across the network which you iterate over twice.
In the first scenario you can scale up your DB host with faster CPUs, more RAM, faster drives or pre-compute your values at regular intervals. The calculations you want done in the DBM are exactly the sort of things it's written to do.
In the second scenario you have to scale up your computing host, and maybe the switch connecting the DBM and computing host, plus maybe the database host because it will have to retrieve and push the data. Imagine the impact on the network as it's handling the data, and the impact on the computing host's CPU as it's doing everything.
I'd do the first one as it seems a lot more scalable to me.

Resources