I’m using SSIS to synchronize data between two databases. I’ve used SSIS and DTS in the past, but I generally write an application for things of this nature (I’m coder and it just comes easier to me).
In my package I use a SQL Task that returns about 15,000 rows. I’ve hooked that up to a Foreach Container, and within that I assign the resultset column values to variables, and then map those variables to parameters that are fed to another SQL Task.
The problem I’m having is with debugging, and not just more complicated debugging like breakpoints and evaluating values at runtime. I simply mean that if I run this with debugging rather than without, it takes hours to complete.
I ended up rewriting the process in Delphi, and the following is what I came up with:
Full Push of Data:
This pulls 15,000 rows, updates a destination table for each row, then pulls 11,000 rows and updates a destination table for each row.
Debugging:
Delphi App: 139s
SSIS: 4 hours, 46 minutes
Not Debugging:
Delphi App: 132s
SSIS: 384s
Update of Data:
This pulls 3,000 rows, but no updates are needed or made to the destination table. It then pulls 11,000 rows but, again, no updates are needed or made to the destination table.
Debugging:
Delphi App: 42s
SSIS: 1 hours, 10 minutes
Not Debugging:
Delphi App: 34s
SSIS: 205s
The odd thing is, I get the feeling that most of this time spent debugging is just updating UI elements in Visual Studio. If I watch the progress tab, a node is added to a tree for each iteration (thousands total), and this gets slower and slower as the process goes on. Trying to stop debugging usually doesn’t work, as Visual Studio seems caught in a loop updating the UI. If I check the profiler for SQL Server no actual work is being done. I'm not sure if the machine matters, but it should be more than up to the job (quad core, 4 gig of ram, 512 mb video card).
Is this sort of behavior normal? As I’ve said I’m a coder by trade, so I have no problem writing an app for this sort of thing (in fact it takes much less time for me to code an application than “draw” it in SSIS, but I figure that margin will shrink with more work done in SSIS), but I’m trying to figure out where something like SSIS and DTS would fit into my toolbox. So far nothing about it has really impressed me. Maybe I’m misusing or abusing SSIS in some way?
Any help would be greatly appreciated, thanks in advance!
SSIS control flow and loops are not very high performance, and not designed for processing these amounts of data. Especially during the debugging - before and after each task execution, debugger sends notifications to designer process, which updates colors of the shapes and this could be slow.
You could get much better performance using data flow. Data flow does not operate with single rows, it works with buffers of rows - much faster, and the debugger is only notified about beginning/end of the buffers - so its impact is less noticeable.
SSIS is not designed to do a foreach like that. If you are doing something for each row coming in, you probably want to read those into a dataflow and then using a lookup or merge join, determine whether to do an INSERT (these happen in bulk) or a database command object for multiple SQL UPDATE commands (a better performing option is to batch these into staging table and do a single UPDATE).
In another typical sync situation, you read all the data into a staging table, and do a SQL Server UPDATE on the existing rows (INNER JOIN) and INSERT on the new rows (LEFT JOIN, rhs IS NULL). There is also the possibility of using linked servers, but joins over that can be slow, since all (or a lot of) the data may have to come across the network.
I have SSIS packages that regular import 24 million rows, including handling data conversion and validation and slowly changing dimensions using the TableDifference component, and it performs relatively quickly for that large amount of data versus a separate client program.
I have noticed this is the behavior, I had an SSIS package for moves, that did somewhere in the neighborhood of 3 million entries, it was not possible to debug as it would run for about 3-4 days.
SSIS is still the way I did it, I just don't "debug" with SSIS, I run them when working with the full datasets. If I must debug, I use very small datasets.
Related
We experience intermittent, seemingly random brownouts of a firebase realtime database. We are beginning to shard our data into multiple databases, however, we are not sure this will solve our problem. It appears to us that firebase cannot scale to meet our needs in terms of doing frequent writes to a specific data set.
We sync data from a third-party data source in cycles (every 4-10 minutes, 1000 active jobs). Each update has the potential to change a few thousand nodes in firebase, most of which lie pretty low. However, most of the time the number of low-level nodes changed is much lower. We do differential updates on the sync'd data in order to allow very small writes to the lower-level nodes. This helps prevent our users from downloading a ton of additional data. We also batch all of our updates per cycle into only a handful of writes, between 10-20 (not sure of the performance impact of a batched write to multiple nodes vs. a write to a single node).
Here is an image of the database load graph, which includes some sharding:
Database Load
The "blue" line is our "main" database. The "orange line" is a database containing only the data that requires many writes, as described above. Currently, the main (blue) database is supporting normal operations, including reads/writes, etc.. The shard (orange) database is only handling writes. The mirror of these is pretty indicative of a "write" load issue, given that a large percentage of writes occurs in the morning.
At times, the database load reaches 100% and remains in this state for 30+ minutes.
Please let me know if I can expand on anything or explain anything in more detail. Would appreciate any suggestions on debugging strategies or explanations as to why this may be occurring.
We are actively refactoring a lot of code to mitigate this issue, however, it is not obvious what the main driver is.
By using both the synchronous=OFF and journal_mode=MEMORY options, I am able to reduce the speed of updates from 15 ms to around 2 ms which is a major performance improvement. These updates happen one at a time, so many other optimizations (like using transactions about a bunch of them) are not applicable.
According to the SQLite documentation, the DB can go 'corrupt' in the worst case if there is a power outage of some type. However, is not the worst thing that can happen is for the data to be lost, or possibly part of a transaction to be lost (which I guess is a form of corruption). Is it really possible for arbitrary corruption to occur with either of these options? If so, why?
I am not using any transactions, so partially written data from transactions is not a concern, and I can handle loosing data once in a blue moon. But if 'corruption' means that all the data in the DB can be randomly changed in an unpredictable way, that would be a strong reason to not use these options.
Does any one know what the real worst-case behavior would be on iOS?
Tables are organized as B-trees with the rowid as the key.
If some writes get lost while SQLite is updating the tree structure, the entire table might become unreadable.
(The same can happen with indexes, but those could be simply dropped and recreated.)
Data is organized in pages (typically 1 KB or 4 KB). If some page update gets lost while some tree is being reorganized, all the data in these pages (i.e., some random rows from the table with nearby rowid values) might become corruped.
If SQLite needs to allocate a new page, and that page contains plausible data (e.g., deleted data from the same table), and the writing of that page gets lost, then you have incorrect data in the table, without the ability to detect it.
I'm using the latest ASP.NET MVC and Entity Framework (MVC 5.2.2, EF 6.1.2), and latest of Glimpse. I'm working on improving query times to eagerly load an entity with several nested child objects, and have reduced the number of queries by using .Include("Object.Child") to bring in navigation properties. At first, I thought I was getting a good result, seeing the "Total query execution time" in the SQL tab of Glimpse reduce significantly. Yet the "Total connection open time" stays high, and is very long for the resulting combined mega-query. See screenshot below.
I'm wondering if anyone can help me understand what is going on with the differences in the two durations? Glimpse says my command takes <100 ms, but that the SQL connection takes >5 seconds. The query in this case is really messy with lots of joins etc, however it's not clear where the time goes if indeed the query itself finishes in 100 ms.
Note: I've seen the answer about why two durations here, but it doesn't explain the nature of each.
Thanks for asking the question. The timer for the connection duration starts when the connection is opened and finishes when it is closed. To work this out further, how are you using your context/connection, are you sharing it, keeping it around, etc?
After further testing, I think I've figured out what was happening. I saw in another question which suggested that the .Include() approach to eager loading hierarchical entities in Entity Framework can result in complex queries with many joins and duplication of data in the result set. I had a long XML string as one of my properties, so if this was duplicated many times at the database, it would take a long time to return/process.
As a test, I cleared the data-heavy field and reran the query, getting a far shorter "connection" duration (the one list on the right in Glimpse). It went from 9 seconds to under 200 ms total. Based on this I assume the data size was the culprit, and learned my lesson about using large data properties this way.
I'd still be interested to know whether Glimpse could show you the raw data being returned from a query, or even show the size in bytes, along with the record count. This would have likely made this problem evident.
A little late to this question, but I encountered the same problem and was also trying to understand the disparity between my query execution time and connection open time.
FWIW, I discovered that I was passing an enumerable in my view model to the view, rather than a concrete list. Thus, the view was triggering evaluation of the query and prolonging the amount of time that the connection remained open. By passing lists of the items (call .ToList() on the enumerable), I drastically reduced the amount of time for which the connection remained open.
This is very strange. I have a CR that takes over 30 minutes to run. It uses 5 large tables and queries the server. I made a View on the server which is IBM i to gather the data there. For some reason it is not giving me data on the CR past 08/12. When I query past that date on the server,it does have data, and even if I make a quick report on CR it will show all the data incl 2013.
The reason can possibly be this>
When I made the View, I mistakenly had a mix of databases used. And one of the 2 databases was one being used as part of a data purge. So it may have not had data past 8.12/
But since that point, I have also modified the View to add some new columns and this it does and even shows them in the data that it does show (till 8/12)
So this would tell me that the CR is fully using the new View.
So I can re create the CR but this is rather tedious. Perhaps there is one thing I am not doing?
Crystal Reports generally does better in reporting over processing a query. For a faster, and easier way of debugging, it's often better to make a procedure in your database that joins together the data from various sources. Once you have the data you want, then use Crystal to display that data.
In other words, try to avoid doing any more work in Crystal than you have to. Sure, the grouping and headers and pretty formatting will be done there. But all of the querying, joining, and sorting is better done in your database. If the query is slow there, then you can optimize there. If the wrong data is returned, you fix your procedure until it is returning what you want.
An additional benefit is when the report needs to change. If the data needs to come from a different location, you can modify the procedure and never touch Crystal. If the formatting needs to change, you can modify the Crystal and never touch the procedure. You're changing less and thus don't have to test everything.
Is the crystal report attached to a scratch server?
If you are using SQL Server, then you can modify the SQL that constitutes your view by modifying the table names to be like this: databasename..tablename I'm not certain how to do the equivalent in other DBMS.
If you modify your table like that so that the view is querying tables from the correct non-purged database and you are still not getting data more recent than 8/12, then check if there are constraints in the WHERE and/or HAVING statements, or if there are implicit/explicit constraints in ON section of the JOINs.
I have a simple jquery which calls a servlet via get and then Neo4j is used to return data in JSON format.
The system is workable after the FIRST query but the very first time it is used the system ins unbelievably slow. This is some kind of initialisation issue. I am using Heroku web hosting.
The code is fairly long so I am not posting now, but are there any known issues regarding the first invocation of Neo4j?
I have done limited testing so far for performance as I had a lot of JSON problems anyway and they only just got resolved.
Summary:
JQuery(LINUX)<--> get (JSON) <---> Neo4j
First Query - response is 10-20 secs
Second Query - time is 2-3 secs
More queries - 2/3 secs.
This is not a one-off; I tested this a few times and always the same pattern comes up.
This is a normal behaviour of Neo4j where store files are mapped into memory lazily for parts of the files that become hot, and becoming hot requires perhaps thousands of requests to such a part. This is a behaviour that has big stores in mind, whereas for smaller stores it merely gets in the way (why not map the whole thing if it fits in memory?).
Then on top of that is an "object" cache that further optimizes access, that get populated lazily for requested entities.
Using an SSD instead of spinning media will usually speed up the initial non-memory-mapped random access quite a bit, but in your scenario I recognize that's not viable.
There are thoughts on beeing more sensitive to hot parts of the store (i.e. memory map even if not as hot) at the start of a database lifecycle, or more precisely have the heat sensitivity be a function of how much is currently memory mapped versus how much can be mapped at maximum. This has shown to make initial requests much more responsive.