I have a procedure inside a Database package that does the follow:
The procedure receives one ID to table - TESTTABLE.
Then we open a cursor, and for each row on TESTETABLE executes +/- 40 queries that will get some reference data, like decoding Codes, getting more info, etc. These queries depend on inputs of each row from TESTTABLE, so they are dynamically created and execute.
For each row processed, and insert is made in another table.
So far so good.
When TESTTABLE has 60 lines, the Execution time of the procedure is about 10 seconds (~60*40 queries plus 60 inserts on a table);
When TESTTABLE for instance 120 lines, the Execution time is ~30mints (120*40 queries plus 120 inserts). That is unacceptable!
I can't understand why this is happening, passing from a few seconds to dozens for minutes.
Can you help me to resolve this? Getting this execution on a fewer time?
Thank you very much.
Related
I have a process that generates a new record every 10 minutes. It was great for some time, however, now Datum.all returns 30k+ records, which are unnecessary as the purpose is simply to display them on a chart.
So as a simple solution, I'd like to provide all available data generated in the past 24 hours, but low res data (every 100th record) prior to the last 24 hours (right back to the beginning of the dataset).
I suspect the solution is some combination of this answer which selects every nth record (but was provided in 2010), and this answer which combines two ActiveRecord objects
But I cannot work out how to get a working implementation that obtains all the required data into one instance variable
You can use OR query:
Datum.where("created_at>?", 1.day.ago).or(Datum.where("id%100=0"))
The Clickhouse table, MergeTree Engine, is continuously populated with “INSERT INTO … FORMAT CSV” queries, starting empty. The average input rate is 7000 rows per sec. The insertion is happening in batches of few thousand rows. This has severe performance impact when SELECT queries are executed concurrently. As described in the Clickhouse documentation, the system needs at most 10 minutes to merge the data of a specific table (re-index). But this is not happening as the table is continuously populated.
This is also evident in the file system. The table folder has thousands of sub-folders and the index is over-segmented. If the data ingestion stops, after a few minutes the table is fully merged, and the number of sub-folders becomes a dozen.
In order to encounter the above weakness, the Buffer Engine was used to buffer the table data ingestion for 10 minutes. Consequently, the buffer maximum number of rows is on average 4200000.
The initial table is remaining at most 10 minutes behind as the buffer is keeping the most recently ingested rows. The table is finally merged, and the behaviour is the same as in case where the table has stopped to be populated for a few minutes.
But the Buffer table, which corresponds to the combination of the buffer and the initial table, is getting severely slower.
From the above appears that, if the table is continuously populated, it is not merging, and indexing suffers. Is there a way to avoid this weakness?
The number of sub-folders in the table data directory is not so representative value.
Indeed, each sub-folder contains a data part consisting of sorted (indexed) rows. If several data parts are merged into a new bigger one the new sub-folder appears.
However, source data parts are not removed instantly after the merge. There is a <merge_tree> setting old_parts_lifetime defining a delay after which the parts will be removed, by default it set to 8 minutes. Also, there is cleanup_delay_period setting defining how often a background cleaner checks and removes outdated parts, it is 30 seconds by default.
So, it is normal to have such amount of sub-folders for about 8 minutes and 30 seconds after the ingestion starts. If it is unacceptable to you, you can change these settings.
It makes sense to check the amount of active parts in a table only (i.e. parts which have not been merged into a bigger one). To do so, you could run the following query: SELECT count() FROM system.parts WHERE database='db' AND table='table' AND active.
Moreover, ClickHouse does such checks internally if the amount of active parts in a partition is greater than parts_to_delay_insert=150, it will slow down INSERTs, but if it is greater than parts_to_throw_insert=300 it will abort insertions.
I have been comparing MySQL with MonetDB. Obviously, queries that took minutes in MySQL got executed in a matter of few seconds in Monet.
However, I found a real blockade with joins.
I have 2 tables - each with 150 columns. Among these (150+150) columns, around 60 are CHARACTER LARGE OBJECT type. Both the tables are populated with around 50,000 rows - with data in all the 150 columns. The average length of data in a CLOB type column is 9,000 (varying from 2 characters to 20,000 characters). The primary key of both the tables have same values and the join is always based on the primary key. The rows are by default inserted in ascending order on the primary key.
When I ran an inner join query on these two tables with about 5 criteria and with limit 1000, Monet processed this query in 5 seconds, which is completely impressive compared to MySQL's (19 seconds).
But when I ran the same query with same criteria and limits using a left or right joins, Monet took around 5 minutes, which is clearly way behind MySQL's (just 22 seconds).
I went through the logs using trace statement, but the traces of inner and left joins are more or less the same, except that time for each action is far higher in left join trace.
Also, the time taken for the same join query execution varies by 2 or 3 seconds when run at several time intervals.
Having read a lot about Monet's speed compared to traditional relation row based DBs, I could feel that I am missing something, but couldn't figure out what.
Can anyone please tell me why there is such a huge difference in such query execution time and how I can prevent it?
Much grateful to have any help. Thanks a lot in advance.
P.S.: I am running Monet on Macbook Pro - 2.3 GHz Core i7 - Quad core processor with 8 GB RAM.
I'm trying load data about 10K records from 6 different tables from my Ultralite DB.
I have created different functions for 6 different tables.
I have tried to load these in parallel using NSInvokeOperations, NSOperations, GCD, Subclassing NSOperation but nothing is working out.
Actually, loading 10K from 1 table takes 4 Sec, and from another 5 Sec, if i keep these 2 in queue it is taking 9 secs. This means my code is not running in parallel.
How to improve performance problem?
There may be multiple ways of doing it.
What i suggest will be :
Set the number of rows for table view to be exact count (10k in your case)
Table view is optimised to create only few number of cells at start(follows pull model). so cellForRowAtIndexPath will be called only for few times at start.
Have an array and fetch only 50 entries at start. Have a counter variable.
When user scrolls table view and count reaches after 50 fetch next 50 items(it will take very less time) and populate cells with next 50 data.
keep on doing same thing.
Hope it works.
You should fetch records in chunks(i.e. fetch 50-60 records at a time in a table). And then when user reach end of the table load another 50 -60 records. Try hands with this library: Bottom Pull to refresh more data in a UITableView
Regarding parallelism go with GCD, and reload respective table when GCD's success block called.
Ok you have to use Para and Time functions look them up online for more info
I'm trying to enhance the performance of a SQL Server 2000 job. Here's the scenario:
Table A has max. of 300,000 rows. If I update/delete the 100th row (Based on the insertion time) all the rows which has been added after that row, should update their values. Row no. 101, should update its value based on row no. 100 and row no. 102 should update its value based on the row no.101's updated value. e.g.
Old Table:
ID...........Value
100..........220
101..........(220/2) = 110
102..........(110/2)=55
......................
Row No. 100 updated with new value: 300.
New Table
ID...........Value
100..........300
101..........(300/2) = 150
102..........(150/2)=75
......................
The actual values calculation is more complex. the formula is for simplicity.
Right now, a trigger is defined for update/delete statements. When a row is updated or deleted, the trigger adds the row's data to a log table. Also, a SQL Job is created in code-behind after update/delete which fires a stored procedure that finally, iterates through all the next rows of table A and updates their values. The process takes ~10 days to be accomplished for 300,000 rows.
When the SP gets fired, it updates the next rows' values. I think this causes the trigger to run again for each SP update and add these rows to the log table too. Also, The task should be done in DB-side as requested by customer.
To solve the problem:
Modify the stored procedure and call it directly from the trigger. The stored procedure then drops the trigger and updates the next rows' values and then creates the trigger again.
There will be multiple instances of the program running simultaneously. if another user modifies a row while the SP is being executed, the system will not fire the trigger and I'll be in trouble! Is there any workaround for this?
What's your opinion about this solution? Is there any better way to achieve this?
Thank you.
First, about the update process. I understand, your procedure is simply calling itself, when it comes to updating the next row. With 300K rows this is certainly not going to be very fast, even without logging (though it would most probably take much fewer days to accomplish). But what is absolutely beyond me is how it is possible to update more than 32 rows that way without reaching the maximum nesting level. Maybe I've got the sequence of actions wrong.
Anyway, I would probably do that differently, with just one instruction:
UPDATE yourtable
SET #value = Value = CASE ID
WHEN #id THEN #value
ELSE #value / 2 /* basically, your formula */
END
WHERE ID >= #id
OPTION (MAXDOP 1);
The OPTION (MAXDOP 1) bit of the statement limits the degree of parallelism for the statement to 1, thus making sure the rows are updated sequentially and every value is based on the previous one, i.e. on the value from the row with the preceding ID value. Also, the ID column should be made a clustered index, which it typically is by default, when it's made the primary key.
The other functionality of the update procedure, i.e. dropping and recreating the trigger, should probably be replaced by disabling and re-enabling it:
ALTER TABLE yourtable DISABLE TRIGGER yourtabletrigger
/* the update part */
ALTER TABLE yourtable ENABLE TRIGGER yourtabletrigger
But then, you are saying the trigger shouldn't actually be dropped/disabled, because several users might update the table at the same time.
All right then, we are not touching the trigger.
Instead I would suggest adding a special column to the table, the one the users shouldn't be aware of, or at least shouldn't care much of and should somehow be made sure never to touch. That column should only be updated by your 'cascading update' process. By checking whether that column was being updated or not you would know whether you should call the update procedure and the logging.
So, in your trigger there could be something like this:
IF NOT UPDATE(SpecialColumn) BEGIN
/* assuming that without SpecialColumn only one row can be updated */
SELECT TOP 1 #id = ID, #value = Value FROM inserted;
EXEC UpdateProc #id, #value;
EXEC LogProc ...;
END
In UpdateProc:
UPDATE yourtable
SET #value = Value = #value / 2,
SpecialColumn = SpecialColumn /* basically, just anything, since it can
only be updated by this procedure */
WHERE ID > #id
OPTION (MAXDOP 1);
You may have noticed that the UPDATE statement is slightly different this time. I understand, your trigger is FOR UPDATE (= AFTER UPDATE), which means that the #id row is already going to be updated by the user. So the procedure should skip it and start from the very next row, and the update expression can now be just the formula.
In conclusion I'd like to say that my test update involved 299,995 of my table's 300,000 rows and took approximately 3 seconds on my not so very fast system. No logging, of course, but I think that should give your the basic picture of how fast it can be.
Big theoretical problem here. It is always extremely suspicious when updating one row REQUIRES updating 299,900 other rows. It suggests a deep flaw in the data model. Not that it is never appropriate, just that it is required far far less often than people think. When things like this are absolutely necessary, they are usually done as a batch operation.
The best you can hope for, in some miraculous situation, is to turn that 10 days into 10 minutes, but never even 10 seconds. I would suggest explaining thoroughly WHY this seems necessary, so that another approach can be explored.