Pentaho "Make the transformation database transactional" plus commit frequency - connection

By using Make the transformation database transactional property, If I get it right, a single commit is done at the end of the transformation (or rollback if there is an error or an abort)
However, the Commit size is still avaiable on the table output step, for example.
Is the Commit size value ignored on this cases? How does the Commit size work in combination with Make the transformation database transactional? (Will there be a single commit or multiple commits?)

I'm pretty sure that End result will be the same.
The whole execution will make the Batch commits, but if any of them fail, the entire execution will not be commited.
I cannot atest to this performance EXACTLY, but i can atest to the end result, Checking the 'Make the transformation Database Transactional' will effectively execute what you want to.

Related

How to ensure insert rate 1 insert per second when using ClickhouseIO

I'm using Apache Beam Java SDK to process events and write them to the Clickhouse Database.
Luckily there is ready to use ClickhouseIO.
ClickhouseIO accumulates elements and inserts them in batch, but because of the parallel nature of the pipeline it still results in a lot of inserts per second in my case. I'm frequently receiving "DB::Exception: Too many parts" or "DB::Exception: Too much simultaneous queries" in Clickhouse.
Clickhouse documentation recommends doing 1 insert per second.
Is there a way I can ensure this with ClickhouseIO?
Maybe some KV grouping before ClickhouseIO.Write or something?
It looks like you interpret these errors not quite correct:
DB::Exception: Too many parts
It means that insert affect more partitions than allowed (by default this value is 100, it is managed by parameter max_partitions_per_insert_block).
So either the count of affected partition is really large or the PARTITION BY-key was defined pretty granular.
How to fix it:
try to group the INSERT-batch such way it contains data related to less than 100 partitions
try to reduce the size of insert-block (if it quite huge) - withMaxInsertBlockSize
increase the limit max_partitions_per_insert_block in SQL-query (like this, INSERT .. SETTINGS max_partitions_per_insert_block=300 (I think ClickhouseIO should have the ability to set custom options on query level)) or on server-side by modifying userprofile-settings
DB::Exception: Too much simultaneous queries
This one managed by param max_concurrent_queries.
How to fix it:
reduce the count of concurrent queries by Beam means
increase the limit on the server-side in userprofile- or server-settings (see https://github.com/ClickHouse/ClickHouse/issues/7765)

Join Vs Reduce In Batch Processing

What are the key differences between Join and Reduce in terms of batch processing?
The join will wait until all tasks are completed (which needs to merge) but reduce won't wait.
However, in contrast to the join pattern described in above diagram, the goal of reduce is not to wait until all data has been processed, but rather to optimistically merge together all of the parallel data items into a single comprehensive representation of the full set.
This is a fortunate contrast to the join pattern because unlike join, it means that
reduce can be started in parallel while there is still processing going on as part of the
map/shard phase. Of course, in order to produce a complete output, all of the data
must be processed eventually, but the ability to begin early means that the batch computation executes more quickly overall.

Trying to hack my way around SQLite3 concurrent writing, any better way to do this?

I use Delphi XE2 along with DISQLite v3 (which is basically a port of SQLite3). I love everything about SQLite3, except the lack of concurrent writing, especially that I extensively rely on multi-threading in this project :(
My profiler made it clear I needed to do something about it, so I decided to use this approach:
Whenever I need to insert a record in DB, Instead of doing an INSERT, I write the SQL query in a special foler, ie.
WriteToFile_Inline(SPECIAL_FOLDER_PATH + '\' + GUID, FileName + '|' + IntToStr(ID) + '|' + Hash + '|' + FloatToStr(ModifDate) + '|' + ...);
I added a timer (in the main app thread) that fires every minute, parse these files and then INSERT the queries using a transaction.
Delete those temporary files at the end.
The result is I have like 500% performance gain. Plus this technique is ACID, as I can always scan the SPECIAL_FOLDER_PATH after a power failure and execute the INSERTs I find.
Despite the good results, I'm not very happy with the method used (hackish to say the least), I keep thinking that if I could have a generics-like with fast lookup access, thread-safe, ACID list, this would be much cleaner (and possibly faster?)
So my question is: do you know anything like that for Delphi XE2?
PS. I trust many of you reading the code above be in shock and will start insulting me at this point! Please be my guest, but if you know a better (ie. faster) ACID approach, please share your thoughts!
Your idea of sending the inserts to a queue, which will rearrange the inserts, and join them via prepared statements is very good. Using a timer in the main thread or a separated thread is up to you. It will avoid any locking.
Do not forget to use a transaction, then commit it every 100/1000 inserts for instance.
About high performance using SQLite3, see e.g. this blog article (and graphic below):
In this graphic, best performance (file off) comes from:
PRAGMA synchronous = OFF
Using prepared statements
Inside a transaction
In WAL mode (especially in concurrency mode)
You may also change the page size, or the journal size, but settings above are the best. See https://stackoverflow.com/search?q=sqlite3+performance
If you do not want to use a background thread, ensure WAL is ON, prepare your statements, use batchs, and regroup your process to release the SQLite3 lock as soon as possible.
The best performance will be achieved by adding a Client-Server layer, just as we did for mORMot.
With files you organized an asynchronous job queue with persistance. It allows you to avoid one-by-one and use batch (records group) approach to insert the records. Comparing one-by-one and batch:
first works in auto-commit mode (probably) for each record, second wraps a batch into a single transaction and gives greatest performance gain.
first prepares an INSERT command each time when you need to insert a record (probably), second once per batch and gives second by value gain.
I dont think, that SQLite concurrency is a problem in your case (at least not the main issue). Because in SQLite a single insert is comparably fast and concurrency performance issues you will get with high workload. Probably similar results you will get with other DBMS, like Oracle.
To improve your batch approach, consider the following:
consider to set journal_mode to WAL and disable shared cache mode.
use a background thread to process your queue. Instead of a fixed time interval (1 min), check SPECIAL_FOLDER_PATH more often. And if the queue has more than X Kb of data, then start processing. Or use a count of queued records and event to notify the thread, that the queue should start processing.
use multy-record prepared INSERT instead of single-record INSERT. You can build an INSERT for 100 records and process your queue data in a single batch, but by 100 record chanks.
consider to write / read a binary field values instead of a text values.
consider to use a set of files with preallocated size.
etc
sqlite3_busy_timeout is pretty inefficient because it doesn't return immediately when the table it's waiting on is unlocked.
I would try creating a critical section (TCriticalSection?) to protect each table. If you enter the critical section before inserting a row and exit it immediately thereafter, you will create better table locks than SQLite provides.
Without knowing your access patterns, though, it's hard to say if this will be faster than batching up a minute's worth of inserts into single transactions.

Determining the size of deltas in TFS

Our TFSVersionControl database has grown significantly in the past couple years, and is edging on 80GB. Unfortunately, we're in an environment where every gig of data storage is internally charged at a high rate, so there's lots of focus on keeping storage growth to a minimum.
I believe the majority of growth is happening because we chose to store binary files in our repository. This is something we will be remedying in the medium term.
In the short-term, there are a few places where we do not need to keep a history of our binaries. Particularly in our mainline branch and our development branch, so we're looking into doing a TF Destroy on these binaries and recreating them as part of the upcoming release.
What I'd like to know is: Is there any way to run a query against the TFSVersionControl database to understand which files are storing deltas that are over a given size?
Ideally, what I'd like to know is for a given path (item spec), for each file, the base size, and the total size of the deltas.
I think this page may be what you're looking for.
Just like asking someone else will often drive you to find your own answer, I did some additional digging, and came up with this:
select ver.VersionFrom, ver.Command, ver.ChildItem, tf.*, ct.CreationDate, ct.OffsetFrom, ct.OffsetTo, DataLength(ct.Content) as Size
from tbl_version ver with (nolock)
inner join tbl_file tf with (nolock) on tf.FileId = ver.FileId
inner join tbl_content ct with (nolock) on ct.FileId = tf.fileid
where parentpath = '$\ProjectName\Branch\Folder\'
ORDER BY ver.ChildItem, Ver.VersionFrom
--where fullpath = '$\ProjectName\Branch\Folder\FileName.cs\'
The query as written will iterate through all files in a particular path and will retrieve a record per checkin. The calculated Size field will show you the size in bytes of the delta. I'm not sure if this is compressed size or "actual" size.
The commented "where" statement will show you the same for an indvidual file.
Note that the typical forward slashes ("/") are stored in the database as backslashes ("\"), and there is always a trailing backslash at the end.
If you pull this data into Excel, you can quickly create a pivot table on it to calculate the sizes (or you can add them up manually).

Processing large recordsets in Rails

I'm trying to perform a daily operation on a larger than normal dataset (2m+ records). However, Rails seems to take a very long time performing operations on such a dataset. Operations like
Dataset.all.each do |data|
...
end
take a very long time to complete (I assume this is because it can't fit all the items into memory at once, right?).
Does anyone have any strategies on how I could handle this situation? I know SQL would probably speed up the process, but I'm looking to use the Rails environment as I can do many more complicated things to the data than I can with just SQL statements.
You want to use ActiveRecord's find_each for this.
Dataset.find_each do |data|
...
end
When processing a large set of rows, a database is very fast and efficient, it what they were designed for. I would recommend attempting to do all this processing in SQL if you want max performance. If you prefer to use Rails, or it is impossible to do everything you want in SQL, you might attempt to do some pre-processing in SQL and the remainder in Rails. Short of that, 2m+ rows is a lot to loop over, even if each only takes a fraction of a second it add up to a long time.

Resources