SELECT queries performance impact when the Clickhouse table is continuously populated with INSERT INTO - buffer

The Clickhouse table, MergeTree Engine, is continuously populated with “INSERT INTO … FORMAT CSV” queries, starting empty. The average input rate is 7000 rows per sec. The insertion is happening in batches of few thousand rows. This has severe performance impact when SELECT queries are executed concurrently. As described in the Clickhouse documentation, the system needs at most 10 minutes to merge the data of a specific table (re-index). But this is not happening as the table is continuously populated.
This is also evident in the file system. The table folder has thousands of sub-folders and the index is over-segmented. If the data ingestion stops, after a few minutes the table is fully merged, and the number of sub-folders becomes a dozen.
In order to encounter the above weakness, the Buffer Engine was used to buffer the table data ingestion for 10 minutes. Consequently, the buffer maximum number of rows is on average 4200000.
The initial table is remaining at most 10 minutes behind as the buffer is keeping the most recently ingested rows. The table is finally merged, and the behaviour is the same as in case where the table has stopped to be populated for a few minutes.
But the Buffer table, which corresponds to the combination of the buffer and the initial table, is getting severely slower.
From the above appears that, if the table is continuously populated, it is not merging, and indexing suffers. Is there a way to avoid this weakness?

The number of sub-folders in the table data directory is not so representative value.
Indeed, each sub-folder contains a data part consisting of sorted (indexed) rows. If several data parts are merged into a new bigger one the new sub-folder appears.
However, source data parts are not removed instantly after the merge. There is a <merge_tree> setting old_parts_lifetime defining a delay after which the parts will be removed, by default it set to 8 minutes. Also, there is cleanup_delay_period setting defining how often a background cleaner checks and removes outdated parts, it is 30 seconds by default.
So, it is normal to have such amount of sub-folders for about 8 minutes and 30 seconds after the ingestion starts. If it is unacceptable to you, you can change these settings.
It makes sense to check the amount of active parts in a table only (i.e. parts which have not been merged into a bigger one). To do so, you could run the following query: SELECT count() FROM system.parts WHERE database='db' AND table='table' AND active.
Moreover, ClickHouse does such checks internally if the amount of active parts in a partition is greater than parts_to_delay_insert=150, it will slow down INSERTs, but if it is greater than parts_to_throw_insert=300 it will abort insertions.

Related

Performance issues when retrieving last value

I have a measurement that keeps track of sensor readings for a bunch of machines.
There are something of the order of 50 different readings per machine, and there are up to 1000 machines. We have one reading every 30 seconds.
The way I store the reading is in a single measurement which has 2 tags, machine_id and analysis_id and a single value.
One of the use cases I have is to retrieve the current value for each reading for a list of machines.
When this database gets to 100 million records or something like that, which with those numbers means less than 1 day, I can no longer retrieve the last values with a query as it takes too long.
I tried the two following alternatives:
SELECT *
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
ORDER BY time DESC
LIMIT 1
and:
SELECT last(*) AS value,
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
both of then take a pretty long time to complete. At 100 million it's something of the order of 1 second.
The use case of retrieving the latest values is a very frequent one. I need to be able to get the "current" state of machines almost instantly.
I can work that out on the side of the app logic, by keeping track of the latest value in a separate place, but I was wondering what I could do with InfluxDB alone.
I was facing something similar and I worked around it by creating a continuous query.
https://docs.influxdata.com/influxdb/v0.8/api/continuous_queries/

Periodic snapshot fact table with large dimensions

I have been asked to model a star diagram.
I have 3 dimensions:
Date (day,month, year, week, quarter, ...)
place (500 distinct values)
Product (80k different products)
The main question is how many items (products) are stored at the end of a day in every place.
After some study-time with regards to dimensional modeling. I think I should implement a Periodic snapshot table. However reading trough the Kimball Docs, I noticed that a periodic snapshot demands an entry for every combination of the dimensions. This means I should add 40M rows every day (80k*500).
Knowing that the products are (real) slow movers and that many places store zero products during long periods, this sounds like an extreme overkill.
FYI the transactions in the source DB are 150k rows after three years.
So should I really add 40M rows every day, or could I just add the non-empty stores with their products specified? Also if for whatever reason one day all stores are empty, should I make an entry for that day (with dimensions N/A for store and product)?
You modeled correctly. It depends from the specifications, but normally you store only the products that are present in a location (you do not store zeroes), which could yield a number substantially lower than the maximum 80k.
If you want to further reduce your numbers, you could store the last N days and then start to move data in a "cold" table. You store (say) last 10 day snapshot, then only monthly snapshots in the main "hot" Fact Table.
Do not exclude the possibility to calculate the snapshot on the fly in report system, depending on your environment it could be easy (in MDX or DAX for example it is). Mixed solutions are also possible (i.e only the last month calculated on the fly).

iOS and Mysql Events

I'm working on an app that connects to a mysql backend. It's a little simliar to snapchat in that once the current user gets the pics from the users they follow and see them they can never again see these pics. However, I can't just delete the pics from the database, the user who uploaded the pic still needs to see them. So I've come up with an interesting design and I want to know if its good or not.
When uploading the pic I would also create a mysql event that would run the same time exactly one day after the pic was uploaded deleting itself. If I have people uploading pics all the time events would be created all the time. How does this effect the mysql database. Is this even scalable?
No, not scalable: Deleting of single records is quick, however if your volume increases, you run into trouble. You do however have a classic case for using partitioning:
Create table your_images (insert_date DATE,some_image BLOB, some_owner INT)
ENGINE=InnoDB /* row_format=compressed key_block_size=4 */
PARTITION BY RANGE COLUMNS (insert_date)
PARTITION p01 VALUES LESS THAN ('2015-07-12'),
PARTITION p02 VALUES LESS THAN ('2015-07-03'),
PARTITION p0x VALUES LESS THAN (ETC),
PARTITION p0n VALUES LESS THAN (MAXVALUE));
You can then insert just as you are used to, drop the partitions once per day (using 1 event for all your data), and create new partitions also once per day (using the same event which is dropping your old partitions).
To make certain a photo lives for 24 hours (minimum), the partition cleanup has to occur with a 1 day delay (So cleanup the day before yesterday, not yessterday itself).
A date filter in your query getting the image from the database is still needed to prevent the images from older then a day being displayed.

Resetting next sequence ID on SERIAL column

After committing a row with a SERIAL (autonumber) column, the row is deleted, but when another row is added, the deleted row's sequence ID is not reused.
The only way I have found for reusing the deleted row's sequence ID is to ALTER the SERIAL column to an INTEGER, then change it back to a SERIAL.
Is there an easier quicker way for accomplishing the resetting of the next sequence ID so that there are no gaps in the sequence?
NOTE: This is a single-user application, so no worries about multiple users simultaneously inserting rows.
There isn't a particularly easy way to do that. You can reset the number by inserting … hmmm, once upon a long time ago, there were bugs in this, and you're using ancient enough versions of the software that on occasions the bug might still be relevant, though all current versions of Informix products do not have the bug.
The safe technique is to insert: 231-2 (note the minus 2; that's +2,147,483,646), then insert a row with 0 (to generate +2,147,483,647), then insert another row with 0 to trip the next sequence number back to 1. That insert operation will fail if there's already a row with 1 in the system and you have a unique constraint on the SERIAL column. You then need to insert the maximum value, or the value before the first gap that you want to fill (another failing insert). However, note that after filling the gap, the inserted values will increase by one, bumping into any still existing rows and causing insertion failures (because you do have a unique constraint/index on each SERIAL column, don't you — and don't 'fess up if you do not have such indexes; just go and add them!).
If you have a more recent version of Informix, you can insert +2,147,483,647 and then a single row to wrap the value without running into trouble. If you have an old version of Informix with the bug, then inserting +2,147,482,647 directly caused problems. IIRC, the trouble was that you ended up with NULLs being generated, but it was long enough ago now (another millennium) that I'm not absolutely sure of that any more.
None of this is really easy, in case you hadn't noticed.
Generally speaking, it is unwise to fill the gaps; you're better off leaving them and not worrying about them, or inserting some sort of dummy record that says "this isn't really a record — but the serial value is otherwise missing so I'm here to show that we know about it and it isn't missing but isn't really used either".

Load Large Data from multiple tables in parallel using multithreading

I'm trying load data about 10K records from 6 different tables from my Ultralite DB.
I have created different functions for 6 different tables.
I have tried to load these in parallel using NSInvokeOperations, NSOperations, GCD, Subclassing NSOperation but nothing is working out.
Actually, loading 10K from 1 table takes 4 Sec, and from another 5 Sec, if i keep these 2 in queue it is taking 9 secs. This means my code is not running in parallel.
How to improve performance problem?
There may be multiple ways of doing it.
What i suggest will be :
Set the number of rows for table view to be exact count (10k in your case)
Table view is optimised to create only few number of cells at start(follows pull model). so cellForRowAtIndexPath will be called only for few times at start.
Have an array and fetch only 50 entries at start. Have a counter variable.
When user scrolls table view and count reaches after 50 fetch next 50 items(it will take very less time) and populate cells with next 50 data.
keep on doing same thing.
Hope it works.
You should fetch records in chunks(i.e. fetch 50-60 records at a time in a table). And then when user reach end of the table load another 50 -60 records. Try hands with this library: Bottom Pull to refresh more data in a UITableView
Regarding parallelism go with GCD, and reload respective table when GCD's success block called.
Ok you have to use Para and Time functions look them up online for more info

Resources