MonetDB - left/right joins too slow than inner join - join

I have been comparing MySQL with MonetDB. Obviously, queries that took minutes in MySQL got executed in a matter of few seconds in Monet.
However, I found a real blockade with joins.
I have 2 tables - each with 150 columns. Among these (150+150) columns, around 60 are CHARACTER LARGE OBJECT type. Both the tables are populated with around 50,000 rows - with data in all the 150 columns. The average length of data in a CLOB type column is 9,000 (varying from 2 characters to 20,000 characters). The primary key of both the tables have same values and the join is always based on the primary key. The rows are by default inserted in ascending order on the primary key.
When I ran an inner join query on these two tables with about 5 criteria and with limit 1000, Monet processed this query in 5 seconds, which is completely impressive compared to MySQL's (19 seconds).
But when I ran the same query with same criteria and limits using a left or right joins, Monet took around 5 minutes, which is clearly way behind MySQL's (just 22 seconds).
I went through the logs using trace statement, but the traces of inner and left joins are more or less the same, except that time for each action is far higher in left join trace.
Also, the time taken for the same join query execution varies by 2 or 3 seconds when run at several time intervals.
Having read a lot about Monet's speed compared to traditional relation row based DBs, I could feel that I am missing something, but couldn't figure out what.
Can anyone please tell me why there is such a huge difference in such query execution time and how I can prevent it?
Much grateful to have any help. Thanks a lot in advance.
P.S.: I am running Monet on Macbook Pro - 2.3 GHz Core i7 - Quad core processor with 8 GB RAM.

Related

Slow query with 22 million points

I have 1 TB (text data).
I installed the Influxd, in a machine (240 G RAM, 32 CUP)
I only inserted around 22 million points in one measurement, one tag and 110 field.
When i do query (select id from ts limit 1) , it exceed 20 second, and this is not good.
So can you please help me in what i should do to have a good performance
how many count your series?
maybe your problem come up from here:
https://docs.influxdata.com/influxdb/v1.2/concepts/schema_and_data_layout/#don-t-have-too-many-series
Tags containing highly variable information like UUIDs, hashes, and random strings will lead to a large number of series in the database, known colloquially as high series cardinality. High series cardinality is a primary driver of high memory usage for many database workloads

BigQuery taking too much time on simple LEFT JOIN

So I'm doing a really basic left join, basically joining different identifiers of my database, described below :
SELECT
main_id,
DT.table_1.mid_id AS mid_id,
final_id
FROM DT.table_1
LEFT JOIN DT.table_2 ON DT.table_1.mid_id = DT.table_2.mid_id
Table 1 is composed of four columns, main_id, mid_id, firstSeen and lastSeen.
There is 17,014,676 rows, for 519 MB of data. Each row is composed of a unique main_id - mid_id couple, but a main_id/mid_id can appear multiple times in the table.
Table 2 is composed of four columns, mid_id, final_id, firstSeen and lastSeen.
There is 66,779,079 rows, for 3.86 GB of data. In the same way, each row is composed of a unique mid_id - final_id couple, but a mid_id/final_id can appear multiple times in the table.
BigQuery is using only 3.11 GB for the query himself.
first_id and mid_id are integers, final_id is a string.
The Query result was too big for bigQuery to resolve so I had to create a "result" table, containing first, mid and final id with the exact type I wrote above. The "Allow Large Results" option had to be selected, or an error was thrown.
My problem is that this simple query already took an hour, and is not even finalised yet ! I read that the good practice would have been to do a RIGHT JOIN so that the first table in the join is the biggest, but still, an hour is awfully long, even for that case !
Do you, kind people of Stack Overflow, have an explanation ?
Thank you by advance !

Performance issues when retrieving last value

I have a measurement that keeps track of sensor readings for a bunch of machines.
There are something of the order of 50 different readings per machine, and there are up to 1000 machines. We have one reading every 30 seconds.
The way I store the reading is in a single measurement which has 2 tags, machine_id and analysis_id and a single value.
One of the use cases I have is to retrieve the current value for each reading for a list of machines.
When this database gets to 100 million records or something like that, which with those numbers means less than 1 day, I can no longer retrieve the last values with a query as it takes too long.
I tried the two following alternatives:
SELECT *
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
ORDER BY time DESC
LIMIT 1
and:
SELECT last(*) AS value,
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
both of then take a pretty long time to complete. At 100 million it's something of the order of 1 second.
The use case of retrieving the latest values is a very frequent one. I need to be able to get the "current" state of machines almost instantly.
I can work that out on the side of the app logic, by keeping track of the latest value in a separate place, but I was wondering what I could do with InfluxDB alone.
I was facing something similar and I worked around it by creating a continuous query.
https://docs.influxdata.com/influxdb/v0.8/api/continuous_queries/

InfluxDB performance

For my case, I need to capture 15 performance metrics for devices and save it to InfluxDB. Each device has a unique device id.
Metrics are written into InfluxDB in the following way. Here I only show one as an example
new Serie.Builder("perfmetric1")
.columns("time", "value", "id", "type")
.values(getTime(), getPerf1(), getId(), getType())
.build()
Writing data is fast and easy. But I saw bad performance when I run query. I'm trying to get all 15 metric values for the last one hour.
select value from perfmetric1, perfmetric2, ..., permetric15
where id='testdeviceid' and time > now() - 1h
For an hour, each metric has 120 data points, in total it's 1800 data points. The query takes about 5 seconds on a c4.4xlarge EC2 instance when it's idle.
I believe InfluxDB can do better. Is this a problem of my schema design, or is it something else? Would splitting the query into 15 parallel calls go faster?
As #valentin answer says, you need to build an index for the id column for InfluxDB to perform these queries efficiently.
In 0.8 stable you can do this "indexing" using continuous fanout queries. For example, the following continuous query will expand your perfmetric1 series into multiple series of the form perfmetric1.id:
select * from perfmetric1 into perfmetric1.[id];
Later you would do:
select value from perfmetric1.testdeviceid, perfmetric2.testdeviceid, ..., permetric15.testdeviceid where time > now() - 1h
This query will take much less time to complete since InfluxDB won't have to perform a full scan of the timeseries to get the points for each testdeviceid.
Build an index on id column. Seems that he engine uses full scan on table to retrieve data. By splitting your query in 15 threads, the engine will use 15 full scans and the performance will be much worse.

Speed of ALTER TABLE ADD COLUMN in Sqlite3?

I have an iOS app that uses sqlite3 databases extensively. I need to add a column to a good portion of those tables. None of the tables are what I'd really consider large (I'm used to dealing with many millions of rows in MySQL tables), but given the hardware constraints of iOS devices I want to make sure it won't be a problem. The largest tables would be a few hundred thousand rows. Most of them would be a few hundred to a few thousand or tens of thousands.
I noticed that sqlite3 can only add columns to the end of a table. I'm assuming that's for some type of speed optimization, though possibly it's just a constraint of the database file format.
What is the time cost of adding a row to an sqlite3 table?
Does it simply update the schema and not change the table data?
Does the time increase with number of rows or number of columns already in the table?
I know the obvious answer to this is "just test" and I'll be doing that soon, but I couldn't find an answer on StackOverflow after a few minutes of searching so I figured I'd ask so others can find this information easier in the future.
From the SQLite ALTER TABLE documentation:
The execution time of the ALTER TABLE command is independent of the
amount of data in the table. The ALTER TABLE command runs as quickly
on a table with 10 million rows as it does on a table with 1 row.
The documentation implies the operation is O(1). It should run in negligible time.

Resources