When working with QuestDB, are symbol columns good for performance for huge amounts of rows each? - time-series

When working with regular SQL databases, indexes are useful for fetching a few rows, but not so useful when you are fetching a large amount of data from a table. For example, imagine you have a table with stock valuations of 10 stocks over time:
|------+--------+-------+
| time | stock | value |
|------+--------+-------+
| ... | stock1 | ... |
| ... | stock2 | ... |
| ... | ... | ... |
|------+--------+-------+
As far as I can tell, indexing it by stock (even with an enum/int/foreign key) is usually not very useful in a database like Postgres if you want to get data over a large period of time. You end up with an index spanning a large part of the table, and it ends up being faster for the database to do a sequential scan, for example, to get the average value over the whole dataset for each stock:
SELECT stock, avg(value) FROM stock_values GROUP BY stock
Given that QuestDB is row oriented, I would guess that it would result in better performance to have a separated column for each stock.
So, what schema is recommended in QuestDB for a situation like this? One column for each stock, or would a symbol column for each stock symbol be as good (or good enough) even if there are millions of results for each row?

A column per stock is not easy to achieve in QuestDB. If you create table like this
|----------------------------------|
| time | stock1 | stock1 | stock3 |
|----------------------------------|
Then you'll have to insert all values together in one row or you end up with gaps
|----------------------------------|
| time | stock1 | stock1 | stock3 |
|----------------------------------|
| t1 | 1.1 | | |
| t2 | | 3.45 | |
| t3 | | | 103.45 |
|----------------------------------|
Even for t1 == t2 == t3 when you do the insert as 3 operation it will still result in 3 rows.
So symbols are a better choice here.
Symbol can be indexed and not indexed and you may have benefits of non-indexed symbols when distinct number of them is low. Reading full table vs reading by index is the matter of index selectivity, not data range. If the selectivity is high (e.g. distinct symbol count is say 10k) fetching by index is faster than range scans.

Related

With a composite index, what column order do ActiveRecord queries use to decide which composite index to search?

Rails v. 5.2.4
ActiveRecord v5.2.4.3
I have a Rails app with a MySQL database, and my app has a Skill model and a SkillAdjacency model. The SkillAdjacency model has the following attributes:
requested_skill_id, table_name: 'Skill'
adjacent_skill_id, table_name: 'Skill'
score, integer
SkillAdjacencies are used to determine how "similar" two instances of Skill are to each other.
One of the app's constraints is that you can't create more than one instance of SkillAdjacency for each combination of requested_skill and adjacent_skill, and I plan to enforce this both with ActiveModel validations and with a composite index which employs a uniqueness constraint. So far I have the following:
add_index :skill_adjacencies, [:requested_skill_id, :adjacent_skill_id], unique: true, name: 'index_adjacencies_on_requested_then_adjacent', using: :btree
However, I know that the order in which the composite columns are declared is important, so I'm considering adding this 2nd composite index to account for the other possible order:
add_index :skill_adjacencies, [:adjacent_skill_id, :requested_skill_id], unique: true, name: 'index_adjacencies_on_adjacent_then_requested', using: :btree
But because writing to an index isn't free, I only want to add the 2nd index if it will actually result in a performance benefit. The problem is, whether or not this 2nd index will be beneficial depends on whether ActiveRecord will start with adjacent_skill_id vs. requested_skill_id when searching for a composite index to search.
How can I determine what order ActiveRecord uses? Does it just use the same order that's specified in the query? For example, if I query SkillAdjacency.where(requested_skill: Skill.last, adjacent_skill: Skill.first), will it always search for a composite index composed of requested_skill 1st and adjacent_skill 2nd? If that's the case, should I cover all my bases by creating that additional composite index?
Alternately, is there some under-the-hood magic which determines if the relevant composite index exists regardless of the order provided in the query?
EDIT:
I ran EXPLAIN and saw the following:
irb(main):013:0> SkillAdjacency.where(requested_skill_id: 1, adjacent_skill_id: 200).explain
SkillAdjacency Load (0.3ms) SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`requested_skill_id` = 1 AND `skill_adjacencies`.`adjacent_skill_id` = 200
=> EXPLAIN for: SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`requested_skill_id` = 1 AND `skill_adjacencies`.`adjacent_skill_id` = 200
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | skill_adjacencies | NULL | const | index_adjacencies_on_requested_then_adjacent,index_adjacencies_on_adjacent_then_requested,index_skill_adjacencies_on_requested_skill_id,index_skill_adjacencies_on_adjacent_skill_id | index_adjacencies_on_requested_then_adjacent | 10 | const,const | 1 | 100.0 | NULL |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
1 row in set (0.00 sec)
irb(main):014:0> SkillAdjacency.where(adjacent_skill_id: 200, requested_skill: 1).explain
SkillAdjacency Load (0.3ms) SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`adjacent_skill_id` = 200 AND `skill_adjacencies`.`requested_skill_id` = 1
=> EXPLAIN for: SELECT `skill_adjacencies`.* FROM `skill_adjacencies` WHERE `skill_adjacencies`.`adjacent_skill_id` = 200 AND `skill_adjacencies`.`requested_skill_id` = 1
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | skill_adjacencies | NULL | const | index_adjacencies_on_requested_then_adjacent,index_adjacencies_on_adjacent_then_requested,index_skill_adjacencies_on_requested_skill_id,index_skill_adjacencies_on_adjacent_skill_id | index_adjacencies_on_requested_then_adjacent | 10 | const,const | 1 | 100.0 | NULL |
+----+-------------+-------------------+------------+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+-------------+------+----------+-------+
1 row in set (0.00 sec)
In both cases, I see that the value in the key column is index_adjacencies_on_requested_then_adjacent, despite each query passing in a different order for the query params. Can I assume this means the order of those params doesn't matter?

What is the best way to attach a running total to selected row data?

I have a table that looks like this:
Created at | Amount | Register Name
--------------+---------+-----------------
01/01/2019... | -150.01 | Front
01/01/2019... | 38.10 | Back
What is the best way to attach an ascending-by-date running total to each record which applies only to the register name the record has? I can do this in Ruby, but doing it in the database will be much faster as it is a web application.
The application is a Rails application running Postgres 10, although the answer can be Rails-agnostic of course.
Use the aggregate sum() as a window function, e.g.:
with my_table (created_at, amount, register_name) as (
values
('2019-01-01', -150.01, 'Front'),
('2019-01-01', 38.10, 'Back'),
('2019-01-02', -150.01, 'Front'),
('2019-01-02', 38.10, 'Back')
)
select
created_at, amount, register_name,
sum(amount) over (partition by register_name order by created_at)
from my_table
order by created_at, register_name;
created_at | amount | register_name | sum
------------+---------+---------------+---------
2019-01-01 | 38.10 | Back | 38.10
2019-01-01 | -150.01 | Front | -150.01
2019-01-02 | 38.10 | Back | 76.20
2019-01-02 | -150.01 | Front | -300.02
(4 rows)

Restrict Sumo Logic search to one timeslice bucket

I have logs being pushed to sumo logic once every day, but other co-workers have the ability to force a push to update statistics. This causes an issue where some sumo logic searches will find and return double (or more) than what is expected due to finding more than one message within the allocated time range.
I am wondering if there is some way I can use timeslice so that I am only looking at the last set of results within a 24h period?
My search that works when there is only one log in 24h:
| json field=_raw "Policy"
| count by policy
| sort by _count
What I am trying to achieve:
| json field=_raw "Policy"
| timeslice 1m
| where last(_timeslice)
| count by policy
| sort by _count
Found a solution, not sure if optimal.
| json field=_raw "Policy"
| timeslice 1m
| count by policy , _timeslice
| filter _timeslice in (sort by _timeslice desc | limit 1)
| sort by _count
| fields policy, _count
If I'm understanding your question right, I think you could try something with the accum operator:
*
| json field=_raw "Policy"
| timeslice 1m
| count by _timeslice, policy
| 1 as rank
| accum rank by _timeslice
| where _accum = 1
This would be similar to doing a window partition in SQL to get rid of duplicates.

select distinct records based on one field while keeping other fields intact

I've got a table like this:
table: searches
+------------------------------+
| id | address | date |
+------------------------------+
| 1 | 123 foo st | 03/01/13 |
| 2 | 123 foo st | 03/02/13 |
| 3 | 456 foo st | 03/02/13 |
| 4 | 567 foo st | 03/01/13 |
| 5 | 456 foo st | 03/01/13 |
| 6 | 567 foo st | 03/01/13 |
+------------------------------+
And want a result set like this:
+------------------------------+
| id | address | date |
+------------------------------+
| 2 | 123 foo st | 03/02/13 |
| 3 | 456 foo st | 03/02/13 |
| 4 | 567 foo st | 03/01/13 |
+------------------------------+
But ActiveRecord seems unable to achieve this result. Here's what I'm trying:
Model has a 'most_recent' scope: scope :most_recent, order('date_searched DESC')
Model.most_recent.uniq returns the full set (SELECT DISTINCT "searches".* FROM "searches" ORDER BY date DESC) -- obviously the query is not going to do what I want, but neither is selecting only one column. I need all columns, but only rows where the address is unique in the result set.
I could do something like Model.select('distinct(address), date, id'), but that feels...wrong.
You could do a
select max(id), address, max(date) as latest
from searches
group by address
order by latest desc
According to sqlfiddle that does exactly what I think you want.
It's not quite the same as your requirement output, which doesn't seem to care about which ID is returned. Still, the query needs to specify something, which is here done by the "max" aggregate function.
I don't think you'll have any luck with ActiveRecord's autogenerated query methods for this case. So just add your own query method using that SQL to your model class. It's completely standard SQL that'll also run on basically any other RDBMS.
Edit: One big weakness of the query is that it doesn't necessarily return actual records. If the highest ID for a given address doesn't corellate with the highest date for that address, the resulting "record" will be different from the one actually stored in the DB. Depending on the use case that might matter or not. For Mysql simply changing max(id) to id would fix that problem, but IIRC Oracle has a problem with that.
To show unique addresses:
Searches.group(:address)
Then you can select columns if you want:
Searches.group(:address).select('id,date')

partition PostgreSQL table based on geometry column

Here is the table with geometry field :
Table "public.regions"
Column | Type |
-----------+-----------------------+-
id | integer |
parent_id | integer |
level | integer |
name | character varying(55) |
location | geometry |
I have stored the geometry for all continents, countries , states and cities. Since it is a huge table I need to partition the table based on top level location ( i.e continent) to improve the performance .
How can I partition my existing table based on geometry(continent) ? Is it good enough to create inheritance tables named asia, europe, australia ... and insert rows in those tables based on queries with contains ? Will that improve the performance of my queries?
For eg. I am trying to run queries like :
11.562424 48.148679 is some point in Munich
EXPLAIN ANALYZE SELECT id, name,level FROM regions WHERE
Contains((location),(GeomFromText('Point(11.562424 48.148679)')));
This is taking around 500 ms with PG in my computer whereas the same query is taking around 200ms in Oracle.

Resources