How to link jobs on coordinator and workers on a Citus database on PostgreSQL 12 - postgresql-12

I have Citus extension on a PostgresSQL server. And I want to see the statistics from pg_stat_statements of each worker through the coordinator node. However, there is no column to match the tables from coordinator and workers. Does anybody know how can I do that?
I am also interested on how the queryId is being computed by PostgreSQL.
So the pg_stat_statements tables on the coordinator would show something like:
userid | dbid | queryid | query | other statistics related columns
1 | 2 | 123 | SELECT * FROM a; | ...
While the pg_stat_statements tables on the worker would show something like:
userid | dbid | queryid | query | other statistics related columns
1 | 2 | 456 | SELECT * FROM a_shard1; | ...
1 | 2 | 789 | SELECT * FROM a_shard2; | ...

You can match the table names on workers (shards) to the distributed tables on the coordinator with the help of pg_dist_partition, and pg_dist_shard_placement tables. For matching the stats, you can check citus_stat_statements view.

(Cannot reply above answer so adding my answer here)
You can use below query to list location of shards of a specific table in a specific worker node (See last three filters in WHERE clause).
SELECT pg_dist_shard.shardid, pg_dist_node.nodename, pg_dist_node.nodeport
FROM pg_dist_shard, pg_dist_placement, pg_dist_node
WHERE pg_dist_placement.groupid = pg_dist_node.groupid AND
logicalrelid = '<distributedTableName>'::regclass AND
pg_dist_node.nodename = '<nodeName>' AND
pg_dist_node.nodeport = '<nodePort>';
Then you can execute below query in worker node of your interest to see what Citus executes for a specific shard in that worker node:
SELECT * FROM pg_stat_statements WHERE query LIKE '%_<shardId>%';

Related

Using Crosstab to Generate Data for Charts

I'm trying to make an efficient query to create a view that will contains counts for the number of successful logins by day as well as by type of user with no duplicate users per day.
I have 3 tables involved in this query. One table that contains all successful login attempts, one table for standard user accounts, and one table for admin user accounts. All user_id values are unique across the entire database so there are no user accounts that will share the same user_id with an admin account:
TABLE 1: user_account
user_id | username
---------|----------
1 | user1
2 | user2
TABLE 2: admin_account
user_id | username
---------|----------
6 | admin6
7 | admin7
TABLE 3: successful_logins
user_id | timestamp
---------|------------------------------
1 | 2022-01-23 14:39:12.63798-07
1 | 2022-01-28 11:16:45.63798-07
1 | 2022-01-28 01:53:51.63798-07
2 | 2022-01-28 15:19:21.63798-07
6 | 2022-01-28 09:42:36.63798-07
2 | 2022-01-23 03:46:21.63798-07
7 | 2022-01-28 19:52:16.63798-07
2 | 2022-01-29 23:12:41.63798-07
2 | 2022-01-29 18:50:10.63798-07
The resulting view I would like to generate would contain the following information from the above 3 tables:
VEIW: login_counts
date_of_login | successful_user_logins | successful_admin_logins
---------------|------------------------|-------------------------
2022-01-23 | 1 | 1
2022-01-28 | 2 | 2
2022-01-29 | 1 | 0
I'm currently reading up on how crosstabs work but having trouble figuring out how to write the query based on my table setups.
I actually was able to get the values I needed by using the following query:
SELECT
to_char(s.timestamp, 'YYYY-MM-DD') AS login_date,
count(distinct u.user_id) AS successful_user_logins,
count(distinct a.user_id) AS successful_admin_logins
FROM successful_logins s
LEFT JOIN user_account u ON u.user_id= s.user_id
LEFT JOIN admin_account a ON a.user_id= s.user_id
GROUP BY login_date
However, I was told it would be even quicker using crosstabs, especially considering the successful_logins table contains millions of records. So I'm trying to also create a version of the query using crosstabs then comparing both execution times.
Any help would be greatly appreciated. Thanks!
Turns out it isn't possible to do what I was asking about using crosstabs, so the original query I have will have to do.

influxdb flux lookup function

I have a table in a MySQL DB containing a model_name as key and a model_desc as a secondary column.
| model1 | Phone |
| model2 | Tablet |
In influxdb I have a bucket with a series in which model_name is a label
| username | model_name | last_connected |
I have to join these two so for each event in influxdb I can associate it with a model_desc. If there is no match, I want to set model_desc to something like 'unknown'
I got as far as using join on both queries but since join is explicitly 'inner' I only see the intersection. I'd require something like an outer join or a lookup that returns either a match or an 'unknown' for non-matching rows.

How do I know the query_group of a query which was run?

I need help in figuring out what was the query_group of a query which was run on redshift. I have set a query_group in the wlm config and want to make sure the query is getting executed from that query group.
query_group is the part of WLM(workload management) configuration which enables you to manage how to run queries through queues on the Redshift cluster. To use query_group, you have to set up your own queue with query_group name(Label) through the AWS console([Amazon Redshift] -> [Parameter Groups] -> Select parameter group -> [WLM]) or cli in advance.
Here is an example which is snipped from the Redshift doc.
set query_group to 'Monday';
select * from category limit 1;
...
reset query_group
You have to set the query_group before starting the query which you want to assign to the specific queue, and reset the query_group after finishing.
You can track the queries of query_group as following. 'label' is the name of query_group.
select query, pid, substring, elapsed, label
from svl_qlog where label ='Monday'
order by query;
query | pid | substring | elapsed | label
------+------+------------------------------------+-----------+--------
789 | 6084 | select * from category limit 1; | 65468 | Monday
790 | 6084 | select query, trim(label) from ... | 1260327 | Monday
791 | 6084 | select * from svl_qlog where .. | 2293547 | Monday
792 | 6084 | select count(*) from bigsales; | 108235617 | Monday
...
This document is good to understand how WLM works and use it.
http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html
This link is about query_group.
http://docs.aws.amazon.com/redshift/latest/dg/r_query_group.html

Heroku Postgres performance monitoring with pg_stat_statements

I've been trying to troubleshoot some recurring H12/13 errors on Heroku. After exhausting everything I can find on Google/Heroku/Stack Overflow I'm now checking to see if some long-running database queries are causing the problem on the advice of Heroku support.
Update: I'm on a production Crane instance. Per the accepted answer below...it appears you cannot do this on Heroku. The best I've been able to do is filter them out per the SQL below:
SELECT u.usename, (total_time / 1000 / 60) as total_minutes,
(total_time/calls) as average_time, query
FROM pg_stat_statements p
JOIN pg_user u ON (u.usesysid = p.userid)
WHERE query != '<insufficient privilege>'
ORDER BY 2
DESC LIMIT 10;
I'm trying to use Craig Kerstien's very useful post,
http://www.craigkerstiens.com/2013/01/10/more-on-postgres-performance/ but running into some permission issues.
When I query the pg_stat_statements table I get "insufficient privileges" for some of the longer-running queries and it doesn't appear that Heroku lets you change user permissions.
Does anyone know how I can change permissions see these queries on Heroku?
heroku pg:psql --remote production
psql (9.2.2, server 9.2.4)
SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
Type "help" for help.
d4k2qvm4tmu579=> SELECT
d4k2qvm4tmu579-> (total_time / 1000 / 60) as total_minutes,
d4k2qvm4tmu579-> (total_time/calls) as average_time,
d4k2qvm4tmu579-> query
d4k2qvm4tmu579-> FROM pg_stat_statements
d4k2qvm4tmu579-> ORDER BY 1 DESC
d4k2qvm4tmu579-> LIMIT 10;
total_minutes | average_time | query
------------------+-------------------+--------------------------
121.755079699998 | 11.7572250919775 | <insufficient privilege>
17.9371053166656 | 1.73208859315089 | <insufficient privilege>
13.8710526000023 | 1.33945202190106 | <insufficient privilege>
6.98494270000089 | 0.674497883626922 | <insufficient privilege>
6.75377774999972 | 0.652175543095124 | <insufficient privilege>
6.55192439999995 | 0.632683664174224 | <insufficient privilege>
3.84014626666634 | 1.12786802880252 | <insufficient privilege>
3.40574066666667 | 1399.61945205479 | <insufficient privilege>
3.16332020000008 | 0.929081204384053 | <insufficient privilege>
2.30192519999944 | 0.222284382614463 | <insufficient privilege>
(10 rows)
I can't answer your question directly but maybe take a look at the pg-extra's plugin which brings a lot of this goodness directly to the Heroku CLI and returns data :)
https://github.com/heroku/heroku-pg-extras
You need to be running a production level instance of Heroku Postgres in order to utilize pg_stat_statements. Even then, it will only be able to show you stats for queries run by your app (or any client using the heroku supplied credentials). You won't be able to see queries for superusers (posters, collectd). Production plans are Crane and up (I believe).
You can see the username by joining in pg_user:
SELECT u.usename, (total_time / 1000 / 60) as total_minutes,
(total_time/calls) as average_time, query
FROM pg_stat_statements p
JOIN pg_user u ON (u.usesysid = p.userid) ORDER BY 2 DESC LIMIT 10;

Rails created_at timestamp order disagrees with id order

I have a Rails 2.3.5 app with a table containing id and created_at columns. The table records state changes to entities over time, so I occasionally use it to look up the state of an entity at a particular time, by looking for state changes that occurred before the time, and picking the latest one according to the created_at timestamp.
For 10 out of 1445 entities, the timestamps of the state changes are in a different order to the ids, and the state of the last state change differs from the state which is stored with the entity itself, e.g.
id | created_at | entity_id | state |
------+---------------------+-----------+-------+
1151 | 2009-01-26 10:27:02 | 219 | 1 |
1152 | 2009-01-26 10:27:11 | 219 | 2 |
1153 | 2009-01-26 10:27:17 | 219 | 4 |
1154 | 2009-01-26 10:26:41 | 219 | 5 |
I can probably get around this by ordering on id instead of timestamp, but can't think of an explanation as to how it could have happened. The app uses several mongrel instances, but they're all on the same machine (Debian Lenny); am I missing something obvious? DB is Postgres.
Because Rails is using database sequence to fetch the new id for your id field (at least in PostgreSQL) on insert or with the RETURNING keyword if the database supports it.
But it updates the created_at and updated_at fields on create with ActiveRecord::Timestamp#create_with_timestamps method which uses the system time.
The row 1154 was inserted later, but the timestamp for created_at field was calculated before.

Resources