Redshift problem with getdate() being cached across multiple procedure calls

Redshift problem with getdate() being cached across multiple procedure calls - stored-procedures

I would like have timestamp along execution of a procedure, im using GETDATE() but calling procedure a 2nd, 3rd etc times, keeps returning same timestamps, seems redshift doing some sort of caching on GETDATE() ?
OK - getdate() outside procedure, correctly keeps returning different timestamps:
select getdate();
2021-03-09 13:03:18.0
select getdate();
2021-03-09 13:03:26.0
NOK - now will create test procedure, capturing start and end timestamps (and big FOR loop to make a pause), and call procedure more then 1 time, and I see on same start/end values being returned on 2nd, 3rd executions etc.
create or replace procedure tests.test_getdate_cache()
as $$
declare
ts timestamp;
f float;
begin
-- execute 'SET enable_result_cache_for_session TO OFF';
ts:= getdate();
raise info 'START - getdate()=[%]', ts;
FOR i IN 1..10000000 LOOP
-- ts:= getdate();
-- raise info '**loop[%] getdate()=[%]', i, ts;
f:= i * 1234.4234;
END LOOP;
ts:= getdate();
raise info 'END - getdate()=[%]', ts;
-- execute 'SET enable_result_cache_for_session TO ON';
end;
$$ LANGUAGE plpgsql;
/
call tests.test_getdate_cache();
START - getdate()=[2021-03-09 13:15:47]
END - getdate()=[2021-03-09 13:16:00
START - getdate()=[2021-03-09 13:15:47] <---- DUPLICATED VALUE
END - getdate()=[2021-03-09 13:16:00] <---- DUPLICATED VALUE
--cache results off
START - getdate()=[2021-03-09 13:23:22]
END - getdate()=[2021-03-09 13:23:34]
START - getdate()=[2021-03-09 13:23:22] <---- DUPLICATED VALUE
END - getdate()=[2021-03-09 13:23:34] <---- DUPLICATED VALUE
SELECT * FROM svl_stored_proc_messages WHERE 1=1 ORDER BY recordtime desc limit 4
userid | session_userid | pid | xid | query | recordtime | loglevel | loglevel_text | message | linenum | querytxt | label | aborted
-------+----------------+-------+---------+-------+---------------------+----------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------
101 | 101 | 31870 | 1540358 | 4332 | 2021-03-09 13:16:32 | 30 | INFO | END - getdate()=[2021-03-09 13:16:00] | 16 | call tests.test_getdate_cache() | default | 0
101 | 101 | 31870 | 1540358 | 4332 | 2021-03-09 13:16:19 | 30 | INFO | START - getdate()=[2021-03-09 13:15:47] | 7 | call tests.test_getdate_cache() | default | 0
101 | 101 | 31870 | 1540334 | 4325 | 2021-03-09 13:16:00 | 30 | INFO | END - getdate()=[2021-03-09 13:16:00] | 16 | call tests.test_getdate_cache() | default | 0
101 | 101 | 31870 | 1540334 | 4325 | 2021-03-09 13:15:47 | 30 | INFO | START - getdate()=[2021-03-09 13:15:47] | 7 | call tests.test_getdate_cache() | default | 0
\
I even tried to turn off result cache, despite probably thats only for SELECTs while getdate() is function, but, anyway gave it a try, but same result, timestamps keep coming the same. Just uncomment the above lines
-- execute 'SET enable_result_cache_for_session TO OFF';
-- execute 'SET enable_result_cache_for_session TO ON';
Any idea how can I get fresh timestamps early in procedure and when about to exit ?
Thanks.

Getdate() returns the current statement time and since a stored procedure call is a single statement it will return the same time throughout the procedure. See: https://docs.aws.amazon.com/redshift/latest/dg/Date_functions_header.html
I don't think what you are trying to do can work (reliably). Remember Redshift is a cluster of networked computers and while their clocks are sync'ed this isn't perfect. Each node will have a slightly different clock and the network time to talk to each other will also affect comparisons. You procedure isn't running in one place but many. You likely can get at the real time of each node through the use of a python UDF but then you have to sort out what it is you are seeing from the disparate results. This may be doable for single calls to the UDF but when you apply the process to real data stored across many nodes you are likely to see many conflicting answers.

I was also interested in getting a datetime for the current statement and I ran into the same issues you did with getdate(). This worked for me:
cast(timeofday() as timestamp)
It feels like a hack, but I'm going forward with this for my Redshift procedure logging.

Related

Google Sheets Formula to calculate actual total duration of tasks with different start/end dates, overlaps, and gaps

I know I how to do this using a custom function/script but I am wondering if it can be done with a built-in formula.
I have a list of tasks with a start date and end date. I want to calculate the actual # of working days (NETWORKDAYS) spent on all the tasks.
Task days may overlap so I can't just total the # of days spent on each task
There may be gaps between tasks so I can't just find the difference between the first start and last end.
For example, let's use these:
| Task Name | Start Date | End Date | NETWORKDAYS |
|:---------:|------------|------------|:-----------:|
| A | 2019-09-02 | 2019-09-04 | 3 |
| B | 2019-09-03 | 2019-09-09 | 5 |
| C | 2019-09-12 | 2019-09-13 | 2 |
| D | 2019-09-16 | 2019-09-17 | 2 |
| E | 2019-09-19 | 2019-09-23 | 3 |
Here it is visually:
Now:
If you total the NETWORKDAYS you'll get 15
If you calculate NETWORKDAYS between 2019-09-02 and 2019-09-23 you get 16
But the actual duration is 13:
A and B overlap a bit
There is a gap between B and C
There is a gap between D and E
If I was to write a custom function I would basically take all the dates, sort them, find overlaps and remove them, and account for gaps.
But I am wondering if there is a way to calculate the actual duration using built-in formulas?

sure, why not:
=ARRAYFORMULA(COUNTA(IFERROR(QUERY(UNIQUE(TRANSPOSE(SPLIT(CONCATENATE("×"&
SPLIT(REPT(INDIRECT("B1:B"&COUNTA(B1:B))&"×",
NETWORKDAYS(INDIRECT("B1:B"&COUNTA(B1:B)), INDIRECT("C1:C"&COUNTA(B1:B)))), "×")+
TRANSPOSE(ROW(INDIRECT("A1:A"&MAX(NETWORKDAYS(B1:B, C1:C))))-1)), "×"))),
"where Col1>4000", 0))))

Rails Postgres query Position of item relative to others?

I'm building a model against Amazon's SQS Standard Queue which can send updates out of order.
My goal is to properly order them.
I am longpolling to copy all data from the queue into my DB.
table example - lets say I fetch some messages and process them
id | published_at | run_at | payload
1 | 1:11pm | nil | ...
2 | 1:12pm | nil | ...
3 | 1:13pm | nil | ...
4 | 1:14pm | nil | ...
5 | 1:15pm | nil | ...
Then I fetch some more, and we can see that a few odd messages are now outdated.
id | published_at | run_at | payload
1 | 1:11pm | 1:15 | ...
2 | 1:12pm | 1:15 | ...
3 | 1:13pm | 1:15 | ...
4 | 1:14pm | 1:15 | ...
5 | 1:15pm | 1:15 | ...
6 | 1:13pm | nil | ...
7 | 1:14pm | nil | ...
8 | 1:16pm | nil | ...
if i were to order by published_at, you can see that the queue needs to be re-processed starting at ID=6 down to make sure messages are processed in order.
id | published_at | run_at | payload
1 | 1:11pm | 1:15 | ...
2 | 1:12pm | 1:15 | ...
3 | 1:13pm | 1:15 | ...
6 | 1:13pm | nil | ...
4 | 1:14pm | 1:15 | ...
7 | 1:14pm | nil | ...
5 | 1:15pm | 1:15 | ...
8 | 1:16pm | nil | ...
There is value in processing data accurately, and very little problem with processing twice so re-running is not a problem.
I am mostly curious how best to best find the oldest item that has not been ran, and start running from that moment forward.
Currently doing:
# fetch oldest publish_time that has not been ran
first_publish_time = AnyOfferChange.where(run_at: nil).minimum(:publish_time)
if first_publish_time
# start there, and process in ascending order
AnyOfferChange.order("publish_time DESC").where("publish_time >= ?",first_publish_time).reverse.each(&:process!)
end
It feels quite fragile, I'd like to fetch the position and use it as a limit.
limit = AnyOfferChange.where(run_at: nil).order("publish_time ASC").pluck("POSITION SOMETHIN(SOMETHING)").first
if limit > 0
# start there, and process in ascending order
AnyOfferChange.order("publish_time DESC").limit(limit).reverse.each(&:process!)
end

The follow SQL query will give you the oldest publish_time:
AnyOfferChange.where(run_at: nil).minimum(:publish_time)
Or, if you want one record:
AnyOfferChange.where(run_at: nil).order(publish_time: :asc).first
This will limit the SQL query the oldest row that has not ran.
Fetch all records that not ran from old till new:
result = AnyOfferChanges.where(run_at: nil).order(publish_time: :asc)
# or
result = AnyOfferChanges.where(run_at: nil).order(:publish_time) # Defaults to :asc
result.each(&:process!) # Process result. See note below for batch info.
Fetch all records that did not ran with exactly the oldest publish_time (that did not ran):
# See note below to prevent unwanted SQL execution for the statements
# below when executing in the terminal.
# Create shorthand.
any_offer_changes = AnyOfferChange.arel_table
# Build query parts.
not_ran = AnyOfferChange.where(run_at: nil)
oldest_publish_time = not_ran.select(any_offer_changes[:publish_time].minimum)
# All records that not ran with with the oldest publish time.
result = not_ran.where(publish_time: oldest_publish_time)
result.each(&:process!) # Process result. See note below for batch info.
This will result in fetching all records with the lowest publish time in one SQL query, using a sub-query.
The reason I use another way than AnyOfferChange.where(run_at: nil).minimum(:publish_time) to fetch the minimum for the last part. Is that this query will break the chain and create multiple SQL queries instead of one. Whereas AnyOfferChange.where(run_at: nil).select(any_offer_changes[:run_at].minimum) will keep the chain intact when used in a where statement.
notes:
unwanted SQL execution
When run one by one this will result in multiple queries since the #inspect (used to show you the result) will trigger the SQL to execute. In the terminal follow each statement with ;nil, to prevent execution when building a #where chain. This is not needed when executed in a script.
using "batches"
For large amounts of records you may have to limit the resulting values. Rails support batches, but they don't respect the given order. To keep the order you can create your own batch, although probably less efficient. This can be done like so:
result = AnyOfferChange.where(run_at: nil).order(:publish_time).limit(100)
result.each(&:process!) while result.reload.any?
Assuming you set the run_at attribute in #process!, otherwise the above will result in an endless loop.

Influx: doing math the same fields in different groups

I have InfluxDB measurement currently set up with following "schema":
+----+-------------+-----------+
| ts | cost(field) | type(tag) |
+----+-------------+-----------+
| 1 | 10 | 'a' |
| 1 | 20 | 'b' |
| 2 | 12 | 'a' |
| 2 | 18 | 'b' |
| 2 | 22 | 'c' |
+------------------+-----------+
I am trying to write a query that will group my table by timestamp and get a delta between field values of two different tags. If I want to get delta between tag 'a' and tag 'b', it will give me following result (please not that I ignore tag 'c'):
+----+-----------+------------+
| ts | type(tag) | delta_cost |
+----+-----------+------------+
| 1 | 'a' | 10 |
| 2 | 'b' | 6 |
+----+-----------+------------+
Is it something Influx can do or am I using the wrong tool?

Just managed to answer my own question. While one of the obvious ways would be performing self-join, Influx does not support joins anymore. We can, however, use nested selects in a following format:
SELECT MEAN(cost_a) - MEAN(cost_b) as delta_cost
FROM
(SELECT cost as cost_a, tag, tablename where tag='a'),
(SELECT cost as cost_b, tag, tablename where tag='b')
GROUP BY time(60s)
Since I am getting my data every 60 seconds anyway, and I have a guarantee of just one point per tag per 60 seconds, I can use GROUP BY and take MEAN without any problems

Sqlite: replacing or updating a row only if it is changed

An Objective-C iOS app integrates a sqlite with a set of rows, each identified by an ID. For example:
| id | user_name | age |
------------------------------
| 1 | johnny | 33 |
| 2 | mark | 30 |
| 3 | maroccia | 50 |
Asynchronously, the app receives the same set of records, but some of them are modified: it has to update (or replace) only the modified records, ignoring the other ones (those not modified).
For example, the app receives such updated rows:
| id | user_name | age |
------------------------------
| 1 | johnny | 33 |
| 2 | mark | 30 |
| 3 | ballarin | 50 | <------ CHANGED RECORD
In this case, only the third record is changed and the app should update or replace just it, ignoring the first two.
Obviously, the INSERT OR REPLACE does not suit me because it will write all the records. So, there exists some procedure in sqlite (or Objective-C) which can help me, updating only the modified records?
Thanks

You could simply replace all rows; the result is the same.
If you do not want to rewrite rows that have not actually changed, you have to compare all column values. If you have both the old rows and the received rows in separate tables, you can compare entire rows with a compound query:
INSERT OR REPLACE INTO MyData
SELECT * FROM ReceivedData
EXCEPT
SELECT * FROM MyData;

Iterating over irregular data with Ruby while 'filling in the blanks'

I'm rolling the following:
Rails 3.2.9
Highcharts
State Machine
I've got an irregular set of data that represents the change of state of hundreds of linux boxes. Each box checks into a central ping server every two minutes.
Every time a device heartbeats, the ping server checks if the device's current state is offline and if so, changes the state to online and sets the heartbeat table's online col to true and inserts the time this happened.
On the ping server, we have a cron that runs a rake task every 5 minutes. This finds all devices with a heartbeat less than the time now minus 5 minutes.
If it discovers a device is offline, it sets the device state to offline and marks to heartbeat table with the time of the last heartbeat and a 0.
We've been doing this for a while and it seems like an efficient way to store the uptime data without creating a row for 500 devices every 5 minutes.
The table looks a little like this:
+---------------------+--------+--------+
| created_at | dev_id | online |
+---------------------+--------+--------+
| 2012-10-08 16:29:16 | 2345 | 0 |
| 2012-11-21 16:40:22 | 2345 | 1 |
| 2012-11-03 19:15:00 | 2345 | 0 |
| 2012-11-08 09:15:01 | 2345 | 1 |
| 2012-11-08 09:18:03 | 2345 | 0 |
| 2012-11-09 17:57:22 | 2345 | 1 |
| 2012-12-09 13:57:23 | 2345 | 0 |
| 2012-12-09 14:57:25 | 2345 | 1 |
| 2012-12-09 15:00:30 | 2345 | 0 |
| 2012-12-09 15:57:31 | 2345 | 1 |
| 2012-12-09 16:07:35 | 2345 | 0 |
| 2012-12-09 16:37:38 | 2345 | 1 |
| 2012-12-09 17:57:40 | 2345 | 0 |
+---------------------+--------+--------+
Following Ryan Bate's fantastic Railscast on Highcharts, I can create a line graph of this data with irregular intervals.
The chart and data series
Following this example:
http://www.highcharts.com/demo/spline-irregular-time
And using a data series something like this:
= #devices.heartbeats.map { |o| o.online == true ? 1 : 0 }
It was plotting the line graph pretty nicely.
Where I'm stuck
The graph finishes at the last time it checked in and I need the graph to show a point at Now. In Ryan's example, he maps a zero to a date if there's no value. I can't translate this part.
I'm trying to achieve a graph like the stack bar chart but can't get the data sorted.
http://www.highcharts.com/demo/bar-stacked
How can I format my query so I get the data until Now as well as each individual point so I can create such a graph?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Redshift problem with getdate() being cached across multiple procedure calls - stored-procedures

I was also interested in getting a datetime for the current statement and I ran into the same issues you did with getdate(). This worked for me: cast(timeofday() as timestamp) It feels like a hack, but I'm going forward with this for my Redshift procedure logging.

Related

Google Sheets Formula to calculate actual total duration of tasks with different start/end dates, overlaps, and gaps

Rails Postgres query Position of item relative to others?

Influx: doing math the same fields in different groups

Sqlite: replacing or updating a row only if it is changed

Iterating over irregular data with Ruby while 'filling in the blanks'

Categories

Resources