DynamoDB Timeseries: Querying large timespans of data

DynamoDB Timeseries: Querying large timespans of data - time-series

I have a simple timeseries table:
{
"n": "EXAMPLE", # Name, Hash Key
"t": 1640893628, # Unix Timestamp, Range Key
"v": 10 # Value being stored
}
Every 15 minutes I will poll data and insert into the table. If I want to query values between a 24-hour period, this works well - this would equate to a total of 96 records.
Now, say I want to query a larger timespan - 1 or 2 years. This is now tens of thousands of records, and (in my opinion) impractical to do regularly. This will require multiple queries to retrieve larger time ranges which would negatively impact response times as well as being much more costly.
I have thought of a couple of potential solutions to this problem:
1. Replicate data in another table, with larger increments. A table with a single record every 6 hours, for example.
2. Have another table to store common query results, such as records for "EXAMPLE" for the past week, month, and year (respectively). I would periodically update records in the new table to hold every N'th record in the main table (a total of 100). Something like:
{
"n": "EXAMPLE#WEEKLY",
"v": [
{
"t": 1640893628,
"v": 10
},
{
"t": 1640993628,
"v": 15
},
... 98 more.
]
}
I believe #2 is a solid approach. It seems to me like this would be a common enough problem, so I would love to hear about how other people have approached it.

More options present themselves if you can convert your unix timestamps into ISO 8601-type strings like 2021-12-31T09:27:58+00:00.
If so, DynamoDB's begins_with key condition expression lets us query for discrete calendar time buckets. December 2021, for example,
is queryable using n = id1 AND begins_with(t, "2021-12"). Same for days and hours. We can take this one step further by adding
other periods in indexes.
Some rolling windows are possible, too: n = id1 AND t > [24 hours ago] gives us last 24h.
n (PK) t (SK) hour_bucket (LSI1 SK) week (LSI2 SK)
id1 2021-12-31T10:45 2021-12-31T09-12 2021-52
id1 2021-12-31T13:00 2021-12-31T13-15 2021-52
id1 2022-06-01T22:00 2022-06-01T22-24 2022-01
If you are looking for arbitrary time-series queries, you might consider Athena, as the other answer suggested, or AWS's serverless
Timestream, which is a "purpose-built time series database that makes it easy to store and analyze trillions of time series data points per day. "

You could export the table to Amazon S3 and run Amazon Athena on the exported data. Here’s a blog post describing the process: https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/

Related

Timezone-aware queries in Azure Data Explorer

I use the Azure Data Explorer to store temperature sensor values. The timestamps are in UTC. I want to aggregate these values by day for the last 7 days. Nevertheless, I want to use the local time from where these values came from and aggregate by the timestamps in local time (e.g. midnight would be at 00:00+2h and 22:00UTC). How can I do this with Kusto Query Language in the ADX?

E.g. if you want to provide the timezone UTC+1, you can extend your Kusto query by this:
| extend Timestamp = Timestamp + 3600s
Your filters for a time range would still need to be provided in UTC though.

Offsets work, but you cannot simply use a fixed offset for time zone if you care about daylight saving and having a very general solution. If you're doing something that regularly produces reports AND the time zone must be correct, read on.
It felt like a bit of a hack, but the way we achieved something along these lines was to create a time zone table with columns like this:
BeginOfDay: datetime(2020-01-01 00:00:00)
Timezone: "Africa/Addis_Ababa"
UTCStart: datetime(2020-01-01 00:00:00)-3h
UTCEnd: datetime(2020-01-02 00:00:00)-3h
There should be one row for every combination of time zone and day of interest. We populated something like ten years into the future. If you're worried about storage space or speed you only need to include the date range and time zones you care about, but even with 'everything' it was not a very large table.
Each row contains the 'day' BeginOfDay, which is always midnight and equivalent to "The first of January, 2020", and then the start and end of that local day in, in UTC time. We wrote a program to generate the contents of the table, of course.
After that, you can do something like:
let TimezoneDay = datatable (BeginOfDay:datetime, Timezone:string, UTCStart:datetime, UTCEnd:datetime)
[datetime(2020-01-01), "Africa/Addis_Ababa", datetime(2019-12-31 21:00:00), datetime(2020-01-01 21:00:00),
datetime(2020-01-02), "Africa/Addis_Ababa", datetime(2020-01-01 21:00:00), datetime(2020-01-02 21:00:00),
datetime(2020-01-03), "Africa/Addis_Ababa", datetime(2020-01-02 21:00:00), datetime(2020-01-03 21:00:00)
];
let TemperatureEvents = datatable (Timestamp:datetime, Device:string, Temperature:real)
[datetime(2020-01-01 05:00:00), "Device 1", 10.5,
datetime(2020-01-01 07:00:00), "Device 1", 30.5,
datetime(2020-01-02 01:50:00), "Device 1", 24.0,
datetime(2020-01-02 20:00:00), "Device 1", 20.5,
datetime(2020-01-02 23:50:00), "Device 1", 19.5,
datetime(2020-01-01 10:20:00), "Device 2", 0.5
];
TimezoneDay
| where Timezone == "Africa/Addis_Ababa"
// Use a dummy column to emulate a cross join
| extend dummy=1
| join kind=inner (TemperatureEvents | extend dummy = 1) on dummy
// Filter values into local time
| where Timestamp between (UTCStart .. UTCEnd)
| summarize AverageTemp=avg(Temperature) by BeginOfDay, Timezone, Device
The cross join may be a little expensive if you have a large dataset, but this is a starting point - you can also do a time window join to restrict the number of events you consider for each 'day'.

Azure Data Explorer doesn't have any built-in functions for converting between time zones.
The documentation recommends:
... Should time zone values be required to be kept as a part of the data, a separate columns should be used (providing offset information relative to UTC).
Thus, you should store two values - The original UTC-based timestamp so you can properly order the data, and the date from the local time zone so you can aggregate by local day.

Multiple column data extraction

Problem : I want to aggregate multiple conditions in discontinueted column.
Link : Google Sheet link
Context: I'm a mushroom grower and I wanna schedule my production program for 2020.
We call batch an amount of inoculated mushroom substrate put in production at the same time. Usually, it weights a ton, and for the total year 2020 we'll have about 40 batches put in production.
A batch has several characteristics :
An ID. 20-B1 is the first batch for 2020, 20-B2 the second, and so on.
A scheduled week of production start. Usually, the production for 1 batch last 14 week. The week of production is numbered from 1 to 14.
A calendar week, where the array of 14 production week is placed.
An Origin. We work with two companies (Eurosubstrat and CNC) and are producing our own substrate. This means 3 arguments: "EURO", "CNC" and "LCU"
A species cultivated. We have two species: Shiitake and Pleurotus Ostreatus
A weight; about 1 ton but few batches can be much lighter
A production capability VS time: For example, in The third week of exploitation, a batch will produce 3,4% of his mass in mushroom.
I wanna extract and put the following results in a column :
Target 1 : In order to establish an order schedule, the amount of batch who match those criteria for 1 given calendar week; for example "Week production = 1" AND "Origin = Euro" AND "Species = shii."
Target 2 : In order to estimate total production capability for 1 given species, the sum of individual production capability for 1 given calendar week: for example "Species = shii" AND (related) "production capability"
Hope I'm clear
If you imagine a better way to organize my data, feel free to suggest. I don't know if the current organization is the best.
Sorry, English is not my native tongue.

Ruby on Rails, How to reset a counter every day?

I need to save a reference number every time I save a record of a certain model, the reference should be composed of 10 numbers, the first 8 numbers are related to the creator id and date, but the last 2 digits should be an incremental number starting at 00 and ending at 99, this count should be reset every single day.
For example:
Records created the same day:
SD01011800
GF01011801
MT01011802
...
GH01011899
------------------------------------------------------------------------------
Records created the next day:
SD02011800
GF02011801
MT02011802
...
GH02011899
Where the first 2 letters are the initials of a name, the next 2 are the current day, next 2 current month, next 2 current year, next 2 incremental number (from 0 to 99, reset daily)
Also every reference HAS TO be unique.
I'm missing the last two digits part, any idea on how to grant this ?
Thanks for reading.

Where the first 2 letters are the initials of a name, the next 2 are the current day, next 2 current month, next 2 current year, next 2 incremental number (from 0 to 99, reset daily).
As folks in the comments have pointed out, this assumes there is a maximum of 100 entries per day, and it will have problems in 2100. One is more pressing than the other. Maybe if you go over 100 you can start using letters?
Also every reference HAS TO be unique.
For globally unique identifiers UUIDs, Universally Unique IDentifiers, are generally the way to go. If you can change this requirement it would be both simpler (databases already support UUIDs), more robust (UUIDs aren't limited to 100 per day), and more secure (they don't leak information about the thing being identified).
Assuming you can't change the requirement, the next number can be gotten by adding up the number of existing rows that day.
select count(id)
from stuff
where date(created_at) == date(NOW());
However there is a problem if two processes both insert a new record at the same time and get the same next number. Probably highly unlikely if you're expecting only 100 a day, but still possible.
Process 1 Process 2 Time
select count(id) ... 1
select count(id) ... 2
insert into stuff ... 3
insert into stuff... 4
A transaction won't save you. You could get an exclusive lock on the whole table, but that's dangerous and slow. Instead you can use an advisory lock for this one operation. Just make sure all code which writes new records uses this same advisory lock.
Stuff.with_advisory_lock(:new_stuff_record) do
...
end
Alternatively, store the daily ID in a column. Add a database trigger to add 1 on insert. Set it back to 0 with a scheduled job at midnight.

I will assume your class is named Record and it has an attribute called reference_number.
If that is the case, you can use the following method to fetch the two last digits.
def fetch_following_two_last_digits
if Record.last.created_at < Time.current.beginning_of_day
"00"
else
(Record.last.reference_number.last(2).to_i + 1).to_s
end
end
Also assuming you never reach 100 records a day. Otherwise, you'd end up having three last digits.

How ot write points into influxdb 0.8 with time in seconds

I would like to write points into an influx 0.8 database with the time values given in seconds through HTTP. Here's a sample point in JSON format:
[
{
"points": [
[
1435692857.0,
897
]
],
"name": "some_series",
"columns": [
"time",
"value"
]
}
]
The documentation is unclear what the format of time values should be (nano or milli seconds?) and how to specify to influxdb what to expect. Currently I'm using a query parameter: precision=s
That seems to work fine, the server returns HTTP Status code 200 as expected. When querying against the database using influx' admin interface using select * from some_series the datapoints in the table are returned with the expected timestamp. On the graph however, the time axis is indexed with fractions of seconds and queries like select * from some_series where time > now() - 1h dont yield any results.
I assume that there is something wrong with the timestamps. I tried multiplying my value by 1000 but then nothing gets inserted into the database with no visible errors.
Whats the problem?

By default, supplied timestamps are assumed to be in milliseconds. I think your writes are defaulting to milliseconds because the query string parameter should be time_precision=s, not precision=s.
See the details under "Time Precision on Written Data" on https://influxdb.com/docs/v0.8/api/reading_and_writing_data.html.
I also think the time value should be an integer rather than a float. I'm not sure how to explain the other behaviors, where the timestamp seems to be the right date and multiplying by 1000 doesn't solve the issue, but I wonder if it's related to writing floats.
Please contact the InfluxDB support team at support#influxdb.com for further assistance.

I found the solution! The problem was only in part with the precision. Your answer was correct, the query parameter is called time_precision and I should post integers instead of floats. Which was probably the first thing I attempted with no results...
However, due to some time zone problems, my time values where in the future relative to server time and by default, any select statement includes a where time < now() statement. So, in fact values were written into the database, but not displayed because of that hidden where statement. The solution was to tell the database to return "future" values, too:
select value from some_series where time < now() + 1h

Store the day of the week and time?

I have a two-part question about storing days of the week and time in a database. I'm using Rails 4.0, Ruby 2.0.0, and Postgres.
I have certain events, and those events have a schedule. For the event "Skydiving", for example, I might have Tuesday and Wednesday and 3 pm.
Is there a way for me to store the record for Tuesday and Wednesday in one row or should I have two records?
What is the best way to store the day and time? Is there a way to store day of week and time (not datetime) or should these be separate columns? If they should be separate, how would I store the day of the week? I was thinking of storing them as integer values, 0 for Sunday, 1 for Monday, since that's how the wday method for the Time class does it.
Any suggestions would be super helpful.

Is there a way for me to store the the record for Tuesday and
Wednesday in one row or do should I have two records?
There are several ways to store multiple time ranges in a single row. #bma already provided a couple of them. That might be useful to save disk space with very simple time patterns. The clean, flexible and "normalized" approach is to store one row per time range.
What is the best way to store the day and time?
Use a timestamp (or timestamptz if multiple time zones may be involved). Pick an arbitrary "staging" week and just ignore the date part while using the day and time aspect of the timestamp. Simplest and fastest in my experience, and all date and time related sanity-checks are built-in automatically. I use a range starting with 1996-01-01 00:00 for several similar applications for two reasons:
The first 7 days of the week coincide with the day of the month (for sun = 7).
It's the most recent leap year (providing Feb. 29 for yearly patterns) at the same time.
Range type
Since you are actually dealing with time ranges (not just "day and time") I suggest to use the built-in range type tsrange (or tstzrange). A major advantage: you can use the arsenal of built-in Range Functions and Operators. Requires Postgres 9.2 or later.
For instance, you can have an exclusion constraint building on that (implemented internally by way of a fully functional GiST index that may provide additional benefit), to rule out overlapping time ranges. Consider this related answer for details:
Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL
For this particular exclusion constraint (no overlapping ranges per event), you need to include the integer column event_id in the constraint, so you need to install the additional module btree_gist. Install once per database with:
CREATE EXTENSION btree_gist; -- once per db
Or you can have one simple CHECK constraint to restrict the allowed time period using the "range is contained by" operator <#.
Could look like this:
CREATE TABLE event (event_id serial PRIMARY KEY, ...);
CREATE TABLE schedule (
event_id integer NOT NULL REFERENCES event(event_id)
ON DELETE CASCADE ON UPDATE CASCADE
, t_range tsrange
, PRIMARY KEY (event_id, t_range)
, CHECK (t_range <# '[1996-01-01 00:00, 1996-01-09 00:00)') -- restrict period
, EXCLUDE USING gist (event_id WITH =, t_range WITH &&) -- disallow overlap
);
For a weekly schedule use the first seven days, Mon-Sun, or whatever suits you. Monthly or yearly schedules in a similar fashion.
How to extract day of week, time, etc?
#CDub provided a module to deal with it on the Ruby end. I can't comment on that, but you can do everything in Postgres as well, with impeccable performance.
SELECT ts::time AS t_time -- get the time (practically no cost)
SELECT EXTRACT(DOW FROM ts) AS dow -- get day of week (very cheap)
Or in similar fashion for range types:
SELECT EXTRACT(DOW FROM lower(t_range)) AS dow_from -- day of week lower bound
, EXTRACT(DOW FROM upper(t_range)) AS dow_to -- same for upper
, lower(t_range)::time AS time_from -- start time
, upper(t_range)::time AS time_to -- end time
FROM schedule;
db<>fiddle here
Old sqliddle
ISODOW instead of DOW for EXTRACT() returns 7 instead of 0 for sundays. There is a long list of what you can extract.
This related answer demonstrates how to use range type operator to compute a total duration for time ranges (last chapter):
Calculate working hours between 2 dates in PostgreSQL

Check out the ice_cube gem (link).
It can create a schedule object for you which you can persist to your database. You need not create two separate records. For the second part, you can create schedule based on any rule and you need not worry on how that will be saved in the database. You can use the methods provided by the gem to get whatever information you want from the persisted schedule object.

Depending how complex your scheduling needs are, you might want to have a look at RFC 5545, the iCalendar scheduling data format, for ideas on how to store the data.
If you needs are pretty simple, than that is probably overkill. Postgresql has many functions to convert date and time to whatever format you need.
For a simple way to store relative dates and times, you could store the day of week as an integer as you suggested, and the time as a TIME datatype. If you can have multiple days of the week that are valid, you might want to use an ARRAY.
Eg.
ARRAY[2,3]::INTEGER[] = Tues, Wed as Day of Week
'15:00:00'::TIME = 3pm
[EDIT: Add some simple examples]
/* Custom the time and timetz range types */
CREATE TYPE timerange AS RANGE (subtype = time);
--drop table if exists schedule;
create table schedule (
event_id integer not null, /* should be an FK to "events" table */
day_of_week integer[],
time_of_day time,
time_range timerange,
recurring text CHECK (recurring IN ('DAILY','WEEKLY','MONTHLY','YEARLY'))
);
insert into schedule (event_id, day_of_week, time_of_day, time_range, recurring)
values
(1, ARRAY[1,2,3,4,5]::INTEGER[], '15:00:00'::TIME, NULL, 'WEEKLY'),
(2, ARRAY[6,0]::INTEGER[], NULL, '(08:00:00,17:00:00]'::timerange, 'WEEKLY');
select * from schedule;
event_id | day_of_week | time_of_day | time_range | recurring
----------+-------------+-------------+---------------------+-----------
1 | {1,2,3,4,5} | 15:00:00 | | WEEKLY
2 | {6,0} | | (08:00:00,17:00:00] | WEEKLY
The first entry could be read as: the event is valid at 3pm Mon - Fri, with this schedule occurring every week.
The second entry could be read as: the event is valid Saturday and Sunday between 8am and 5pm, occurring every week.
The custom range type "timerange" is used to denote the lower and upper boundaries of your time range.
The '(' means "inclusive", and the trailing ']' means "exclusive", or in other words "greater than or equal to 8am and less than 5pm".

Why not just store the datestamp then use the built in functionality for Date to get the day of the week?
2.0.0p247 :139 > Date.today
=> Sun, 10 Nov 2013
2.0.0p247 :140 > Date.today.strftime("%A")
=> "Sunday"
strftime sounds like it can do everything for you. Here are the specific docs for it.
Specifically for what you're talking about, it sounds like you'd need an Event table that has_many :schedules, where a Schedule would have a start_date timestamp...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart