Multiple column data extraction

Multiple column data extraction - google-sheets

Problem : I want to aggregate multiple conditions in discontinueted column.
Link : Google Sheet link
Context: I'm a mushroom grower and I wanna schedule my production program for 2020.
We call batch an amount of inoculated mushroom substrate put in production at the same time. Usually, it weights a ton, and for the total year 2020 we'll have about 40 batches put in production.
A batch has several characteristics :
An ID. 20-B1 is the first batch for 2020, 20-B2 the second, and so on.
A scheduled week of production start. Usually, the production for 1 batch last 14 week. The week of production is numbered from 1 to 14.
A calendar week, where the array of 14 production week is placed.
An Origin. We work with two companies (Eurosubstrat and CNC) and are producing our own substrate. This means 3 arguments: "EURO", "CNC" and "LCU"
A species cultivated. We have two species: Shiitake and Pleurotus Ostreatus
A weight; about 1 ton but few batches can be much lighter
A production capability VS time: For example, in The third week of exploitation, a batch will produce 3,4% of his mass in mushroom.
I wanna extract and put the following results in a column :
Target 1 : In order to establish an order schedule, the amount of batch who match those criteria for 1 given calendar week; for example "Week production = 1" AND "Origin = Euro" AND "Species = shii."
Target 2 : In order to estimate total production capability for 1 given species, the sum of individual production capability for 1 given calendar week: for example "Species = shii" AND (related) "production capability"
Hope I'm clear
If you imagine a better way to organize my data, feel free to suggest. I don't know if the current organization is the best.
Sorry, English is not my native tongue.

Related

DynamoDB Timeseries: Querying large timespans of data

I have a simple timeseries table:
{
"n": "EXAMPLE", # Name, Hash Key
"t": 1640893628, # Unix Timestamp, Range Key
"v": 10 # Value being stored
}
Every 15 minutes I will poll data and insert into the table. If I want to query values between a 24-hour period, this works well - this would equate to a total of 96 records.
Now, say I want to query a larger timespan - 1 or 2 years. This is now tens of thousands of records, and (in my opinion) impractical to do regularly. This will require multiple queries to retrieve larger time ranges which would negatively impact response times as well as being much more costly.
I have thought of a couple of potential solutions to this problem:
1. Replicate data in another table, with larger increments. A table with a single record every 6 hours, for example.
2. Have another table to store common query results, such as records for "EXAMPLE" for the past week, month, and year (respectively). I would periodically update records in the new table to hold every N'th record in the main table (a total of 100). Something like:
{
"n": "EXAMPLE#WEEKLY",
"v": [
{
"t": 1640893628,
"v": 10
},
{
"t": 1640993628,
"v": 15
},
... 98 more.
]
}
I believe #2 is a solid approach. It seems to me like this would be a common enough problem, so I would love to hear about how other people have approached it.

More options present themselves if you can convert your unix timestamps into ISO 8601-type strings like 2021-12-31T09:27:58+00:00.
If so, DynamoDB's begins_with key condition expression lets us query for discrete calendar time buckets. December 2021, for example,
is queryable using n = id1 AND begins_with(t, "2021-12"). Same for days and hours. We can take this one step further by adding
other periods in indexes.
Some rolling windows are possible, too: n = id1 AND t > [24 hours ago] gives us last 24h.
n (PK) t (SK) hour_bucket (LSI1 SK) week (LSI2 SK)
id1 2021-12-31T10:45 2021-12-31T09-12 2021-52
id1 2021-12-31T13:00 2021-12-31T13-15 2021-52
id1 2022-06-01T22:00 2022-06-01T22-24 2022-01
If you are looking for arbitrary time-series queries, you might consider Athena, as the other answer suggested, or AWS's serverless
Timestream, which is a "purpose-built time series database that makes it easy to store and analyze trillions of time series data points per day. "

You could export the table to Amazon S3 and run Amazon Athena on the exported data. Here’s a blog post describing the process: https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/

Commission calculation based on sliding month (googlesheets)

I have to pay a commission to Agents (affiliates) based on the following conditions:
the commission starts on a monthly basis following a USER (linked to the Agent) first deposit/purchase on a website
agents have a decreasing commission, ex: 1 month following first deposit of their USER = 30% of sales, 2d month period following 1st deposit of USER: 25% of sales, etc
Commission are paid on a month basis calculation (ex: from 01/07/2020 till 31/07/2020)
If a USER makes a first purchase on June 22d and if sales commission for 1st period is 30%, then agent is eligible to a 30% commission on sales from june 22d till July 22d, then 25% for sales from 23rd july till 23rd august, etc
I have designed a googlesheets (see below) that serves the purpose (using 12 columns to get the correct commission% for a specific user on a specific day!), but I am trying to find a more straight forward formula to get the applicable com. % based on the commission sliding table and the first deposit date of a specific user.
Can anyone help?
The google sheets showing my calc is here:
https://docs.google.com/spreadsheets/d/1I1gzZ670hJH8HwCGizzbvlQkg0dgAvgOSQTfOUL0VgU/edit?usp=sharing

This might help you. (Updated to correct the row number of where the formula should go.)
If you insert a new column in your sheet, to the right of Column W, the Current Commision Month, and paste the following formula in the cell where the Current Commision Month header text should appear (Row 9 in your sample) of that column, it replicates the results in your Current Commision Month.
But it does not require columns I through V. You can test that by deleting columns I through V - you can use Undo and Redo to go back and forth, if necessary. Depending on how you use your "End" column - the logic wasn't clear to me - the info for that can also be gained in this one column.
={"Current
Commision
Month";"";ArrayFormula(
if(
($H11:H<>"") * ($A11:A>=date(year(H11:H),month(H11:H),day(H11:H))),
ifs( $A11:A< date(year(H11:H),month(H11:H)+1,day(H11:H)),1,
$A11:A< date(year(H11:H),month(H11:H)+2,day(H11:H)),2,
$A11:A< date(year(H11:H),month(H11:H)+3,day(H11:H)),3,
$A11:A< date(year(H11:H),month(H11:H)+4,day(H11:H)),4,
$A11:A< date(year(H11:H),month(H11:H)+5,day(H11:H)),5,
$A11:A< date(year(H11:H),month(H11:H)+6,day(H11:H)),6,
$A11:A>=date(year(H11:H),month(H11:H)+7,day(H11:H)),9999),
""))}
The first IF test is to check that the FirstDeposit date is not blank, and that the sale date is greater than or equal to the FirstDeposit date.
The IFS tests go through and check whether the sale date is less than one of the months, and stops at the first value (commission month) that is greater than the sale date. If never, it places a vlaue of 9999.
Note that the "9999" values are just to indicate the sale date is greater than the "End" date, and can be changed to blanks or whatever you want.
[![enter image description here][1]][1]
I've added a sample tab with the final result. Let me know if this helps. There may be several other enhancements possible for your sheet, in particular the use of ARRAYFORMULAS to fill values down many of your columns.
I haven't spent time on the actual commision calculations, in the final columns, but if you feel that still needs improvements, I can try to simplify there as well.
[1]: https://i.stack.imgur.com/TfFZ5.png

SPSS - flag cases within a calendar month

I have a list of prisoners and when their prison term started (PrisonStart) and when it ended (PrisonEnd). If they're still in prison, PrisonEnd is blank.
I would like to flag prisoners who were in prison at least one full calendar month during a 6-month period (1/1/16 to 5/30/16).
compute PeriodBeg = date.mdy(01,01,16).
compute PeriodEnd = date.mdy(05,30,16).
formats PeriodBeg PeriodEnd (adate10).
execute.
Any suggestions for how best to go about this? Seems I might need to compare prisoners' start and end dates separately for each month during the 6-month period (like below), and then select any prisoner with at least one full month, but I'm wondering if there's a more efficient way.
if ((PrisonStart le [January 1, 2016]) and (PrisonEnd ge ([January 31, 2016]) | missing(PrisonEnd))) InPrisonJan = 1.
if ((PrisonStart le [February 1, 2016]) and (PrisonEnd ge ([February 28, 2016]) | missing(PrisonEnd))) InPrisonFeb = 1.
etc.
execute.
Some sample data below. The first two prisoners should be flagged as having been in prison for at least one full calendar month during the 6-month period (OneMonth = 1). The last three prisoners were not in prison for at lease one full calendar month during the 6-month period (OneMonth = 0).
data list list /PrisonerID (F8.0) PrisonStart (adate10) PrisonEnd (adate10) PeriodBeg (adate10) PeriodEnd (adate10).
begin data
1 10.3.14 7.12.16 1.1.16 5.30.16
2 2.9.16 4.1.16 1.1.16 5.30.16
3 5.2.16 10.11.16 1.1.16 5.30.16
4 12.1.13 2.8.14 1.1.16 5.30.16
5 1.7.16 1.20.16 1.1.16 5.30.16
6 1.1.17 3.2.17 1.1.16 5.30.16
end data.

The following syntax avoids mentioning the last day of each month separately, so it could be used to automate across any number of months. The trick to check if the date is the last day of the month, is by checking day "0" of NEXT month:
do repeat inpr=inPrison1 to inPrison5/ mon=1 to 5.
compute inpr=( PrisonStart<=date.dmy(1,mon,2016) and
(PrisonEnd>=date.dmy(0,mon+1,2016) or missing (PrisonEnd)) ).
end repeat.

Query InfluxDB for specific hours every day

What is the best way to query InfluxDB for specific hours every day, for example, I have a Series that have checkin/checkout activities, and I need to see them between hour 2PM - 3PM every day for last month, am aware that there's no direct way to do this on the query language -current version 1.2- Not sure if there is a work around or something ?

I have been searching for the same and found your question. As you say, the syntax does not seem to allow to do it.
My closest attempt was trying to use a regular expression for a time WHERE clausule, which is not currently supported by InfluxDB.
So that should probably be the answer, and I would not post an answer to just say that.
However, working on a different problem, I have found a way that may or may not help you in your specific case. It is a workaround that is not very nice, but it seems to work in the case that you can formulate an aggregation/selection of what you want to see in that given hour so that you end up with having one value per hour. For example, (mean/max/count number of checkin-checkouts in that hour for a given person, which may be what you are looking for, or that you may use to identify the days that you would like to them individually query to see what happened there).
For example, I want to obtain the measurement of electricity consumption daily from 00:00 to 06:00 a.m. I make a first subquery that divides the measurements grouping by 6 hours starting at 00:00 of a given date. Then in the main query, I group by 24 hours and I select the first value. Like this
SELECT first("mean") FROM (SELECT mean("value") FROM "Energy" WHERE "devicename" = 'Electricity' AND "deviceid" = '0_5' AND time > '2017-01-01' GROUP BY time(6h) ) WHERE time > '2017-01-01' GROUP BY time(24h)
If you want 2-4 pm, so 14:00-16:00, you need to first group by 2 hours in the subquery, then offseting the set by 14h so that it starts at 14:00.
SELECT first("mean") FROM ( SELECT mean("value") FROM "Energy" WHERE "devicename" = 'Electricity' AND "deviceid" = '0_5' AND time > '2017-01-01T14:00:00Z' GROUP BY time(2h) ) WHERE time > '2017-01-01T14:00:00Z' GROUP BY time(24h,14h)
Just for checking it. In my 1.2 InfluxDB this is the final result:
Energy
time first
2017-01-01T14:00:00Z 86.41747572815534
2017-01-02T14:00:00Z 43.49722222222222
2017-01-03T14:00:00Z 81.05416666666666
The subquery returns:
Energy
time mean
2017-01-01T14:00:00Z 86.41747572815534
2017-01-01T16:00:00Z 91.46879334257974
2017-01-01T18:00:00Z 89.14027777777778
2017-01-01T20:00:00Z 94.47434119278779
2017-01-01T22:00:00Z 89.94305555555556
2017-01-02T00:00:00Z 86.29542302357837
2017-01-02T02:00:00Z 92.2625
2017-01-02T04:00:00Z 89.93619972260748
2017-01-02T06:00:00Z 87.78888888888889
2017-01-02T08:00:00Z 50.790277777777774
2017-01-02T10:00:00Z 0.6597222222222222
2017-01-02T12:00:00Z 0.10957004160887657
2017-01-02T14:00:00Z 43.49722222222222
2017-01-02T16:00:00Z 86.0610263522885
2017-01-02T18:00:00Z 86.59778085991678
2017-01-02T20:00:00Z 91.56527777777778
2017-01-02T22:00:00Z 90.52565880721221
2017-01-03T00:00:00Z 86.79166666666667
2017-01-03T02:00:00Z 87.15533980582525
2017-01-03T04:00:00Z 89.47988904299584
2017-01-03T06:00:00Z 91.58888888888889
2017-01-03T08:00:00Z 41.67732962447844
2017-01-03T10:00:00Z 16.216366158113733
2017-01-03T12:00:00Z 25.27739251040222
2017-01-03T14:00:00Z 81.05416666666666
If you would need 13:00-15:00, you need to offset the subquery in the previous example by 1h.
For 14:00-15:00:
SELECT first("mean") FROM ( SELECT mean("value") FROM "Energy" WHERE "devicename" = 'Electricity' AND "deviceid" = '0_5' AND time > '2017-01-01T14:00:00Z' GROUP BY time(1h) ) WHERE time > '2017-01-01T14:00:00Z' GROUP BY time(24h,14h)
Hope this helps :)

Store the day of the week and time?

I have a two-part question about storing days of the week and time in a database. I'm using Rails 4.0, Ruby 2.0.0, and Postgres.
I have certain events, and those events have a schedule. For the event "Skydiving", for example, I might have Tuesday and Wednesday and 3 pm.
Is there a way for me to store the record for Tuesday and Wednesday in one row or should I have two records?
What is the best way to store the day and time? Is there a way to store day of week and time (not datetime) or should these be separate columns? If they should be separate, how would I store the day of the week? I was thinking of storing them as integer values, 0 for Sunday, 1 for Monday, since that's how the wday method for the Time class does it.
Any suggestions would be super helpful.

Is there a way for me to store the the record for Tuesday and
Wednesday in one row or do should I have two records?
There are several ways to store multiple time ranges in a single row. #bma already provided a couple of them. That might be useful to save disk space with very simple time patterns. The clean, flexible and "normalized" approach is to store one row per time range.
What is the best way to store the day and time?
Use a timestamp (or timestamptz if multiple time zones may be involved). Pick an arbitrary "staging" week and just ignore the date part while using the day and time aspect of the timestamp. Simplest and fastest in my experience, and all date and time related sanity-checks are built-in automatically. I use a range starting with 1996-01-01 00:00 for several similar applications for two reasons:
The first 7 days of the week coincide with the day of the month (for sun = 7).
It's the most recent leap year (providing Feb. 29 for yearly patterns) at the same time.
Range type
Since you are actually dealing with time ranges (not just "day and time") I suggest to use the built-in range type tsrange (or tstzrange). A major advantage: you can use the arsenal of built-in Range Functions and Operators. Requires Postgres 9.2 or later.
For instance, you can have an exclusion constraint building on that (implemented internally by way of a fully functional GiST index that may provide additional benefit), to rule out overlapping time ranges. Consider this related answer for details:
Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL
For this particular exclusion constraint (no overlapping ranges per event), you need to include the integer column event_id in the constraint, so you need to install the additional module btree_gist. Install once per database with:
CREATE EXTENSION btree_gist; -- once per db
Or you can have one simple CHECK constraint to restrict the allowed time period using the "range is contained by" operator <#.
Could look like this:
CREATE TABLE event (event_id serial PRIMARY KEY, ...);
CREATE TABLE schedule (
event_id integer NOT NULL REFERENCES event(event_id)
ON DELETE CASCADE ON UPDATE CASCADE
, t_range tsrange
, PRIMARY KEY (event_id, t_range)
, CHECK (t_range <# '[1996-01-01 00:00, 1996-01-09 00:00)') -- restrict period
, EXCLUDE USING gist (event_id WITH =, t_range WITH &&) -- disallow overlap
);
For a weekly schedule use the first seven days, Mon-Sun, or whatever suits you. Monthly or yearly schedules in a similar fashion.
How to extract day of week, time, etc?
#CDub provided a module to deal with it on the Ruby end. I can't comment on that, but you can do everything in Postgres as well, with impeccable performance.
SELECT ts::time AS t_time -- get the time (practically no cost)
SELECT EXTRACT(DOW FROM ts) AS dow -- get day of week (very cheap)
Or in similar fashion for range types:
SELECT EXTRACT(DOW FROM lower(t_range)) AS dow_from -- day of week lower bound
, EXTRACT(DOW FROM upper(t_range)) AS dow_to -- same for upper
, lower(t_range)::time AS time_from -- start time
, upper(t_range)::time AS time_to -- end time
FROM schedule;
db<>fiddle here
Old sqliddle
ISODOW instead of DOW for EXTRACT() returns 7 instead of 0 for sundays. There is a long list of what you can extract.
This related answer demonstrates how to use range type operator to compute a total duration for time ranges (last chapter):
Calculate working hours between 2 dates in PostgreSQL

Check out the ice_cube gem (link).
It can create a schedule object for you which you can persist to your database. You need not create two separate records. For the second part, you can create schedule based on any rule and you need not worry on how that will be saved in the database. You can use the methods provided by the gem to get whatever information you want from the persisted schedule object.

Depending how complex your scheduling needs are, you might want to have a look at RFC 5545, the iCalendar scheduling data format, for ideas on how to store the data.
If you needs are pretty simple, than that is probably overkill. Postgresql has many functions to convert date and time to whatever format you need.
For a simple way to store relative dates and times, you could store the day of week as an integer as you suggested, and the time as a TIME datatype. If you can have multiple days of the week that are valid, you might want to use an ARRAY.
Eg.
ARRAY[2,3]::INTEGER[] = Tues, Wed as Day of Week
'15:00:00'::TIME = 3pm
[EDIT: Add some simple examples]
/* Custom the time and timetz range types */
CREATE TYPE timerange AS RANGE (subtype = time);
--drop table if exists schedule;
create table schedule (
event_id integer not null, /* should be an FK to "events" table */
day_of_week integer[],
time_of_day time,
time_range timerange,
recurring text CHECK (recurring IN ('DAILY','WEEKLY','MONTHLY','YEARLY'))
);
insert into schedule (event_id, day_of_week, time_of_day, time_range, recurring)
values
(1, ARRAY[1,2,3,4,5]::INTEGER[], '15:00:00'::TIME, NULL, 'WEEKLY'),
(2, ARRAY[6,0]::INTEGER[], NULL, '(08:00:00,17:00:00]'::timerange, 'WEEKLY');
select * from schedule;
event_id | day_of_week | time_of_day | time_range | recurring
----------+-------------+-------------+---------------------+-----------
1 | {1,2,3,4,5} | 15:00:00 | | WEEKLY
2 | {6,0} | | (08:00:00,17:00:00] | WEEKLY
The first entry could be read as: the event is valid at 3pm Mon - Fri, with this schedule occurring every week.
The second entry could be read as: the event is valid Saturday and Sunday between 8am and 5pm, occurring every week.
The custom range type "timerange" is used to denote the lower and upper boundaries of your time range.
The '(' means "inclusive", and the trailing ']' means "exclusive", or in other words "greater than or equal to 8am and less than 5pm".

Why not just store the datestamp then use the built in functionality for Date to get the day of the week?
2.0.0p247 :139 > Date.today
=> Sun, 10 Nov 2013
2.0.0p247 :140 > Date.today.strftime("%A")
=> "Sunday"
strftime sounds like it can do everything for you. Here are the specific docs for it.
Specifically for what you're talking about, it sounds like you'd need an Event table that has_many :schedules, where a Schedule would have a start_date timestamp...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart