BigQuery GCP Regression Model issues - machine-learning

I am working on a lab to build a regression model in BigQuery on GCP. I followed the steps almost exactly, but the last query I run fails. Here's what I did.
In BigQuery, I clicked "Create dataset" to create a dataset with the default options.
I created a table and uploaded a CSV file that I downloaded of tick data of stock prices. The table included the columns: Time (this is just the tick number), Price, and Volume. I kept all the default options, but I checked off the "Auto detect" option.
I ran the following query and saved it to a data table:
WITH
raw AS (
SELECT
Time,
Price,
LAG(Price, 1) OVER(ORDER BY Time) AS min_1_Price,
LAG(Price, 2) OVER(ORDER BY Time) AS min_2_Price,
LAG(Price, 3) OVER(ORDER BY Time) AS min_3_Price,
LAG(Price, 4) OVER(ORDER BY Time) AS min_4_Price
FROM
`dataset.table`
ORDER BY
Time DESC ),
raw_plus_trend AS (
SELECT
Time,
Price,
min_1_Price,
IF (min_1_Price - min_2_Price > 0, 1, -1) AS min_1_trend,
IF (min_2_Price - min_3_Price > 0, 1, -1) AS min_2_trend,
IF (min_3_Price - min_4_Price > 0, 1, -1) AS min_3_trend
FROM
raw ),
ml_data AS (
SELECT
Time,
Price,
min_1_Price AS day_prev_Price,
IF (min_1_trend + min_2_trend + min_3_trend > 0, 1, -1) AS trend_3_day
FROM
raw_plus_trend )
SELECT
*
FROM
ml_data
I created a regression model by running this query:
CREATE OR REPLACE MODEL `dataset.model`
OPTIONS
( model_type='linear_reg',
input_label_cols=['Price'],
data_split_method='seq',
data_split_eval_fraction=0.3,
data_split_col='Time') AS
SELECT
Time,
Price,
day_prev_Price,
trend_3_day
FROM
`dataset.model_data`
This is where I failed. I ran this query to make the predictions, but it didn't produce anything. I just used 130000 as my time, since the ticks in my table only went up to 124256.
SELECT
*
FROM
ml.PREDICT(MODEL `dataset.model`,
(
SELECT
*
FROM
`dataset.model_data`
WHERE
Time >= 130000) )
What is going wrong here? I also tried the same query at the end with Time >= 124256 and it only produced one row of results, which just turned out to be the same price in the original table.

Related

Rails Postgres, select users with relation and group them based on users starting time

I have Users and Checkpoint tables, each User can make multiple Checkpoints per day
I want to aggregate how many Checkpoints had been taken each day in the past 6 months based on each user's starting point, to create a graph showing avarage user Checkpoints witin thier x months.
for example:
if user1 started on January 1st, user2 started on March 15th, and user3 started on July 6th those would each be considered day 1, I would want the data from each of their day 1 even though they occur at different periods of time.
The current query I came up with, but unfotunatily it returns data based on fixed time for all of the users.
SELECT dates.date AS date,
checkpoints_count,
checkpoints_users
FROM (SELECT DATE_TRUNC('DAY', ('2000-01-01'::DATE - offs))::DATE AS date
-- 180 = 6 month in days
FROM GENERATE_SERIES(0, 180) AS offs) AS dates
LEFT OUTER JOIN (
SELECT
checkpoints_date::DATE AS date,
COUNT(id) AS checkpoints_count,
COUNT(user_id) AS checkpoints_users
FROM checkpoints
WHERE user_id in (1, 2, 3)
AND checkpoints_date::DATE
BETWEEN '2000-01-01'::DATE AND '2000-06-01'::DATE
GROUP BY checkpoints_date::DATE
) AS ck
ON dates.date = ck.date
ORDER BY dates.date;
EDIT
Here is a working SQL example that works (If I understand what you are looking for. Your SQL seems really complicated for what you are asking but I'm not a SQL expert)...
SELECT t1.*
FROM checkpoints AS t1
WHERE t1.user_id IN (1, 2, 3)
AND t1.id = (SELECT t2.id
FROM checkpoints AS t2
WHERE t2.user_id = t1.user_id
ORDER BY t2.check_date ASC
LIMIT 1)
SQL FIDDLE here
Since this is tagged Ruby on Rails I'll give you a rails answer.
If you know your user IDs (or use some other query to get them you have:
user_ids = [1, 2, 3]
first_checkpoints = []
user_ids.each do |id|
first_checkpoints << Checkpoint.where(user_id: id).order(:date).first
end
#returns an array of the first checkpoint of each user in list
This assumes a column in the checkpoints table called date. You didn't give a table structure for the two tables so this answer is a bit general. There might be a pure ActiveRecord answer. But this will get you what you want.

Combining two tables (Join) In data studio

I'm attempting to join two tables in Data Studio. My data sources are Google Ads and Microsoft ads. I'd like to end up with a table that looks like the following example:
Campaign
Clicks
Campaign 1
500
Campaign 2
700
The clicks from each table is added together to give a total.
When I attempt to join both tables, I get a result that looks like this (full example here):
Campaign
Clicks (Table 1)
Clicks (Table 2)
Campaign 1
100
400
Campaign 2
200
500
The data appears to be joined by 'campaign' but the 'clicks' are not being consolidated into one column, instead the clicks data from both tables are separate.
I've already attempted to solve this issue by:
Creating calculated fields in the newly blended data (Clicks Table 1+ Clicks Table 2) but this yields strange results when trying to aggregate other metrics.
Join using 'Clicks', however, this doesn't work as the number of clicks for each campaign is always likely to be different for each data source.
Change the join type from 'Left outer' to right outer, inner, full outer and cross but none of these appear to work either.
Grouping campaigns by a 'Campaign Group' calculated field using a CASE statement but this doesn't appear to work either- this generally results in only one set of data to show at a time (possibly whichever loads quickest).
Here's how my blend is setup. You can attempt to reproduce this issue using this page.
What is the best way to join both tables and have the metrics (like clicks) properly aggregated?
The values in the two separate fields, Clicks (Table 1) and Clicks (Table 2) can be consolidated using the calculated field:
Clicks (Table 1) + Clicks (Table 2)
This will work as long as there are no NULL values in either (or both) tables in the blend, for any given row of data.
This is because 1 + NULL = NULL (where 1 is used as an example to represent a number) as NULL is not a numeric literal (it's not a number, thus cannot be calculated)
Since this blend has NULL values, one approach is to use the IFNULL function ("returns a result if the input is null, otherwise, returns the input") below, which treats NULL values as a numeric literal (in this case, 0), so that the values can be calculated:
IFNULL(Clicks (Table 1), 0) + IFNULL(Clicks (Table 2), 0)
This will ensure that calculations are changed:
1 + NULL = NULL is replaced by 1 + 0 = 1
NULL + NULL = NULL is replaced by 0 + 0 = 0
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:

Stream analytics getting average for 1 year from history

I have Stream Analytics job with
INPUTS:
1) "InputStreamCSV" - linked to Event hub and recievies data . InputStreamHistory
2) "InputStreamHistory" - Input stream linked BlobStorage. InputStreamCSV
OUTPUTS:
1) "AlertOUT" - linked to table storage and inserts alarm event as row in table
I want to calculate AVERAGE amount for all transactions for year 2018(one number - 5,2) and compare it with transaction, that is comming in 2019:
If new transaction amount is bigger than average - put that transaction in "AlertOUT" output.
I am calculating average as :
SELECT AVG(Amount) AS TresholdAmount
FROM InputStreamHistory
group by TumblingWindow(minute, 1)
Recieving new transaction as:
SELECT * INTO AlertOUT FROM InputStreamCSV TIMESTAMP BY EventTime
How can I combine this 2 queries to be able to check if new transaction amount is bigger than average transactions amount for last year?
Please use JOIN operator in ASA sql,you could refer to below sql to try to combine the 2 query sql.
WITH
t2 AS
(
SELECT AVG(Amount) AS TresholdAmount
FROM jsoninput2
group by TumblingWindow(minute, 1)
)
select t2.TresholdAmount
from jsoninput t1 TIMESTAMP BY EntryTime
JOIN t2
ON DATEDIFF(minute,t1,t2) BETWEEN 0 AND 5
where t1.Amount > t2.TresholdAmount
If the history data is stable, you also could join the history data as reference data.Please refer to official sample.
If you are comparing last year's average with current stream, it would be better to use reference data. Compute the averages for 2018 using either asa itself or a different query engine to a storage blob. After that you can use the blob as reference data in asa query - it will replace the average computation in your example.
After that you can do a reference data join with inputStreamCsv to produce alerts.
Even if you would like to update the averages once in a while, above pattern would work. Based on the refresh frequency, you can either use another asa job or a batch analytics solution.

Query the most recent timestamp (MAX/Last) for a specific key, in Influx

Using InfluxDB (v1.1), I have the requirement where I want to get the last entry timestamp for a specific key. Regardless of which measurement this is stored and regardless of which value this was.
The setup is simple, where I have three measurements: location, network and usage.
There is only one key: device_id.
In pseudo-code, this would be something like:
# notice the lack of a FROM clause on measurement here...
SELECT MAX(time) WHERE 'device_id' = 'x';
The question: What would be the most efficient way of querying this?
The reason why I want this is that there will be a decentralised sync process. Some devices may have been updated in the last hour, whilst others haven't been updated in months. Being able to get a distinct "last updated on" timestamp for a device (key) would allow me to more efficiently store new points to Influx.
I've also noticed there is a similar discussion on InfluxDB's GitHub repo (#5793), but the question there is not filtering by any field/key. And this is exactly what I want: getting the 'last' entry for a specific key.
Unfortunately there wont be single query that will get you what you're looking for. You'll have to do a bit of work client side.
The query that you'll want is
SELECT last(<field name>), time FROM <measurement> WHERE device_id = 'x'
You'll need to run this query for each measurement.
SELECT last(<field name>), time FROM location WHERE device_id = 'x'
SELECT last(<field name>), time FROM network WHERE device_id = 'x'
SELECT last(<field name>), time FROM usage WHERE device_id = 'x'
From there you'll get the one with the greatest time stamp
> select last(value), time from location where device_id = 'x'; select last(value), time from network where device_id = 'x'; select last(value), time from usage where device_id = 'x';
name: location
time last
---- ----
1483640697584904775 3
name: network
time last
---- ----
1483640714335794796 4
name: usage
time last
---- ----
1483640783941353064 4
tl;dr;
The first() and last() selectors will NOT work consistently if the measurement have multiple fields, and fields have NULL values. The most efficient solution is to use these queries
First:
SELECT * FROM <measurement> [WHERE <tag>=value] LIMIT 1
Last:
SELECT * FROM <measurement> [WHERE <tag>=value] ORDER BY time DESC LIMIT 1
Explanation:
If you have a single field in your measurement, then the suggested solutions will work, but if you have more than one field and values can be NULL then first() and last() selectors won't work consistently and may return different timestamps for each field. For example, let's say that you have the following data set:
time fieldKey_1 fieldKey_2 device
------------------------------------------------------------
2019-09-16T00:00:01Z NULL A 1
2019-09-16T00:00:02Z X B 1
2019-09-16T00:00:03Z Y C 2
2019-09-16T00:00:04Z Z NULL 2
In this case querying
SELECT first(fieldKey_1) FROM <measurement> WHERE device = "1"
will return
time fieldKey_1
---------------------------------
2019-09-16T00:00:02Z X
and the same query for first(fieldKey_2) will return a different time
time fieldKey_2
---------------------------------
2019-09-16T00:00:01Z A
A similar problem will happen when querying with last.
And in case you are wondering, it wouldn't do querying 'first(*)' since you'll get an 'epoch-0' time in the results, such as:
time first_fieldKey_1 first_fieldKey_2
-------------------------------------------------------------
1970-01-01T00:00:00Z X A
So, the solution would be querying using combinations of LIMIT and ORDER BY.
For instance, for the first time value you can use:
SELECT * FROM <measurement> [WHERE <tag>=value] LIMIT 1
and for the last one you can use
SELECT * FROM <measurement> [WHERE <tag>=value] ORDER BY time DESC LIMIT 1
It is safe and fast as it will relay on indexes.
Is curious to mention that this more simple approach was mentioned in the thread linked in the opening post, but was discarded. Maybe it was just lost overlooked.
Here there's a thread in InfluxData blogs about the subject also suggesting to use this approach.
I tried this and it worked for me in a single command :
SELECT last(<field name>), time FROM location, network, usage WHERE device_id = 'x'
The result I got :
name: location
time last
---- ----
1483640697584904775 3
name: network
time last
---- ----
1483640714335794796 4
name: usage
time last
---- ----
1483640783941353064 4

PSQL group by vs. aggregate speed

So, the general question is, what's faster, taking an aggregate of a field or having extra expressions in the GROUP BY clause. Here are the two queries.
Query 1 (extra expressions in GROUP BY):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN some_date_field IS NOT NULL
THEN
FLOOR(((some_date_field - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
some_integer * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1, some_date_field, some_integer
) subquery
Query 2 (using an (arbitrary, because each record for the table 2 fields in question here have the same value (in this dataset)) aggregate function):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN MAX(some_date_field) IS NOT NULL
THEN
FLOOR(((MAX(some_date_field) - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
MAX(some_integer) * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1
) subquery
As far as I can tell, psql doesn't provide good benchmarking tools. \timing on only times for one query, so running a benchmark with enough trials for meaningful results is... tedious at best.
For the record, I did do this at about n=50 and saw the aggregate method (Query 2) run faster on average, but a p value of ~.13, so not quite conclusive.
'sup with that?
The general answer - should be +- same. There's a chance to hit/miss function based index when using/not using functions on a field, but not aggregation function and in where clause more then in column list. But this is speculation only.
What you should use for analyzing execution is EXPLAIN ANALYZE. In plan you not only see scan types, but also number of iterations, cost and individual operations time. And of course you can use it with psql

Resources