I have Stream Analytics job with
INPUTS:
1) "InputStreamCSV" - linked to Event hub and recievies data . InputStreamHistory
2) "InputStreamHistory" - Input stream linked BlobStorage. InputStreamCSV
OUTPUTS:
1) "AlertOUT" - linked to table storage and inserts alarm event as row in table
I want to calculate AVERAGE amount for all transactions for year 2018(one number - 5,2) and compare it with transaction, that is comming in 2019:
If new transaction amount is bigger than average - put that transaction in "AlertOUT" output.
I am calculating average as :
SELECT AVG(Amount) AS TresholdAmount
FROM InputStreamHistory
group by TumblingWindow(minute, 1)
Recieving new transaction as:
SELECT * INTO AlertOUT FROM InputStreamCSV TIMESTAMP BY EventTime
How can I combine this 2 queries to be able to check if new transaction amount is bigger than average transactions amount for last year?
Please use JOIN operator in ASA sql,you could refer to below sql to try to combine the 2 query sql.
WITH
t2 AS
(
SELECT AVG(Amount) AS TresholdAmount
FROM jsoninput2
group by TumblingWindow(minute, 1)
)
select t2.TresholdAmount
from jsoninput t1 TIMESTAMP BY EntryTime
JOIN t2
ON DATEDIFF(minute,t1,t2) BETWEEN 0 AND 5
where t1.Amount > t2.TresholdAmount
If the history data is stable, you also could join the history data as reference data.Please refer to official sample.
If you are comparing last year's average with current stream, it would be better to use reference data. Compute the averages for 2018 using either asa itself or a different query engine to a storage blob. After that you can use the blob as reference data in asa query - it will replace the average computation in your example.
After that you can do a reference data join with inputStreamCsv to produce alerts.
Even if you would like to update the averages once in a while, above pattern would work. Based on the refresh frequency, you can either use another asa job or a batch analytics solution.
Related
I have a measurement in influxDb with two keys: operation and count. The operation key can store two different values: 'add' and 'delete'.
I want to subtract the sum(count) value when operation='delete' to sum(count) value when operation='add'.
The following query is supported in mysql but it throws and error in influxql:
select (select sum(count) from measurement where operation='add') - (select sum(count) from measurement where operation='delete');
How can this be done using a single influxql query ? I don't think influxql allows two different where clauses in this case.
InfluxQL doesn't support this kind of multiquery math. You will need to calculate it on the app level.
Sorry, I just start to lean kylin
When I execute the sql select * from kylin_sales where price > 2 in default sample cube of kylin, it failed with the message
ERROR while executing SQL "select * from kylin_sales where price > 2 LIMIT 50000": Can't find any realization. Please confirm with providers SQL digest: fact table DEFAULT.KYLIN_SALES,group by [],filter on[DEFAULT.KYLIN_SALES.PRICE],with aggregates[].
anybody knows the reason?
Thanks
Kylin is a MOLAP (multidimensional online analytical processing) engine. It divides columns into dimensions and measures, expects queries to filter by dimensions and return aggregated measures.
Your query select * from kylin_sales where price > 2 does not work, because price is not a dimension thus is not suitable for filtering. Also the query does not select any aggregated measures.
A simple MOLAP query is like select week_beg_dt, sum(price) from kylin_sales where meta_categ_name='Collectibles' group by week_beg_dt
Kylin also supports a special type of RAW measure, that allows filter such as price > 2, but that's not demonstrated by the sample cube.
I'm trying to process this query.
SELECT
r.src,r.dst, ROUND(r.price/50)*50 pb,COUNT(*) results
FROM [search.interesting_routes] ovr
LEFT JOIN [search.search_results2] r ON ovr.src=r.src AND ovr.dst=r.dst
WHERE DATE(r.saved_at) >= '2015-10-1' AND DATE(r.saved_at) <= '2015-10-01' AND r.price < 20000
GROUP BY pb, r.src, r.dst
ORDER BY pb
The table search_results2 contains a huge amout of search results about prices for routes (route is defined by src and dst).
I need to count all records in search_results2 for each record in interesting_routes for different price buckets.
The query works fine on small sample of data, but once the data is huge it ends with
Error: Shuffle reached broadcast limit for table __I0 (broadcasted at
least 176120970 bytes). Consider using partitioned joins instead of
broadcast joins.
I have a difficulty to rewrite the SELECT with usage of suggested partitioned join. Or at least get the result somehow.
I would like to build an histogram on time series stored as time tree in neo4j.
The data structures are event done by a user each has timestamp, say user purchases category.
What I need to have is the number of browsing on each category by each user between start and end time, with interval of (1 second to days)
My model feats graph db very nicely, as I read neo4j documentation I could not find any way to do it in one query, and I'm afraid that calling for each user would be very slow.
I am aware to cypher capabilities, but I have no idea how to create such query.
I am looking for something like this (not working)
MATCH startPath=(root)-[:`2010`]->()-[:`12`]->()-[:`31`]->(startLeaf),
endPath=(root)-[:`2011`]->()-[:`01`]->()-[:`03`]->(endLeaf),
valuePath=(startLeaf)-[:NEXT*0..]->(middle)-[:NEXT*0..]->(endLeaf),
vals=(middle)-[:VALUE]->(event)
WHERE root.name = 'Root'
RETURN event.name, count(*)
ORDER BY event.name ASC
GROUP BY event.timestamp % 1000*60*10 // 10 minutes histogram bar
Then I'd like to have a report, for how many users browse to each site category:
0-9 news 5, commerce 3 ; 10-19 news 6, commerce 19; 1 20-29 news 2, commerce 8;
Any idea if it is optional with neo4j time tree model?
if so how? :-)
Does this work?
MATCH
startPath=(root)-[:`2010`]->()-[:`12`]->()-[:`31`]->(startLeaf),
endPath=(root)-[:`2011`]->()-[:`01`]->()-[:`03`]->(endLeaf),
valuePath=(startLeaf)-[:NEXT*0..]->(middle)-[:NEXT*0..]->(endLeaf),
vals=(middle)-[:VALUE]->(event)
WHERE root.name = 'Root'
RETURN event.name, event.timestamp % 1000*60*10 AS slice, count(*)
ORDER BY slice ASC
Basically I just added the event.timestamp % 1000*60*10 into the return so that Neo4j will use that as a grouping criteria
I have written a Rails 4 app that accepts and plots sensor data. Sometimes there are 10 points per hour (but this number is not fixed). I'm plotting the data and doing a simple query of Points.all to get all the data points.
In order to reduce the query size, I would like to only return one record per hour. It doesn't matter which record is returned. The first record each hour using the created_at field would be fine.
How do I construct a query to do this?
You can get first one, but maybe average value is better. All you need to do is to group it by hour. I am not 100% about sqlite syntax but something in this sense:
connection.execute("SELECT AVG(READING_VALUE) FROM POINTS GROUP BY STRFTIME('%Y%m%d%H0', CREATED_AT)")
Inspired from this answer, here is an alternative which retrieves the latest record in that hour (if you don't want to average):
Point.from(
Point.select("max(unix_timestamp(created_at)) as max_timestamp")
.group("HOUR(created_at)") # subquery
)
.joins("INNER JOIN points ON subquery.max_timestamp = unix_timestamp(created_at)")
This will result in the following query:
SELECT `points`.*
FROM (
SELECT max(unix_timestamp(created_at)) as max_timestamp
FROM `points`
GROUP BY HOUR(created_at)
) subquery
INNER JOIN points ON subquery.max_timestamp = unix_timestamp(created_at)
You can also use MIN instead to get the first record of the hour, if you like, as well.