influxdb count() gives wrong value - influxdb

This bounty has ended. Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 8 hours.
Sugatur Deekshith S N is looking for an answer from a reputable source.
I have Influx DB where I store information related jenkins.
when I execute when below query
SELECT
project_name,
build_number,
build_result
FROM (
SELECT
project_name,
build_number,
build_result
FROM
"jenkins_data"
WHERE (
"project_name" =~ /^(?i)(test1|test2)$/ AND
"project_path" =~ /.*(?i)Playground.*$/
)
ORDER BY time DESC
LIMIT 15
)
WHERE (
"build_result" = 'SUCCESS'
)
ORDER BY time DESC
will get the below result
time project_name build_number build_result
1676039543717000000 test1 1600 SUCCESS
1676039352721000000 test1 1792 SUCCESS
1676039283509000000 test2 1669 SUCCESS
1676039543717000000 test1 1600 SUCCESS
1676039352721000000 test1 1792 SUCCESS
1676039283509000000 test2 1669 SUCCESS
1676039543717000000 test1 1600 SUCCESS
1676039352721000000 test1 1792 SUCCESS
1676039283509000000 test2 1669 SUCCESS
1676039283509000000 test2 1669 SUCCESS
the above result is correct
but when use count in my query it gives improper results
SELECT
count(build_number)
FROM (
SELECT
project_name,
build_number,
build_result
FROM
"jenkins_data"
WHERE (
"project_name" =~ /^(?i)(test1|test2)$/ AND
"project_path" =~ /.*(?i)Playground.*$/
)
ORDER BY time DESC
LIMIT 15
)
WHERE (
"build_result" = 'SUCCESS'
)
ORDER BY time DESC
will give result as 15 which is not correct I am doing anything wrong here?
Influxdb version used - InfluxDB v1.8.6 (git: 1.8 v1.8.6)

I could able to solve the issue by adding DISTINCT in count functions.
used below query
SELECT
count(DISTINCT build_number)
FROM (
SELECT
project_name,
build_number,
build_result
FROM
"jenkins_data"
WHERE (
"project_name" =~ /^(?i)(test1|test2)$/ AND
"project_path" =~ /.*(?i)Playground.*$/
)
ORDER BY time DESC
LIMIT 15
)
WHERE (
"build_result" = 'SUCCESS'
)
ORDER BY time DESC

Related

BigQuery ML.DETECT_ANOMALIES with model 'arima_plus' returns only nulls

I build a VERY simple model with only a time series and a data field of always 1 to find anomalies.
CREATE OR REPLACE MODEL `mytest.dummy`
OPTIONS(
model_type='arima_plus',
TIME_SERIES_DATA_COL='cnt',
TIME_SERIES_TIMESTAMP_COL='ts',
DATA_FREQUENCY='HOURLY',
DECOMPOSE_TIME_SERIES=TRUE
)
AS
select ts, 1 cnt
from UNNEST(GENERATE_TIMESTAMP_ARRAY('2022-05-01', '2022-05-02', INTERVAL 1 HOUR)) as ts;
Model works fine unless I use a custom select query to find anomalies. Even if the query is the exact same that was used to create the model.
SELECT *
FROM ML.DETECT_ANOMALIES(
MODEL `mytest.dummy`,
STRUCT (0.9 AS anomaly_prob_threshold),
(select ts, 1 cnt
from UNNEST(GENERATE_TIMESTAMP_ARRAY('2022-05-01', '2022-05-02', INTERVAL 1 HOUR)) as ts)
)
Result
Row
ts
cnt
is_anomaly
lower_bound
upper_bound
anomaly_probability
1
2022-05-01 00:00:00 UTC
1.0
null
null
null
null
2
2022-05-01 01:00:00 UTC
1.0
null
null
null
null
3
....
Does anyone know what I need to do to get expected results of is_anomaly = false.
After a closer look into the documentation I found out that anomaly detection works only outside of the training range - at least for new queries and only as far as as the HORIZONgoes (by the time of writing the default is 1.000).
Historical data can be also classified, but only without a query and only if the parameter DECOMPOSE_TIME_SERIES is set to true.
The example above would look like this:
CREATE OR REPLACE MODEL `mytest.dummy`
OPTIONS(
model_type='arima_plus',
TIME_SERIES_DATA_COL='cnt',
TIME_SERIES_TIMESTAMP_COL='ts'
)
AS
select ts, 1 cnt
from UNNEST(GENERATE_TIMESTAMP_ARRAY('2022-05-01', '2022-05-02', INTERVAL 1 HOUR)) as ts;
The query with the next days
SELECT *
FROM ML.DETECT_ANOMALIES(
MODEL `mytest.dummy`,
STRUCT (0.9 AS anomaly_prob_threshold),
(select ts, 1 cnt
from UNNEST(GENERATE_TIMESTAMP_ARRAY('2022-05-03', '2022-05-04', INTERVAL 1 HOUR)) as ts)
)
Result
Row
ts
cnt
is_anomaly
lower_bound
upper_bound
anomaly_probability
1
2022-05-03 00:00:00 UTC
1.0
false
1.0
1.0
0.0
2
2022-05-04 01:00:00 UTC
1.0
false
1.0
1.0
0.0
3
....

How to do inner joins using Kusto query language on AppInsights

I'm using the following query to get the operationId values from the requests that failed with 400 using AppInsights:
requests
| project timestamp, id, operation_Name, success, resultCode, duration, operation_Id, cloud_RoleName, invocationId=customDimensions['InvocationId']
| where cloud_RoleName =~ 'xxxx' and operation_Name == 'createCase' and resultCode == 400
| order by timestamp desc
I use these operationId values on the following queries to get the logs of what happened:
traces
| union exceptions
| where operation_Id == '35edbc7c13f7ac4c85fa0b8071a12b72'
| order by timestamp asc
With this I'm getting the information I want but I need to write and execute queries several times so I'm trying to do a join between both queries without success as I'm not an expert on querying AppInsights and I'm not sure about how to do the join with a union, can you help me?
Please try the query below:
requests
| project timestamp, id, operation_Name, success, resultCode, duration, operation_Id, cloud_RoleName, invocationId=customDimensions['InvocationId']
| where cloud_RoleName =~ 'xxxx' and operation_Name == 'createCase' and resultCode == 400
| join (
traces
| union exceptions
) on operation_Id
| project-away operation_Id1
| order by timestamp asc
More details on the join operator - https://learn.microsoft.com/en-us/azure/kusto/query/joinoperator

Error when visualize apache kylin data in apache superset

I tried to view apache kylin data with apache superset by an official blog guide, but I met the following error when click "visualize" button after query out result table. I have upgraded kylinpy to latest version. I know the correct sql should be "WHERE MONTH_BEG_DT >= '1918-03-12' AND MONTH_BEG_DT <= '2018-03-12'", but it is generated by superset auto.
Caused by: java.lang.NumberFormatException: For input string: "12 00:00:00"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.apache.calcite.avatica.util.DateTimeUtils.dateStringToUnixDate(DateTimeUtils.java:637)
at Baz$6$1.<clinit>(Unknown Source)
... 99 more
2018-03-12 18:13:12,606 INFO [Query eb988c1e-5f6c-4275-a9b8-1946f5976020-60] service.QueryService:328 :
==========================[QUERY]===============================
Query Id: eb988c1e-5f6c-4275-a9b8-1946f5976020
SQL: SELECT META_CATEG_NAME AS META_CATEG_NAME,
sum(CNT) AS sum__CNT
FROM
(select YEAR_BEG_DT,
MONTH_BEG_DT,
WEEK_BEG_DT,
META_CATEG_NAME,
CATEG_LVL2_NAME,
CATEG_LVL3_NAME,
OPS_REGION,
NAME as BUYER_COUNTRY_NAME,
sum(PRICE) as GMV,
sum(ACCOUNT_BUYER_LEVEL) ACCOUNT_BUYER_LEVEL,
count(*) as CNT
from KYLIN_SALES
join KYLIN_CAL_DT on CAL_DT = PART_DT
join KYLIN_CATEGORY_GROUPINGS on SITE_ID = LSTG_SITE_ID
and KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID = KYLIN_SALES.LEAF_CATEG_ID
join KYLIN_ACCOUNT on ACCOUNT_ID = BUYER_ID
join KYLIN_COUNTRY on ACCOUNT_COUNTRY = COUNTRY
group by YEAR_BEG_DT,
MONTH_BEG_DT,
WEEK_BEG_DT,
META_CATEG_NAME,
CATEG_LVL2_NAME,
CATEG_LVL3_NAME,
OPS_REGION,
NAME) AS expr_qry
WHERE MONTH_BEG_DT >= '1918-03-12 00:00:00'
AND MONTH_BEG_DT <= '2018-03-12 18:13:11'
GROUP BY META_CATEG_NAME
ORDER BY sum__CNT DESC
LIMIT 5000
User: ADMIN
Success: true
Duration: 1.313
Project: learn_kylin
Realization Names: [CUBE[name=kylin_sales_cube]]
Cuboid Ids: [23715]
Total scan count: 9946
Total scan bytes: 556263
Result row count: 0
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Is Query Push-Down: false
Is Prepare: false
Trace URL: null
Message: null
==========================[QUERY]===============================
Please check column(dimension) type in superset, make sure the type is DATA, and then please make sure kylinpy version is above 1.0.9.

Speed up insert new data to PostgreSQL

My current workflow is as follows.
However, it is extremely slow. It only can handle millions of data per/day
I want to speed it up. Any idea?
query_expression (generated by Ruby rake file, then put the generated SQL expression into ActiveRecord::Base.connection.execute)
Step 1 of sample-1-query.sql : Aggregate data by minute, hour, ...
LEFT JOIN
(
SELECT DISTINCT ON (1)
date_trunc('#{frequence}', ticktime) AS ticktime ,
Step 2 of sample-1-query.sql : Filling the empty/gap
FROM
(
SELECT DISTINCT ON (1) generate_series
(
date_trunc('second', min(ticktime)::TIMESTAMP),
max(ticktime)::TIMESTAMP,
query_expression
SELECT DISTINCT ON (time)
time_series.ticktime AS time,
t.high,
t.low,
t.open,
t.close,
t.volume,
t.product_type,
t.contract_month
FROM
(
SELECT DISTINCT ON (1) generate_series
(
date_trunc('second', min(ticktime)::TIMESTAMP),
max(ticktime)::TIMESTAMP,
'1 #{frequence}'::interval
) AS ticktime FROM #{market} WHERE product_type ='#{product_type}' AND contract_month = '#{contract_month}'::timestamp
) time_series
LEFT JOIN
(
SELECT DISTINCT ON (1)
date_trunc('#{frequence}', ticktime) AS ticktime ,
first_value(last_price) OVER w AS open,
max(last_price) OVER w AS high ,
min(last_price) OVER w AS low,
last_value(last_price) OVER w AS close,
sum(last_volume) OVER w AS volume,
product_type,
contract_month
FROM #{market}
WHERE product_type ='#{product_type}'
AND contract_month = '#{contract_month}'::timestamp
WINDOW w AS (PARTITION BY date_trunc('#{frequence}', ticktime) ORDER BY ticktime
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
) t USING (ticktime)
WHERE time_series.ticktime::time >= '#{market_begin_end_time[market]["begin_at"]}'::time
AND time_series.ticktime::time < '#{market_begin_end_time[market]["end_at"]}'::time
AND time_series.ticktime > '#{sampling_begin_time}'::TIMESTAMP
ORDER BY 1
Then
Rake file
ActiveRecord::Base.connection.execute(query_expression).each_with_index do |raw_record, j|
Model.create(raw_record)
end

How to use joins and averages together in Hive queries

I have two tables in hive:
Table1: uid,txid,amt,vendor Table2: uid,txid
Now I need to join the tables on txid which basically confirms a transaction is finally recorded. There will be some transactions which will be present only in Table1 and not in Table2.
I need to find out number of avg of transaction matches found per user(uid) per vendor. Then I need to find the avg of these averages by adding all the averages and divide them by the number of unique users per vendor.
Let's say I have the data:
Table1:
u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2
Table2:
u1,100
u1,101
u2,200
u2,202
Example For vendor vend1:
u1-> Avg transaction find rate = 2(matches found in both Tables,Table1 and Table2)/4(total occurrence in Table1) =0.5
u2 -> Avg transaction find rate = 1/1 = 1
Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75
Required output:
vend1,0.75
vend2,1
I can't seem to find count of both matches and occurrence in just Table1 in one hive query per user per vendor. I have reached to this query and can't find how to change it further.
SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid
Any help would be great.
I think you are running into trouble with your JOIN. When you JOIN by txid and uid, you are losing the total number of uid's per group. If I were you I would assign a column of 1's to table2 and name the column something like success or transaction and do a LEFT OUTER JOIN. Then in your new table you will have a column with the number 1 in it if there was a completed transaction and NULL otherwise. You can then do a case statement to convert these NULLs to 0
Query:
select vendor
,(SUM(avg_uid) / COUNT(uid)) as avg_of_avgs
from (
select vendor
,uid
,AVG(complete) as avg_uid
from (
select uid
,txid
,amt
,vendor
,case when success is null then 0
else success
end as complete
from (
select A.*
,B.success
from table1 as A
LEFT OUTER JOIN table2 as B
ON B.txid = A.txid
) x
) y
group by vendor, uid
) z
group by vendor
Output:
vend1 0.75
vend2 1.0
B.success in line 17 is the column of 1's that I put int table2 before the JOIN. If you are curious about case statements in Hive you can find them here
Amazing and precise answer by GoBrewers14!! Thank you so much. I was looking at it from a wrong perspective.
I made little changes in the query to get things finally done.
I didn't need to add a "success" colummn to table2. I checked B.txid in the above query instead of B.success. B.txid will be null in case a match is not found and be some value if a match is found. That checks the success & failure conditions itself without adding a new column. And then I set NULL as 0 and !NULL as 1 in the part above it. Also I changed some variable names as hive was finding it ambiguous.
The final query looks like :
select vendr
,(SUM(avg_uid) / COUNT(usrid)) as avg_of_avgs
from (
select vendr
,usrid
,AVG(complete) as avg_uid
from (
select usrid
,txnid
,amnt
,vendr
,case when success is null then 0
else 1
end as complete
from (
select A.uid as usrid,A.vendor as vendr,A.amt as amnt,A.txid as txnid
,B.txid as success
from Table1 as A
LEFT OUTER JOIN Table2 as B
ON B.txid = A.txid
) x
) y
group by vendr, usrid
) z
group by vendr;

Resources