How to use index in filter in a "dask-sql" SQL query - dask

I create a sample dask dataframe with the timestamp as an index.
df = dask.datasets.timeseries()
df.head()
id name x y
timestamp
2000-01-01 00:00:00 915 Norbert -0.989381 0.974546
2000-01-01 00:00:01 1026 Zelda 0.919731 0.656581
2000-01-01 00:00:02 1003 Patricia -0.128303 -0.354592
2000-01-01 00:00:03 986 Jerry 0.557732 0.160812
Now I want to use dask-sql and a filter on the index in an SQL query. This does not work however:
from dask_sql import Context
c = Context()
c.create_table("mytab", df)
result = c.sql("""
SELECT
count(*)
FROM mytab
WHERE "timestamp" > '2000-01-01 00:00:00'
""")
print(result.compute())
The Error Message is:
Traceback (most recent call last):
File "/opt/dask_sql/startup_script.py", line 15, in <module>
result = c.sql("""
File "/opt/dask_sql/dask_sql/context.py", line 458, in sql
rel, select_names, _ = self._get_ral(sql)
File "/opt/dask_sql/dask_sql/context.py", line 892, in _get_ral
raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: From line 4, column 15 to line 4, column 25: Column 'timestamp' not found in any table
The problem is probably somewhere here:
SELECT count(*)
FROM timeseries
WHERE "timestamp" > '2000-01-01'
^^^^^^^^^^^
I am using this docker image nbraun/dask-sql:2022.1.0.
Is there an efficient way to get all the rows based on an index filter? It is important that this can be done in dask-sql because I need to execute the SQL via the presto endpoint provided by the dask-sql-server.

dask-sql doesn't seem to identify "timestamp" as the index-column-name, so one workaround is to use reset_index:
import dask
import dask.dataframe as dd
from dask_sql import Context
ddf = dask.datasets.timeseries()
c = Context()
c.create_table("mytab", ddf.reset_index())
result = c.sql("""
SELECT
count(*)
FROM mytab
WHERE "timestamp" > '2000-01-01 00:00:00'
""")
print(result.compute())
In this specific example, we get TypeError('Invalid comparison between dtype=datetime64[ns] and datetime') because pandas/Dask use the datetime64ns format. You can convert the "timestamp" column to be in datetime format using something like:
import datetime
c.create_table("mytab", ddf.reset_index().assign(timestamp = lambda df: df["timestamp"].apply(lambda x: x.strftime('%Y-%m-%d'), meta=('timestamp', 'object'))))
Which is similar to,
ddf_new = ddf.reset_index()
ddf_new["timestamp"] = ddf_new["timestamp"].apply(lambda x: x.strftime('%Y-%m-%d'), meta=('timestamp', 'object'))
c.create_table(ddf_new)
I'd also encourage you to open relevant issues on the dask-sql issue tracker to reach the team directly. :)

Related

How to CAST one attribute to INTEGER then group and sum

I would like to do the following with a query/SQL in Rails:
Collect a batch of Orders, selecting :buyer_id and :weight_lb.
Convert every weight_lb from a string (like "12.3lb" to an integer 12).
Sum all the weight_lb and group by buyer_id.
The output should look like: {buyer_id_1: 65, buyer_id_2: 190}, etc., where each number is the sum of each buyer's order weights.
This is what I've tried:
Order.find_by_sql("SELECT \"orders\".\"id\", \"orders\".\"buyer_id\", CAST(\"orders\".\"weight_lb\" AS DECIMAL) FROM \"orders\" LIMIT 500 OFFSET 1000")
=> [
#<Order:0x0000000118054830 id: 15076494, buyer_id: 22918, weight_lb: "315.0">,
#<Order:0x0000000118054918 id: 15076495, buyer_id: 22918, weight_lb: "110.0">,
...]
Despite CAST() as DECIMAL, the weight is still output as a string.
When I try to CAST() as INTEGER, it fails entirely with PG::InvalidTextRepresentation: ERROR: invalid input syntax for type integer: "315.0" (ActiveRecord::StatementInvalid)
What I would ideally like to have is:
{
15076494: 425, # Sum of all weights for the ID 15076494
15076495: 0,
15076496: 95, ...
}
I'm just not sure how to get there efficiently using Postgres.
We can use a combination of REPLACE, CAST and SUM operations
Order
.select("buyer_id, SUM(CAST(REPLACE(weight_lb, 'lb', '') AS DECIMAL)) AS weight_lb")
.group("buyer_id")
.limit(500)
.offset(1000)
The generated SQL will be:
SELECT "orders"."buyer_id", SUM(CAST(REPLACE("orders"."weight_lb", 'lb', '') AS DECIMAL)) AS weight_lb
FROM "orders"
GROUP BY "orders"."buyer_id"
LIMIT 500
OFFSET 1000
Let me know if it helps. :)

How can you tell if a Influx Database contains data?

I'm currently trying to count the number of rows in an InfluxDB, but the following fails.
SELECT count(*) FROM "TempData_Quarantine_1519835017000_1519835137000"..:MEASUREMENT";
with the message
InfluxData API responded with status code=BadRequest, response={"error":"error parsing query: found :, expected ; at line 1, char 73"}
To my understanding this query should be checking all measurements and counting them?
(I inherited this code from someone else, so apologies for not understanding it better)
If you need a binary answer to the question "tell if a Influx Database contains data?" then just do
select count(*) from /.*/
In case if the current retention policy in the current database is empty (contains 0 rows) it will return just nothing. Otherwise it will return something like this:
name: api_calls
time count_value
---- -----------
0 5
name: cpu
time count_value
---- -----------
0 1
Also you can specify retention policy explicitly:
SELECT count(*) FROM "TempData_Quarantine_1519835017000_1519835137000"./.*/

Error when visualize apache kylin data in apache superset

I tried to view apache kylin data with apache superset by an official blog guide, but I met the following error when click "visualize" button after query out result table. I have upgraded kylinpy to latest version. I know the correct sql should be "WHERE MONTH_BEG_DT >= '1918-03-12' AND MONTH_BEG_DT <= '2018-03-12'", but it is generated by superset auto.
Caused by: java.lang.NumberFormatException: For input string: "12 00:00:00"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.apache.calcite.avatica.util.DateTimeUtils.dateStringToUnixDate(DateTimeUtils.java:637)
at Baz$6$1.<clinit>(Unknown Source)
... 99 more
2018-03-12 18:13:12,606 INFO [Query eb988c1e-5f6c-4275-a9b8-1946f5976020-60] service.QueryService:328 :
==========================[QUERY]===============================
Query Id: eb988c1e-5f6c-4275-a9b8-1946f5976020
SQL: SELECT META_CATEG_NAME AS META_CATEG_NAME,
sum(CNT) AS sum__CNT
FROM
(select YEAR_BEG_DT,
MONTH_BEG_DT,
WEEK_BEG_DT,
META_CATEG_NAME,
CATEG_LVL2_NAME,
CATEG_LVL3_NAME,
OPS_REGION,
NAME as BUYER_COUNTRY_NAME,
sum(PRICE) as GMV,
sum(ACCOUNT_BUYER_LEVEL) ACCOUNT_BUYER_LEVEL,
count(*) as CNT
from KYLIN_SALES
join KYLIN_CAL_DT on CAL_DT = PART_DT
join KYLIN_CATEGORY_GROUPINGS on SITE_ID = LSTG_SITE_ID
and KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID = KYLIN_SALES.LEAF_CATEG_ID
join KYLIN_ACCOUNT on ACCOUNT_ID = BUYER_ID
join KYLIN_COUNTRY on ACCOUNT_COUNTRY = COUNTRY
group by YEAR_BEG_DT,
MONTH_BEG_DT,
WEEK_BEG_DT,
META_CATEG_NAME,
CATEG_LVL2_NAME,
CATEG_LVL3_NAME,
OPS_REGION,
NAME) AS expr_qry
WHERE MONTH_BEG_DT >= '1918-03-12 00:00:00'
AND MONTH_BEG_DT <= '2018-03-12 18:13:11'
GROUP BY META_CATEG_NAME
ORDER BY sum__CNT DESC
LIMIT 5000
User: ADMIN
Success: true
Duration: 1.313
Project: learn_kylin
Realization Names: [CUBE[name=kylin_sales_cube]]
Cuboid Ids: [23715]
Total scan count: 9946
Total scan bytes: 556263
Result row count: 0
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Is Query Push-Down: false
Is Prepare: false
Trace URL: null
Message: null
==========================[QUERY]===============================
Please check column(dimension) type in superset, make sure the type is DATA, and then please make sure kylinpy version is above 1.0.9.

complex db2/sql query with time-sampling, group, map, join and csv export

I have data in a table (named: TESTING) on a dashDB2 on IBM bluemix (Db2 Warehouse on Cloud) which is looking like this:
ID TIMESTAMP NAME VALUE
abc 2017-12-21 19:55:38.762 test1 123
abc 2017-12-21 19:55:42.762 test2 456
abc 2017-12-21 19:57:38.762 test1 789
abc 2017-12-21 19:58:38.762 test3 345
def 2017-12-21 19:59:38.762 test1 678
I am looking for a query that:
samples the data (for each NAME) to a given timeformat (ex. to a 1 minute based timestamp)
VALUES in same timerange (in same minute) should be averaged, empty times should be NULL
for 1. and 2. something like (only for one NAME working):
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
)
select temporaer, avg(VALUE) as test1 from dummy
LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
group by temporaer
ORDER BY temporaer ASC;
join all different NAMES column-wise to a matrix, like:
TIMESTAMP test1 test2 test3
2017-12-01 00:00:00 null null null
...
2017-12-21 19:55:00 123 456 null
2017-12-21 19:56:00 null null null
2017-12-21 19:57:00 789 null null
2017-12-21 19:58:00 678 null 345
...
2018-01-31 23:59:00 null null null
the query result should be exportet as a csv. or given back as csv-string
Does anybody know how this could be done in one query or in a simple and fast way? Or is it necessary to save the data in another tabe-format - can you give me a hint?
here is a code snipped that does the job, but needs very long time:
WITH
-- get all distinct names in table:
header(names) AS (SELECT DiSTINCT name
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),
-- generate a range of times from date to date in defined steps:
dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- add each name (from header) to each time/row (in dummy):
dumpy(time, names) AS (SELECT Dummy.time, Header.names
FROM Dummy
LEFT OUTER JOIN Header
ON Dummy.time IS NOT NULL),
-- averages values by name and timeinterval and sorts result to dummy:
dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
FROM Dummy
LEFT OUTER JOIN Dummie
ON Dummie.time = Dummy.time
GROUP BY Dummie.names, Dummy.time),
-- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
FROM Dumpy
LEFT OUTER JOIN Dummj
ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),
-- converts the high amount of rows to less rows with delimited strings:
test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
FROM Testo
GROUP BY time)
SELECT* FROM test ORDER BY time ASC, names ASC;
The performance problem is in the "testo" subquery. Does anybody have an idea what is the failure here or know how to improve the query?
Well, one problem I see is that you keep using functions on columns, but that shouldn't be too big a drain if id is reasonably unique. If this query is very common, it may also be worth it to permanently build and index the range table. Hmm, you probably need several indices (starting with FieldTest.id), but you might also try this version:
-- let's name things properly, too, to keep them straight.
WITH
-- generate a range of times from date to date in defined steps:
Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM Range
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FieldTest
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
-- just make the white space check part of the regex
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
FROM Range
CROSS JOIN Header
LEFT JOIN FieldTest
ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
AND FieldTest.names = Header.names
AND FieldTest.timestamp >= Range.rangeStart
AND FieldTest.timestamp < Range.rangeEnd
GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart,
-- converts the high amount of rows to less rows with delimited strings:
LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names,
LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names
(not tested)
the CROSS JOIN was defenitly a nice hint. Also I was not able to implement the following LEFT JOIN like you suggested, I found a workaround, which - I am sure - still keeps room for improvement but at this moment is acceptable for me (timesaving about factor 30 compared to my first query solution). Here the actual code:
WITH
-- generate a range of times from date to date in defined steps:
TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM TimeRange
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
FROM TimeRange
CROSS JOIN Header
LEFT JOIN rawData
ON rawData.names = Header.names
AND rawData.time = TimeRange.rangeStart
GROUP BY TimeRange.rangeStart, Header.names),
test(time, names, avgvalues) AS (SELECT Data.rangeStart,
LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
FROM Data
GROUP BY Data.rangeStart)
-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;

Aerospike: How to perform IN query on PK

How to perform (sql like) IN queries in aerospike.
Do we need an UDF for this?
Something like this: Select * from ns.set where PK in (1,2,3)
If this requires a UDF how to go about it as the UDF is executed on a key:
EXECUTE <module>.<function>(<args>) ON <ns>[.<set>] WHERE PK = <key>
You're basically looking at retrieving records by a list of keys. This is a batch-read operation in Aerospike. Every language client for Aerospike should have this capability.
For example, in the Python client this is the Client.get_many method:
from __future__ import print_function
import aerospike
from aerospike.exception import AerospikeError
import sys
config = { 'hosts': [('127.0.0.1', 3000)] }
client = aerospike.client(config).connect()
try:
# assume the fourth key has no matching record
keys = [
('test', 'demo', '1'),
('test', 'demo', '2'),
('test', 'demo', '3'),
('test', 'demo', '4')
]
records = client.get_many(keys)
print records
except AerospikeError as e:
print("Error: {0} [{1}]".format(e.msg, e.code))
sys.exit(1)
finally:
client.close()
Similarly, in the Java client the AerospikeClient.get() method can take a list of keys.
In Ver 3.12.1+, you can run such queries if you store your Primary Key in a bin and then run predicate filtering on that bin. See http://www.aerospike.com/docs/guide/predicate.html
Aerospike by default does not store the PK in its raw string or numeric form as you assign it. It stores a RIPEMD160 hash of the PK+Set name.

Resources