Error when visualize apache kylin data in apache superset - kylin

I tried to view apache kylin data with apache superset by an official blog guide, but I met the following error when click "visualize" button after query out result table. I have upgraded kylinpy to latest version. I know the correct sql should be "WHERE MONTH_BEG_DT >= '1918-03-12' AND MONTH_BEG_DT <= '2018-03-12'", but it is generated by superset auto.
Caused by: java.lang.NumberFormatException: For input string: "12 00:00:00"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.apache.calcite.avatica.util.DateTimeUtils.dateStringToUnixDate(DateTimeUtils.java:637)
at Baz$6$1.<clinit>(Unknown Source)
... 99 more
2018-03-12 18:13:12,606 INFO [Query eb988c1e-5f6c-4275-a9b8-1946f5976020-60] service.QueryService:328 :
==========================[QUERY]===============================
Query Id: eb988c1e-5f6c-4275-a9b8-1946f5976020
SQL: SELECT META_CATEG_NAME AS META_CATEG_NAME,
sum(CNT) AS sum__CNT
FROM
(select YEAR_BEG_DT,
MONTH_BEG_DT,
WEEK_BEG_DT,
META_CATEG_NAME,
CATEG_LVL2_NAME,
CATEG_LVL3_NAME,
OPS_REGION,
NAME as BUYER_COUNTRY_NAME,
sum(PRICE) as GMV,
sum(ACCOUNT_BUYER_LEVEL) ACCOUNT_BUYER_LEVEL,
count(*) as CNT
from KYLIN_SALES
join KYLIN_CAL_DT on CAL_DT = PART_DT
join KYLIN_CATEGORY_GROUPINGS on SITE_ID = LSTG_SITE_ID
and KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID = KYLIN_SALES.LEAF_CATEG_ID
join KYLIN_ACCOUNT on ACCOUNT_ID = BUYER_ID
join KYLIN_COUNTRY on ACCOUNT_COUNTRY = COUNTRY
group by YEAR_BEG_DT,
MONTH_BEG_DT,
WEEK_BEG_DT,
META_CATEG_NAME,
CATEG_LVL2_NAME,
CATEG_LVL3_NAME,
OPS_REGION,
NAME) AS expr_qry
WHERE MONTH_BEG_DT >= '1918-03-12 00:00:00'
AND MONTH_BEG_DT <= '2018-03-12 18:13:11'
GROUP BY META_CATEG_NAME
ORDER BY sum__CNT DESC
LIMIT 5000
User: ADMIN
Success: true
Duration: 1.313
Project: learn_kylin
Realization Names: [CUBE[name=kylin_sales_cube]]
Cuboid Ids: [23715]
Total scan count: 9946
Total scan bytes: 556263
Result row count: 0
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Is Query Push-Down: false
Is Prepare: false
Trace URL: null
Message: null
==========================[QUERY]===============================

Please check column(dimension) type in superset, make sure the type is DATA, and then please make sure kylinpy version is above 1.0.9.

Related

How do I optimise this ClickHouse DB JOIN query?

I am playing around with Clickhouse DB and I am trying to figure out why the query below is giving me a DB::Exception: Memory limit (for query) exceeded and could use some help...
SELECT * FROM
(
SELECT created_at, rates.car_id, MIN(rates.price) FROM rates
WHERE
pickup_location_id = 198
AND created_at = '2020-10-01'
GROUP BY created_at, car_id
) r
JOIN cars c2 ON r.car_id = c2.id
The inner query bit performs almost instantly (millions of records) and yields only 212 results. However, adding the JOIN causes the query to fail (memory exception, 45GB)
Looks like the JOIN happens on the whole of rates/cars - and not on the "result"?
CH uses HASHJOIN and places the right table into memory into a HashTable.
In case of inner join you can swap tables:
SELECT * FROM cars c2 JOIN
(
SELECT created_at, rates.car_id, MIN(rates.price) FROM rates
WHERE
pickup_location_id = 198
AND created_at = '2020-10-01'
GROUP BY created_at, car_id
) r
ON r.car_id = c2.id

create view of count based on latest timestamp for a particular id

I have been provided with the following code to run a query that counts the number of connector_pks grouped by group_status based on the latest timestamp:
SELECT
`group_status`,COUNT(*) 'Count of status '
FROM
(SELECT `connector_pk`, `group_status`, `status_timestamp`
FROM connector_status_report t1
WHERE `status_timestamp` = (SELECT MAX(`status_timestamp`)
FROM connector_status_report t2 WHERE t2.`connector_pk` = t1.`connector_pk`))
t3
GROUP BY `group_status`
Unfortunately this takes about 30 minutes to run so I was hoping for an optimised solution.
Example table
connector_pk group_status status timestamp
1 Available 2020-02-11 19:14:45
1 Charging 2020-02-11 19:18:45
2 Available 2020-02-11 19:15:45
2 Not Available 2020-02-11 19:18:45
3 Not Available 2020-02-11 19:14:45
The desired output would look like this:
group_Status | Count of status
Available | 0
Charging | 1
Not Available | 2
For my original question I was pointed to the following question (and answers):
Get records with max value for each group of grouped SQL results
I would like to create a view with the output
Is it possible to also add the following to the query to include in the View:
SELECT status, = IF(status = 'charging', 'Charging', if(status = 'Not
Occupied','Available', 'Occupied') AS group_status FROM
connector_status_report
I managed to speed up the query using the following:
CREATE VIEW statuscount AS
Select group_status, COUNT(*) 'Count of status'
FROM
(SELECT tt.*
FROM connector_status_report tt
INNER JOIN
(SELECT connector_pk, MAX(status_timestamp) AS MaxDateTime
FROM connector_status_report
GROUP BY connector_pk) groupedtt
ON tt.connector_pk = groupedtt.connector_pk
AND tt.status_timestamp = groupedtt.MaxDateTime) t3
GROUP BY group_status
If anybody can help with inserting the query that creates the 'Group_Status' column it would be much appreciated

Linq Query Timing Out

I have this query that uses the DBContext entities I created.
var referral = entities.StudentReferrals.Where(x => x.ReferralID == p && x.SchoolYear == year).FirstOrDefault();
When I remove x.SchoolYear == year the query works fine, but with it my query times out. The opposite of what I would expect to happen, I would expect the more you narrow a query down via Where clause constraints the less likely it would time out.
SchoolYear is a field in the query and the query itself is valid, when I perform the query within SQL Studio Manager it returns results in less than a second.
My confusion is, why would adding a constraint to the Where clause cause a query to time out??
x.SchoolYear and year are both strings.
The full query is...
SELECT [Extent1].[BirthDate] AS [BirthDate],
[Extent1].[LegalFirstName] AS [LegalFirstName],
[Extent1].[LegalLastName] AS [LegalLastName],
[Extent1].[PreferredFirstName] AS [PreferredFirstName],
[Extent1].[PreferredLastName] AS [PreferredLastName],
[Extent1].[StudentNumber] AS [StudentNumber],
[Extent1].[LegacyStudentNumber] AS [LegacyStudentNumber],
[Extent1].[TranscriptSchoolCode] AS [TranscriptSchoolCode],
[Extent1].[OEN] AS [OEN],
[Extent1].[StatusIndicator] AS [StatusIndicator],
[Extent1].[SchoolYear] AS [SchoolYear],
[Extent1].[ReferralID] AS [ReferralID],
[Extent1].[PersonID] AS [PersonID],
[Extent1].[Active] AS [Active],
[Extent1].[ServiceTypeID] AS [ServiceTypeID],
[Extent1].[IsSchoolActive] AS [IsSchoolActive],
[Extent1].[Principal] AS [Principal],
[Extent1].[SchoolName] AS [SchoolName],
[Extent1].[SchoolCode] AS [SchoolCode],
[Extent1].[NearNorthSchoolCode] AS [NearNorthSchoolCode],
[Extent1].[TranscriptSchoolPrincipal] AS [TranscriptSchoolPrincipal],
[Extent1].[TranscriptSchoolName] AS [TranscriptSchoolName],
[Extent1].[TranscriptNearNorthSchoolCode] AS [TranscriptNearNorthSchoolCode],
[Extent1].[GuardianFirstName] AS [GuardianFirstName],
[Extent1].[GuardianLastName] AS [GuardianLastName],
[Extent1].[AreaCode] AS [AreaCode],
[Extent1].[ContactNo] AS [ContactNo],
[Extent1].[ReferredByFirstName] AS [ReferredByFirstName],
[Extent1].[ReferredByLastName] AS [ReferredByLastName],
[Extent1].[ReferredDate] AS [ReferredDate],
[Extent1].[Reason] AS [Reason],
[Extent1].[gender] AS [gender],
[Extent1].[grade] AS [grade],
[Extent1].[HomeroomTeacher] AS [HomeroomTeacher],
[Extent1].[IntakeTeamMember] AS [IntakeTeamMember],
[Extent1].[IntakeMemberID] AS [IntakeMemberID]
FROM (SELECT [StudentReferrals].[BirthDate] AS [BirthDate],
[StudentReferrals].[LegalFirstName] AS [LegalFirstName],
[StudentReferrals].[LegalLastName] AS [LegalLastName],
[StudentReferrals].[PreferredFirstName] AS [PreferredFirstName],
[StudentReferrals].[PreferredLastName] AS [PreferredLastName],
[StudentReferrals].[gender] AS [gender],
[StudentReferrals].[StudentNumber] AS [StudentNumber],
[StudentReferrals].[LegacyStudentNumber] AS [LegacyStudentNumber],
[StudentReferrals].[TranscriptSchoolCode] AS [TranscriptSchoolCode],
[StudentReferrals].[OEN] AS [OEN],
[StudentReferrals].[StatusIndicator] AS [StatusIndicator],
[StudentReferrals].[SchoolYear] AS [SchoolYear],
[StudentReferrals].[grade] AS [grade],
[StudentReferrals].[ReferralID] AS [ReferralID],
[StudentReferrals].[PersonID] AS [PersonID],
[StudentReferrals].[Active] AS [Active],
[StudentReferrals].[ServiceTypeID] AS [ServiceTypeID],
[StudentReferrals].[IsSchoolActive] AS [IsSchoolActive],
[StudentReferrals].[Principal] AS [Principal],
[StudentReferrals].[SchoolName] AS [SchoolName],
[StudentReferrals].[SchoolCode] AS [SchoolCode],
[StudentReferrals].[NearNorthSchoolCode] AS [NearNorthSchoolCode],
[StudentReferrals].[TranscriptSchoolPrincipal] AS [TranscriptSchoolPrincipal],
[StudentReferrals].[TranscriptSchoolName] AS [TranscriptSchoolName],
[StudentReferrals].[TranscriptNearNorthSchoolCode] AS [TranscriptNearNorthSchoolCode],
[StudentReferrals].[GuardianFirstName] AS [GuardianFirstName],
[StudentReferrals].[GuardianLastName] AS [GuardianLastName],
[StudentReferrals].[AreaCode] AS [AreaCode],
[StudentReferrals].[ContactNo] AS [ContactNo],
[StudentReferrals].[ReferredByFirstName] AS [ReferredByFirstName],
[StudentReferrals].[ReferredByLastName] AS [ReferredByLastName],
[StudentReferrals].[ReferredDate] AS [ReferredDate],
[StudentReferrals].[IntakeTeamMember] AS [IntakeTeamMember],
[StudentReferrals].[IntakeMemberID] AS [IntakeMemberID],
[StudentReferrals].[Reason] AS [Reason],
[StudentReferrals].[HomeroomTeacher] AS [HomeroomTeacher]
FROM [dbo].[StudentReferrals] AS [StudentReferrals]) AS [Extent1]
WHERE ([Extent1].[ReferralID] = #p__linq__0) AND ([Extent1].[SchoolYear] = #p__linq__1)
Here is the StudentReferral definition...
SELECT TOP (100) PERCENT p.person_id AS PersonID, p.birth_date AS BirthDate, p.legal_first_name AS LegalFirstName, p.legal_surname AS LegalLastName, p.preferred_first_name AS PreferredFirstName,
p.preferred_surname AS PreferredLastName, p.gender, p.student_no AS StudentNumber, p.legacy_student_number AS LegacyStudentNumber, p.transcript_school_code AS TranscriptSchoolCode,
p.oen_number AS OEN, s.status_indicator_code AS StatusIndicator, s.school_year AS SchoolYear, s.grade, CAST(CASE WHEN PATINDEX('%[^A-Za-z]%', s.Grade) = 0 THEN 1 ELSE CASE WHEN CAST(s.Grade AS int)
< 9 THEN 1 ELSE 0 END END AS bit) AS IsElementary, t.SchoolName, t.SchoolCode, t.NearNorthSchoolCode, pg.person_id AS GuardianID, pg.legal_first_name AS GuardianFirstName,
pg.legal_surname AS GuardianLastName, pt.area_code AS AreaCode, pt.phone_no AS ContactNo, pt.email_account AS Email
FROM Trillium.dbo.persons AS p INNER JOIN
Trillium.dbo.student_registrations AS s ON s.person_id = p.person_id INNER JOIN
dbo.Schools AS t ON t.SchoolCode = s.school_code INNER JOIN
NNDSB_AD_Routines.dbo.Students_Trillium_Guardians AS g ON s.person_id = g.student_person_id INNER JOIN
Trillium.dbo.persons AS pg ON g.contact_person_id = pg.person_id INNER JOIN
Trillium.dbo.person_telecom AS pt ON pg.person_id = pt.person_id
WHERE (s.status_indicator_code IN ('Active', 'PreReg')) AND (pt.telecom_type_name = 'home')
GROUP BY p.person_id, p.birth_date, p.legal_first_name, p.legal_surname, p.preferred_first_name, p.preferred_surname, p.gender, p.student_no, p.legacy_student_number, p.transcript_school_code, p.oen_number,
s.status_indicator_code, s.school_year, s.grade, CAST(CASE WHEN PATINDEX('%[^A-Za-z]%', s.Grade) = 0 THEN 1 ELSE CASE WHEN CAST(s.Grade AS int) < 9 THEN 1 ELSE 0 END END AS bit), t.SchoolName,
t.SchoolCode, t.NearNorthSchoolCode, pg.person_id, pg.legal_first_name, pg.legal_surname, pt.area_code, pt.phone_no, pt.email_account, g.primary_contact_priority
ORDER BY g.primary_contact_priority
I can almost guarantee that the query that EF produces and the query you're executing in SSMS are not the exact same SELECT statement. You probably wrote something like what Stephen Byrne has in his answer, i.e.
SELECT * from StudentReferrals WHERE ReferallID=1 AND SchoolYear='2015'
Right off the bat this query doesn't have a TOP qualifier on it which your EF query probably will due to the presence of the FirstOrDefault call.
Your first step should be to use something like SQL Profiler and grab the actual query that EF is generating. It's possible that with that query the optimizer is choosing to do a table scan because of the type of query that is being generated.
This likely won't make any difference, but you could also try rewriting your query as:
var referral = entities.StudentReferrals.FirstOrDefault(x => x.ReferralID == p && x.SchoolYear == year);
As an example, when I write the following query against my database:
OrganizationalNodes.FirstOrDefault(on => on.Name == "Justice League")
EF generates the following SQL:
SELECT
[Limit1].[C1] AS [C1],
[Limit1].[Id] AS [Id],
-- columns omitted for brevity
FROM ( SELECT TOP (1)
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
-- columns omitted for brevity
'0X0X' AS [C1]
FROM [dbo].[OrganizationalItems] AS [Extent1]
INNER JOIN [dbo].[OrganizationalNodes] AS [Extent2] ON [Extent1].[Id] = [Extent2].[Id]
WHERE N'Justice League' = [Extent1].[Name]
) AS [Limit1]
Well, to answer the question
why would adding a constraint to the Where clause cause a query to time out
The most likely cause is that you have a lot of data in the table, but no index covers the SchoolYear column. Therefore when you include in in a WHERE clause, this causes a Table Scan (because every row has to be checked to see if it should be included or not in the result set)
If you use SQL Server Management Studio and write the query manually for e.g
SELECT * from StudentReferrals WHERE ReferallID=1 AND SchoolYear='2015'
And then include the actual Execution Plan (Query->Include Actual Estimation Plan) then you will get the execution breakdown which will show you clearly if there is a Table Scan involved. If there is, create an index to "cover" the columns involved and it should fix your issue.
Update
Another possible solution could be to run DBCC FREEPROCCACHE to clear out any cached execution plans just in case for some reason SQL Server has picked something insane for whatever query is generated by Entity Framework.

How to use joins and averages together in Hive queries

I have two tables in hive:
Table1: uid,txid,amt,vendor Table2: uid,txid
Now I need to join the tables on txid which basically confirms a transaction is finally recorded. There will be some transactions which will be present only in Table1 and not in Table2.
I need to find out number of avg of transaction matches found per user(uid) per vendor. Then I need to find the avg of these averages by adding all the averages and divide them by the number of unique users per vendor.
Let's say I have the data:
Table1:
u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2
Table2:
u1,100
u1,101
u2,200
u2,202
Example For vendor vend1:
u1-> Avg transaction find rate = 2(matches found in both Tables,Table1 and Table2)/4(total occurrence in Table1) =0.5
u2 -> Avg transaction find rate = 1/1 = 1
Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75
Required output:
vend1,0.75
vend2,1
I can't seem to find count of both matches and occurrence in just Table1 in one hive query per user per vendor. I have reached to this query and can't find how to change it further.
SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid
Any help would be great.
I think you are running into trouble with your JOIN. When you JOIN by txid and uid, you are losing the total number of uid's per group. If I were you I would assign a column of 1's to table2 and name the column something like success or transaction and do a LEFT OUTER JOIN. Then in your new table you will have a column with the number 1 in it if there was a completed transaction and NULL otherwise. You can then do a case statement to convert these NULLs to 0
Query:
select vendor
,(SUM(avg_uid) / COUNT(uid)) as avg_of_avgs
from (
select vendor
,uid
,AVG(complete) as avg_uid
from (
select uid
,txid
,amt
,vendor
,case when success is null then 0
else success
end as complete
from (
select A.*
,B.success
from table1 as A
LEFT OUTER JOIN table2 as B
ON B.txid = A.txid
) x
) y
group by vendor, uid
) z
group by vendor
Output:
vend1 0.75
vend2 1.0
B.success in line 17 is the column of 1's that I put int table2 before the JOIN. If you are curious about case statements in Hive you can find them here
Amazing and precise answer by GoBrewers14!! Thank you so much. I was looking at it from a wrong perspective.
I made little changes in the query to get things finally done.
I didn't need to add a "success" colummn to table2. I checked B.txid in the above query instead of B.success. B.txid will be null in case a match is not found and be some value if a match is found. That checks the success & failure conditions itself without adding a new column. And then I set NULL as 0 and !NULL as 1 in the part above it. Also I changed some variable names as hive was finding it ambiguous.
The final query looks like :
select vendr
,(SUM(avg_uid) / COUNT(usrid)) as avg_of_avgs
from (
select vendr
,usrid
,AVG(complete) as avg_uid
from (
select usrid
,txnid
,amnt
,vendr
,case when success is null then 0
else 1
end as complete
from (
select A.uid as usrid,A.vendor as vendr,A.amt as amnt,A.txid as txnid
,B.txid as success
from Table1 as A
LEFT OUTER JOIN Table2 as B
ON B.txid = A.txid
) x
) y
group by vendr, usrid
) z
group by vendr;

PGError: ERROR: aggregates not allowed in WHERE clause on a AR query of an object and its has_many objects

Running the following query on a has_many association. Recommendations has_many Approvals.
I am running, rails 3 and PostgreSQL:
Recommendation.joins(:approvals).where('approvals.count = ?
AND recommendations.user_id = ?', 1, current_user.id)
This is returning the following error: https://gist.github.com/1541569
The error message tells you:
aggregates not allowed in WHERE clause
count() is an aggregate function. Use the HAVING clause for that.
The query could look like this:
SELECT r.*
FROM recommendations r
JOIN approvals a ON a.recommendation_id = r.id
WHERE r.user_id = $current_user_id
GROUP BY r.id
HAVING count(a.recommendation_id) = 1
With PostgreSQL 9.1 or later it is enough to GROUP BY the primary key of a table (presuming recommendations.id is the PK). In Postgres versions before 9.1 you had to include all columns of the SELECT list that are not aggregated in the GROUP BY list. With recommendations.* in the SELECT list, that would be every single column of the table.
I quote the release notes of PostgreSQL 9.1:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause (Peter Eisentraut)
Simpler with a sub-select
Either way, this is simpler and faster, doing the same:
SELECT *
FROM recommendations r
WHERE user_id = $current_user_id
AND (SELECT count(*)
FROM approvals
WHERE recommendation_id = r.id) = 1;
Avoid multiplying rows with a JOIN a priori, then you don't have to aggregate them back.
Looks like you have a column named count and PostgreSQL is interpreting that column name as the count aggregate function. Your SQL ends up like this:
SELECT "recommendations".*
FROM "recommendations"
INNER JOIN "approvals" ON "approvals"."recommendation_id" = "recommendations"."id"
WHERE (approvals.count = 1 AND recommendations.user_id = 1)
The error message specifically points at the approvals.count:
LINE 1: ...ecommendation_id" = "recommendations"."id" WHERE (approvals....
^
I can't reproduce that error in my PostgreSQL (9.0) but maybe you're using a different version. Try double quoting that column name in your where:
Recommendation.joins(:approvals).where('approvals."count" = ? AND recommendations.user_id = ?', 1, current_user.id)
If that sorts things out then I'd recommend renaming your approvals.count column to something else so that you don't have to worry about it anymore.

Resources