Solr pass value between subquery and upper query - join

I have to perform the following SQL query on a Solr collection:
select c.CONTRACT
from contracts c
where c.LIMIT_TYPE = 4
and c.TRANSACTION_DATE_DB = (select max(TRANSACTION_DATE_DB) from contracts c2 where c.CONTRACT = c2.CONTRACT and c2.LIMIT_TYPE = 4)
and c.STATUS = 'ENABLED'
group by c.CONTRACT
order by c.CONTRACT desc;
The objective of the query is the following:
the subquery find, for each contract, the record with last transaction date for limit type 4
the upper query checks if, for each contract, if the record returned has status enabled.
I've seen that there are two possible ways to perform this query using solr: join and subquery
Here below I'll paste my 100th attempt using subquery:
http://localhost:8983/solr/contracts_collection/select?
q=*:*
&indent=true
fq=LIMIT_TYPE:4&
fq=STATUS:ENABLED
&fl=*,mySub:[subquery]
&mySub.q=*
&mySub.fq={!term f=CONTRACT_IDENTITY v=$row.CONTRACT_IDENTITY}
&mySub.fq=LIMIT_TYPE:4
&mySub.fl=CONTRACT_IDENTITY,TRANSACTION_DATE_DB,LIMIT_TYPE
&mySub.sort=TRANSACTION_DATE_DB+desc
&mySub.rows=1
The subquery is correct. However, for each contract, the result contains all the records for that contract. In fact, I'm missing here how to make the upper query return, for each contract, only the record having the same transaction_date_db returned by the subquery.
I tried to add the following condition:
&mySub.fq={!term f=TRANSACTION_DATE_DB v=$row.TRANSACTION_DATE_DB}
But still, I don't have the desired result.
I've also saw join, but I've failed miserably :P
Does anyone have any suggestion? Going crazy for two days now.
Thanks a lot

Related

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Properly format an ActiveRecord query with a subquery in Postgres

I have a working SQL query for Postgres v10.
SELECT *
FROM
(
SELECT DISTINCT ON (title) products.title, products.*
FROM "products"
) subquery
WHERE subquery.active = TRUE AND subquery.product_type_id = 1
ORDER BY created_at DESC
With the goal of the query to do a distinct based on the title column, then filter and order them. (I used the subquery in the first place, as it seemed there was no way to combine DISTINCT ON with ORDER BY without a subquery.
I am trying to express said query in ActiveRecord.
I have been doing
Product.select("*")
.from(Product.select("DISTINCT ON (product.title) product.title, meals.*"))
.where("subquery.active IS true")
.where("subquery.meal_type_id = ?", 1)
.order("created_at DESC")
and, that works! But, it's fairly messy with the string where clauses in there. Is there a better way to express this query with ActiveRecord/Arel, or am I just running into the limits of what ActiveRecord can express?
I think the resulting ActiveRecord call can be improved.
But I would start improving with original SQL query first.
Subquery
SELECT DISTINCT ON (title) products.title, products.* FROM products
(I think that instead of meals there should be products?) has duplicate products.title, which is not necessary there. Worse, it misses ORDER BY clause. As PostgreSQL documentation says:
Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first
I would rewrite sub-query as:
SELECT DISTINCT ON (title) * FROM products ORDER BY title ASC
which gives us a call:
Product.select('DISTINCT ON (title) *').order(title: :asc)
In main query where calls use Rails-generated alias for the subquery. I would not rely on Rails internal convention on aliasing subqueries, as it may change anytime. If you do not take this into account you could merge these conditions in one where call with hash-style argument syntax.
The final result:
Product.select('*')
.from(Product.select('DISTINCT ON (title) *').order(title: :asc))
.where(subquery: { active: true, meal_type_id: 1 })
.order('created_at DESC')

Linq Query Timing Out

I have this query that uses the DBContext entities I created.
var referral = entities.StudentReferrals.Where(x => x.ReferralID == p && x.SchoolYear == year).FirstOrDefault();
When I remove x.SchoolYear == year the query works fine, but with it my query times out. The opposite of what I would expect to happen, I would expect the more you narrow a query down via Where clause constraints the less likely it would time out.
SchoolYear is a field in the query and the query itself is valid, when I perform the query within SQL Studio Manager it returns results in less than a second.
My confusion is, why would adding a constraint to the Where clause cause a query to time out??
x.SchoolYear and year are both strings.
The full query is...
SELECT [Extent1].[BirthDate] AS [BirthDate],
[Extent1].[LegalFirstName] AS [LegalFirstName],
[Extent1].[LegalLastName] AS [LegalLastName],
[Extent1].[PreferredFirstName] AS [PreferredFirstName],
[Extent1].[PreferredLastName] AS [PreferredLastName],
[Extent1].[StudentNumber] AS [StudentNumber],
[Extent1].[LegacyStudentNumber] AS [LegacyStudentNumber],
[Extent1].[TranscriptSchoolCode] AS [TranscriptSchoolCode],
[Extent1].[OEN] AS [OEN],
[Extent1].[StatusIndicator] AS [StatusIndicator],
[Extent1].[SchoolYear] AS [SchoolYear],
[Extent1].[ReferralID] AS [ReferralID],
[Extent1].[PersonID] AS [PersonID],
[Extent1].[Active] AS [Active],
[Extent1].[ServiceTypeID] AS [ServiceTypeID],
[Extent1].[IsSchoolActive] AS [IsSchoolActive],
[Extent1].[Principal] AS [Principal],
[Extent1].[SchoolName] AS [SchoolName],
[Extent1].[SchoolCode] AS [SchoolCode],
[Extent1].[NearNorthSchoolCode] AS [NearNorthSchoolCode],
[Extent1].[TranscriptSchoolPrincipal] AS [TranscriptSchoolPrincipal],
[Extent1].[TranscriptSchoolName] AS [TranscriptSchoolName],
[Extent1].[TranscriptNearNorthSchoolCode] AS [TranscriptNearNorthSchoolCode],
[Extent1].[GuardianFirstName] AS [GuardianFirstName],
[Extent1].[GuardianLastName] AS [GuardianLastName],
[Extent1].[AreaCode] AS [AreaCode],
[Extent1].[ContactNo] AS [ContactNo],
[Extent1].[ReferredByFirstName] AS [ReferredByFirstName],
[Extent1].[ReferredByLastName] AS [ReferredByLastName],
[Extent1].[ReferredDate] AS [ReferredDate],
[Extent1].[Reason] AS [Reason],
[Extent1].[gender] AS [gender],
[Extent1].[grade] AS [grade],
[Extent1].[HomeroomTeacher] AS [HomeroomTeacher],
[Extent1].[IntakeTeamMember] AS [IntakeTeamMember],
[Extent1].[IntakeMemberID] AS [IntakeMemberID]
FROM (SELECT [StudentReferrals].[BirthDate] AS [BirthDate],
[StudentReferrals].[LegalFirstName] AS [LegalFirstName],
[StudentReferrals].[LegalLastName] AS [LegalLastName],
[StudentReferrals].[PreferredFirstName] AS [PreferredFirstName],
[StudentReferrals].[PreferredLastName] AS [PreferredLastName],
[StudentReferrals].[gender] AS [gender],
[StudentReferrals].[StudentNumber] AS [StudentNumber],
[StudentReferrals].[LegacyStudentNumber] AS [LegacyStudentNumber],
[StudentReferrals].[TranscriptSchoolCode] AS [TranscriptSchoolCode],
[StudentReferrals].[OEN] AS [OEN],
[StudentReferrals].[StatusIndicator] AS [StatusIndicator],
[StudentReferrals].[SchoolYear] AS [SchoolYear],
[StudentReferrals].[grade] AS [grade],
[StudentReferrals].[ReferralID] AS [ReferralID],
[StudentReferrals].[PersonID] AS [PersonID],
[StudentReferrals].[Active] AS [Active],
[StudentReferrals].[ServiceTypeID] AS [ServiceTypeID],
[StudentReferrals].[IsSchoolActive] AS [IsSchoolActive],
[StudentReferrals].[Principal] AS [Principal],
[StudentReferrals].[SchoolName] AS [SchoolName],
[StudentReferrals].[SchoolCode] AS [SchoolCode],
[StudentReferrals].[NearNorthSchoolCode] AS [NearNorthSchoolCode],
[StudentReferrals].[TranscriptSchoolPrincipal] AS [TranscriptSchoolPrincipal],
[StudentReferrals].[TranscriptSchoolName] AS [TranscriptSchoolName],
[StudentReferrals].[TranscriptNearNorthSchoolCode] AS [TranscriptNearNorthSchoolCode],
[StudentReferrals].[GuardianFirstName] AS [GuardianFirstName],
[StudentReferrals].[GuardianLastName] AS [GuardianLastName],
[StudentReferrals].[AreaCode] AS [AreaCode],
[StudentReferrals].[ContactNo] AS [ContactNo],
[StudentReferrals].[ReferredByFirstName] AS [ReferredByFirstName],
[StudentReferrals].[ReferredByLastName] AS [ReferredByLastName],
[StudentReferrals].[ReferredDate] AS [ReferredDate],
[StudentReferrals].[IntakeTeamMember] AS [IntakeTeamMember],
[StudentReferrals].[IntakeMemberID] AS [IntakeMemberID],
[StudentReferrals].[Reason] AS [Reason],
[StudentReferrals].[HomeroomTeacher] AS [HomeroomTeacher]
FROM [dbo].[StudentReferrals] AS [StudentReferrals]) AS [Extent1]
WHERE ([Extent1].[ReferralID] = #p__linq__0) AND ([Extent1].[SchoolYear] = #p__linq__1)
Here is the StudentReferral definition...
SELECT TOP (100) PERCENT p.person_id AS PersonID, p.birth_date AS BirthDate, p.legal_first_name AS LegalFirstName, p.legal_surname AS LegalLastName, p.preferred_first_name AS PreferredFirstName,
p.preferred_surname AS PreferredLastName, p.gender, p.student_no AS StudentNumber, p.legacy_student_number AS LegacyStudentNumber, p.transcript_school_code AS TranscriptSchoolCode,
p.oen_number AS OEN, s.status_indicator_code AS StatusIndicator, s.school_year AS SchoolYear, s.grade, CAST(CASE WHEN PATINDEX('%[^A-Za-z]%', s.Grade) = 0 THEN 1 ELSE CASE WHEN CAST(s.Grade AS int)
< 9 THEN 1 ELSE 0 END END AS bit) AS IsElementary, t.SchoolName, t.SchoolCode, t.NearNorthSchoolCode, pg.person_id AS GuardianID, pg.legal_first_name AS GuardianFirstName,
pg.legal_surname AS GuardianLastName, pt.area_code AS AreaCode, pt.phone_no AS ContactNo, pt.email_account AS Email
FROM Trillium.dbo.persons AS p INNER JOIN
Trillium.dbo.student_registrations AS s ON s.person_id = p.person_id INNER JOIN
dbo.Schools AS t ON t.SchoolCode = s.school_code INNER JOIN
NNDSB_AD_Routines.dbo.Students_Trillium_Guardians AS g ON s.person_id = g.student_person_id INNER JOIN
Trillium.dbo.persons AS pg ON g.contact_person_id = pg.person_id INNER JOIN
Trillium.dbo.person_telecom AS pt ON pg.person_id = pt.person_id
WHERE (s.status_indicator_code IN ('Active', 'PreReg')) AND (pt.telecom_type_name = 'home')
GROUP BY p.person_id, p.birth_date, p.legal_first_name, p.legal_surname, p.preferred_first_name, p.preferred_surname, p.gender, p.student_no, p.legacy_student_number, p.transcript_school_code, p.oen_number,
s.status_indicator_code, s.school_year, s.grade, CAST(CASE WHEN PATINDEX('%[^A-Za-z]%', s.Grade) = 0 THEN 1 ELSE CASE WHEN CAST(s.Grade AS int) < 9 THEN 1 ELSE 0 END END AS bit), t.SchoolName,
t.SchoolCode, t.NearNorthSchoolCode, pg.person_id, pg.legal_first_name, pg.legal_surname, pt.area_code, pt.phone_no, pt.email_account, g.primary_contact_priority
ORDER BY g.primary_contact_priority
I can almost guarantee that the query that EF produces and the query you're executing in SSMS are not the exact same SELECT statement. You probably wrote something like what Stephen Byrne has in his answer, i.e.
SELECT * from StudentReferrals WHERE ReferallID=1 AND SchoolYear='2015'
Right off the bat this query doesn't have a TOP qualifier on it which your EF query probably will due to the presence of the FirstOrDefault call.
Your first step should be to use something like SQL Profiler and grab the actual query that EF is generating. It's possible that with that query the optimizer is choosing to do a table scan because of the type of query that is being generated.
This likely won't make any difference, but you could also try rewriting your query as:
var referral = entities.StudentReferrals.FirstOrDefault(x => x.ReferralID == p && x.SchoolYear == year);
As an example, when I write the following query against my database:
OrganizationalNodes.FirstOrDefault(on => on.Name == "Justice League")
EF generates the following SQL:
SELECT
[Limit1].[C1] AS [C1],
[Limit1].[Id] AS [Id],
-- columns omitted for brevity
FROM ( SELECT TOP (1)
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
-- columns omitted for brevity
'0X0X' AS [C1]
FROM [dbo].[OrganizationalItems] AS [Extent1]
INNER JOIN [dbo].[OrganizationalNodes] AS [Extent2] ON [Extent1].[Id] = [Extent2].[Id]
WHERE N'Justice League' = [Extent1].[Name]
) AS [Limit1]
Well, to answer the question
why would adding a constraint to the Where clause cause a query to time out
The most likely cause is that you have a lot of data in the table, but no index covers the SchoolYear column. Therefore when you include in in a WHERE clause, this causes a Table Scan (because every row has to be checked to see if it should be included or not in the result set)
If you use SQL Server Management Studio and write the query manually for e.g
SELECT * from StudentReferrals WHERE ReferallID=1 AND SchoolYear='2015'
And then include the actual Execution Plan (Query->Include Actual Estimation Plan) then you will get the execution breakdown which will show you clearly if there is a Table Scan involved. If there is, create an index to "cover" the columns involved and it should fix your issue.
Update
Another possible solution could be to run DBCC FREEPROCCACHE to clear out any cached execution plans just in case for some reason SQL Server has picked something insane for whatever query is generated by Entity Framework.

Using distinct in a join

I'm still a novice at SQL and I need to run a report which JOINs 3 tables. The third table has duplicates of fields I need. So I tried to join with a distinct option but hat didn't work. Can anyone suggest the right code I could use?
My Code looks like this:
SELECT
C.CUSTOMER_CODE
, MS.SALESMAN_NAME
, SUM(C.REVENUE_AMT)
FROM C_REVENUE_ANALYSIS C
JOIN M_CUSTOMER MC ON C.CUSTOMER_CODE = MC.CUSTOMER_CODE
/* This following JOIN is the issue. */
JOIN M_SALESMAN MS ON MC.SALESMAN_CODE = (SELECT SALESMAN_CODE FROM M_SALESMAN WHERE COMP_CODE = '00')
WHERE REVENUE_DATE >= :from_date
AND REVENUE_DATE <= :to_date
GROUP BY C.CUSTOMER_CODE, MS.SALESMAN_NAME
I also tried a different variation to get a DISTINCT.
/* I also tried this variation to get a distinct */
JOIN M_SALESMAN MS ON MC.SALESMAN_CODE =
(SELECT distinct(SALESMAN_CODE) FROM M_SALESMAN)
Please can anyone help? I would truly appreciate it.
Thanks in advance.
select distinct
c.customer_code,
ms.salesman_code,
SUM(c.revenue_amt)
FROM
c_revenue c,
m_customer mc,
m_salesman ms
where
c.customer_code = mc.customer_code
AND mc.salesman_code = ms.salesman_code
AND ms.comp_code = '00'
AND Revenue_Date BETWEEN (from_date AND to_date)
group by
c.customer_code, ms.salesman_name
The above will return you any distinct combination of Customer Code, Salesman Code and SUM of Revenue Amount where the c.CustomerCode matches an mc.customer_code AND that same mc record matches an ms.salesman_code AND that ms record has a comp_code of '00' AND the Revenue_Date is between the from and to variables. Then, the whole result will be grouped by customer code and salesman name; the only thing that will cause duplicates to appear is if the SUM(revenue) is somehow different.
To explain, if you're just doing a straight JOIN, you don't need the JOIN keywords. I find it tends to convolute things; you only need them if you're doing an "odd" join, like an LEFT/RIGHT join. I don't know your data model so the above MIGHT still return duplicates but, if so, let me know.

How do I query on a subset of ActiveModel records?

I've rewritten this question as my previous explanation was causing confusion.
In the SQL world, you have an initial record set that you apply a query to. The output of this query is the result set. Generally, the initial record set is an entire table of records and the result set is the records from the initial record set that match the query ruleset.
I have a use case where I need my application to occasionally operate on only a subset of records in a table. If a table has 10,000 records in it, I'd like my application to behave like only the first 1,000 records exist. These should be the same 1,000 records each time. In other words, I want the initial record set to be the first 1,000 devices in a table (when ordered by primary key), and the result set the resulting records from these first 1,000 devices.
Some solutions have been proposed, and it's revealed that my initial description was not very clear. To be more explicit, I am not trying to implement pagination. I'm also not trying to limit the number of results I receive (which .limit(1,000) would indeed achieve).
Thanks!
This is the line in your question that I don't understand:
This causes issues though with both of the calls, as limit limits the results of the query, not the database rows that the query is performed on.
This is not a Rails thing, this is a SQL thing.
Device.limit(n) runs SELECT * FROM device LIMIT n
Limit always returns a subset of the queried result set.
Would first(n) accomplish what you want? It will both order the result set ascending by the PK and limit the number of results returned.
SQL Statements can be chained together. So if you have your subset, you can then perform additional queries with it.
my_subset = Device.where(family: "Phone")
# SQL: SELECT * FROM Device WHERE `family` = "Phone"
my_results = my_subset.where(style: "Touchscreen")
# SQL: SELECT * FROM Device WHERE `family` = "Phone" AND `style` = "Touchscreen"
Which can also be written as:
my_results = Device.where(family: "Phone").where(style: "Touchscreen")
my_results = Device.where(family: "Phone", style: "Touchscreen")
# SQL: SELECT * FROM Device WHERE `family` = "Phone" AND `style` = "Touchscreen"
From your question, if you'd like to select the first 1,000 rows (ordered by primary key, pkey) and then query against that, you'll need to do:
my_results = Device.find_by_sql("SELECT *
FROM (SELECT * FROM devices ORDER BY pkey ASC LIMIT 1000)
WHERE `more_searching` = 'happens here'")
You could specifically ask for a set of IDs:
Device.where(id: (1..4).to_a)
That will construct a WHERE clause like:
WHERE id IN (1,2,3,4)

Resources