Getting incomplete data when running 6 ignite servers - join

I am running 6 Ignite servers on version 2.7.5. The problem is when I am hitting queries using my client API I am not getting all records. Only some records are coming. I am using partitioned cache. I don't want to use replicated mode. When queried with DBeaver it show all records have been fetched.
The following code is used to fetch the data:
public List<Long> getGroupIdsByUserId(Long createdBy) {
final String query = "select g.groupId from groups g where g.createdBy = ? and g.isActive = 1";
SqlFieldsQuery sql = new SqlFieldsQuery(query);
sql.setArgs(createdBy);
List<List<?>> rsList = groupsCache.query(sql).getAll();
List<Long> ids = new ArrayList<>();
for (List<?> l : rsList) {
ids.add((Long)l.get(0));
}
return ids;
}
Ignite Version - 2.7.5
Client Query method
And the join Query is :
final String query = "select distinct u.userId from
groupusers gu "
+ "inner join \"GroupsCache\".groups g on gu.groupId = g.groupId
"
+ "inner join \"OrganizationsCache\".organizations o on
gu.organizationId = o.organizationId "
+ "inner join \"UsersCache\".users u on gu.userId = u.userId
where " + "g.groupId = ? and "
+ "g.isActive = 1 and " + "gu.isActive = 1 and " +
"gu.createdBy
= ? and " + "o.organizationId = ? and "
+ "o.isActive = 1 and " + "u.isActive = 1";
For the join query Actual records in db is 120 but with ignite client only 3-4 records are comming .and they are not consistent. sometime it comes 3 records and some time it is 4 records. And for query
select g.groupId from groups g where g.createdBy = ? and g.isActive = 1
actual records are 27 but comming records are sometimes 20 sometimes 19 and sometimes complete. Please Help me with this and with collocated joins..

Most likely this would mean that your affinity is incorrect.
Apache Ignite assumes that your data has proper affinity, i.e. when joining two tables, rows to join will always be available on the same node. This works when you either join by primary key, or by a part of primary key which is marked as affinity column (e.g. by #AffinityKeyMapped annotation). There's a documentation page about affinity.
You can check that by setting distribtedJoins connection setting to true. If you see all the records after that, it means you need to fix your affinity.

Related

Flink temporal join works only for a few seconds

I'm trying to implement an event time temporal join in Flink. Here's the first join table:
tEnv.executeSql("CREATE TABLE AggregatedTrafficData_Kafka (" +
"`timestamp` TIMESTAMP_LTZ(3)," +
"`area` STRING," +
"`networkEdge` STRING," +
"`vehiclesNumber` BIGINT," +
"`averageSpeed` INTEGER," +
"WATERMARK FOR `timestamp` AS `timestamp`" +
") WITH (" +
"'connector' = 'kafka'," +
"'topic' = 'seneca.trafficdata.aggregated'," +
"'properties.bootstrap.servers' = 'localhost:9092'," +
"'properties.group.id' = 'traffic-data-aggregation-job'," +
"'format' = 'json'," +
"'json.timestamp-format.standard' = 'ISO-8601'" +
")");
The table is used as a sink for the following query:
Table aggregatedTrafficData = trafficData
.window(Slide.over(lit(30).seconds())
.every(lit(15).seconds())
.on($("timestamp"))
.as("w"))
.groupBy($("w"), $("networkEdge"), $("area"))
.select(
$("w").end().as("timestamp"),
$("area"),
$("networkEdge"),
$("plate").count().as("vehiclesNumber"),
$("speed").avg().as("averageSpeed")
);
Here's the other join table. I use Debezium to stream a Postgres table into Kafka:
tEnv.executeSql("CREATE TABLE TransportNetworkEdge_Kafka (" +
"`timestamp` TIMESTAMP_LTZ(3) METADATA FROM 'value.source.timestamp' VIRTUAL," +
"`urn` STRING," +
"`flow_rate` INTEGER," +
"PRIMARY KEY(`urn`) NOT ENFORCED," +
"WATERMARK FOR `timestamp` AS `timestamp`" +
") WITH (" +
"'connector' = 'kafka'," +
"'topic' = 'seneca.network.transport_network_edge'," +
"'scan.startup.mode' = 'latest-offset'," +
"'properties.bootstrap.servers' = 'localhost:9092'," +
"'properties.group.id' = 'traffic-data-aggregation-job'," +
"'format' = 'debezium-json'," +
"'debezium-json.schema-include' = 'true'" +
")");
Finally here's the temporal join:
Table transportNetworkCongestion = tEnv.sqlQuery("SELECT AggregatedTrafficData_Kafka.`timestamp`, `networkEdge`, " +
"congestion(`vehiclesNumber`, `flow_rate`) AS `congestion` FROM AggregatedTrafficData_Kafka " +
"JOIN TransportNetworkEdge_Kafka FOR SYSTEM_TIME AS OF AggregatedTrafficData_Kafka.`timestamp` " +
"ON AggregatedTrafficData_Kafka.`networkEdge` = TransportNetworkEdge_Kafka.`urn`");
The problem I'm having is that the join works only for the first few second (after an update in the Postgres table), but I need to continuosly join the first table with debezium one. Am I doing something wrong?
Thanks
euks
Temporal joins using the AS OF syntax you're using require:
an append-only table with a valid event-time attribute
an updating table with a primary key and a valid event-time attribute
an equality predicate on the primary key
When Flink SQL's temporal operators are applied to event time streams, watermarks play a critical role in determining when results are produced, and when the state is cleared.
When performing a temporal join:
rows from the append-only table are buffered in Flink state until the current watermark of the join operator reaches their timestamps
for the versioned table, for each key the latest version whose timestamp precedes the join operator's current watermark is kept in state, plus any versions from after the current watermark
whenever the join operator's watermark advances, new results are produced, and state that's no longer relevant is cleared
The join operator tracks the watermarks it receives from its input channels, and its current watermark is always the minimum of these two watermarks. This is why your join stalls, and only makes progress when the flow_rate is updated.
One way to fix this would be to set the watermark for the TransportNetworkEdge_Kafka table like this:
"WATERMARK FOR `timestamp` AS " + Watermark.MAX_WATERMARK
This will set the watermark for this table/stream to the largest possible value, which will have the effect of making the watermarks from this stream irrelevant -- this stream's watermarks will never be the smallest.
This will, however, have the drawback of making the join results non-deterministic.

How to combine query results?

I have three queries that are tied together. The final output requires multiple loops over the queries. This way works just fine but seems very inefficient and too complex in my opinion. Here is what I have:
Query 1:
<cfquery name="qryTypes" datasource="#application.datasource#">
SELECT
t.type_id,
t.category_id,
c.category_name,
s.type_shortcode
FROM type t
INNER JOIN section s
ON s.type_id = t.type_id
INNER JOIN category c
ON c.category_id = t.category_id
WHERE t.rec_id = 45 -- This parameter is passed from form field.
ORDER BY s.type_name,c.category_name
</cfquery>
Query Types will produce this set of results:
4 11 SP PRES
4 12 CH PRES
4 13 MS PRES
4 14 XN PRES
Then loop over query Types and get the records from another query for each record that match:
Query 2:
<cfloop query="qryTypes">
<cfquery name="qryLocation" datasource=#application.datasource#>
SELECT l.location_id, l.spent_amount
FROM locations l
WHERE l.location_type = '#trim(category_name)#'
AND l.nofa_id = 45 -- This is form field
AND l.location_id = '#trim(category_id)##trim(type_id)#'
GROUP BY l.location_id,l.spent_amount
ORDER BY l.location_id ASC
</cfquery>
<cfset spent_total = arraySum(qryLocation['spent_amount']) />
<cfset amount_total = 0 />
<cfloop query="qryLocation">
<cfquery name="qryFunds" datasource=#application.datasource#>
SELECT sum(budget) AS budget
FROM funds f
WHERE f.location_id= '#qryLocation.location_id#'
AND nofa_id = 45
</cfquery>
<cfscript>
if(qryFunds.budgetgt 0) {
amount_total = amount_total + qryFunds.budget;
}
</cfscript>
</cfloop>
<cfset GrandTotal = GrandTotal + spent_total />
<cfset GrandTotalad = GrandTotalad + amount_total />
</cfloop>
After the loops are completed this is result:
CATEGORY NAME SPENT TOTAL AMOUNT TOTAL
SP 970927 89613
CH 4804 8759
MS 9922 21436
XN 39398 4602
Grand Total: 1025051 124410
Is there a good way to merge this together and have only one query instead of three queries and inner loops? I was wondering if this might be a good fit for a stored procedure and then do all data manipulations in there? If anyone have suggestions please let me know.
qryTypes returns X records
qryLocation returns Y records
So far you've run (1 + X) queries.
qryFunds returns Z records
Now you've run (1 + X)(Y) queries.
The more data each returns, the more queries you'll run. Obviously not good.
If all you want is the final totals for each category, in a stored procedure, you could create a temp table with the joined data from qryTypes and qryLocation. Then your last qryFunds is just joined against that temp table data.
SELECT
sum(budget) AS budget
FROM
funds f
INNER JOIN
#TEMP_TABLE t ON t.location_id = f.location_id
AND
nofa_id = 45
You could then get other sums off the temp table if needed. It's possible this could all be worked into a single query, but maybe this helps you get there.
Also, a stored procedure can return multiple record sets, so you can have one return the aggregated table amount data and a 2nd return the grand total. This would keep all the calculations on the database and no need for CF to be involved.

Crystal Reports External Join

I have 2 data sources that I am querying, then joining in Crystal Reports on a key string with a Left Outer Join. The intent of the report is to identify purchases made that were not processed. The issue is that CR refuses to show the matching right query records.
Data Source 1: Excel worksheet on my local machine containing raw
credit card purchases. "Left table"
Data Source 2: 2 subqueries from a hosted Oracle database with a
Union join containing processed credit card transactions. "Right
table"
Key String: The last 4 digits of a credit card number concatenated
with the date-time of the transaction, e.g. "223402-06-2019 04:15:00"
The queries return proper values when executed separately. I have verified that many records returned for the Left table actually do have matching Right table records that are not displayed. I did this using a separate report showing only the Right table query results and manually searching for different key strings.
I'm completely buffaloed and any assistance would be appreciated.
The SQL from Crystal Reports:
I:\Dept\DCS\MPOOL\Fleet Management Data\M5\M5 Automation Data Tables\ComData Transaction Data.xls
`SELECT DISTINCT CD.`First Name` AS UNIT_NO,
CD.`HIERARCHY LEVEL3` AS USE_DEPT,
DATEVALUE(MONTH(CD.`Transaction Date`) & "/" & DAY(CD.`Transaction Date`) & "/" & YEAR(CD.`Transaction Date`)) + TIMEVALUE(HOUR(CD.`Transaction Time`) & ":" & MINUTE(CD.`Transaction Time`) & ":" & SECOND(CD.`Transaction Time`)) AS TRANS_DT,
CD.`Odometer` AS ODOMETER,
CD.`Card Number` AS CARD_NO,
RIGHT(CD.`Card Number`, 4) & FORMAT(DATEVALUE(MONTH(CD.`Transaction Date`) & "/" & DAY(CD.`Transaction Date`) & "/" & YEAR(CD.`Transaction Date`)) + TIMEVALUE(HOUR(CD.`Transaction Time`) & ":" & MINUTE(CD.`Transaction Time`) & ":" & SECOND(CD.`Transaction Time`)), "mm-dd-yyyy hh:mm:ss") AS KEYSTRING
FROM `Sheet1$` CD
WHERE ISDATE(CD.`Transaction Date`) AND CD.`Transaction Date` >= FORMAT('02/01/2019', 'mm-dd-yyyy') AND CD.`Transaction Date` <= FORMAT('02/15/2019', 'mm-dd-yyyy')
EXTERNAL JOIN Command.KEYSTRING={?m5oksr: Command_1.KEYSTRING}
m5oksr
SELECT DISTINCT TCC.UNIT_NO,
VUDC.USING_DEPT_NO AS USE_DEPT,
TCC.ISSUE_DT + 2/24 AS TRANS_DT,
TCC.NEW_METER AS ODOMETER,
'COMP' AS STATUS,
TCC.CARD_NO AS CARD_NO,
SUBSTR(TCC.CARD_NO, 16, 4) || TO_CHAR(TCC.ISSUE_DT + 2/24, 'MM-DD-YYYY HH24:MI:SS') AS KEYSTRING,
FROM MFIVE.VIEW_TRIPCARD_COMPLETED_TRANS TCC
LEFT OUTER JOIN VIEW_UNIT_DEPT_COMP VUDC ON TCC.COMPANY = VUDC.COMPANY and TCC.UNIT_NO = VUDC.UNIT_NO
WHERE TCC.ISSUE_DT + 2/24 >= TO_DATE('02/01/2019 00:00:00', 'MM/DD/YYYY HH24:MI:SS') AND TCC.ISSUE_DT + 2/24 <= TO_DATE('02/15/2019 11:59:59', 'MM/DD/YYYY HH24:MI:SS')
UNION
SELECT DISTINCT IR.FIELD2 as UNIT_NO,
VUDC.USING_DEPT_NO AS USE_DEPT,
TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24 AS TRANS_DT,
IR.METER as ODOMETER,
'FAIL' AS STATUS,
NVL2(IR.FIELD27, CONCAT('XXXX-XXXX-XXXX-', SUBSTR(IR.FIELD27,-4)),'') as CARD_NO,
SUBSTR(NVL2(IR.FIELD27, CONCAT('XXXX-XXXX-XXXX-', SUBSTR(IR.FIELD27,-4)),''), 16, 4) || TO_CHAR(TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24, 'MM-DD-YYYY HH24:MI:SS') AS KEYSTRING,
FROM INTERFACE_REJECT IR
INNER JOIN INTERFACE_STAT ST ON IR.COMPANY = ST.COMPANY and IR.STAT_ID = ST.STAT_ID
LEFT OUTER JOIN EMP_MAIN E ON IR.COMPANY = E.COMPANY AND IR.FIELD29 = E.TRIPCARD_PIN
LEFT OUTER JOIN VIEW_UNIT_DEPT_COMP VUDC ON IR.COMPANY = VUDC.COMPANY and IR.FIELD2 = VUDC.UNIT_NO
WHERE LENGTH(IR.FIELD1) = 19 AND ST.INTERFACE_NAME = 'M5-TRIP-CARD-INTF' AND TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24 >=TO_DATE('02/01/2019 00:00:00', 'MM/DD/YYYY HH24:MI:SS') AND TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24 <= TO_DATE('02/15/2019 11:59:59', 'MM/DD/YYYY HH24:MI:SS')
EXTERNAL JOIN Command_1.KEYSTRING={?I:\Dept\DCS\MPOOL\Fleet Management Data\M5\M5 Automation Data Tables\ComData Transaction Data.xls: Command.KEYSTRING}
Are you sure the join works? If the join doesn't work then you will get nulls and my guess is that this is what is happening. Try to use INNER JOIN instead of Lef join and check if there are any rows returned. If records are returned you may need to cast the values to the same type and trim them. It is possible that the value returned by excel has empty spaces or different value type, which Crystal converts incorrectly

Entity Framework 6 vs Entity Framework Core Raw Sql

Entity Framework 6 example writing SQL queries for non-entity types:
context.Database.SqlQuery<string>(" ; with tempSet as " +
"(select " +
In Entity Framework 6, I can also write the following query with SqlQuery. How can I run the following query with Entity Framework Core?
; with tempSet as
(
select
transitionDatetime = l.transitionDate,
gateName = g.gateName,
staffid = l.staffid,
idx = row_number() over(partition by l.staffid order by l.transitionDate) -
row_number() over(partition by l.staffid, cast(l.transitionDate as date) order by l.transitionDate),
transitionDate = cast(l.transitionDate as date)
from
logs l
inner join
staff s on l.staffid = s.staffid and staffType = 'Student'
join
gate g on g.gateid = l.gateid
), groupedSet as
(
select
t1.*,
FirstGateName = t2.gatename,
lastGateName = t3.gatename
from
(select
staffid,
mintransitionDate = min(transitionDatetime),
maxtransitionDate = case when count(1) > 1 then max(transitionDatetime) else null end,
transitionDate = max(transitionDate),
idx
from
tempSet
group by
staffid, idx) t1
left join
tempSet t2 on t1.idx = t2.idx
and t1.staffid = t2.staffid
and t1.mintransitionDate = t2.transitionDatetime
left join
tempSet t3 on t1.idx = t3.idx
and t1.staffid = t3.staffid
and t1.maxtransitionDate = t3.transitionDatetime
where
t1.transitionDate between #startdate and #enddate
)
select
t.*,
g.mintransitionDate,
g.maxtransitionDate,
g.FirstGateName,
g.LastGateName
from
groupedSet g
right join
(select
d,
staffid
from
(select top (select datediff(d, #startdate, #endDate))
d = dateadd(d, row_number() over(order by (select null)) - 1, #startDate)
from
sys.objects o1
cross join
sys.objects o2) tally
cross join
staff
where
staff.stafftype = 'Student') t on cast(t.d as date) = cast(g.transitionDate as date)
and t.staffid = g.staffid
order by
t.d asc, t.staffid asc
How can I do with Entity Framework Core? Writing SQL queries for non-entity types?
I have done the 'fromsql' off of the context directly when it is a single table, but I realize this is not what you want but it builds on it.
var blogs = context.Blogs
.FromSql("SELECT * FROM dbo.Blogs")
.ToList();
However in a case like yours it is complex and a joining of multiple tables and CTEs. I would suggest you create a custom object, POCO C# in code, and assign it a DbSet<> in your model builder. Then you can do something like this:
var custom = context.YOURCUSTOMOBJECT.FromSql("(crazy long SQL)").ToList();
If your return matches the type it may work. I did something similar and just wrapped my whole method in a procedure. However EF Core you need to make a migration manually up and then add the creation of the proc manually in the 'Up' method of the migration if you wish to deploy it. If you went that route your proc would need to exist on the server already or deploy it like said above and do something similar to this:
context.pGetResult.FromSql("pGetResult #p0, #p1, #p2", parameters: new[] { "Flight", null, null }).ToList()
The important thing to note is you need to create a DBSet object first in your model context so the context you are calling knows the well typed object it is returning from direct SQL. It must match EXACTLY the columns and types being returned.
EDIT 3-8
To be sure you need to do a few steps I will write out:
A POCO class that has a Data Annotation of [Key] above a distinct property. This class matches your columns of what a procedure returns exactly.
A DBSet<(POCO)> in your context.
Create a new Migration with: "Dotnet ef Migrations add 'yourname'"
Observe the new migration scripts. If anything generating a table for the POCO gets created, erase it. You don't need it. This is for a result set not storage in the database.
Change the 'Up' section to manually script your SQL to the database something like below. Also ensure you drop the data if you ever want to revert in the 'Down' section
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql(
"create proc POCONameAbove" +
"( #param1 varchar(16), #Param2 int) as " +
"BEGIN " +
"Select * " +
"From Table "
"Where param1 = #param1 " +
" AND param2 = #param2 "
"END"
);
}
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql("drop proc POCONameAbove");
}
So now you essentially hijacked the migration to do explicitly what you want. Test it out by deploying the changes to the database with "dotnet ef database update 'yourmigrationname'".
Observe the database, it should have your proc if the database update succeeded and you did not accidentally create a table in your migration.
The section you said you didn't understand is what gets the data in EF Core. Let's break it up:
context.pGetResult.FromSql("pGetResult #p0, #p1, #p2", parameters: new[] { "Flight", null, null }).ToList()
context.pGetResult = is using the DbSet you made up. It keeps you well typed to your proc.
.FromSQL( = telling the context you are going to do some SQL directly in the string.
"pGetResult #p0, #p1, #p2" = I am naming a procedure in the database that has three params.
, parameters: new[] { "Flight", null, null }) = I am just doing an array of objects that is in order of the parameters as needed. You need to match the SQL types of course but provided that is okay it will be fine.
.ToListAsync() = I want a collection and my goto is always ToList when debugging something.
Hope that helps. Once I learned this would work it opened up a whole other world of what I could do. You can take a look at a project I have done that is unfinished for reference. I hard coded a controller to show the proc with preset values. But it could be changed easily to just inject them in the api.
https://github.com/djangojazz/EFCoreTest/tree/master/EFCoreCodeFirstScaffolding

T-SQL Matching process using only provided fields

I am trying to write a stored procedure to match lists of physicians with existing records in our database based off of the information provided to us by our clients. Currently we use MS Access to join manually based on the given identifiers, but this process tends to be tedious and overly time consuming, hence the desire to automate it.
What I am trying to do is create a temporary table that contains all columns that could potentially be matched on, and then run through a series of matching queries using the fields as join conditions to get our identifier to pass back.
For instance, the available matching fields are Name, NPI, MedicaidNum, and DOB so I would write something like:
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy
ON Phy.Name = Temp.Name
AND Phy.NPI = Temp.NPI
AND Phy.MedicaidNum = Temp.MedicaidNum
AND Phy.DOB = Temp.DOB
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy
ON Phy.Name = Temp.Name
AND Phy.NPI = Temp.NPI
AND Phy.MedicaidNum = Temp.MedicaidNum
WHERE Temp.RECID IS NULL
...etc
The problem lies in the fact that there about 15 different identifiers which could potentially be provided and clients usually only provide three or four per record set. So by the time null values are accounted for, there are potentially over a hundred different queries that need to be written to match on only half a dozen provided fields.
I am thinking that there may be a way to pass in a variable (or variables) which indicate which columns are actually provided with the data set, and then write a dynamic join statement and/or where clause, but I do not know if this will work in T-SQL. Something like:
DECLARE #Field1
DECLARE #Field2
....
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy
ON Phy.#Field1 = Temp.#Field1
AND Phy.#Field2 = Temp.#Field2
This way I would limit the number of queries I need to write, and only need to worry about the number of fields I am matching, rather then which specific ones. Perhaps there is a better approach to this problem however?
You can do something like this, but be warned this method is super prone to SQL injection. It's just to illustrate the principle of how to do something like this. I leave it up to you what you want to do with it. For this code, I made the proc take three fields:
CREATE PROC DynamicUpdateSQLFromFieldList #Field1 VARCHAR(50) = NULL,
#Field2 VARCHAR(50) = NULL,
#Field3 VARCHAR(50) = NULL,
#RunMe BIT = 0
AS
BEGIN
DECLARE #SQL AS VARCHAR(1000);
SELECT #SQL = 'UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy ON ' +
COALESCE('Phy.' + #Field1 + ' = Temp.' + #Field1 + ' AND ', '') +
COALESCE('Phy.' + #Field2 + ' = Temp.' + #Field2 + ' AND ', '') +
COALESCE('Phy.' + #Field3 + ' = Temp.' + #Field3, '') + ';';
IF #RunMe = 0
SELECT #SQL AS SQL;
ELSE
EXEC(#SQL)
END
I've added a debug mode flag just so you can see the SQL if you don't want to run it. So, for example, if you run:
EXEC DynamicUpdateSQLFromFieldList #field1='col1', #field2='col2', #field3='col3'
or
EXEC DynamicUpdateSQLFromFieldList #field1='col1', #field2='col2', #field3='col3', #RunMe=0
the SQL produced will be:
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp INNER JOIN Physicians Phy
ON Phy.col1 = Temp.col1 AND
Phy.col2 = Temp.col2 AND
Phy.col3 = Temp.col3;
If you run this line:
EXEC DynamicUpdateSQLFromFieldList #field1='col1', #field2='col2', #field3='col3', #RunMe=1
It will perform the update. If you wanted it to be more secure, you could whitelist the incoming field names against the sys tables to make sure the columns actually exist in each table before you execute any code.

Resources