Flink temporal join works only for a few seconds - join

I'm trying to implement an event time temporal join in Flink. Here's the first join table:
tEnv.executeSql("CREATE TABLE AggregatedTrafficData_Kafka (" +
"`timestamp` TIMESTAMP_LTZ(3)," +
"`area` STRING," +
"`networkEdge` STRING," +
"`vehiclesNumber` BIGINT," +
"`averageSpeed` INTEGER," +
"WATERMARK FOR `timestamp` AS `timestamp`" +
") WITH (" +
"'connector' = 'kafka'," +
"'topic' = 'seneca.trafficdata.aggregated'," +
"'properties.bootstrap.servers' = 'localhost:9092'," +
"'properties.group.id' = 'traffic-data-aggregation-job'," +
"'format' = 'json'," +
"'json.timestamp-format.standard' = 'ISO-8601'" +
")");
The table is used as a sink for the following query:
Table aggregatedTrafficData = trafficData
.window(Slide.over(lit(30).seconds())
.every(lit(15).seconds())
.on($("timestamp"))
.as("w"))
.groupBy($("w"), $("networkEdge"), $("area"))
.select(
$("w").end().as("timestamp"),
$("area"),
$("networkEdge"),
$("plate").count().as("vehiclesNumber"),
$("speed").avg().as("averageSpeed")
);
Here's the other join table. I use Debezium to stream a Postgres table into Kafka:
tEnv.executeSql("CREATE TABLE TransportNetworkEdge_Kafka (" +
"`timestamp` TIMESTAMP_LTZ(3) METADATA FROM 'value.source.timestamp' VIRTUAL," +
"`urn` STRING," +
"`flow_rate` INTEGER," +
"PRIMARY KEY(`urn`) NOT ENFORCED," +
"WATERMARK FOR `timestamp` AS `timestamp`" +
") WITH (" +
"'connector' = 'kafka'," +
"'topic' = 'seneca.network.transport_network_edge'," +
"'scan.startup.mode' = 'latest-offset'," +
"'properties.bootstrap.servers' = 'localhost:9092'," +
"'properties.group.id' = 'traffic-data-aggregation-job'," +
"'format' = 'debezium-json'," +
"'debezium-json.schema-include' = 'true'" +
")");
Finally here's the temporal join:
Table transportNetworkCongestion = tEnv.sqlQuery("SELECT AggregatedTrafficData_Kafka.`timestamp`, `networkEdge`, " +
"congestion(`vehiclesNumber`, `flow_rate`) AS `congestion` FROM AggregatedTrafficData_Kafka " +
"JOIN TransportNetworkEdge_Kafka FOR SYSTEM_TIME AS OF AggregatedTrafficData_Kafka.`timestamp` " +
"ON AggregatedTrafficData_Kafka.`networkEdge` = TransportNetworkEdge_Kafka.`urn`");
The problem I'm having is that the join works only for the first few second (after an update in the Postgres table), but I need to continuosly join the first table with debezium one. Am I doing something wrong?
Thanks
euks

Temporal joins using the AS OF syntax you're using require:
an append-only table with a valid event-time attribute
an updating table with a primary key and a valid event-time attribute
an equality predicate on the primary key
When Flink SQL's temporal operators are applied to event time streams, watermarks play a critical role in determining when results are produced, and when the state is cleared.
When performing a temporal join:
rows from the append-only table are buffered in Flink state until the current watermark of the join operator reaches their timestamps
for the versioned table, for each key the latest version whose timestamp precedes the join operator's current watermark is kept in state, plus any versions from after the current watermark
whenever the join operator's watermark advances, new results are produced, and state that's no longer relevant is cleared
The join operator tracks the watermarks it receives from its input channels, and its current watermark is always the minimum of these two watermarks. This is why your join stalls, and only makes progress when the flow_rate is updated.
One way to fix this would be to set the watermark for the TransportNetworkEdge_Kafka table like this:
"WATERMARK FOR `timestamp` AS " + Watermark.MAX_WATERMARK
This will set the watermark for this table/stream to the largest possible value, which will have the effect of making the watermarks from this stream irrelevant -- this stream's watermarks will never be the smallest.
This will, however, have the drawback of making the join results non-deterministic.

Related

Getting incomplete data when running 6 ignite servers

I am running 6 Ignite servers on version 2.7.5. The problem is when I am hitting queries using my client API I am not getting all records. Only some records are coming. I am using partitioned cache. I don't want to use replicated mode. When queried with DBeaver it show all records have been fetched.
The following code is used to fetch the data:
public List<Long> getGroupIdsByUserId(Long createdBy) {
final String query = "select g.groupId from groups g where g.createdBy = ? and g.isActive = 1";
SqlFieldsQuery sql = new SqlFieldsQuery(query);
sql.setArgs(createdBy);
List<List<?>> rsList = groupsCache.query(sql).getAll();
List<Long> ids = new ArrayList<>();
for (List<?> l : rsList) {
ids.add((Long)l.get(0));
}
return ids;
}
Ignite Version - 2.7.5
Client Query method
And the join Query is :
final String query = "select distinct u.userId from
groupusers gu "
+ "inner join \"GroupsCache\".groups g on gu.groupId = g.groupId
"
+ "inner join \"OrganizationsCache\".organizations o on
gu.organizationId = o.organizationId "
+ "inner join \"UsersCache\".users u on gu.userId = u.userId
where " + "g.groupId = ? and "
+ "g.isActive = 1 and " + "gu.isActive = 1 and " +
"gu.createdBy
= ? and " + "o.organizationId = ? and "
+ "o.isActive = 1 and " + "u.isActive = 1";
For the join query Actual records in db is 120 but with ignite client only 3-4 records are comming .and they are not consistent. sometime it comes 3 records and some time it is 4 records. And for query
select g.groupId from groups g where g.createdBy = ? and g.isActive = 1
actual records are 27 but comming records are sometimes 20 sometimes 19 and sometimes complete. Please Help me with this and with collocated joins..
Most likely this would mean that your affinity is incorrect.
Apache Ignite assumes that your data has proper affinity, i.e. when joining two tables, rows to join will always be available on the same node. This works when you either join by primary key, or by a part of primary key which is marked as affinity column (e.g. by #AffinityKeyMapped annotation). There's a documentation page about affinity.
You can check that by setting distribtedJoins connection setting to true. If you see all the records after that, it means you need to fix your affinity.

Crystal Reports External Join

I have 2 data sources that I am querying, then joining in Crystal Reports on a key string with a Left Outer Join. The intent of the report is to identify purchases made that were not processed. The issue is that CR refuses to show the matching right query records.
Data Source 1: Excel worksheet on my local machine containing raw
credit card purchases. "Left table"
Data Source 2: 2 subqueries from a hosted Oracle database with a
Union join containing processed credit card transactions. "Right
table"
Key String: The last 4 digits of a credit card number concatenated
with the date-time of the transaction, e.g. "223402-06-2019 04:15:00"
The queries return proper values when executed separately. I have verified that many records returned for the Left table actually do have matching Right table records that are not displayed. I did this using a separate report showing only the Right table query results and manually searching for different key strings.
I'm completely buffaloed and any assistance would be appreciated.
The SQL from Crystal Reports:
I:\Dept\DCS\MPOOL\Fleet Management Data\M5\M5 Automation Data Tables\ComData Transaction Data.xls
`SELECT DISTINCT CD.`First Name` AS UNIT_NO,
CD.`HIERARCHY LEVEL3` AS USE_DEPT,
DATEVALUE(MONTH(CD.`Transaction Date`) & "/" & DAY(CD.`Transaction Date`) & "/" & YEAR(CD.`Transaction Date`)) + TIMEVALUE(HOUR(CD.`Transaction Time`) & ":" & MINUTE(CD.`Transaction Time`) & ":" & SECOND(CD.`Transaction Time`)) AS TRANS_DT,
CD.`Odometer` AS ODOMETER,
CD.`Card Number` AS CARD_NO,
RIGHT(CD.`Card Number`, 4) & FORMAT(DATEVALUE(MONTH(CD.`Transaction Date`) & "/" & DAY(CD.`Transaction Date`) & "/" & YEAR(CD.`Transaction Date`)) + TIMEVALUE(HOUR(CD.`Transaction Time`) & ":" & MINUTE(CD.`Transaction Time`) & ":" & SECOND(CD.`Transaction Time`)), "mm-dd-yyyy hh:mm:ss") AS KEYSTRING
FROM `Sheet1$` CD
WHERE ISDATE(CD.`Transaction Date`) AND CD.`Transaction Date` >= FORMAT('02/01/2019', 'mm-dd-yyyy') AND CD.`Transaction Date` <= FORMAT('02/15/2019', 'mm-dd-yyyy')
EXTERNAL JOIN Command.KEYSTRING={?m5oksr: Command_1.KEYSTRING}
m5oksr
SELECT DISTINCT TCC.UNIT_NO,
VUDC.USING_DEPT_NO AS USE_DEPT,
TCC.ISSUE_DT + 2/24 AS TRANS_DT,
TCC.NEW_METER AS ODOMETER,
'COMP' AS STATUS,
TCC.CARD_NO AS CARD_NO,
SUBSTR(TCC.CARD_NO, 16, 4) || TO_CHAR(TCC.ISSUE_DT + 2/24, 'MM-DD-YYYY HH24:MI:SS') AS KEYSTRING,
FROM MFIVE.VIEW_TRIPCARD_COMPLETED_TRANS TCC
LEFT OUTER JOIN VIEW_UNIT_DEPT_COMP VUDC ON TCC.COMPANY = VUDC.COMPANY and TCC.UNIT_NO = VUDC.UNIT_NO
WHERE TCC.ISSUE_DT + 2/24 >= TO_DATE('02/01/2019 00:00:00', 'MM/DD/YYYY HH24:MI:SS') AND TCC.ISSUE_DT + 2/24 <= TO_DATE('02/15/2019 11:59:59', 'MM/DD/YYYY HH24:MI:SS')
UNION
SELECT DISTINCT IR.FIELD2 as UNIT_NO,
VUDC.USING_DEPT_NO AS USE_DEPT,
TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24 AS TRANS_DT,
IR.METER as ODOMETER,
'FAIL' AS STATUS,
NVL2(IR.FIELD27, CONCAT('XXXX-XXXX-XXXX-', SUBSTR(IR.FIELD27,-4)),'') as CARD_NO,
SUBSTR(NVL2(IR.FIELD27, CONCAT('XXXX-XXXX-XXXX-', SUBSTR(IR.FIELD27,-4)),''), 16, 4) || TO_CHAR(TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24, 'MM-DD-YYYY HH24:MI:SS') AS KEYSTRING,
FROM INTERFACE_REJECT IR
INNER JOIN INTERFACE_STAT ST ON IR.COMPANY = ST.COMPANY and IR.STAT_ID = ST.STAT_ID
LEFT OUTER JOIN EMP_MAIN E ON IR.COMPANY = E.COMPANY AND IR.FIELD29 = E.TRIPCARD_PIN
LEFT OUTER JOIN VIEW_UNIT_DEPT_COMP VUDC ON IR.COMPANY = VUDC.COMPANY and IR.FIELD2 = VUDC.UNIT_NO
WHERE LENGTH(IR.FIELD1) = 19 AND ST.INTERFACE_NAME = 'M5-TRIP-CARD-INTF' AND TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24 >=TO_DATE('02/01/2019 00:00:00', 'MM/DD/YYYY HH24:MI:SS') AND TO_DATE(IR.FIELD1, 'MM/DD/YYYY HH24:MI:SS') + 2/24 <= TO_DATE('02/15/2019 11:59:59', 'MM/DD/YYYY HH24:MI:SS')
EXTERNAL JOIN Command_1.KEYSTRING={?I:\Dept\DCS\MPOOL\Fleet Management Data\M5\M5 Automation Data Tables\ComData Transaction Data.xls: Command.KEYSTRING}
Are you sure the join works? If the join doesn't work then you will get nulls and my guess is that this is what is happening. Try to use INNER JOIN instead of Lef join and check if there are any rows returned. If records are returned you may need to cast the values to the same type and trim them. It is possible that the value returned by excel has empty spaces or different value type, which Crystal converts incorrectly

convert two querys to one in YQL

can i do JOIN with two querys with yql, i have two querys:
select *
from yahoo.finance.historicaldata
where symbol in ('YHOO')
and startDate='" + startDate + "'
and endDate='" + endDate + "'&format=json&diagnostics=true&env=store://datatables.org/alltableswithkeys&callback="
and
select symbol,
Earnings_per_Share,
Dividend_Yield,
week_Low,
Week_High,
Last_Trade_Date,
open,
low,
high,
volume,
Last_Trade
from csv where url="http://download.finance.yahoo.com/d/quotes.csv?s=YHOO,GOOG&f=seyjkd1oghvl1&e=.csv"
and columns="symbol,Earnings_per_Share,Dividend_Yield,Last_Trade_Date,week_Low,Week_High,open,low,high,volume,Last_Trade"
i need to convert this two querys to one. how to do this?

T-SQL Matching process using only provided fields

I am trying to write a stored procedure to match lists of physicians with existing records in our database based off of the information provided to us by our clients. Currently we use MS Access to join manually based on the given identifiers, but this process tends to be tedious and overly time consuming, hence the desire to automate it.
What I am trying to do is create a temporary table that contains all columns that could potentially be matched on, and then run through a series of matching queries using the fields as join conditions to get our identifier to pass back.
For instance, the available matching fields are Name, NPI, MedicaidNum, and DOB so I would write something like:
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy
ON Phy.Name = Temp.Name
AND Phy.NPI = Temp.NPI
AND Phy.MedicaidNum = Temp.MedicaidNum
AND Phy.DOB = Temp.DOB
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy
ON Phy.Name = Temp.Name
AND Phy.NPI = Temp.NPI
AND Phy.MedicaidNum = Temp.MedicaidNum
WHERE Temp.RECID IS NULL
...etc
The problem lies in the fact that there about 15 different identifiers which could potentially be provided and clients usually only provide three or four per record set. So by the time null values are accounted for, there are potentially over a hundred different queries that need to be written to match on only half a dozen provided fields.
I am thinking that there may be a way to pass in a variable (or variables) which indicate which columns are actually provided with the data set, and then write a dynamic join statement and/or where clause, but I do not know if this will work in T-SQL. Something like:
DECLARE #Field1
DECLARE #Field2
....
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy
ON Phy.#Field1 = Temp.#Field1
AND Phy.#Field2 = Temp.#Field2
This way I would limit the number of queries I need to write, and only need to worry about the number of fields I am matching, rather then which specific ones. Perhaps there is a better approach to this problem however?
You can do something like this, but be warned this method is super prone to SQL injection. It's just to illustrate the principle of how to do something like this. I leave it up to you what you want to do with it. For this code, I made the proc take three fields:
CREATE PROC DynamicUpdateSQLFromFieldList #Field1 VARCHAR(50) = NULL,
#Field2 VARCHAR(50) = NULL,
#Field3 VARCHAR(50) = NULL,
#RunMe BIT = 0
AS
BEGIN
DECLARE #SQL AS VARCHAR(1000);
SELECT #SQL = 'UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp
INNER JOIN Physicians Phy ON ' +
COALESCE('Phy.' + #Field1 + ' = Temp.' + #Field1 + ' AND ', '') +
COALESCE('Phy.' + #Field2 + ' = Temp.' + #Field2 + ' AND ', '') +
COALESCE('Phy.' + #Field3 + ' = Temp.' + #Field3, '') + ';';
IF #RunMe = 0
SELECT #SQL AS SQL;
ELSE
EXEC(#SQL)
END
I've added a debug mode flag just so you can see the SQL if you don't want to run it. So, for example, if you run:
EXEC DynamicUpdateSQLFromFieldList #field1='col1', #field2='col2', #field3='col3'
or
EXEC DynamicUpdateSQLFromFieldList #field1='col1', #field2='col2', #field3='col3', #RunMe=0
the SQL produced will be:
UPDATE Temp
SET Temp.RECID = Phy.RECID
FROM TempTable Temp INNER JOIN Physicians Phy
ON Phy.col1 = Temp.col1 AND
Phy.col2 = Temp.col2 AND
Phy.col3 = Temp.col3;
If you run this line:
EXEC DynamicUpdateSQLFromFieldList #field1='col1', #field2='col2', #field3='col3', #RunMe=1
It will perform the update. If you wanted it to be more secure, you could whitelist the incoming field names against the sys tables to make sure the columns actually exist in each table before you execute any code.

ADOQuery Master/Detail

How can I setup a ADOQuery to filter the data to display all lakes that have Brook trout in a DBGrid?
Data:
Nate Pond - LakeMaps.Lake_Name
Brook trout - Species.Species_Name
Creek chub
Golden shiner
Black Pond
Brook trout
Brown bullhead
Common shiner
Lake Placid
Lake trout
Smallmouth bass
Yellow perch
MDB Database
ADoTable1 = LakeMaps MASTER
ADOTable2 = Species DETAIL
Relationship
LakeMaps Table
LakeMaps.Field[0] = Lake_ID: Autonumber --- ]
LakeMaps.Field[1] = Lake_Name: Text--- |
|Relationship set in the access database
Species Table |
Species.Field[0] = Species_ID: numeric --- ]
Species.Field[1] = Species_Name: text
The Species Table is a Detail the LakeMaps is the Master
How can I setup a ADOQuery to filter the data to display all lakes that have Brook trout in a DBGrid?
Filtered Data:
Nate Pond
Brook trout
Creek chub
Golden shiner
Black Pond
Brook trout
Brown bullhead
Common shiner
You can set Filtered = true and then use OnFilterRecord event and check if detail dataset contains requested value (this can be done in loop or with Locate procedure of dataset)
This will probably be very slow on larger amount of data. In those situations I usually filter master records directly in SQL. Something like this:
SELECT * FROM LakeMaps
WHERE Lake_ID in (SELECT Lake_ID
FROM Species INNER JOIN SpeciesLakesRelation
ON (Species.Species_ID = SpeciesLakesRelation.Species_Id)
WHERE SPECIES_NAME = 'Brook Trout')
This SQL returns records from Lakes that have 'Brook Trout'.
SpeciesLakesRelation is table that contains relation between LakeMaps and Species.
Problem with your query in is that text in query must be in apostrophes. If ComboBoxSpecies.Text has value Brook Trout, then SQL evaluates to:
SELECT * FROM LakeMaps WHERE Lake_ID in
(SELECT Lake_ID FROM Species INNER JOIN LakeMaps ON
(Species.Species_ID = LakeMaps.Lake_Id)
WHERE SPECIES_NAME = Brook Trout)
Note that Brook Trout is not in apostrophes, so you get syntax error from MsAccess.
Edit:
As Gerry noted in comment:
apostrophes should be added using QuotedStr function, instead of double apostrophe.
best solution is to use query parameter
Delphi code, using QuotedStr, should look like this:
ADOQuery1.SQL.Add( 'SELECT * FROM LakeMaps WHERE Lake_ID in ' +
'(SELECT Lake_ID FROM Species INNER JOIN LakeMaps ON ' +
'(Species.Species_ID = LakeMaps.Lake_Id) ' +
'WHERE SPECIES_NAME = ' + QuotedStr(ComboBoxSpecies.Text) + ')');
Now, if ComboBoxSpecies.Text has value Brook Trout, then this string:
'WHERE SPECIES_NAME = ' + QuotedStr(ComboBoxSpecies.Text) + ')'
evaluates as:
WHERE SPECIES_NAME = 'Brook Trout')

Resources