Duplicates in Snowflake Stream

Duplicates in Snowflake Stream - stored-procedures

With the setting SHOW_INITIAL_ROWS = TRUE, we created a stream on top of a view (which has many joins).
We created a Stored procedure with a single merge statement that ingests all of the data from the stream into a target table. Following is the merge statement used by the stored procedure.
merge into target tgt
using
(
select id,fname,metadata$action,metadata$isupdate
from emp_stream where not(metadata$action = 'DELETE' and metadata$isupdate = 'TRUE')
) src
on src.id = tgt.id
when matched and metadata$action = 'DELETE' and metadata$isupdate = 'FALSE' then delete
when matched and metadata$action = 'INSERT' and metadata$isupdate = 'TRUE' then update
set tgt.id = src.id
,tgt.fname = src.fname
when not matched and metadata$action = 'INSERT' and metadata$isupdate = 'FALSE' then
insert (id,fname) values (src.id,src.fname);
A task was created to run the stored procedure for every 8 hours. It ran successfully for the first time, i.e. the full load, which inserts all of the records from the view into the target table. However, the second load failed due to a duplicate row error. When we queried the stream, we found two records with the same PK(i.e., id) but different metadata$rowids and metadata$actions of insert and delete, with metadata$isupdate set to false for each.
If it is an update, the metadata$isupdate parameter should be set to true which is not the case here.
Could someone please assist us with this?
Trying to do Incremental load using Streams in snowflake but facing duplicate row error.

Related

Snowflake stream behavior

I have following fields in table 1-
db,schema,jobnm,status,runtime, ins_tstmp, upd_tstmp.
A stream has been created on table 1.
A stored procedure was written to loop through another table's dataset (4 records) and write all 4 records to table 1 if they don't already exist else update (using merge sql here; ins_tstmp gets populated via insert part of merge while upd_tstmp gets updated via update part ).
As expected, table1 has all 4 records and Stream also has 4 records with metadata$action as INSERT . UPD_TSTMP is null here.
Now on 2nd run, same 4 records were retrieved. Since they were a match, upd_tstmp got populated in both table 1 and stream but why metadata$action is INSERT only? Not seeing 2 entries for an update. Could someone please explain what I am missing here?
Thanks

Since they were a match, upd_tstmp got populated in both table 1 and
stream but why metadata$action is INSERT only?
The METADATA$ACTION column can have 2 possible values: INSERT and DELETE. So you can't see "UPDATE" in this column.
METADATA$ISUPDATE: This is an extra column indicating whether the operation was part of an UPDATE statement. In your case, you should also see it "false" because Streams record the differences between two offsets. If a row is added and then updated in the current offset, the delta change is a new row. The METADATA$ISUPDATE row records a FALSE value.
https://docs.snowflake.com/en/user-guide/streams-intro.html#stream-columns

TFDQuery failing to update?

I'm having a problem with a synchronisation issue... I have a source table (mtAllowanceCategory) which I want to update to a copy (qryAllowanceCategory) of it. To make sure records in the copy are deleted if they are no longer present in the source, the copy has a "StillHere" boolean field, which is set to on when the record is added or updated and otherwise stays off. Afterwards, all records with StillHere=false are deleted.
That's the idea, anyway... in practice, the flag fields isn't turned on when posting updates. When I trace the code, the statement is executed; when I look in Access, it stays off. Hence the delete SQL afterwards clears the entire table.
Been trying to figure this for hours now; what am I missing??
mtAllowanceCategory:TFDMemTable (filled from an API call, this works fine)
qryAllowanceCategory:TFDQuery
conn:TFDConnection to a local Access database (also used for qryAllowanceCategory)
conn.ExecSQL('UPDATE AllowanceCategory SET StillHere=false;');
while not mtAllowanceCategory.eof do
begin
if qryAllowanceCategory.locate('WLPid',mtAllowanceCategory.FieldByName('Id').AsString,[loCaseInsensitive]) then
begin
Updating:=true;
qryAllowanceCategory.Edit;
end
else
begin
Updating:=false;
qryAllowanceCategory.Insert;
end;
qryAllowanceCategory.fieldbyname('createdBy').AsString:=mtAllowanceCategory.FieldByName('createdBy').AsString;
qryAllowanceCategory.fieldbyname('createdOn').AsString:=mtAllowanceCategory.FieldByName('createdOn').AsString;
qryAllowanceCategory.fieldbyname('description').AsString:=mtAllowanceCategory.FieldByName('description').AsString;
qryAllowanceCategory.fieldbyname('WLPid').AsString:=mtAllowanceCategory.FieldByName('id').AsString;
qryAllowanceCategory.fieldbyname('isDeleted').Asboolean:=mtAllowanceCategory.FieldByName('isDeleted').Asboolean;
qryAllowanceCategory.fieldbyname('isInUse').Asboolean:=mtAllowanceCategory.FieldByName('isInUse').Asboolean;
qryAllowanceCategory.fieldbyname('modifiedBy').AsString:=mtAllowanceCategory.FieldByName('modifiedBy').AsString;
qryAllowanceCategory.fieldbyname('modifiedOn').AsString:=mtAllowanceCategory.FieldByName('modifiedOn').AsString;
qryAllowanceCategory.fieldbyname('WLPname').AsString:=mtAllowanceCategory.FieldByName('name').AsString;
qryAllowanceCategory.fieldbyname('number').AsInteger:=mtAllowanceCategory.FieldByName('number').AsInteger;
qryAllowanceCategory.fieldbyname('percentage').AsFloat:=mtAllowanceCategory.FieldByName('number').AsFloat;
qryAllowanceCategory.fieldbyname('remark').AsString:=mtAllowanceCategory.FieldByName('remark').AsString;
qryAllowanceCategory.fieldbyname('LocalEdited').AsBoolean:=false;
qryAllowanceCategory.fieldbyname('LocalInserted').AsBoolean:=false;
qryAllowanceCategory.fieldbyname('LocalDeleted').AsBoolean:=false;
qryAllowanceCategory.fieldbyname('StillHere').AsBoolean:=true;
qryAllowanceCategory.Post;
mtAllowanceCategory.next;
end;
conn.commit;
conn.ExecSQL('DELETE FROM AllowanceCategory WHERE StillHere=false;');

When I read your q, I was struck by two thoughts:
One was that I couldn't immediately
see the cause of your problem and the other that you could probably avoid the problem anyway
if you used Sql rather than table traversals in code.
It seemed to me that you might be able to do most
if not all of what you need, in terms of synchronising the two tables, using Access
Sql rather than traversing the qryAllowanceCategory table using a while not EOF loop.
(btw, in the following I'm going to use 'mtAC' and qryAC to reduce typing & typos)
Using Access SQL
Initially, I did not have much luck, as Access rejected my attempts to
refer to both tables in an Update statement against the qryAC one using a Join
or Outer Join, but then I came across a reference that showed that Access does
support an Inner Join syntax. These SQL statements execute successfully by calling
ExecSQL on the FireDAC connection to the database:
update qryAC set qryAC.StillHere = True
where exists(select mtAC.* from mtAC inner join qryAC on mtAC.WLPid = qryAC.WLPid)
and
update qryAC inner join mtAC on mtAC.WLPid = qryAC.WLPid set qryAC.AValue = mtAC.AValue
This first of these obviously provides a way to update the StillHere field to set it to True,
or False with a trivial modification.
The second shows a way to update a set of fields in qryAC from the matching rows in mtAC
and this could, of course, be limited to a subset of rows with a suitable Where clause.
Access Sql also supports checking whether a row in one table exists in the other, as in
select * from qryAC q where exists (select * from mtac m where q.wlpid = m.wlpid)
and for deleting rows in one table which do not exist in the other
delete from qryAC q where not exists (select * from mtac m where q.wlpid = m.wlpid)
Using FireDAC's LocalSQL
I also mentioned LocalSQL in a comment. This supports a far broader range
of Sql statements that native Access Sql and can operate on any TDataSet descendant,
so if you find something that Access Sql syntax doesn't support, it is worth considering
using LocalSQL instead. Its main downside is that it operates on the datasets using
traversals, so in not quite as "instant" as native Sql. It can be a bit tricky to set up,
so here are the settings from the DFM which show how the components need connecting up. You would use it by feeding what you want to FDQuery1.
object AccessConnection: TFDConnection
Params.Strings = (
'Database=D:\Delphi\Code\FireDAC\LocalSQL\Allowance.accdb'
'DriverID=MSAcc')
Connected = True
LoginPrompt = False
end
object mtAC: TFDQuery
AfterOpen = mtACAfterOpen
Connection = AccessConnection
SQL.Strings = (
'select * from mtAC')
end
object qryAC: TFDQuery
Connection = AccessConnection
end
object LocalSqlConnection: TFDConnection
Params.Strings = (
'DriverID=SQLite')
Connected = True
LoginPrompt = False
end
object FDLocalSQL1: TFDLocalSQL
Connection = LocalSqlConnection
DataSets = <
item
DataSet = mtAC
end
item
DataSet = qryAC
end>
end
object FDGUIxWaitCursor1: TFDGUIxWaitCursor
Provider = 'Forms'
end
object FDPhysSQLiteDriverLink1: TFDPhysSQLiteDriverLink
end
object FDQuery1: TFDQuery
Connection = LocalSqlConnection
end

If anyone is interested:
The problem was in not refreshing qryAllowanceCategory after the initial SQL setting StillHere to false. The memory version (qryAllowanceCategory) of the record didn't get that update, so according to him, the flag was still on; after the field updates it appeared there were no changes (all the other fields were unchanged as well) so the post was ignored. In the actual table it was off though, so the final delete SQL removed it.
The problem was solved by adding a refresh after the first UPDATE SQL statement.

Using FireDac to update only 1 of a duplicate row (no primary key or unique field)

I have an old application I am supporting that uses a Microsoft Access database. The original table design did not add primary keys to every table. I am working on a migration program that among other things is adding and filling in a new primary key field (GUID) when needed.
This is happening in three steps:
Add a new guid field with no constraints
Fill the field with new unique guids
Add the primary key constraints
My problem is setting the unique guids when the table has duplicate rows. Here is my code to set the guids.
Query.SQL.Add('SELECT * FROM ' + TableName);
Query.Open;
while Query.Eof = false do
begin
Query.Edit;
Query.FieldByName(NewPrimaryKeyFieldName).AsGuid := TGuid.NewGuid;
Query.Post;
Query.Next;
end;
FireDac generates an update statement that contains a where clause with all the original fields/values in the row (since there is no unique field for it to use). However, because the rows are complete duplicates the statement still updates two rows.
FireDac correctly errors with this message
Update command updated [2] instead of [1] record.
I can open up the database in Access and delete the duplicate records or assign them a unique guid by editing the table. I would like my conversion tool to automatically do this.
Is there some way to work with these duplicate rows in FireDac? Either to update just one at a time, or to delete just one of them?

In my opinion there is no way to do it with just one SQL Statement.
I would do this:
1. Copy the whole table without duplicates by using a new temp table
SELECT DISTINCT * FROM <TABLENAME>
Add the Keys
Delete old table content and copy new content from new table
Notes:
The DB Should be unavailable for everyone else for that Operation
2. Make BACKUP before

Migrate PropertyData to a new PropertyType

We have an existing PropertyType called IsPublic which uses a Umbraco.TrueFalse property editor.
Requirements have changed and this value now needs to be represented by multiple checkboxes that are driven from an Enum with the Values Public, Group1, Group2.
This all works as expected but with 10's of Thousands of documents we want to save our content editors from manually populating them all.
Saving a document in Umbraco, I can see that it creates an entry in the table cmsPropertyData with the value [ "Public", "Group1", "Group2" ] in the dataNvarchar column.
I've written a script to insert a row into this table based on the value of the original IsPublic flag.
However following running this, when opening a document in Umbraco the changes aren't displayed.
The script used to update is
DECLARE #HasPublicFlag NVARCHAR(50) = '[ "Public", "Group1", "Group2" ]'
DECLARE #NoPublicFlag NVARCHAR(50) = '[ "Group1", "Group2" ]'
DECLARE #feature INT = (SELECT nodeId FROM cmsContentType WHERE Alias = 'Feature')
--Existing IsPublic flag
DECLARE #featureIsPublic INT = (SELECT id FROM cmsPropertyType WHERE Alias = 'IsPublic' AND contentTypeId = #feature)
--New PropertyType
DECLARE #featureRoleRestriction INT = (SELECT id FROM cmsPropertyType WHERE Alias = 'documentRoleRestriction' AND contentTypeId = #page)
--Get feature document versions that are either newest version or published
;WITH FeatureDocumentsToUpdate AS
(
SELECT d.*, pd.dataInt
FROM cmsDocument d
JOIN cmsPropertyData pd ON pd.versionId = d.versionId
LEFT JOIN cmsPropertyData pd2 ON pd2.versionId = d.versionId AND pd2.propertytypeid = #featureRoleRestriction
WHERE (d.newest = 1 OR d.Published = 1) AND pd.propertytypeid = #featureIsPublic AND pd2.id IS NULL
)
--INSERT INTO cmsPropertyData based on value of existing flag
INSERT INTO cmsPropertyData(contentNodeId, versionId, propertytypeid, dataNvarchar)
SELECT s.nodeId, versionId, #featureRoleRestriction,
CASE WHEN s.dataInt = 0 THEN #NoPublicFlag ELSE #HasPublicFlag END AS NewValue
FROM FeatureDocumentsToUpdate s
Is there another table(s) that will need updating or is there a better way to do this?

My guess would be that you need to republish all of the affected pages for the caches etc to update and populate with the new values properly.
With 10,000 plus documents, doing a full republish of everything might be quite slow.
You could also try and update the XML for each page in the cmsContentXml table to have the correct values, and then rebuild the Examine indexes for the site, which should do the trick and be a bit quicker. This is because the contents this table is used to rebuild the indexes to save on speed.
Another option would be to write an API Controller task that you can run once and then remove to update all of the values using the Umbraco Services, but again, that'll be quite slow I think on the volume of pages you're talking about.

Self reference update on insert trigger in Informix

I'm extracting data from various sources into one table. In this new table, there's a field called lineno. This field value is should be in sequence based on company code and batch number. I've wrote the following procedure
CREATE PROCEDURE update_line(company CHAR(4), batch CHAR(8), rcptid CHAR(12));
DEFINE lineno INT;
SELECT Count(*)
INTO lineno
FROM tmp_cb_rcpthdr
WHERE cbrh_company = company
AND cbrh_batchid = batch;
UPDATE tmp_cb_rcpthdr
SET cbrh_lineno = lineno + 1
WHERE cbrh_company = company
AND cbrh_batchid = batch
AND cbrh_rcptid = rcptid;
END PROCEDURE;
This procedure will be called using the following trigger
CREATE TRIGGER tmp_cb_rcpthdr_ins INSERT ON tmp_cb_rcpthdr
REFERENCING NEW AS n
FOR EACH ROW
(
EXECUTE PROCEDURE update_line(n.company, cbrh_batchid, cbrh_rcptid)
);
However, I got the following error
SQL Error = -747 Table or column matches object referenced in
triggering statement.
From oninit.com, I learn that the error caused by a triggered SQL statement acts on the triggering table which in this case is the UPDATE statement.
So my question is, how do I solve this problem? Is there any work around or better solution?

I think the design needs to be reconsidered. For a start, what happens if some rows get deleted from tmp_cb_rcpthdr ? The COUNT(*) query will result in duplicate lineno values.
Even if this is an ETL only process, and you can be confident the data won't be manipulated from elsewhere, performance will be an issue, and will only get worse the more data you have for any one combination of company and batch_id.
Is it necessary for the lineno to increment from zero, or is it just to maintain the original load order? Because if it's the latter, a SEQUENCE or a SERIAL field on the table will achieve the same end, and be a lot more efficient.
If you must generate lineno in this way, I would suggest you create a second control table, keyed on company and batch_id, that tracks the current lineno value, ie: (untested)
CREATE PROCEDURE update_line(company CHAR(4), batch CHAR(8));
DEFINE lineno INT;
SELECT cbrh_lineno INTO lineno
FROM linenoctl
WHERE cbrh_company = company
AND cbrh_batchid = batch;
UPDATE linenoctl
SET cbrh_lineno = lineno + 1
WHERE cbrh_company = company
AND cbrh_batchid = batch;
-- A test that no other process has grabbed this record
-- might need to be considered here, ie cbrh_lineno = lineno
RETURN lineno + 1
END PROCEDURE;
Then use it as follows:
CREATE TRIGGER tmp_cb_rcpthdr_ins INSERT ON tmp_cb_rcpthdr
REFERENCING NEW AS n
FOR EACH ROW
(
EXECUTE PROCEDURE update_line(n.company, cbrh_batchid) INTO cbrh_lineno
);
See the IDS documentation for more on using calculated values with triggers.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Duplicates in Snowflake Stream - stored-procedures

Related

Snowflake stream behavior

TFDQuery failing to update?

Using FireDac to update only 1 of a duplicate row (no primary key or unique field)

Migrate PropertyData to a new PropertyType

Self reference update on insert trigger in Informix

Categories

Resources