Pentaho Spoon: CROSS JOIN on 2 streams - join

I have main stream that has some fields and hundreds of thousands of records.
I created a Table Input to just query the max value of a date column. It brings 1 unique record.
Now I need to do some kind of CROSS join this Table Input into the main stream, and add this new column into ts columns set. There's no ON clause, all records will have the same value for that column.
I tried using Merge Join, but instead of adding the value to all records it added an extra record to the stream. This extra record has null on all fields and the date value on the new field, while all original records have the new field as null.

You could use a "Join Rows (cartesian product)" step for this case.

you could use a stream lookup step. You would just need to make sure your main stream has a constant lookup value (add constant step right before the stream lookup) and add the same constant value in a new column to your query stream. The stream lookup should find the query result and add it to your main stream.

Related

Write to BQ one field of the rows of a PColl - Need the entire row for table selection

I have a problem:
My Pcoll is made of rows with this format
{'word':'string','table':'string'}
I want to write into BigQuery only the words, however I need the table field to be able to select the right table in BigQuery.
This is how my pipeline looks:
tobq = (input
| 'write names to BigQuery '>> beam.io.gcp.bigquery.WriteToBigQuery(
table=compute_table_name, schema=compute_schema,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
create_disposition=beam.io.gcp.bigquery.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.gcp.bigquery.BigQueryDisposition.WRITE_APPEND)
)
The function compute_table_name accesses an element and returns the table field. Is there a way to write into BQ just the words while still having this table selection mechanism based on rows?
Many thanks!
Normally the best approach with a situation like this in BigQuery is to use the ignoreUnknownValues parameter in ExternalDataConfiguration. Unfortunately Apache Beam doesn't yet support enabling this parameter while writing to BigQuery, so we must find a workaround, as follows:
Pass a Mapping of IDs to Tables as a table_side_input
This solution only works if identical word values are guaranteed to map to the same table each time, or there is some kind of unique identifier for your elements. This method is a bit more involved than Solution 1, but it relies only on the Beam model instead of having to touch the BigQuery API.
The solution involves making use of table_side_input to dynamically pick which table to place an element even if the element is missing the table field. The basic idea is to create a dict of ID:table (where ID is either the unique ID, or just the word field). Creating this dict can be done with CombineGlobally by combining all elements into a single dict.
Meanwhile, you use a transform to drop the table field from your elements before the WriteToBigQuery transform. Then you pass the dict into the table_side_input parameter of WriteToBigQuery, and write a callable table parameter that checks with the dict to figure out which table to use, instead of the table field.

Inserting after the last record (bottom) shown in DBGrid

What should be the code for inserting after the last record using the ClientDataSet?
I tried the following:
cdsSomething.Last;
cdsSomething.Insert:
But it appears it replaces the last record instead. I am sure there must be a quick code for this.
The method to append a record to the end of the Dataset (let alone any index) is Append. You don't even need to call Last before.
cdsSomething.Append;
Insert inserts a row before the selected record, so with your code, the new record should become the second to last record.
In general, where an added record (or, in fact, any record) appears in the DBGrid does not depend on the dataset operations which were used to insert it.
In fact, the DBGrid is irrelevant to this question, because it simply displays the added row in the ClientDataSet in the position the added row occurs in the CDS according to its current index order.
So, for example, if the CDS contains an integer ID field, and its current index is this ID field (e.g. because the CDS's IndexFieldNames property is set to 'ID'), to make the added row appear at the end, all you need to do is to set its ID value to something higher than any existing record in the CDS. If the field is of type ftAutoInc, this will, of course, happen automatically.
Uwe Raabe has answered this q a bit differently. What he says is correct if the CDS is not using any index, so the records are displayed in the physical order they appear in the CDS's datafile. However, relying on the physical order to determine the display order is not necessarily a good idea if the display order is important. If it is, then use an indexed field (or fields) to determine the order.

IN Clause used in the case of IN parameters is not Working while DELETING

I want to delete my records from the database based on some ID's.
This is the statement written in my STORED PROCEDURE to delete records based on DeletedID.
DELETE FROM tms_activity where activity_id IN (DeletedID);
My DeletedID is a string with records comma seperated like("1,2,3")
Now when I am passing DeletedID in my Statement as a parameter it is taking the input as "1,2,3" and deleting the record with the first DeletedID it is getting(1 in this case).But I want to delete all the records based on the given parameter.
DeletedId must be passed like (1,2,3) rather than ("1,2,3") than only it can delete all the records based on provided ID's...Now how can I do that?
I consulted this question: MySQL wrong output with IN clause and parameter,
but couldn't understand how can I achieve my result.
Did you try this
DELETE FROM tms_activity where activity_id in
( SELECT ACTIVITY_ID FROM SOMETABLE WHERE FIELD = CRITERIA )
Or some more findings, if I were at this problem would select one of these solutions.
I investigated and found some very good links for you.
[http://johnhforrest.com/2010/10/parameterized-sql-queries-in-c/][1]
The Parametrized SQL queries have a benefit if you have to send a large number of parameters and want to run against one sql. It takes the sql in one chunk and all the parameters in other so keep the network traffic low.
You can search more in this topic
Thanks
QF

SSIS Foreach through a table, insert into another and delete the source row

I have an SSIS routine that reads from a very dynamic table and inserts whichever rows it finds into a table in a different database, before truncating the original source table.
Due to the dynamic nature of the source table this truncation not surprisingly leads to rows not making it to the second database.
What is the best way of deleting only those rows that have been migrated?
There is an identity column on the source table but it is not migrated across.
I can't change either table schema.
A option, that might sound stupid but it works, is to delete first and use the OUTPUT clause.
I created a simple control flow that populates a table for me.
IF EXISTS
(
SELECT 1 FROM sys.tables AS T WHERE T.name = 'DeleteFirst'
)
BEGIN
DROP TABLE dbo.DeleteFirst;
END
CREATE TABLE dbo.DeleteFirst
(
[name] sysname
);
INSERT INTO
dbo.DeleteFirst
SELECT
V.name
FROM
master.dbo.spt_values V
WHERE
V.name IS NOT NULL;
In my OLE DB Source, instead of using a SELECT, DELETE the data you want to go down the pipeline and OUTPUT the DELETED virtual table. Somethinng like
DELETE
DF
OUTPUT
DELETED.*
FROM
dbo.DeleteFirst AS DF;
It works, it works!
One option would be to create a table to log the identity of your processed records into, and then a separate package (or dataflow) to delete those records. If you're already logging processed records somewhere then you could just add the identity there - otherwise, create a new table to store the data.
A second option: If you're trying to avoid creating additional tables, then separate the record selection and record processing into two stages. Broadly, you'd select all your records in the control flow, then process them on-by-one in the dataflow.
Specifically:
Create a variable of type Object to store your record list, and another variable matching your identity type (int presumably) to store the 'current record identity'.
In the control flow, add an Execute SQL task which uses a query to build a list of identity values to process, then stores them into the recordlist variable.
Add a Foreach Loop Container to process that list; the foreach task would load the current record identifier into the second variable you defined above.
In the foreach task, add a dataflow to copy that single record, then delete it from the source.
There's quite a few examples of this online; e.g. this one from the venerable Jamie Thomson, or this one which includes a bit more detail.
Note that you didn't talk about the scale of the data; if you have very large numbers of records the first suggestion is likely a better choice. Note that in both cases you lose the advantage of the table truncation (because you're using a standard delete call).

generate unique sequence number and transaction - with Entity framework 4

I have a Document Sequence table with 1 row and 1 column. All it does is have a sequence number. Every time we need to create a new document, we call a stored proc which updates the existing sequence number in this table by 1 and read that and use that for id in the document table.
My question is: If multiple requests try to call this stored proc which updates the sequence number and returns it, is there a chance it will give the same number to multiple callers? I right now have another stored proc which calls this seq number generator sp in a transaction and then create a document with the obtained id. I was wondering if I had to do it with just entity framework in code and not use the stored proc, is it possible? Will Entity framework 4 support transaction by itself so that only one process is calling the seq# updater sp at a time?

Resources