What is " rows in all the tables" in for each loop? - foreach

When setting up a for each loop to read products from an "objProduct" object variable, I got three options in "Enumerator Mode" pane as snapshot shows:
I know "Rows in the first table" is the right option for current case. However, I'm curious in which scenarios will the second and third options be used?
Seems that "ADO Object Source Variable" will contain multiple tables if 2nd/3rd is applied. That's confusing... shouldn't one variable be regarded as one table and thus, only the first option is needed?
P.S.
I did researches and only MSDN sheds some light as below, but not quite clear when they will be applied and for what purpose.
**Rows in all tables (ADO.NET dataset only)**
Select to enumerate rows in all tables. This option is available only if the objects to enumerate are all members of the same ADO.NET dataset.
**All tables (ADO.NET dataset only)**
Select to enumerate tables only.

Let's say that you execute the following SQL in an Execute SQL Task (using an ADO.NET connection) and you store the full result set in an SSIS Object variable.
select * from
(select 1 as id, 'test' as description) resultSet1
;
select * from
(select 2 as anotherId, 'test2' as description union
select 3 as anotherId, 'test3' as description) resultSet2
That object is actually a System.Data.DataSet, which can contain multiple result sets (accessible via the Tables property). Each of those result sets is a System.Data.DataTable object. Within each result set (or System.Data.DataTable) you have rows.
The Rows in all tables (ADO.NET dataset only) and All tables (ADO.NET dataset only) options can be used when you need to iterate through all the result sets (instead of just the first one). The difference between the two is what objects are being enumerated over.
Rows in all tables (ADO.NET dataset only) - take all the rows of data returned from the SQL above and go through them one by one, mapping the column values to variables specified in your Variable Mappings. For the example above, you would have 3 total iterations (3 total rows). This behavior in a Script Task would look something like this:
All tables (ADO.NET dataset only) - take all the result sets from the SQL above and go through them one by one, mapping the result set to the variable specified in Variable Mappings. For the example above, you would have 2 total iterations (2 total result sets). This behavior in a Script Task would look something like this:
I've never had the need to use either one of these options, so I can't provide any specific scenarios where I've used them.

Related

How to identify cases that have multiple variables with the same value in SPSS

I have a dataset in which there are multiple variables for various times.
Here is a sample part of the dataset:
I'm trying to identify the number/percentage of cases that have the same value in any of the multiple variables.
For example, if I have a database of teachers who left a school where they worked and there are variables for why the teacher left each school, how would I find out if a teacher left multiple schools for the same reason. So I have reasonleft1, reasonleft2, reasonleft3, up to 20. Each reasonleft has the same coded response options. For example, 1=better opportunity elsewhere, 2=retired, 3=otherwise left workforce, etc. I'm stumped on how to figure out if any case/teacher left multiple schools for the same reason. Which teachers left multiple schools out of the 20 for 1=better opportunity elsewhere, for example.
Thanks!
This can be done in the following two steps:
You need to restructure the dataset so that each "time" appears in a separate row.
Now you can aggregate to count the number of appearances of each reason per person.
The following syntax will do that:
varstocases
/make facilititype from facilititype1_pre facilititype2_pre facilititype3_pre
/make timeinplace from timeinplace1_pre timeinplace2_pre timeinplace3_pre
/make reasonleft from reasonleft1_pre reasonleft2_pre reasonleft3_pre
/index = timeN(reasonleft).
* you should continue the numbering for as much as needed.
dataset declare MyAgg.
aggregate outfile=MyAgg /break=ID reasonleft/Ntimes=n.
At this point you have a new dataset which has the count of each reason for each ID. If you wish you can go back to wide format, and create a column for each reason (the values in each column are the count of times this reason appeared for the ID).
This way:
dataset activate MyAgg.
casestovars /id=ID /index=reasonleft/separator="_".

Internal working of Pentaho Mondrian for query generation

I am trying to analyse how sql queries are generated by Pentaho mondrian. Let us assume there are no aggregate tables as of now. I have noticed two types of behaviour when I try to fetch data from data warehouse (star schema) using Pentaho.
Case 1: I apply various filters and try to get fact count corresponding to it which is the default measure in my case.
Case 2: I apply the same filters as mentioned in case 1 and try to get some other measure by explicitly putting it into the measures selection box.
Observation: In both the cases, sql queries generated in the back-end include joins of fact table with multiple dimension tables as per the filters applied and columns and rows selected in Pentaho.
However, the join order is different in both the cases. In case 1, the fact table is placed at the left-most position of join whereas it is placed somewhere between the dimension tables in case 2.
I have connected Pentaho with AWS Athena at the back-end to execute queries on data stored on s3 with the help of jdbc connection. Since Athena has Presto at the back-end and Presto does not do automatic JOIN re-ordering, queries in case 2 are getting failed.
(http://docs.qubole.com/en/latest/user-guide/presto/best-practices.html)
I noticed that hash joins are being performed by Presto here. For hash joins to be effective, the largest table should be placed on the left side of join so that the smaller table is cached in memory while performing join. This is not happening in second case and it is trying to hash the fact table which consists of a large amount of data as compared to any of the dimension tables. This causes the query to fail whenever I add measure explicitly (other than default measure) and the data range is large (across an year for example).
Can someone please give an insight into the logic behind query formation of Mondrian in both the cases. Also, is there a way we can make the fact table to always remain on the left-most position of joins in the sql queries generated by Mondrian. Or is there any property of Presto which could be set through Athena to change the join type from hash join to some other type of join in which could solve this problem.
Pentaho version - 6.1.0
Saiku version - 3.10

SSIS Foreach through a table, insert into another and delete the source row

I have an SSIS routine that reads from a very dynamic table and inserts whichever rows it finds into a table in a different database, before truncating the original source table.
Due to the dynamic nature of the source table this truncation not surprisingly leads to rows not making it to the second database.
What is the best way of deleting only those rows that have been migrated?
There is an identity column on the source table but it is not migrated across.
I can't change either table schema.
A option, that might sound stupid but it works, is to delete first and use the OUTPUT clause.
I created a simple control flow that populates a table for me.
IF EXISTS
(
SELECT 1 FROM sys.tables AS T WHERE T.name = 'DeleteFirst'
)
BEGIN
DROP TABLE dbo.DeleteFirst;
END
CREATE TABLE dbo.DeleteFirst
(
[name] sysname
);
INSERT INTO
dbo.DeleteFirst
SELECT
V.name
FROM
master.dbo.spt_values V
WHERE
V.name IS NOT NULL;
In my OLE DB Source, instead of using a SELECT, DELETE the data you want to go down the pipeline and OUTPUT the DELETED virtual table. Somethinng like
DELETE
DF
OUTPUT
DELETED.*
FROM
dbo.DeleteFirst AS DF;
It works, it works!
One option would be to create a table to log the identity of your processed records into, and then a separate package (or dataflow) to delete those records. If you're already logging processed records somewhere then you could just add the identity there - otherwise, create a new table to store the data.
A second option: If you're trying to avoid creating additional tables, then separate the record selection and record processing into two stages. Broadly, you'd select all your records in the control flow, then process them on-by-one in the dataflow.
Specifically:
Create a variable of type Object to store your record list, and another variable matching your identity type (int presumably) to store the 'current record identity'.
In the control flow, add an Execute SQL task which uses a query to build a list of identity values to process, then stores them into the recordlist variable.
Add a Foreach Loop Container to process that list; the foreach task would load the current record identifier into the second variable you defined above.
In the foreach task, add a dataflow to copy that single record, then delete it from the source.
There's quite a few examples of this online; e.g. this one from the venerable Jamie Thomson, or this one which includes a bit more detail.
Note that you didn't talk about the scale of the data; if you have very large numbers of records the first suggestion is likely a better choice. Note that in both cases you lose the advantage of the table truncation (because you're using a standard delete call).

Optimizing JOINs : comparison with indexed tables

Let's say we have a time consuming query described below :
(SELECT ...
FROM ...) AS FOO
LEFT JOIN (
SELECT ...
FROM ...) AS BAR
ON FOO.BarID = BAR.ID
Let's suppose that
(SELECT ...
FROM ...) AS FOO
Returns many rows (let's say 10 M). Every single row has to be joined with data in BAR.
Now let's say we insert the result of
SELECT ...
FROM ...) AS BAR
In a table, and add the ad hoc index(es) to it.
My question :
How would the performance of the "JOIN" with a live query differ from the performance of the "JOIN" to a table containing the result of the previous live query, to which ad hoc indexes would have been added ?
Another way to put it :
If a JOIN is slow, would there be any gain in actually storing and indexing the table to which we JOIN to ?
The answer is 'Maybe'.
It depends on the statistics of the data in question. The only way you'll find out for sure is to actually load the first query into a temp table, stick a relevant index on it, then run the second part of the query.
I can tell you if speed is what you want, if it's possible for you load the results of your first query permanently into a table then of course your query is going to be quicker.
If you want it to be even faster, depending on which DBMS you are using you could consider creating an index which crosses both tables - if you're using SQL Server they're called 'Indexed Views' or you can also look up 'Reified indexes' for other systems.
Finally, if you want the ultimate in speed, consider denormalising your data and eliminating the join that is occurring on the fly - basically you move the pre-processing (the join) offline at the cost of storage space and data consistency (your live table will be a little behind depending on how frequently you run your updates).
I hope this helps.

Fetch data from multiple tables and sort all by their time

I'm creating a page where I want to make a history page. So I was wondering if there is any way to fetch all rows from multiple tables and then sort by their time? Every table has a field called "created_at".
So is there any way to fetch from all tables and sort without having Rails sorting them form me?
You may get a better answer, but I would presume you would need to
Create a History table with a Created date column, an autogenerated Id column, and any other contents you would like to expose [eg Name, Description]
Modify all tables that generate a "history" item to consume this new table via Foreign Key relationship on History.Id
"Mashing up" tables [ie merging different result sets into a single result set] is a very difficult problem, but you would effectively be doing the above anyway - just in the application layer, so why not do it correctly and more efficiently in the data layer.
Hope this helps :)
You would need to perform the sql like:
Select * from table order by created_at incr
: Store this into an array. Do this for each of the data sources, and then perform a merge sort on all the arrays in Ruby. Of course this will work well for small data sets, but once you get a data set that is large (ie: greater than will fit into memory) then you will have to use a different collect/merge algorithm.
So I guess the answer is that you do need to perform some sort of Ruby, unless you resort to the Union method described in another answer.
Depending on whether these databases are all on the same machine or not:
On same machine: Use OrderBy and UNION statements in your sql to return your result set
On different machines: You'll want to test this for performance, but you could use Linked Servers and UNION, ORDER BY. Alternatively, you could have ruby get the results from each db, and then combine them and sort
EDIT: From your last comment about different tables and not DB's; use something like this:
SELECT Created FROM table1
UNION
SELECT Created FROM table2
ORDER BY created

Resources