Pentaho map values to field after join - join

I am doing a data source integration using Pentaho Data Integration where I need to join a table A with multiple Google Analytics data streams (Lets call them GA_A, GA_B, GA_C, ... GA_Z).All the GA streems have the same fields, but they come from different profiles. I am using a LEFT OUTER JOIN in each merge step to keep all the data from table A while adding the values of each GA data stream. The problem is that, when I make the joins, all the GA fields from each data stream are added to the result but renamed with an underscore. Here is an example:
GA_A , GA_B and GA_C all have the field "name" and are joined to the table A. In the last join result, I get the fields "name" , "name_1", and "name_2").
This obviously happens because of the nature of the LEFT OUTER JOIN. However, I want to "map" of "send" all the values from "name_1", "name_2", "name_3", etc to the field "name". How can I achieve this? I see that there's a "Value Mapper" step in PDI, but I don't want to use a step for each of the 10 fields I bring from GA (also, I'm not sure if that step does what I want to do)
Thanks!

As #Brian.D.Myers said there are multiple solutions available.
First, if all the GA streams are of the same structure there is no need to use join for all of them - you can first union all data (just directing them to a same step i.e. Dummy step) and after do the join - in that case you won't get multiple name_* fields.
However if there are still fields having the same name in table A and GA stream - they will obviously be renamed with underscores (it is essential as you pointed out). To handle this there ar few options:
If you need to just copy values - use the Set field value step - it copies a value from one field to another
If there is some complex processing logic - use the Javascript step
If streams are relatively small and you actually need to retain both fields - you may use the "Stream lookup" step instead of Merge join - it will allow you to specify names of the "merged" columns so no naming conflicts occurs.

Related

Write to BQ one field of the rows of a PColl - Need the entire row for table selection

I have a problem:
My Pcoll is made of rows with this format
{'word':'string','table':'string'}
I want to write into BigQuery only the words, however I need the table field to be able to select the right table in BigQuery.
This is how my pipeline looks:
tobq = (input
| 'write names to BigQuery '>> beam.io.gcp.bigquery.WriteToBigQuery(
table=compute_table_name, schema=compute_schema,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
create_disposition=beam.io.gcp.bigquery.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.gcp.bigquery.BigQueryDisposition.WRITE_APPEND)
)
The function compute_table_name accesses an element and returns the table field. Is there a way to write into BQ just the words while still having this table selection mechanism based on rows?
Many thanks!
Normally the best approach with a situation like this in BigQuery is to use the ignoreUnknownValues parameter in ExternalDataConfiguration. Unfortunately Apache Beam doesn't yet support enabling this parameter while writing to BigQuery, so we must find a workaround, as follows:
Pass a Mapping of IDs to Tables as a table_side_input
This solution only works if identical word values are guaranteed to map to the same table each time, or there is some kind of unique identifier for your elements. This method is a bit more involved than Solution 1, but it relies only on the Beam model instead of having to touch the BigQuery API.
The solution involves making use of table_side_input to dynamically pick which table to place an element even if the element is missing the table field. The basic idea is to create a dict of ID:table (where ID is either the unique ID, or just the word field). Creating this dict can be done with CombineGlobally by combining all elements into a single dict.
Meanwhile, you use a transform to drop the table field from your elements before the WriteToBigQuery transform. Then you pass the dict into the table_side_input parameter of WriteToBigQuery, and write a callable table parameter that checks with the dict to figure out which table to use, instead of the table field.

IBM Cognos 10 - Smple way to globally rename a table column?

My client has decided they want to rename a very commonly used data item name.
So, for example, the database has a column called 'Cost' and they see 'Cost' on a heap of reports.
The client now wants to see 'Net Cost' everywhere.
So we need to change every occurrence of 'Cost' and change it to 'Net Cost'
I can do this in Framework Manager easily enough, and I can even run Tools > Report Dependency to find all the reports that use the 'Cost' column. But if there's 4,000 of them, that's a lot of work to update them all.
One idea is to deploy the entire content store to a Cognos Deployment zip file, extract that & do a global search & replace on the XML. But that's going to be messy & dangerous.
Option 2 is to use MotioPI to do a search & replace. I don't think the client will spring for buying this product just for this task.
Are there other options?
has anyone written anything in the Cognos SDK which will do a rename?
has someone investigated the Content Store database to the degree
that they could do a rename on all the report specs in SQL?
are there other options I've overlooked?
Any ideas would be greatly welcomed ...
Your first option is the way to go. This essentially boils down to an XML find-and-replace scenario. You'll need to isolate just the instances of the word "Cost" which are relevant to you. This may involve the start and end tags.
To change the data source reference across reports, you'll need to find and replace on the three part name [Presentation Layer].[Namespace].[Cost]. If there are filters on the item, they may just reference the one part name from the Query. Likewise, any derived queries would reference the two part name. Handle these by looking through the XML report spec and figuring out how to isolate the text.
I'm assuming your column names are set to inherit the business name from the model and not hard coded (Source Type should be Data Item Label, NOT Text). If not, you'll need to handle these as well. Looking at the XML, you would see <staticValue>Cost</staticValue> for these.
It's not really dangerous as you have a backup. Just take multiple passes, each with as granular a find and replace as possible.
Motio will just look at the values inside the tags, so you will be unable to isolate Cost, thus it can't be used for this. However, it would come in handy for mass validation of reports after the find and replace. A one seat license for the year could be justified by the amount of development time it could save here.
Have you tried using DRU? (http://www-01.ibm.com/support/docview.wss?uid=swg24021248)
I have used this tool before to do what you are describing.
Good luck.
You can at least search for text in the Content Store (v10.2.1) using something like:
set define off
select distinct T4.name as folder_name, T2.name as report_name
from cmobjprops7 T1
inner join cmobjnames T2 on T1.cmid=T2.cmid
inner join cmobjects T3 on T1.cmid=T3.cmid
inner join cmobjnames T4 on T3.pcmid=T4.cmid
inner join ( -- only want the latest version (this still shows reports that have been deleted)
Select T4.name as folder_name, T2.name as report_name, max(T3.modified) as latest_version_date
from cmobjnames T2
inner join cmobjects T3 on T2.cmid=T3.cmid
inner join cmobjnames T4 on T3.pcmid=T4.cmid
Where T2.Name like ‘%myReport%’ -- search only reports with 'myReport' i the name
and substr(T4.name,1,1) in ('Project Zeus','Project Jupiter') -- search only folders with this in the name
Group by T4.name, T2.name
) TL on TL.folder_name=T4.name and TL.report_name=T2.name and TL.latest_version_date=T3.modified
where T1.spec like '%[namespace].[column_name]%' -- text you want to find
and substr(T4.name,1,2) in ('Project Zeus','Project Jupiter')
order by 1 desc, 2;

Core data Math?

I have a core data application. It allows the user to enter a job, then for that job they can enter equipment info, then for that equipment, they can enter parts for the equipment, and simultaneously that same part shows up under the job part list.
My problem is that when the user enters the same part under 2 different pieces of equipment, for the same job, I want the job part list to update its quantity. Right now it simply shows up as a duplicate under 2 tableview cells, leaving the user to have to add up the quantity manually.
Is there a way to allow the NSFetchedResultsController - perhaps through a predicate - to do the math automatically and use only 1 cell...simply updating the quantity?
Any suggestions are worth trying at this point...I'm officially stumped.
You can do part of this with the NSFetchRequest you assign your NSFetchedResultsController.
Check the documentation for:
- (void)setPropertiesToGroupBy:(NSArray *)array
- (void)setReturnsDistinctResults:(BOOL)values
If you fetch all parts and group by the part #, and only return distinct values, you’ll get one result for each part #. However, I haven’t figured out how to show the actual part count yet.
Understanding your model would help here because its not clear what you mean by "enter parts for the equipment"
- does this CREATE a new managedObject for each required part?
- or does this simply create a reference between equipment and part, such that every time you do this you create a reference to the same part?
I would create a RequiredPart entity and have a reference to Part to identify the type of part that is required. So you would have something like
Job ->> Equipment ->> RequiredPart - Part
Then in your predicate using collection operators such as #sum, #count etc. (possibly in conjunction with setPropertiesToGroupBy) to get the quantity.
Here is a link for an explanation of how to do a group by queryhttp://mattconnolly.wordpress.com/2012/06/21/ios-core-data-group-by-and-count-results/

Map-side join with Hadoop Streaming

I have a file in which each line is a record. I want all records with the same value in a certain field (call if field A) to go to the same mapper. I have heard this is called a Map-Side Join, and I also heard that it's easy if the records in the file are sorted by what I call field A.
If it would be easier, the data could be spread across multiple files, but each file sorted on field A.
Is this right? How do I do this in with streaming? I'm using Python. A assume it's just part of the command I use to start Hadoop?
What is the real justification for wanting only certain records to go to certain mappers? If what you want out of this is the final result to be 3 output files (one with all A, another with all B, last with all C), you can accomplish that with multiple reducers. Need to know what you really want to accomplish.

Fetch data from multiple tables and sort all by their time

I'm creating a page where I want to make a history page. So I was wondering if there is any way to fetch all rows from multiple tables and then sort by their time? Every table has a field called "created_at".
So is there any way to fetch from all tables and sort without having Rails sorting them form me?
You may get a better answer, but I would presume you would need to
Create a History table with a Created date column, an autogenerated Id column, and any other contents you would like to expose [eg Name, Description]
Modify all tables that generate a "history" item to consume this new table via Foreign Key relationship on History.Id
"Mashing up" tables [ie merging different result sets into a single result set] is a very difficult problem, but you would effectively be doing the above anyway - just in the application layer, so why not do it correctly and more efficiently in the data layer.
Hope this helps :)
You would need to perform the sql like:
Select * from table order by created_at incr
: Store this into an array. Do this for each of the data sources, and then perform a merge sort on all the arrays in Ruby. Of course this will work well for small data sets, but once you get a data set that is large (ie: greater than will fit into memory) then you will have to use a different collect/merge algorithm.
So I guess the answer is that you do need to perform some sort of Ruby, unless you resort to the Union method described in another answer.
Depending on whether these databases are all on the same machine or not:
On same machine: Use OrderBy and UNION statements in your sql to return your result set
On different machines: You'll want to test this for performance, but you could use Linked Servers and UNION, ORDER BY. Alternatively, you could have ruby get the results from each db, and then combine them and sort
EDIT: From your last comment about different tables and not DB's; use something like this:
SELECT Created FROM table1
UNION
SELECT Created FROM table2
ORDER BY created

Resources