Informatica Mapping: Joiner must have exactly two inputs - mapping

I get the following message when I try to validate the mapping (see Warning attached):
...Joiner jnr_Normal_jnr_Master_ZC_OR_Delay_Reason must have exactly two inputs.
WARNING: Joiner transformation jnr_Normal_jnr_Master_ZC_OR_Delay_Reason Condition field OR_CASE_ID1 is unconnected.
I have a joiner (jnr_Master_ZC_OR_Delay_Reason) and expression (exp_Text) that I would like to join. I tried to do this with a normal joiner (jnr_Normal_jnr_Master_ZC_OR_Delay_Reason). However, the data from the jnr_Master_ZC_OR_Delay_Reason does not connect to this jnr_Normal_jnr_Master_ZC_OR_Delay_Reason. See Joiners-Two Inputs attached.
Should I be using a different transformation to join the joiner and expression?
I tried to use Sorting but I still get the same error message. Am I using the Sorting correctly? Please see the attached images.enter image description here
enter image description here

If you want to join flows that originate from the same source (let's call that a self-join), you need to have the data sorted on both branches of the flow and check the Sorted Input property on the Joiner Transformation (jnr_Normal_jnr_Master_ZC_OR_Delay_Reason in this case).
A self-join is only allowed if both flows are sorted. Depending on your flow, it may be enough to sort data only once, before the flow gets split.
Now, if you enable the Sorted Input property but the data will not be sorted, you will get an error while session execution.

Related

InfluxDB: Group rows with same timestamp

Assume a DB with the following data records:
2018-04-12T00:00:00Z value=1000 [series=distance]
2018-04-12T00:00:00Z value=10 [series=signal_quality]
2018-04-12T00:01:00Z value=1100 [series=distance]
2018-04-12T00:01:00Z value=0 [series=signal_quality]
There is one field called value. Square brackets denote the tags (further tags omitted). As you can see, the data is captured in different data records instead of using multiple fields on the same record.
Given the above structure, how can I query the time series of distances, filtered by signal quality? The goal is to only get distance data points back when the signal quality is above a fixed threshold (e.g. 5).
"Given the above structure", there's no way to do it in plain InfluxDB.
Please keep in mind - Influx is NONE of a relational DB, it's different, despite query language looks familiar.
Again, given that structure - you can proceed with Kapacitor, as it was already mentioned.
But I strongly suggest you to rethink the structure, if it is possible, if you're able to control the way the metrics are collected.
If that is not an option - here's the way: spin a simple job in Kapacitor that will just join the two points into one basing on time (check this out for how), and then drop it into new measurement.
The data point would look like this, then:
DistanceQualityTogether,tag1=if,tag2=you,tag2=need,tag4=em distance=1000,signal_quality=10 2018-04-12T00:00:00Z
The rest is oblivious with such a measurement.
But again, if you can configure your metrics to be sent like this - better do it.

Pentaho map values to field after join

I am doing a data source integration using Pentaho Data Integration where I need to join a table A with multiple Google Analytics data streams (Lets call them GA_A, GA_B, GA_C, ... GA_Z).All the GA streems have the same fields, but they come from different profiles. I am using a LEFT OUTER JOIN in each merge step to keep all the data from table A while adding the values of each GA data stream. The problem is that, when I make the joins, all the GA fields from each data stream are added to the result but renamed with an underscore. Here is an example:
GA_A , GA_B and GA_C all have the field "name" and are joined to the table A. In the last join result, I get the fields "name" , "name_1", and "name_2").
This obviously happens because of the nature of the LEFT OUTER JOIN. However, I want to "map" of "send" all the values from "name_1", "name_2", "name_3", etc to the field "name". How can I achieve this? I see that there's a "Value Mapper" step in PDI, but I don't want to use a step for each of the 10 fields I bring from GA (also, I'm not sure if that step does what I want to do)
Thanks!
As #Brian.D.Myers said there are multiple solutions available.
First, if all the GA streams are of the same structure there is no need to use join for all of them - you can first union all data (just directing them to a same step i.e. Dummy step) and after do the join - in that case you won't get multiple name_* fields.
However if there are still fields having the same name in table A and GA stream - they will obviously be renamed with underscores (it is essential as you pointed out). To handle this there ar few options:
If you need to just copy values - use the Set field value step - it copies a value from one field to another
If there is some complex processing logic - use the Javascript step
If streams are relatively small and you actually need to retain both fields - you may use the "Stream lookup" step instead of Merge join - it will allow you to specify names of the "merged" columns so no naming conflicts occurs.

PIG error message: Projected field doesn't e

I'm taking an on-line course. My current assignment asks me to compute an average for one of the fields.
I got this working when I was using a single relation, but the current task involves computing a value from a relation created by joining two others.
When I try a function based on the previously successful approach, I get this error message which has me confused.
Invalid field projection. Projected field [join2::Y_rate2::wtd_stars] does not exist in schema:
The code I have entered in the PIG shell is:
avg = FOREACH groupedJoin2 GENERATE AVG(join2::Y_rate2::wtd_stars);
When I enter
grunt> describe groupedJoin2
this is my output:
groupedJoin2: {group: chararray,join2: {(Y_rate2::business_id: chararray,Y_rate2::stars: int,Y_rate2::useful_clipped: int,Y_rate2::wtd_stars: double,Y_m2::business_idgroup: chararray,Y_m2::num_ratings: long,Y_m2::avg_stars: double,Y_m2::avg_useful: double,Y_m2::avg_wtdstars: double)}}
I believe that my problem is that I don't know how to reference the field I want to compute an average of, but hours of searching over several days has not enlightened me.
Can anyone point out how to reference the field, if that is my problem? If that isn't my problem, I'll be grateful for your pointing me in the right direction.
I think you want to say: AVG(join2.Y_rate2::wtd_stars)
Dot is used to dereference a bag or tuple (in this case, the join2 bag).
Colons are used after a join or group to disambiguate field names.
Bag dereferencing can be done by name (bag.field_name) or position (bag.$0). If a set of fields are dereferenced (bag.(name1, name2) or bag.($0, $1)), the expression represents a bag composed of the specified fields.
http://pig.apache.org/docs/r0.15.0/basic.html#deref

Retrieving statement in Jena by its unique ID

I'm building a REST API which will serve information about statements stored in my Jena TDB.
It would be great if each statement has its unique ID so I can use this ID in GET request to retrieve information about particular statement. Is there something like that in Jena?
I know I can retrieve statement(s) by providing appropriate subject/predicate/object identifiers to model.listStatements method, but it would be quite ugly to add these parameters to API GET requests.
In RDF, a triple is defined by its subject, object and predicate. If you have two triples with the same S/P/O, it is really the same triple (value-equality, not instance equality). An RDF graph is a set of triples; if you add a triple twice, the set has only one instance. There is no triple id concept in RDF, and there isn't internally in TDB.
So you could use unique identifiers, say a string of length 4, for every S, every P and every O. Just save them all as key/value (id/resource, id/property) pairs. Then you will have a string of length 12 as unique identifier of your statement.
Even if a statements is deleted and added again, leading to a different id when tagging every statement with an id, this method will yield the same statement every time.

TADOQuery filter and an expression always true

I am trying to filter some records from a TADOQuery. I set the filtered property to true and when I set the filter to field='value', all works fine. I would like to dynamically build this filter by appending
<space>AND field='value'
to a value always true, and I thought 1=1 would do the trick. So I would have 1=1 as the default filter and then just append AND field='value' to it as necessary.
This, however, does not work. The error message reads:
Arguments are of the wrong type, are out of acceptable range, or are in conflict with one another.
Could anyone please tell me what could I use as a versatile always-true expression for this filter?
I suppose it goes without saying, but it depends on the OLE DB provider whether or not this works. When you set a filter on an existing record set, it ends up going through a different OLE DB interface (IViewFilter if I remember correctly). So even if a filter works in a WHERE clause on an SQL statement, it does not necessarily mean that it will work as a filter. The filter that you set ends up getting parsed apart into the component pieces and then passed to the OLE DB interface. It may be that the provider's implementation is not expecting a filter of the form "constant = constant". As a workaround, you might try setting it all in the WHERE clause of the SQL statement.
You have to set the 'Filtered' property to False if you are not filtering something, and set it True and your condition when you want the resultset to be filtered.
I would dynamically build the correct SQL property though so that you always exactly know what is being send to the database (and you are sure that only those records you want is received by your program).
The 1=1 trick works fine in the where clause of a query, but not in the filtered property. If you want to disable the filter, set filtered to false and all records will be returned.
The problem with filtering is that it is done client side. If you are using a database engine such as SQL Server and expect to have a large set of records to filter, then your better served by changing the SQL Query which will allow the database server to return only the records requested. Just remember to close your TAdoQuery first, change the SQL then re-open.
A trick I use to avoid returning the entire dataset (used for large datasets) is to consider a maximum number of records I want to display, then use the TOP SQL Syntax to return one more than the number of records I wanted to display 'n' ...if I reach that number, then I notify the user that there were more than n-1 records returned and to adjust the search/filter criteria.

Resources