I'm attempting to join two tables in Data Studio. My data sources are Google Ads and Microsoft ads. I'd like to end up with a table that looks like the following example:
Campaign
Clicks
Campaign 1
500
Campaign 2
700
The clicks from each table is added together to give a total.
When I attempt to join both tables, I get a result that looks like this (full example here):
Campaign
Clicks (Table 1)
Clicks (Table 2)
Campaign 1
100
400
Campaign 2
200
500
The data appears to be joined by 'campaign' but the 'clicks' are not being consolidated into one column, instead the clicks data from both tables are separate.
I've already attempted to solve this issue by:
Creating calculated fields in the newly blended data (Clicks Table 1+ Clicks Table 2) but this yields strange results when trying to aggregate other metrics.
Join using 'Clicks', however, this doesn't work as the number of clicks for each campaign is always likely to be different for each data source.
Change the join type from 'Left outer' to right outer, inner, full outer and cross but none of these appear to work either.
Grouping campaigns by a 'Campaign Group' calculated field using a CASE statement but this doesn't appear to work either- this generally results in only one set of data to show at a time (possibly whichever loads quickest).
Here's how my blend is setup. You can attempt to reproduce this issue using this page.
What is the best way to join both tables and have the metrics (like clicks) properly aggregated?
The values in the two separate fields, Clicks (Table 1) and Clicks (Table 2) can be consolidated using the calculated field:
Clicks (Table 1) + Clicks (Table 2)
This will work as long as there are no NULL values in either (or both) tables in the blend, for any given row of data.
This is because 1 + NULL = NULL (where 1 is used as an example to represent a number) as NULL is not a numeric literal (it's not a number, thus cannot be calculated)
Since this blend has NULL values, one approach is to use the IFNULL function ("returns a result if the input is null, otherwise, returns the input") below, which treats NULL values as a numeric literal (in this case, 0), so that the values can be calculated:
IFNULL(Clicks (Table 1), 0) + IFNULL(Clicks (Table 2), 0)
This will ensure that calculations are changed:
1 + NULL = NULL is replaced by 1 + 0 = 1
NULL + NULL = NULL is replaced by 0 + 0 = 0
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:
Related
I'm using a QUERY function in Google Sheets. I have a named data range ("Contributions" in table on another sheet) that consists of many columns, but I'm only concerned with two of them. For simplicity sake, it looks something like this:
I have another table that contains the unique set of names (e.g.: "Fred", "Ginger", etc. each only once) and I want to extract the level # (column B) from the above table to insert the most recent (largest number) in this second table.
Right now, my query looks like this:
=QUERY(Contributions, "select B,C where C='"&A5&"' order by B desc limit 1",1)
The problem is, that it outputs both B & C data - e.g.:
11 Fred
But since I already have the name (in column A of this other table) I only want it to output the value from B - e.g.:
11
Is there a way to output only a subset (in this case 1 of 2) of the columns of output based on a directive within the query itself (as opposed to doing post-processing of the results)?
Outputting a Subset of Columns Used in Query
In order to output only certain columns of a query result, the query only needs to select the columns to be displayed while the constraints / conditions may utilize other columns of data.
For example (as an answer to my own question) - I have a table like this:
I needed to get the data from the row with a name matching another cell (on another sheet) and with the latest (largest) number - but I only want to output the number part.
My initial attempt was:
=QUERY(Contributions, "select B,C where C='"&A5&"' order by B desc limit 1",1)
But that output both B & C where I only wanted B. The answer (thanks to # Calculuswhiz) was to continue using C for the condition but only select on B:
=QUERY(Contributions, "select B where C='"&A5&"' order by B desc limit 1",1)
I have a Kafka topic "events" which records user image votes and has json in the following structure:
{"category":"image","action":"vote","label":"amsterdam","ip":"1.1.1.1","value":2}
I need to receive on another topic the sum of all votes for the label (e.g. amsterdam) but drop any votes that came from the same IP address using only the last vote. This topic should have json in this format:
{label:”amsterdam”,SCORE:8,TOTAL:3}
SCORE is a sum of all votes and TOTAL is the number of votes counted.
The solution I made creates a stream from the topic events:
CREATE STREAM st_events
(CATEGORY STRING, ACTION STRING, LABEL STRING, VALUE BIGINT, IP STRING)
WITH (KAFKA_TOPIC='events', VALUE_FORMAT='JSON');
Then, I create a table tb_votes which calculates the score and total for each label and IP address:
CREATE TABLE tb_votes WITH (KAFKA_TOPIC='tb_votes', PARTITIONS=1, REPLICAS=1) AS SELECT
st_events.LABEL "label", SUM(st_events.VALUE-1) "score", CAST(COUNT(*) AS BIGINT) "total"
FROM st_events
WHERE
st_events.category='image' AND st_events.action='vote'
GROUP BY st_events.label, st_events.ip
EMIT CHANGES;
The problem is that instead of dropping all the previous votes coming from the same ip address for the same image, Kafka uses all of them. This makes sense as it is a GROUP BY.
Any idea how to "drop" all previous votes and only use the latest values for an image/ IP?
You need a two stage aggregation.
The first stage should build a table with a primary key containing both the ip and label and another column holding the value.
Build a second table from this first table to get the count and sum per-label that you need.
If another vote comes in from the same ip for the same label then the first table will be updated with the new value and the second table will be correctly updated. It will first remove the old value from the count and sum and then apply the new value.
ksqlDB does not yet support multiple primary key columns (though its coming VERY soon!). So when you group by two columns it just does a funky string concatenation. But we can work with that for now.
CREATE TABLE BY_IP_AND_LABEL AS
SELECT
label + '-' + ip AS ipAndLabel,
value
FROM st_events
GROUP BY ip + '#' + label;
CREATE TABLE BY_LABEL AS
SELECT
SUBSTRING(labelAndIp, INSTR(labelAndIp, '#')) AS label,
SUM(VALUE-1) AS score,
COUNT(*) AS total
FROM BY_IP_AND_LABEL
GROUP BY SUBSTRING(ipAndLabel, INSTR(ipAndLabel, '#'));
The first table creates a composite key with and # as the separator. The second table uses INSTR and SUBSTRING to find the separator and extract the label.
Note: I've not tested this - I could have some 'off-by-one' errors in the logic.
This should do what you need.
I have an Esper query that returns multiple rows, but I'd like to instead get one row, where that row has a list (or concatenated string) of all of the values from the (corresponding columns of the) matching rows that my current query returns.
For example:
SELECT Name, avg(latency) as avgLatency
FROM MyStream.win:time(5 min)
GROUP BY Name
HAVING avgLatency / 1000 > 60
OUTPUT last every 5 min
Returns:
Name avgLatency
---- ----------
A 65
B 70
C 75
What I'd really like:
Name
----
{A, B, C}
Is this possible to do via the query itself? I tried to make this work using subqueries, but I'm not working with multiple streams. I can't find any aggregation functions or enumeration functions in the Esper documentation that fits what I'm trying to do either.
Thanks to anybody that has any insight or direction for me here.
EDIT:
If this can't be done via the query, I'm open to changing the subscriber, or anything else, if necessary.
You can have a subscriber or listener do the concat. There is a "Multi-Row Delivery" for subscribers. Or use a table like below.
// create table to hold aggregation result
create table LatencyTable(name string primary key, avgLatency avg(double));
// update aggregations in table from events coming in
into LatencyTable select name, avg(latency) as avgLatency from MyStream#time(5 min) group by name;
// do a select with the "aggregate" enumeration method
select (select * from LatencyTable where avgLatency > x).aggregate(....) from pattern[every timer:interval(5 min)]
have some denormalized data, along the lines of the following:
FruitData:
LOAD * INLINE [
ID,ColumnA, ColumnB, ColumnC
1,'Apple','Pear','Banana'
2,'Banana','Mango','Strawberry'
3,'Pear','Strawberry','Kiwi'
];
MasterFruits
LOAD * INLINE [
Fruitname
'Apple'
'Banana'
'Pear'
'Mango'
'Kiwi'
'Strawberry'
'Papaya'
];
And what I need to do is compare these fields to a master list of fruit (held in another table). This would mean that if I chose Banana, IDs 1 and 2 would come up and if I chose Strawberry, IDs 2 and 3 would come up.
Is there any way I can create a listbox that searches across all 3 fields at once?
A list box is just a mechanism to allow you to "select" a value in a certain field as a filter. The real magic behind what Qlikview is doing comes from the associations made in the data model. Since your tables have no common field you couldn't, for example, load a List Box for Fruitname and click something and have it alter List Boxes for other fields such as ColumnA, B, or C. To get the behavior you want you need to associate the two tables. This is can be accomplished by concatenating the various columns into one column (essentially normalizing the data).
[LinkTable]:
LOAD Distinct ColumnA as Fruitname,
ID
Resident FruitData;
Concatenate([LinkTable])
LOAD Distinct ColumnB as Fruitname,
ID
Resident FruitData;
Concatenate([LinkTable])
LOAD Distinct ColumnC as Fruitname,
ID
Resident FruitData;
You can see the table this produces here:
and the data model looks like this:
and finally, the desired behavior:
In my ETL process I am using Change Data Capture (CDC) to discover only rows that have been changed in the source tables since the last extraction. Then I do the transformation only for this rows. The problem is when I have for example 2 tables which I want to join into one dimension, and only one of them has changed. For example I have table Countries and Towns as following:
Countries:
ID Name
1 France
Towns:
ID Name Country_ID
1 Lyon 1
Now lets say a new row is added to Towns table:
ID Name Country_ID
1 Lyon 1
2 Paris 2
The Countries table has not been changed, so CDC for these tables shows me only the row from Towns table. The problem is when I do the join between Countries and Towns, there is no row in Countries change set, so the join will result in empty set.
Do you have an idea how to solve it? Of course there might be more difficult cases, involving 3 and more tables, and consequential joins.
This is a typical problem found when doing Realtime Change-Data-Capture, or even Incremental-only daily changes.
There's multiple ways to solve this.
One way would be to do your joins on the natural keys in the dimension or mapping table, to get the associated country (SELECT distinct country_name, [..other attributes..] from dim_table where country_id = X).
Another alternative would be to do the join as part of the change capture process - when a row is loaded to towns, a trigger goes off that loads the foreign key values into the associated staging tables (country, etc).
There is allot i could babble on for more information on but i will be specific to what is in your question. I would suggest the following to get the results...
1st Pass is where everything matches via the join...
Union All
2nd Pass Gets all towns where there isn't a country
(left outer join with a where condition that
requires the ID in the countries table to be null/missing).
You would default the Country ID value in that unmatched join to something designated as a "Unmatched Value" typically 0 or -1 is used or a series of standard -negative numbers that you could assign descriptions to later to identify why data is bad for your example -1 could be "Found Town Without Country".