Represent on data studio entries duplicated by joins

Represent on data studio entries duplicated by joins - join

I am working on a project building the ETL process and dashboard to control some KPI metrics. I have created a table in BigQuery where, once a month, I save some key values calculated by aggregating data extracted from other table. I am measuring emails sent by employees, so in order to calculate one of those key values I need to read from two different tables and perform a left join to match, from each of the company's working areas existing in the aggregation (left table), how many employees that area has (right join).
This is a simplification of my tables:
Sent emails, grouped by area
| Area Id | Service | Bad employees | ...
| 1 | Gmail | 3416 | ...
| 2 | Gmail | 10782 | ...
| 2 | Groups | 9267 | ...
Total number of employees, grouped by area
| Area Id | Total employees | ...
| 1 | 34124 | ...
| 2 | 82561 | ...
| 3 | 49472 | ...
The problem comes here: as you can see, the first table (sent emails) has a field which does not appear on the second one; I am talking about Service. For this reason, when I join both tables I will get duplicated values for the Total employees field:
Joined table
| Area Id | Service | Bad employees | Total employees |
| 1 | Gmail | 3416 | 34124 |
| 2 | Gmail | 10782 | 82561 |
| 2 | Groups | 9267 | 82561 |
This final table will be used to create a report in Data Studio. I want to keep the Service field in my final table as I want to give the users the option of filtering by it. I can not edit the employees table schema and add a Service field to its entries because that information is unique from the emails table, it represents the service from which the email was sent and has nothing to do with the employees table.
I am struggling to get a valid data modeling option for this problem; if go with this solution and I want to represent on Data Studio, let's say, the Total number of employees per selected areas, I will get the wrong value for those areas containing multiple services:
Total employees area 1: 34.124
Total employees area 2: 82.561 + 82.561 = 165.122
Total employees: 34.124 + 165.122 = 199.246
Expected value: 34.124 + 82.561 = 116.685
This will affect any metric using the total employees value.
How can I keep the Service field of my joined table and still represent on Data Studio the correct value for Total employees?

It is possible to solve this by splitting the Total employees equally to the services in each area.
The "Sent emails" dataset has to be included again the the blend data. The only join key is the Area Id and add as metric field areas in emails the record count.
In the chart add a calculated field with Total employees / areas in emails

I solved the problem by using nested and repeated fields. I thought Data Studio couldn't filter by values inside a repeated field, but I have checked that it is possible so that fits perfectly my use case.
Joined table schema:
[
{
"mode": "REQUIRED",
"name": "id",
"type": "INTEGER"
},
{
"mode": "REPEATED",
"name": "service",
"type": "RECORD",
"fields": [
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "bad_employees",
"type": "INTEGER",
}
]
},
{
"mode": "NULLABLE",
"name": "total_employees",
"type": "INTEGER",
"description": "Sum of the emails sent during off hours for all the sources"
},
]
Joined table representation:
| id | service.name | service.bad_employees | total_employees |
| 1 | Gmail | 3416 | 34124 |
| 2 | Gmail | 10782 | 82561 |
| | Groups | 9267 | |
This way, I can get the correct sum of bad_employees by performing SUM(service.bad_employees), and the correct value for total_employees with SUM(total_employees).
Also, if I want to filter only by a certain service, I can add a control on the field service.name and it will filter properly.

Related

Using Crosstab to Generate Data for Charts

I'm trying to make an efficient query to create a view that will contains counts for the number of successful logins by day as well as by type of user with no duplicate users per day.
I have 3 tables involved in this query. One table that contains all successful login attempts, one table for standard user accounts, and one table for admin user accounts. All user_id values are unique across the entire database so there are no user accounts that will share the same user_id with an admin account:
TABLE 1: user_account
user_id | username
---------|----------
1 | user1
2 | user2
TABLE 2: admin_account
user_id | username
---------|----------
6 | admin6
7 | admin7
TABLE 3: successful_logins
user_id | timestamp
---------|------------------------------
1 | 2022-01-23 14:39:12.63798-07
1 | 2022-01-28 11:16:45.63798-07
1 | 2022-01-28 01:53:51.63798-07
2 | 2022-01-28 15:19:21.63798-07
6 | 2022-01-28 09:42:36.63798-07
2 | 2022-01-23 03:46:21.63798-07
7 | 2022-01-28 19:52:16.63798-07
2 | 2022-01-29 23:12:41.63798-07
2 | 2022-01-29 18:50:10.63798-07
The resulting view I would like to generate would contain the following information from the above 3 tables:
VEIW: login_counts
date_of_login | successful_user_logins | successful_admin_logins
---------------|------------------------|-------------------------
2022-01-23 | 1 | 1
2022-01-28 | 2 | 2
2022-01-29 | 1 | 0
I'm currently reading up on how crosstabs work but having trouble figuring out how to write the query based on my table setups.
I actually was able to get the values I needed by using the following query:
SELECT
to_char(s.timestamp, 'YYYY-MM-DD') AS login_date,
count(distinct u.user_id) AS successful_user_logins,
count(distinct a.user_id) AS successful_admin_logins
FROM successful_logins s
LEFT JOIN user_account u ON u.user_id= s.user_id
LEFT JOIN admin_account a ON a.user_id= s.user_id
GROUP BY login_date
However, I was told it would be even quicker using crosstabs, especially considering the successful_logins table contains millions of records. So I'm trying to also create a version of the query using crosstabs then comparing both execution times.
Any help would be greatly appreciated. Thanks!

Turns out it isn't possible to do what I was asking about using crosstabs, so the original query I have will have to do.

When working with QuestDB, are symbol columns good for performance for huge amounts of rows each?

When working with regular SQL databases, indexes are useful for fetching a few rows, but not so useful when you are fetching a large amount of data from a table. For example, imagine you have a table with stock valuations of 10 stocks over time:
|------+--------+-------+
| time | stock | value |
|------+--------+-------+
| ... | stock1 | ... |
| ... | stock2 | ... |
| ... | ... | ... |
|------+--------+-------+
As far as I can tell, indexing it by stock (even with an enum/int/foreign key) is usually not very useful in a database like Postgres if you want to get data over a large period of time. You end up with an index spanning a large part of the table, and it ends up being faster for the database to do a sequential scan, for example, to get the average value over the whole dataset for each stock:
SELECT stock, avg(value) FROM stock_values GROUP BY stock
Given that QuestDB is row oriented, I would guess that it would result in better performance to have a separated column for each stock.
So, what schema is recommended in QuestDB for a situation like this? One column for each stock, or would a symbol column for each stock symbol be as good (or good enough) even if there are millions of results for each row?

A column per stock is not easy to achieve in QuestDB. If you create table like this
|----------------------------------|
| time | stock1 | stock1 | stock3 |
|----------------------------------|
Then you'll have to insert all values together in one row or you end up with gaps
|----------------------------------|
| time | stock1 | stock1 | stock3 |
|----------------------------------|
| t1 | 1.1 | | |
| t2 | | 3.45 | |
| t3 | | | 103.45 |
|----------------------------------|
Even for t1 == t2 == t3 when you do the insert as 3 operation it will still result in 3 rows.
So symbols are a better choice here.
Symbol can be indexed and not indexed and you may have benefits of non-indexed symbols when distinct number of them is low. Reading full table vs reading by index is the matter of index selectivity, not data range. If the selectivity is high (e.g. distinct symbol count is say 10k) fetching by index is faster than range scans.

Best practices for fact table that depends on two processes

I am building a star schema for an online business. One of the key processes is email newsletter signup.
But the analysis depends on two processes and I can't figure out how to model it the best way.
Here's how the process works:
Person visits website
Person fills out web form and is recorded as a contact in our CRM
Person receives a link asking him to confirm if this is really his email
Person clicks the link and is considered confirmed
Person can now receive emails from us
The signup and confirmation process take place at different times. Most people click the confirmation link on the same day, but we send two follow up email over a few days after the signup so some people may confirm their email only after a few days.
On top of that a person could signup several times on the website. Most of our signups are people who exchange their email address in return for some sort of resource like an eBook.
As long as the person's email is not marked confirmed in our system, we ask the person to confirm on each signup.
Since we have multiple offers it's not uncommon for a person to request eBook A, eBook B and eBook C and only confirm after several signups.
In the fact table each signup for emails that are unconfirmed yet is marked as ConfirmationRequested -> True.
If the person clicks a confirmation link of ANY of the confirmation request emails he should be considered confirmed for each of those signups.
How I want to analyse the data
See how many signups we had
See how many signups were re-signups and how many were new contacts in the CRM (new email address)
See how many new contacts have confirmed their email address (and become full subscribers)
See how many re-signups were asked to confirm their email and how many have done so
Analyse how long it takes for people to confirm their email address
Analyse the confirmation rate
Filter contacts by their confirmation status and analyse what people who have or have not confirmed have in common
I don't really care about confirmations in isolation from signups.
And for my purposes I would like to have a ConfirmationStatus dimension that is...
"Confirmed" if the person confirms within 7 days of sign up
"Pending" if the person hasn't confirmed, but 7 days haven't passed since signup yet
"Not Confirmed" if the person hasn't confirmed within 7 days (even if the person does confirm at some later point)
On top of that I usually look at this report on Mondays to analyse the previous week and compare it to other weeks. (I already have a working version of this report in a flat table, but I am trying to learn how to build proper star schemas.)
This has the additional challenge that contacts that signed up on Sunday for example only had less than a day to confirm and would drag down the confirmation rate and the latest week would look bad if compared to previous week where all contacts had the full 7 days to confirm.
So I calculate a "Confirmed within signup week" confirmation count and rate for all weeks to allow apples to apples comparisons.
How to model this...
I have considered the following options...
Option #1: Separate fact tables
Since these are separate processes that happen at separate times I have learned that I should create separate fact tables and then drill across common dimensions.
I could calculate signups that requested confirmations from the signup table and then calculate confirmations within a week of the signup through the contact and date dimensions.
But that wouldn't allow me to filter the signups by confirmation status.
That's why I am considering...
Option 2: A fact table that combines both signups and confirmations
I am thinking of something like this:
| Dim Signup Info | | | Dim Contact | | | Fact Signups | |
|-----------------------|------|---|-------------|------|---|----------------------|----|
| SignupInfoKey | SK | | ContactKey | SK | | SignupDateKey | FK |
| SignupType | SCD1 | | Name | SCD1 | | ConfirmationDate | FK |
| ConfirmationRequested | SCD1 | | Email | SCD1 | | SignupInfoKey | FK |
| ConfirmationSucceeded | SCD1 | | ... | | | ContactKey | FK |
| ConfirmationStatus | SCD1 | | | | | SignupId | DD |
| | | | | | | SignupDateTime | DD |
| | | | | | | ConfirmationDateTime | DD |
| | | | | | | Signups | M |
| | | | | | | NewContacts | M |
| | | | | | | ConfirmationMin | M |
| | | | | | | ConfirmationDays | M |
I need the ConfirmationDate in the fact to calculate the "Confirmed Within Week" measures at report time (I am using powerbi and it's easy there). I could of course also create a dimension "ConfirmedWIthinWeek" and then filter based on that, but it won't be as flexible... What if I decide later on to look at the data on a daily or monthly basis for example?
Another concern is that it will require to reprocess and update the fact tables on each incremental load for the past 7 days.
I know that's ok for dimensions, but is that ok for fact tables too?
So my questions are
Is option #2 a good solution or is there a better way to do this?
Is it ok to update fact tables or is that discouraged?
Overall my question is: What am I missing?
This seems like a very common thing. One obvious example would be an order star that has fact table columns for AmountOrdered, AmountPaid, AmountRefunded and dimensions like "Order Status", "Paid Status" and "Refunded Status".
But none of my searches have resulted in answers to this common problem. Surely there must be a term for the problem and a pattern name for the solution where I can learn more about it?

Database Design for state, cities and districts

I have users represented in a user table and need to design a model to associate them with state/cities/districts that they choose:
On the database side,
Each user will be associated with 1 state, 1 city and a number of districts within that state/city combination. For instance, User A can choose to be associated with "NY" and "Brooklyn" and any X number of districts in "Brooklyn" (or none).
On the view side,
I'd like to present the district choices with checkboxes so they should be able to be pulled from the database field with simple_form in Rails pretty easily.
The design of the database should make it easy to query for the user and get the associated state / city and district relations that the user has chosen.
One idea I have is to simply have a one-to-many field for districts and a district table listing all the different districts. However, is there a way to enforce that the districts have to be valid for the city/state combination on the backend using validate?
Any tips would be appreciated.

Below I have outlined the database schema I would use based on the information you have given.
Every city belongs to exactly one state.
cities
id unsigned int(P)
state_id unsigned int(F states.id)
name varchar(50)
+----+----------+---------------+
| id | state_id | name |
+----+----------+---------------+
| 1 | 33 | New York City |
| .. | ........ | ............. |
+----+----------+---------------+
See ISO 3166 for more information. You didn't ask for countries but it's trivial to add them...
countries
id char(2)(P)
iso3 char(3)(U)
iso_num char(3)(U)
name varchar(45)(U)
+----+------+---------+---------------+
| id | iso3 | iso_num | name |
+----+------+---------+---------------+
| ca | can | 124 | Canada |
| mx | mex | 484 | Mexico |
| us | usa | 840 | United States |
| .. | .... | ....... | ............. |
+----+------+---------+---------------+
Every district belongs to exactly one city.
districts
id unsigned int(P)
city_id unsigned int(F cities.id)
name varchar(50)
+----+---------+-----------+
| id | city_id | name |
+----+---------+-----------+
| 1 | 1 | The Bronx |
| 2 | 1 | Brooklyn |
| 3 | 1 | Manhattan |
| .. | ....... | ......... |
+----+---------+-----------+
See ISO 3166-2:US for more information. Every state belongs to exactly one country.
states
id unsigned int(P)
country_id char(2)(F countries.id)
code char(2)
name varchar(50)
+----+------------+------+----------+
| id | country_id | code | name |
+----+------------+------+----------+
| 1 | us | AL | Alabama |
| .. | .......... | .... | ........ |
| 33 | us | NY | New York |
| .. | .......... | .... | ........ |
+----+------------+------+----------+
Based on your information a user belongs to exactly one city. In the example data Bob is associated with New York City. By joining tables you can very easily find that Bob is in New York state and the country of United States.
users
id unsigned int(P)
username varchar(255)
city_id unsigned int(F cities.id)
...
+----+----------+---------+-----+
| id | username | city_id | ... |
+----+----------+---------+-----+
| 1 | bob | 1 | ... |
| .. | ........ | ....... | ... |
+----+----------+---------+-----+
Users can belong to any number of districts. In the example data Bob belongs to The Bronx and Brooklyn. user_id and district_id form the Primary Key which insures a user cannot be associated with the same district more than once.
users_districts
user_id unsigned int(F users.id) \_(P)
district_id unsigned int(F districts.id) /
+---------+-------------+
| user_id | district_id |
+---------+-------------+
| 1 | 1 |
| 1 | 2 |
| ....... | ........... |
+---------+-------------+
My database model does NOT enforce the rule that the districts a user belongs to must be in the city that user belongs to - in my opinion that logic should be done at the application level. If Bob moves from New York City to Baltimore I think all of his records should be deleted from the users_districts table and then add any new ones for his new city.
As for the user interface, I would have the user:
Select a country - this will auto-populate a drop down list of associated states.
Select a state - this will auto-populate a drop down list of associated cities.
Select a city - this will auto-populate a list of associated districts.
Allow the user to select any number of districts.

You will need some combination of database and application-level logic.
Here is how I would build the database fields:
users = id, <other user fields>, city_id
districts = id, <other district fields>, city_id
cities = id, name, state_id
states = id, name
And then in the application, set it up so that the user can type in one city and multiple districts, and can not edit the state (view only):
When the user types in a city - maybe through a autocomplete field - it automatically updates the read-only state field with the state of the city
When the user types in a district, list only the districts that have district.city_id == cities.id
If you don't want to restrict the district selection in the UI, you will need to enforce the district.city_id == cities.id check in your application, though I personally think that's less intuitive than doing it right in the front-end UI.

Indian States AND UT MySQL QUERY
INSERT INTO `states`
VALUES
(1,'Andhra Pradesh'),
(2,'Telangana'),
(3,'Arunachal Pradesh'),
(4,'Assam'),
(5,'Bihar'),
(6,'Chhattisgarh'),
(7,'Chandigarh'),
(8,'Dadra and Nagar Haveli'),
(9,'Daman and Diu'),
(10,'Delhi'),
(11'Goa'),
(12,'Gujarat'),
(13,'Haryana'),
(14,'Himachal Pradesh'),
(15,'Jammu and Kashmir'),
(16,'Jharkhand'),
(17,'Karnataka'),
(18,'Kerala'),
(19,'Madhya Pradesh'),
(20,'Maharashtra'),
(21,'Manipur'),
(22,'Meghalaya'),
(23,'Mizoram'),
(24,'Nagaland'),
(25,'Orissa'),
(26,'Punjab'),
(27,'Pondicherry'),
(28,'Rajasthan'),
(29,'Sikkim'),
(30,'Tamil Nadu'),
(31,'Tripura'),
(32,'Uttar Pradesh'),
(33,'Uttarakhand'),
(34,'West Bengal'),
(35,'Lakshadweep'),
(36,'Ladakh ');

How to make selection content an attribute for a Rails model

I am having a hard time even formulating the question I want to be answered, so here's my situation:
I'm trying to make a simple stock market plotter tool using an existing database I populate elsewhere. My app already has a nice and dynamic plotter that works with any database, but it expects data in a certain way. So say my model (database) looks like this:
Stock:
|___ticker___|___open___|___close___|___date___|
| aapl | 100 | 101 | 1/1/11 |
| aapl | 101 | 102 | 1/2/11 |
| goog | 500 | 450 | 1/1/11 |
| goog | 450 | 451 | 1/2/11 |
...
My plotter routines work off of class attributes (I think thats the terminology), which correspond to columns in the database.
I can select all the data corresponding to 'aapl', and easily plot the open and close versus date since my model has said attributues.
#stock = Stock.select_by_ticker('aapl')
>> #stock.open #=> 100 ...
>> #stock.close #=> 101 ...
>> #stock.date #= 1/1/11 ...
so the attributes would be
{open, close, date}
But if I want to compare say the closing price for different stocks, I need attributes pertaining to each stock. So basically I want to end up with a model with ticker names as attributes, each corresponding to that ticker's hunk in the database. Using easy to build scopes, I want something like:
#stock = Stock.select_close_by_ticker('aapl','goog')
attributes are:
{aapl, goog, date}
where aapl and goog contain the closing price data for just that ticker. I can run multiple database queries if I need to, for now I just want to be able to sort my data into this form. Also, it must be completely dynamic, so I can't hardcode 'aapl', 'goog' and all the millions of other tickers into my model.

Would something like:
stocks = ['appl', 'goog']
Stock.find(:conditions => ['ticker in (?)'], stocks)
work for your scenario?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Represent on data studio entries duplicated by joins - join

Related

Using Crosstab to Generate Data for Charts

When working with QuestDB, are symbol columns good for performance for huge amounts of rows each?

Best practices for fact table that depends on two processes

Database Design for state, cities and districts

How to make selection content an attribute for a Rails model

Categories

Resources