Removing duplicates in InfluxDB - influxdb

I would like to perform a query to remove duplicates. What I define as a duplicate here is a measurement where we have more than 1 data point. They will have different tags, so they are not overwritten by default but I would like to remove the oldest inserted, regardless of the tags.
So for example, measurement of logins (it doesn't really make sense but it's to avoid using abstract entities):
> Email | Name | TS | Login Time
>
> a#a.com | Alice | xxxxx1000 | 2017-05-19
> a#a.com | Alice | xxxxx1000 | 2017-05-18
> a#a.com | Alice | xxxxx1000 | 2017-05-17
> b#b.com | Bob | xxxxx1000 | 2017-05-18
> c#c.com | Charlie | xxxxx1200 | 2017-05-19
I would like to remove the second and third line, because the data point has the same timestamp as the first, it is the same measurement but they have different login times and I would like to take only the last.
I know well that I could solve this with a query, but the requirement is more complex than this (visualization in Grafana of weird KPI data) and I need to remove actual duplicates (generated and loaded twice).
Thank you.

You can fetch all login user names using group by and then order by time , so that the latest login time will come up first ,then you can delete the remaining ones.
Also, you might need to copy your latest items to some another measurement , since you can't remove row in influxdb .
For this you might use limit 1 offset 0 so that only the latest login time will come from the query output.
Let me know, if I understand it correctly.

Related

Using Crosstab to Generate Data for Charts

I'm trying to make an efficient query to create a view that will contains counts for the number of successful logins by day as well as by type of user with no duplicate users per day.
I have 3 tables involved in this query. One table that contains all successful login attempts, one table for standard user accounts, and one table for admin user accounts. All user_id values are unique across the entire database so there are no user accounts that will share the same user_id with an admin account:
TABLE 1: user_account
user_id | username
---------|----------
1 | user1
2 | user2
TABLE 2: admin_account
user_id | username
---------|----------
6 | admin6
7 | admin7
TABLE 3: successful_logins
user_id | timestamp
---------|------------------------------
1 | 2022-01-23 14:39:12.63798-07
1 | 2022-01-28 11:16:45.63798-07
1 | 2022-01-28 01:53:51.63798-07
2 | 2022-01-28 15:19:21.63798-07
6 | 2022-01-28 09:42:36.63798-07
2 | 2022-01-23 03:46:21.63798-07
7 | 2022-01-28 19:52:16.63798-07
2 | 2022-01-29 23:12:41.63798-07
2 | 2022-01-29 18:50:10.63798-07
The resulting view I would like to generate would contain the following information from the above 3 tables:
VEIW: login_counts
date_of_login | successful_user_logins | successful_admin_logins
---------------|------------------------|-------------------------
2022-01-23 | 1 | 1
2022-01-28 | 2 | 2
2022-01-29 | 1 | 0
I'm currently reading up on how crosstabs work but having trouble figuring out how to write the query based on my table setups.
I actually was able to get the values I needed by using the following query:
SELECT
to_char(s.timestamp, 'YYYY-MM-DD') AS login_date,
count(distinct u.user_id) AS successful_user_logins,
count(distinct a.user_id) AS successful_admin_logins
FROM successful_logins s
LEFT JOIN user_account u ON u.user_id= s.user_id
LEFT JOIN admin_account a ON a.user_id= s.user_id
GROUP BY login_date
However, I was told it would be even quicker using crosstabs, especially considering the successful_logins table contains millions of records. So I'm trying to also create a version of the query using crosstabs then comparing both execution times.
Any help would be greatly appreciated. Thanks!
Turns out it isn't possible to do what I was asking about using crosstabs, so the original query I have will have to do.

Best practices for fact table that depends on two processes

I am building a star schema for an online business. One of the key processes is email newsletter signup.
But the analysis depends on two processes and I can't figure out how to model it the best way.
Here's how the process works:
Person visits website
Person fills out web form and is recorded as a contact in our CRM
Person receives a link asking him to confirm if this is really his email
Person clicks the link and is considered confirmed
Person can now receive emails from us
The signup and confirmation process take place at different times. Most people click the confirmation link on the same day, but we send two follow up email over a few days after the signup so some people may confirm their email only after a few days.
On top of that a person could signup several times on the website. Most of our signups are people who exchange their email address in return for some sort of resource like an eBook.
As long as the person's email is not marked confirmed in our system, we ask the person to confirm on each signup.
Since we have multiple offers it's not uncommon for a person to request eBook A, eBook B and eBook C and only confirm after several signups.
In the fact table each signup for emails that are unconfirmed yet is marked as ConfirmationRequested -> True.
If the person clicks a confirmation link of ANY of the confirmation request emails he should be considered confirmed for each of those signups.
How I want to analyse the data
See how many signups we had
See how many signups were re-signups and how many were new contacts in the CRM (new email address)
See how many new contacts have confirmed their email address (and become full subscribers)
See how many re-signups were asked to confirm their email and how many have done so
Analyse how long it takes for people to confirm their email address
Analyse the confirmation rate
Filter contacts by their confirmation status and analyse what people who have or have not confirmed have in common
I don't really care about confirmations in isolation from signups.
And for my purposes I would like to have a ConfirmationStatus dimension that is...
"Confirmed" if the person confirms within 7 days of sign up
"Pending" if the person hasn't confirmed, but 7 days haven't passed since signup yet
"Not Confirmed" if the person hasn't confirmed within 7 days (even if the person does confirm at some later point)
On top of that I usually look at this report on Mondays to analyse the previous week and compare it to other weeks. (I already have a working version of this report in a flat table, but I am trying to learn how to build proper star schemas.)
This has the additional challenge that contacts that signed up on Sunday for example only had less than a day to confirm and would drag down the confirmation rate and the latest week would look bad if compared to previous week where all contacts had the full 7 days to confirm.
So I calculate a "Confirmed within signup week" confirmation count and rate for all weeks to allow apples to apples comparisons.
How to model this...
I have considered the following options...
Option #1: Separate fact tables
Since these are separate processes that happen at separate times I have learned that I should create separate fact tables and then drill across common dimensions.
I could calculate signups that requested confirmations from the signup table and then calculate confirmations within a week of the signup through the contact and date dimensions.
But that wouldn't allow me to filter the signups by confirmation status.
That's why I am considering...
Option 2: A fact table that combines both signups and confirmations
I am thinking of something like this:
| Dim Signup Info | | | Dim Contact | | | Fact Signups | |
|-----------------------|------|---|-------------|------|---|----------------------|----|
| SignupInfoKey | SK | | ContactKey | SK | | SignupDateKey | FK |
| SignupType | SCD1 | | Name | SCD1 | | ConfirmationDate | FK |
| ConfirmationRequested | SCD1 | | Email | SCD1 | | SignupInfoKey | FK |
| ConfirmationSucceeded | SCD1 | | ... | | | ContactKey | FK |
| ConfirmationStatus | SCD1 | | | | | SignupId | DD |
| | | | | | | SignupDateTime | DD |
| | | | | | | ConfirmationDateTime | DD |
| | | | | | | Signups | M |
| | | | | | | NewContacts | M |
| | | | | | | ConfirmationMin | M |
| | | | | | | ConfirmationDays | M |
I need the ConfirmationDate in the fact to calculate the "Confirmed Within Week" measures at report time (I am using powerbi and it's easy there). I could of course also create a dimension "ConfirmedWIthinWeek" and then filter based on that, but it won't be as flexible... What if I decide later on to look at the data on a daily or monthly basis for example?
Another concern is that it will require to reprocess and update the fact tables on each incremental load for the past 7 days.
I know that's ok for dimensions, but is that ok for fact tables too?
So my questions are
Is option #2 a good solution or is there a better way to do this?
Is it ok to update fact tables or is that discouraged?
Overall my question is: What am I missing?
This seems like a very common thing. One obvious example would be an order star that has fact table columns for AmountOrdered, AmountPaid, AmountRefunded and dimensions like "Order Status", "Paid Status" and "Refunded Status".
But none of my searches have resulted in answers to this common problem. Surely there must be a term for the problem and a pattern name for the solution where I can learn more about it?

Unexpected behaviour with FireDAC Master-Detail relationships

I face a problem with FireDAC Master-Detail relationships.
FireDAC has two modes for M/D relationships : Parameter-Based and Range-Based http://docwiki.embarcadero.com/RADStudio/Berlin/en/Master-Detail_Relationship_(FireDAC)
The first one uses parameters on every query to retrieve the correspondent details needed after every scroll, and the second one loads first all the data in the datasets, and set the fields that define the master-detail relationships (filtering the details after every scroll on the master).
You can combine both methods, giving you the advantages of both (querys returning limited records while reduced traffic with the database, offline mode, ...).
It works nice and fast except when one of the details is empty. This seems to be the reason (quoted from the Documentation) :
Combining Methods
To combine both methods, an application should use both Parameters and
Range-based setups and include fiDetails into FetchOptions.Cache. Then
FireDAC at first uses range-based M/D. And if a dataset is empty, then
FireDAC uses parameter-based M/D. The new queried records are appended
to the internal records storage.
Also, you can use the TFDDataSet.OnMasterSetValues event handler to override M/D behavior.
Suppose you have
Master BILLS
+---------+------------+
| Bill_Id | Date |
+---------+------------+
| 1 | 01/01/2017 |
+---------+------------+
Detail LINES
+---------+---------+------------+
| Bill_Id | Line_Id | Concept |
+---------+---------+------------+
| 1 | 1 | Television |
| 1 | 2 | Computer |
+---------+---------+------------+
Subdetail TAXES
+---------+---------+-----+--------+
| Bill_Id | Line_Id | Tax | Import |
+---------+---------+-----+--------+
| 1 | 1 | 14% | 74.25 |
| 1 | 1 | 7% | 36.12 |
+---------+---------+-----+--------+
I have those 3 FDQuerys with parameters :
qryBills.SQL = 'select * from BILLS where Bill_Id = :Id';
qryLines.SQL = 'select * from LINES where Bill_Id = :Id';
qryTaxes.SQL = 'select * from TAXES where Bill_Id = :Id';
And the Master-Detail relationship is defined by range
qryLines.MasterFields = 'Bill_Id';
qryTaxes.MasterFields = 'Bill_Id;Line_Id';
If all the details contain records then everything is fine, but when a detail is empty (like in my example, where there are no Taxes for the Line #2) then when I scroll to that empty detail its query is re-launched (as the documentation says) duplicating the records for the not-empty details.
I mean :
I open the three Datasets for the Bill_Id #1
Everything looks fine, I see the master record, the Line #1 and its two taxes
I move to the second line and it still looks fine, the taxes appear empty.
When I go back to the first line, now I see two times its two taxes.
If I go to the second line again, and return to the first one, now I will see three times its two taxes.
...
The problem is that every time I move to the second line, its subdetail is empty, so it relaunches the qryTaxes query, duplicating its entire content.
Is not uncommon to have empty details, do you know of a way to prevent its query to be re-launched when it happens ?. I can't find it.
Thank you.

Join between Streaming data vs Historical Data in spark

Let say I have transaction data and visit data
visit
| userId | Visit source | Timestamp |
| A | google ads | 1 |
| A | facebook ads | 2 |
transaction
| userId | total price | timestamp |
| A | 100 | 248384 |
| B | 200 | 43298739 |
I want to join transaction data and visit data to do sales attribution. I want to do it realtime whenever transaction occurs (streaming).
Is it scalable to do join between one data and very big historical data using join function in spark?
Historical data is visit, since visit can be anytime (e.g. visit is one year before transaction occurs)
I did join of historical data and streaming data in my project. Here the problem is that you have to cache historical data in RDD and when streaming data comes, you can do join operations. But actually this is a long process.
If you are updating historical data, then you have to keep two copies and use accumulator to work with either copy at once, so it wont affect the the second copy.
For example,
transactionRDD is stream rdd which you are running at some interval.
visitRDD which is historical and you update it once a day.
So you have to maintain two databases for visitRDD. when you are updating one database, transactionRDD can work with cached copy of visitRDD and when visitRDD is updated, you switch to that copy. Actually this is very complicated.
I know this question is very old but lemme share my viewpoint.Today, this can be easily done in Apache Beam. And this job can run on same spark cluster.

Cassandra cql kind of multiget

i want to make a query for two column families at once... I'm using the cassandra-cql gem for rails and my column families are:
users
following
followers
user_count
message_count
messages
Now i want to get all messages from the people a user is following. Is there a kind of multiget with cassandra-cql or is there any other possibility by changing the datamodel to get this kind of data?
I would call your current data model a traditional entity/relational design. This would make sense to use with an SQL database. When you have a relational database you rely on joins to build your views that span multiple entities.
Cassandra does not have any ability to perform joins. So instead of modeling your data based on your entities and relations, you should model it based on how you intend to query it. For your example of 'all messages from the people a user is following' you might have a column family where the rowkey is the userid and the columns are all the messages from the people that user follows (where the column name is a timestamp+userid and the value is the message):
RowKey Columns
-------------------------------------------------------------------
| | TimeStamp0:UserA | TimeStamp1:UserB | TimeStamp2:UserA |
| UserID |------------------|------------------|------------------|
| | Message | Message | Message |
-------------------------------------------------------------------
You would probably also want a column family with all the messages a specific user has written (I'm assuming that the message is broadcast to all users instead of being addressed to one particular user):
RowKey Columns
--------------------------------------------------------
| | TimeStamp0 | TimeStamp1 | TimeStamp2 |
| UserID |------------|------------|-------------------|
| | Message | Message | Message |
--------------------------------------------------------
Now when you create a new message you will need to insert it multiple places. But when you need to list all messages from people a user is following you only need to fetch from one row (which is fast).
Obviously if you support updating or deleting messages you will need to do that everywhere that there is a copy of the message. You will also need to consider what should happen when a user follows or unfollows someone. There are multiple solutions to this problem and your solution will depend on how you want your application to behave.

Resources