Join between Streaming data vs Historical Data in spark - join

Let say I have transaction data and visit data
visit
| userId | Visit source | Timestamp |
| A | google ads | 1 |
| A | facebook ads | 2 |
transaction
| userId | total price | timestamp |
| A | 100 | 248384 |
| B | 200 | 43298739 |
I want to join transaction data and visit data to do sales attribution. I want to do it realtime whenever transaction occurs (streaming).
Is it scalable to do join between one data and very big historical data using join function in spark?
Historical data is visit, since visit can be anytime (e.g. visit is one year before transaction occurs)

I did join of historical data and streaming data in my project. Here the problem is that you have to cache historical data in RDD and when streaming data comes, you can do join operations. But actually this is a long process.
If you are updating historical data, then you have to keep two copies and use accumulator to work with either copy at once, so it wont affect the the second copy.
For example,
transactionRDD is stream rdd which you are running at some interval.
visitRDD which is historical and you update it once a day.
So you have to maintain two databases for visitRDD. when you are updating one database, transactionRDD can work with cached copy of visitRDD and when visitRDD is updated, you switch to that copy. Actually this is very complicated.

I know this question is very old but lemme share my viewpoint.Today, this can be easily done in Apache Beam. And this job can run on same spark cluster.

Related

Swift & Firebase - Cloud firestore scalable?

I'm really new on Cloud Firestore, so it's a bit strange for me to structure the database.
I would like to save my workouts. If I were on RealtimeDatabase I would do something like that:
WorkoutResults
|
+--AutoID
| |
| +--date
| +--userID
| +--result
AND
UserWorkoutResult
|
+--UserID
| |
| +--WodResultGeneratedID
|
In that way, I can only fetch one node to a specific user.
But if I understand well on Cloud Firestore, it's not possible to query on subcollections.
So my question is, do you think this structure is good enough to scale?
WorkoutResults
|
+--AutoID
| |
| +--date
| +--userID
| +--result
By doing something like:
.whereField("userID", isEqualTo: "userIDString").whereField("date", isEqualTo: theDateIWant) ?
Your query looks fine to me. And as Firestore promises, its performance is only related to the number of matching WorkoutResults, not to the size of that collection.
But you could get the exact same result by querying collection("Users").doc("userIDString").collection("WorkoutResults").whereField("date", isEqualTo: theDateIWant) in your first data structure. The only thing that isn't possible there is to query across the WorkoutResults for multiple users, since querying across multiple collections isn't possible.

Unexpected behaviour with FireDAC Master-Detail relationships

I face a problem with FireDAC Master-Detail relationships.
FireDAC has two modes for M/D relationships : Parameter-Based and Range-Based http://docwiki.embarcadero.com/RADStudio/Berlin/en/Master-Detail_Relationship_(FireDAC)
The first one uses parameters on every query to retrieve the correspondent details needed after every scroll, and the second one loads first all the data in the datasets, and set the fields that define the master-detail relationships (filtering the details after every scroll on the master).
You can combine both methods, giving you the advantages of both (querys returning limited records while reduced traffic with the database, offline mode, ...).
It works nice and fast except when one of the details is empty. This seems to be the reason (quoted from the Documentation) :
Combining Methods
To combine both methods, an application should use both Parameters and
Range-based setups and include fiDetails into FetchOptions.Cache. Then
FireDAC at first uses range-based M/D. And if a dataset is empty, then
FireDAC uses parameter-based M/D. The new queried records are appended
to the internal records storage.
Also, you can use the TFDDataSet.OnMasterSetValues event handler to override M/D behavior.
Suppose you have
Master BILLS
+---------+------------+
| Bill_Id | Date |
+---------+------------+
| 1 | 01/01/2017 |
+---------+------------+
Detail LINES
+---------+---------+------------+
| Bill_Id | Line_Id | Concept |
+---------+---------+------------+
| 1 | 1 | Television |
| 1 | 2 | Computer |
+---------+---------+------------+
Subdetail TAXES
+---------+---------+-----+--------+
| Bill_Id | Line_Id | Tax | Import |
+---------+---------+-----+--------+
| 1 | 1 | 14% | 74.25 |
| 1 | 1 | 7% | 36.12 |
+---------+---------+-----+--------+
I have those 3 FDQuerys with parameters :
qryBills.SQL = 'select * from BILLS where Bill_Id = :Id';
qryLines.SQL = 'select * from LINES where Bill_Id = :Id';
qryTaxes.SQL = 'select * from TAXES where Bill_Id = :Id';
And the Master-Detail relationship is defined by range
qryLines.MasterFields = 'Bill_Id';
qryTaxes.MasterFields = 'Bill_Id;Line_Id';
If all the details contain records then everything is fine, but when a detail is empty (like in my example, where there are no Taxes for the Line #2) then when I scroll to that empty detail its query is re-launched (as the documentation says) duplicating the records for the not-empty details.
I mean :
I open the three Datasets for the Bill_Id #1
Everything looks fine, I see the master record, the Line #1 and its two taxes
I move to the second line and it still looks fine, the taxes appear empty.
When I go back to the first line, now I see two times its two taxes.
If I go to the second line again, and return to the first one, now I will see three times its two taxes.
...
The problem is that every time I move to the second line, its subdetail is empty, so it relaunches the qryTaxes query, duplicating its entire content.
Is not uncommon to have empty details, do you know of a way to prevent its query to be re-launched when it happens ?. I can't find it.
Thank you.

Removing duplicates in InfluxDB

I would like to perform a query to remove duplicates. What I define as a duplicate here is a measurement where we have more than 1 data point. They will have different tags, so they are not overwritten by default but I would like to remove the oldest inserted, regardless of the tags.
So for example, measurement of logins (it doesn't really make sense but it's to avoid using abstract entities):
> Email | Name | TS | Login Time
>
> a#a.com | Alice | xxxxx1000 | 2017-05-19
> a#a.com | Alice | xxxxx1000 | 2017-05-18
> a#a.com | Alice | xxxxx1000 | 2017-05-17
> b#b.com | Bob | xxxxx1000 | 2017-05-18
> c#c.com | Charlie | xxxxx1200 | 2017-05-19
I would like to remove the second and third line, because the data point has the same timestamp as the first, it is the same measurement but they have different login times and I would like to take only the last.
I know well that I could solve this with a query, but the requirement is more complex than this (visualization in Grafana of weird KPI data) and I need to remove actual duplicates (generated and loaded twice).
Thank you.
You can fetch all login user names using group by and then order by time , so that the latest login time will come up first ,then you can delete the remaining ones.
Also, you might need to copy your latest items to some another measurement , since you can't remove row in influxdb .
For this you might use limit 1 offset 0 so that only the latest login time will come from the query output.
Let me know, if I understand it correctly.

Ruby on Rails: Join Tables Concept

So I have been out of the coding game for a while and recently decided to pick up rails. I have a question about the concept of Join tables in rails. Specifically:
1) why are these join tables needed in the database?
2) Why can't I just JOIN two tables on the fly like we do in SQL?
A join table allows a clean linking of association between two independent tables. Join tables reduce data duplication while making it easy to find relationships in your data later on.
E.g. if you compare a table called users:
| id | name |
-----------------
| 1 | Sara |
| 2 | John |
| 3 | Anthony |
with a table called languages:
| id| title |
----------------
| 1 | English |
| 2 | French |
| 3 | German |
| 4 | Spanish |
You can see that both truly exist as separate concepts from one another. Neither is subordinate to the other the way a single user may have many orders, (where each order row might store a unique foreign_key representing the user_id of the user that made it).
When a language can have many users, and a user can have many languages -- we need a way to join them.
We can do that by creating a join table, such as user_languages, to store every link between a user and the language(s) that they may speak. With each row containing every matchup between the pairs:
| id | user_id | language_id |
------------------------------
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 1 | 4 |
| 4 | 2 | 1 |
| 5 | 3 | 1 |
With this data we can see that Sara (user_id: 1) is trilingual, while John(user_id: 2) and Anthony(user_id: 3) only speak English.
By creating a join table in-between both tables to store the linkage, we preserve our ability to make powerful queries in relation to data on other tables. For example, with a join table separating users and languages it would now be easy to find every User that speaks English or Spanish or both.
But where join tables get even more powerful is when you add new tables. If in the future we wanted to link languages to a new table called schools, we could simply create a new join table called school_languages. Even better, we can add this join table without needing to make any changes to the languages SQL table itself.
As Rails models, the data relationship between these tables would look like this:
User --> user_languages <-- Language --> school_languages <-- School
By default every school and user would be linked to Language using the same language_id(s)
This is powerful. Because with two join tables (user_languages & school_languages) now referencing the same unique language_id, it will now be easy to write queries about how either relates. For example we could find all schools who speak the language(s) of a user, or find all users who speak the language(s) of a school. As our tables expand, we can ride the joins to find relations about pretty much anything in our data.
tl;dr: Join tables preserve relations between separate concepts, making it easy to make powerful relational queries as you add new tables.

Cassandra cql kind of multiget

i want to make a query for two column families at once... I'm using the cassandra-cql gem for rails and my column families are:
users
following
followers
user_count
message_count
messages
Now i want to get all messages from the people a user is following. Is there a kind of multiget with cassandra-cql or is there any other possibility by changing the datamodel to get this kind of data?
I would call your current data model a traditional entity/relational design. This would make sense to use with an SQL database. When you have a relational database you rely on joins to build your views that span multiple entities.
Cassandra does not have any ability to perform joins. So instead of modeling your data based on your entities and relations, you should model it based on how you intend to query it. For your example of 'all messages from the people a user is following' you might have a column family where the rowkey is the userid and the columns are all the messages from the people that user follows (where the column name is a timestamp+userid and the value is the message):
RowKey Columns
-------------------------------------------------------------------
| | TimeStamp0:UserA | TimeStamp1:UserB | TimeStamp2:UserA |
| UserID |------------------|------------------|------------------|
| | Message | Message | Message |
-------------------------------------------------------------------
You would probably also want a column family with all the messages a specific user has written (I'm assuming that the message is broadcast to all users instead of being addressed to one particular user):
RowKey Columns
--------------------------------------------------------
| | TimeStamp0 | TimeStamp1 | TimeStamp2 |
| UserID |------------|------------|-------------------|
| | Message | Message | Message |
--------------------------------------------------------
Now when you create a new message you will need to insert it multiple places. But when you need to list all messages from people a user is following you only need to fetch from one row (which is fast).
Obviously if you support updating or deleting messages you will need to do that everywhere that there is a copy of the message. You will also need to consider what should happen when a user follows or unfollows someone. There are multiple solutions to this problem and your solution will depend on how you want your application to behave.

Resources