Apache Cassandra - Follow/Unfollow relationship on Twissandra example - twitter

I'm trying to learn Cassandra modeling by looking at the Twissandra project.
It seems that when user unfollows one of his followers only friendship/follower relationship is removed from the tables, but tweets of the unfollowed user remain in the timeline of the user that has just unfollowed him.
Also, with very basic knowledge of Cassandra modeling that I have currently, it seems to me that it is practically impossible to remove tweets from the timeline. Here is the model from Twissandra:
CREATE TABLE timeline (
username text,
time timeuuid,
tweet_id uuid,
PRIMARY KEY (username, time)
) WITH CLUSTERING ORDER BY (time DESC)
Since tweet_id is neither a partition key nor a clustering column it is impossible to query by it and delete the record.
Further, can someone please suggest a model where it would be possible to remove tweets of the unfollowed users from the timeline.
I've been busting my head around this problem for a day and it seems as this is not very easy thing to do in Cassandra. I'm coming from relational world, so maybe my point of view is wrong also.

Since tweet_id neither a partition key nor a clustering column it is impossible to query by it and delete the record.
CREATE TABLE timeline (
username text,
tweet_id timeuuid,
tweet_content text
PRIMARY KEY (username, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_id DESC)
The above data model should do the trick
Further, can someone please suggest a model where it would be possible to remove tweets of the unfollowed users from the timeline.
You'll have to denormalize and create another table
CREATE TABLE followed_users_tweets(
username text,
followed_user text,
tweet_id timeuuid,
tweet_content text
PRIMARY KEY ((username,followed_user) tweet_id)
) WITH CLUSTERING ORDER BY (tweet_id DESC);
When you unfollow "John DOE" and your username is "Helen SUE":
DELETE FROM followed_users_tweets WHERE username='Helen SUE' AND followed_user='John DOE'

Related

To implement follow unfollow functionality with one to many relationship on same table

I have a User class which stores user's details like Id, Name etc.
Now, every user can follow or unfollow other user.So, should i generate a new table Follow with columns Original_Id and Follow_Id where Original_Id is the user's own Id and Follow_Id is the foreign key to User's table pointing to Id primary key.
Two possibilities i am seeing here:
1>Follow table would have one to one relationship with User table through Original_Id column.
2> Follow table would have many to one relationship with User table through Follow_Id column.
So, if you have a user named Newton that has two followers, Tesla and Edison, your Users table will have something like this:
Id, Name
1, Newton
2,Tesla
3, Edison
and my Follow table will have following values:
Original_Id, Follow_Id
1, 2
1, 3
Is this approach correct. what i am thinking is possible, if any better approach please suggest me?
Will it be possible to filter User's related data from other tables through this Follow table.
I am showing here sample relationship of User's class with other class:
Now, i want to show user's only those post from user's whom he is following.
Any guidance most appreciated.

Constructing a 1-many relationship with custom string foreign keys in PGSQL ActiveRecord

I have the following tables (Showing only the relevant fields):
lots
history_id
histories
initial_date
updated_date
r_doc_date
l_doc_date
datasheet_finalized_date
users
username
So I am rebuilding an exisiting application that dealt with a rather large amount of bureaucracy, and needs to keep track of five separate dates (as shown in the histories table). The problem that I am having is that I don't know how best to model this in ActiveRecord, historically it's been done by having the histories tables represented as so:
histories
initial_date
updated_date
r_doc_date
l_doc_date
datasheet_finalized_date
username
Where only one of the five date fields could ever be filled at one time...which in my opinion is a terrible way to go about modeling this...
So basically I want to build a unique queryable connection between every date in the histories table and its specific relevant user. Is it possible to use every timestamp in the histories table as a foreign key to query the specific user?
I think that there's a simpler approach to what you're trying to accomplish. It sounds like you want to be able to query each lot and find the 'relevant user' (I am guessing that this refers to the user who did whatever action is necessary to update the specific column on the histories table). To do this I would first create a join table between users and histories, called user_histories:
user_histories
user_id
history_id
I would create a row on this table any time a lot's history is updated and one of the relevant dates changes. But that now brings up the issue of being able to differentiate which specific date-type the user actually changed (since there are five). Instead of using each one as a foreign key (since they wouldn't necessarily be unique) I would recommend creating a 'history_code' on the user_histories table to represent each one of the history date-types (much like how a polymorphic_type is used). Resulting in the user_histories table looking like this:
user_histories
user_id
history_id
history_code
And an example record looking like this:
UserHistory.sample = {
user_id: 1,
history_id: 1,
history_code: "Initial"
}
Allowing you to query the specific user who changed a record in the histories table with the following:
history.user_histories.select { |uhist| hist.history_code == "Initial" }
I would recommend building these longer queries out into model methods, allowing for a faster, cleaner query down the line, for example:
#app/models/history.rb
def initial_user
self.user_histories.select { |uhist| hist.history_code == "Initial" }
end
This should give you the results you want, but should get around the whole issue of the dates not being suitable for foreign keys, since you can't guarantee their uniqueness.

Best way to save/store customer purchase order data?

I have a custom membership provider which I extended - added a couple of fields, first name, last name, adress, zip code and city.
now, these fields reside in the aspnet_Membership table so that I can easily access them when using the static Membership asp.net class.
now, I want to be able so save customer purchase order data (first name, last name, adress, zip code and city) to the database.
should I in my order model/table use a new set of fields - first name, last name, adress, zip code, city or should I create a relationship between my asp_Membershihp table and my Orders table?
Also, If i have dupe data, once a users asks for his account to be removed I wont have any orphan rows in my Orders table if I use the first method.
so, which is best, to have the user data, first name, last name, adress, zipcode, city in only one table and create a relationship between aspnet_Membership table and Orders table OR create the dupe fields in my Orders table with no relationship to the aspnet_Membership table? Pros cons?
Thanks!
/P
In this scenario, i would rather have the relationship.
Also being the data you are storing Orders (i assume at least, from the name :)) i would maintain a separate set of data on the Order, so one would optionally be able to specify different billing/shipping data than it's Identity on the site.
Another valid reason for duplicate at least some data on the Order table is to have all the necessary data relevant to an Order in the table, thus avoiding problems if the Client request his data to be deleted, and maintain the original values for that data on the order if the customer data were to change in time.
If you are able to, though, you should not actually delete User data, but have a field in which you specify if the User isActive or isDeleted.

How we design Dynamo db with keep relation of two entity

Hi iam new in dynamo db and, with my knowledge its a non relational db ie we cant join the tables. My doubt is how we design the table structure. Please clarify with following example.
I have a following tables
1) users - user_id, username, password, email, phone number, role
2) roles - id, name [ie admin, supervisor, ect..]
a) My first doubt is we have any provision to set auto increment for user_id fields ?
b) Is this correct way of setting primary key as user_id?
c) Is this is the correct method to store user role in dynamo db? ie a roles table contains id and title and store role id in user table?
e) Is this possible to retrieve two tables data along with each user? Am using rails 3 and aws-sdk gem
If anybody reply it will be very helpful for me like a new dynamodb user
Typically with nosql style databases you would provide the unique identifier, rather than having an auto increment PK field do that for you. This usually would mean that you would have a GUID be the key for each User record.
As far as the user roles, there are many ways to accomplish this and each has benefits and problems:
One simple way would be to add a "Role" attribute to the Users table and have one entry per role for that user. Then you could grab the User and you would have all the roles in one query. DynamoDB allows attributes to have multiple values, so one attribute can have one value per role.
If you need to be able to query users in a particular role (ie. "Give me all the Users who are Supervisors") then you will be doing a table scan in DynamoDB, which can be an expensive operation. But, if your number of users is reasonably small, and if the need to do this kind of lookup is infrequent, this still may be acceptable for your application.
If you really need to do this expensive type of lookup often, then you will need to create a new table something like "RolesWithUsers" having one record per Role, with the userIds of the users in the role record. For most applications I'd advise against doing something like this, because now you have two tables representing one fact: what role does a particular user have. So, delete or update needs to be done in two places each time. Not impossible to do, but it takes more vigilance and testing to be sure your application doesn't get wrong data. The other disadvantage of this approach is that you need two queries to get the information, which may be more expensive than the table scan, again, depending on the quantity of records.
Another option that makes sense for this specific use case would be to use SimpleDb. It has better querying capability (all attributes are indexed by default) and the single table with roles as multi-valued attribute is going to be a much better solution than DynamoDB in this case.
Hope this helps!
We have a similar situation and we simply use two DBs, a relational and a NoSQL (Dynamo). For a "User" object, everything that is tied to other things, such as roles, projects, skills, etc, that goes in relational, and everything about the user (attributes, etc) goes in Dynamo. If we need to add new attributes to the user, that is fine, since NoSQL doesn't care about those attributes. The rule of thumb is if we only need something on that object page (that is, we don't need to associate with other objects), then we put in Dynamo. Otherwise, it goes in relational.
Using a table scan on the NoSQL DB is not really an option after you cross even a small threshold (up to that point, you can just use an in memory DB anyway).

To normalize or not to normalize user_ids

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Resources