I have an application where users fill out surveys on a regular basis.
Surveys are sent via e-mail and need to be semi-tracable, meaning, that I need to follow what kind of question categories are sent to each user on each survey. Right now after answering the survey, answers are saved in a separate table and any association to a particular user is removed to guarantee anonymity.
What I would like to achieve is a way where it is not possible to map any anwser with a particular user BUT it is possible to get all answers that any one user has submitted. We want to do this for analysis purposes to track how user's answers change over time, but at the same time preserve complete anonymity on a database level.
Users fill out surveys using several devices so private key storing on their device is not an option.
Application is written in Rails with PostgreSQL, but the solution can also involve other languages if it is not possible in ruby.
One of the simple solutions would be feeding a fixed set of user data to a hash function to calculate an identifier for that user, save it into the separate table along with the answers.
This way you'll be able to gather all submissions per user but there will be no relation to the user record.
A drawback is that if you would still want it you can trace a user by calculating all the hashes for your "users" table and associating it with the "answers" table. Unless you give up control over either "users" or "answers" entity to another team/business unit etc.
Related
Overview
I'm creating a Ruby on Rails website which uses Facebook to login.
For each user I have a database entry which stores their Facebook User ID along with other basic information.
I'm also using the Koala gem in order to retrieve a user's friendlist from Facebook, but I'm unsure as to how I should store this data...
Option 1
I could store the user's friends as a serialized hash in the User table, then if I wanted to display a list of all the current user's friends, I could grab this hash and do something along the lines of SELECT FROM Users WHERE facebook_user_id IN hash
Each time the user logs in I could update this field to store the latest friends list.
Option 2
I could create a Friend table and store friendship information in here, where a User has many Friends. So there would be a row for each friendship, (User1 and User2 columns). Then to display a list of the current user's friends I could do something like SELECT User2 FROM Friends WHERE User1 = current_user
This seems like the better option to me, but...
It has the disadvantage that there would be many rows... If there were 100,000 users, each with 100 friends, that's now 10,000,000 rows in the Friends table.
It also means each time the user logs in, I'd need to loop over their Facebook friends list returned using Koala and create a Friend record if someone on their friendlist is in my User table and there isn't a corresponding entry in the Friends table. This seems like it'd be slow if a user has 1000 Facebook friends?
I'd appreciate any guidance on how it would be best to achieve this.
Apologies for the badly worded question, I'll try and reword/organise it shortly.
Thanks for any help in advance.
If you need to store a lot of data, then you need to store a lot of data. If you are like most, you probably won't run into that problem sooner than you have the cash to solve it. In other words, you are probably assuming you'll have more traffic and data than you'll get, at least in the short-term. So I doubt this is an issue, even though it is a good sign that you are thinking about it now rather than later.
As I mentioned in my comment below, the easiest solution is to have a tie table with a row for each side of the friend relationship (a has_many :friends, through: :facebook_friend_relationships, class_name: 'FacebookFriend' on FacebookFriend, per the design mentioned below). But your question seemed to be about how to reduce the number of records, so that is what the remainder of the answer will address.
If you have to store in the DB and you know for sure that you will absolutely have every FB user on the planet hitting your site because it is so awesome, but they won't all hit at once, then if you are limited in storage, you may want to use a LRU algorithm (remove the least recently used records) possibly with timed expiration also. You could just have a cron job that does a query on the DB then deletes old/unused records to do this. Wouldn't be perfect, but it would be a simple solution.
You could also archive older data rather than throw it away. So, frequently used data could stay in the table of active users, and then you might offload older data to another table or even another database (and you might see the apartment and second_base gems for that). However, once you get to the size, you're probably looking at a number of other architectural solutions that have much less to do with ActiveRecord models/associations or schema design. Though it pays to plan ahead, I wouldn't worry about that excessively until you are sure that the application will get enough users to invest the time in that.
Even though ActiveRecord has some caching, you could just avoid the DB and cache friends in memory yourself in the beginning for speed, especially if you don't yet have many users, which you probably don't yet. If you think you'll run out of memory because of the high number of users, LRU might be a good option here also, and lru_redux looks interesting. Again, you might want to time the cache also so expires and re-gets friends when the cache expires. Even just storing the results in the user session may be adequate, i.e. in the controller action method, just do #friends ||= Something.find_friends(fb_user_id), and the latter is what most might do as a first shot at it while you're getting started.
If you use ActiveRecord, in your query in the controller (or on the association in the model) consider using include: to avoid n+1 queries. That will speed up things.
For the schema design, maybe:
User - users table with email and authN info. Look at the Devise gem.
FacebookUser - info about the Facebook user.
FacebookFriendRelationship - a tie model with (id and) two columns, one for one FacebookUser id and one for the other.
By separating the authN info (User) from the FB data (FacebookUser and FacebookFriendRelationship), you make it easier to have other social media accounts, etc. each with information specific to those accounts in other tables.
The complexity comes in FacebookUser's relationship with friends if the goal is to minimize rows in the relationship table. To half the number of rows, you'd have a single row for a relationship where the id of FacebookUser could be in either foreign key column. Either the user has a friend or is a friend, so you could have two has_many :through associations on FacebookFriend that each use a different foreign key in FacebookFriendRelationship. Or you could do HABTM without the model and use foreign_key and association_foreign_key options in each association. Either way, you could add a method to add both associations together (because they are arrays). Instead, you could use custom SQL in a single has_many if you didn't care about having to use ActiveRecord to remove associations the normal way. However, per your comments, I think you want to avoid this complexity, and I agree with you, unless you really must limit the number of relationship rows. However, it isn't the number of tie table rows that will eat the data, it is going to be all of the user info you keep in the FacebookFriends table.
I'm building a Rails 3.1 application that allows people to submit events. One of the fields for the event is a venue. On the create/edit form, the venue_name field has autocomplete functionality so it displays venues with a similar name, but the user is able to enter any name.
When the form is submitted, I'm using find_or_create_by_name when attaching the venue to the event model.
I'm doing this because it's not possible for us to maintain a complete list of venues and I don't want to prevent people from submitting an event because the venue isn't in the list.
The problem is that it's quite likely we'll get duplicates over time like "Venue Name" and "The Venue Name" or any number of other possibilities.
I was thinking that I probably just need to create an administrative tool that allows the admin to review recent venues and if he/she thinks they're duplicates to search/select a master record and have the duplicate record's association copied over to the master record and once successful to delete the duplicate record.
Is this a good approach? In terms of the data manipulation would it be best to handle this in a transaction? Would it be best to add this functionality in a sort of utility class - or directly in the Venue model?
Thanks for your time.
If I were going to put together a system like that, I'd probably try to find a unique identifier I could associate with each venue - perhaps an address or a phone number?
So, if I had "The Clubhouse" with a phone number 503-555-1212, and someone tried to input a new venue called "Clubhouse" with the phone number 503-555-1212, I might take them to an interstitial page where I ask them "Did you mean this location?"
Barring that, I might ask for a phone number or address first, then present a list of possible matches with the option to create a new venue.
Otherwise, you're introducing a lot of potential for error at the admin level, plus you run into a scalability problem. If your admin has to review 10 entries a month, maybe not so bad - but if your app takes off and that number goes to 1000, that becomes unmanageable fast!
I am trying to follow the mantra of "select only what you need" when making database calls in my rails applications. Currently I am eager loading all users against their posts, however I notice that this selects every column from the user table for each users.
I have tried to use a join instead which offers a much easier way to select what specific attributes I need, however I am using paperclip and the user has_attached_file. I currently haven't found a way to include this in the join.
I was just wondering if anyone had any suggestions or tips on what the best way to load both users and posts is in terms of database performance?
Currently to retrieve users I am simply using this sort of syntax:
Post.all.includes(:user)
Have you tried Post.eager_load(:user)? Should fetch all the Post fields and all the User fields in a single query, and instantiate AR objects for each. Paperclip attachment would be stored in the user table under 4-5 fields starting with the same prefix. Those should be loaded through Post.eager_load(:user).
The only reason this wouldn't work is if the association between Post and User is polymorphic.
I'm learning Rails by building a simple site where users can create articles and comment on those articles. I have a view which lists a user's most recent articles and comments. Now I'd like to add user 'profiles' where users can enter information like their location, age and a short biography. I'm wondering if this profile should be a separate model/resource (I already have quite a lot of fields in my user model because I'm using Authlogic and most of it's optional fields).
What are the pros and cons of using a separate resource?
I'd recommend keeping profile columns in the User model for clarity and simplicity. If you find that you're only using certain fields, only select the columns you need using :select.
If you later find that you need a separate table for some reason (e.g. one user can have multiple profiles) it shouldn't be a lot of work to split them out.
I've made the mistake of having two tables and it didn't buy me anything but additional complexity.
Pros: It simplifies each model
Cons: Managing 2 at once is slightly harder
It basically comes down to how big the user and profile are. If the user is 5 fields, and the profile 3, there is no point. But if the user is 12 fields, and the profile 20, then you definitely should.
I think you'd be best served putting in a separate model. Think about how the models correspond to database tables, and then how you read those for the various use cases your app supports.
If a user only dips in to his actual profile once in a while but the User model is accessed frequently, you should definitely make it a separate object with a one-to-one relationship. If the profile data is needed every time the User data is needed, you might want to stick them in the same table.
Maybe the location is needed every time you display the user (say on a comment they left), but the biography should be a different model? You'll have to figure out the right breakdown, but the general rule is to structure things so you don't have to pull data that isn't being used right away.
A user "owns" various resources on your site, such as comments, etc. If you separate the profile from the user then it's just one more resource. The user is static, while the profile will change from time to time.
Separating it out would also allow you to easily maintain a profile history.
I would keep it separate. Not all your users would want to fill out a profile, so those would be empty fields sitting in your user table. It also means you can change the profile fields without changing any of the logic of your user model.
Depends on the width of the existing user table. Databases typically havea limit to the number of bytes a recird can contain. I fyou are close to (or over which you can usually do if you have lots of fields with null values) the limit, I would add a table with a one-to-one relationship for better performance and less of a likelihood of a record that suddenly can't be inserted as there is too much data for the row size. If you are nowhere near the limit, the add to the exisiting table.
I am developing a gallery which allows users to post photos, comments, vote and do many other tasks.
Now I think that it is correct to allow users to unsubscribe and remove all their data if they want to. However it is difficult to allow such a thing because you run the risk to break your application (e.g. what should I do when a comment has many replies? what should I do with pages that have many revisions by different users?).
Photos can be easily removed, but for other data (i.e. comments, revisions...) I thought that there are three possibilities:
assign it to the admin
assign it to a user called "removed-user"
mantain the current associations (i.e. the user ID) and only rename user's data (e.g. assign a new username such as "removed-user-24" and a non-existent e-mail such as "noreply-removed-user-24#mysite.com"
What are the best practices to follow when we allow users to remove their accounts? How do you implement them (particularly in Rails)?
I've typically solved this type of problem by having an active flag on user, and simply setting active to false when the user is deleted. That way I maintain referential integrity throughout the system even if a user is "deleted". In the business layer I always validate a user is active before allowing them to perform operations. I also filter inactive users when retrieving data.
The usual thing to do is instead of deleting them from a database, add a boolean flag field and have it be true for valid users and false for invalid users. You will have to add code to filter on the flag. You should also remove all relevant data from the user that you can. The primary purpose of this flag is to keep the links intact. It is a variant of the renaming the user's data, but the flag will be easier to check.
Ideally in a system you would not want to "hard delete" data. The best way I know of and that we have implemented in past is "soft delete". Maintain a status column in all your data tables which ideally refers to the fact whether the row is active or not. Any row when created is "Active" by default; however as entries are deleted; they are made inactive.
All select queries which display data on screen filter results for only "active records". This way you get following advantages:
1. Data Recovery is possible.
2. You can have a scheduled task on database level, which can take care of hard deletes of once in a way; if really needed. (Like a SQL procedure or something)
3. You can have an admin screen to be able to decide which accounts, entries etc you'd really want to mark for deletion
4. A temperory disabling of account can also be implemented with same solution.
In prod environments where I have worked on, a hard delete is a strict No-No. Infact audits are maintained for deletes also. But if application is really small; it'd be upto user.
I would still suggest a "virtual delete" or a "soft delete" with periodic cleanup on db level; which will be faster efficient and optimized way of cleaning up.
I generally don't like to delete anything and instead opt to mark records as deleted/unpublished using states (with AASM i.e. acts as state machine).
I prefer states and events to just using flags as you can use events to update attributes and send emails etc. in one foul swoop. Then check states to decide what to do later on.
HTH.
I would recommend putting in a delete date field that contains the date/time the user unsubscribed - not only to the user record, but to all information related to that user. The app should check the field prior to displaying anything. You can then run a hard delete for all records 30 days (your choice of time) after the delete date. This will allow the information not to be shown (you will probably need to update the app in a few places), time to allow the user to re-subscribe (accidental or rethinking) and a scheduled process to delete old data. I would remove ALL information about the member and any related comments about the member or their prior published data (photos, etc.)
I am sure it changing lot since update with Data Protection and GDPR, etc.
the reason I found this page as I was looking for advice because of new Apply policy on account deletion requirements extended https://developer.apple.com/news/?id=i71db0mv
We are using Ruby on Rails right now. Your answers seem a little outdated? or not or still useful right now
I was thinking something like that
create a new table “old_user_table” with old user_id , First name, Second name, email, and booking slug.
It will allow keep all users who did previous booking. And deleted their user ID in the app. We need to keep all records for booking for audit purpose in the last 5 years in the app.
the user setup with this app, the user but never booking, then the user will not transfer to “old_user_table” cos the user never booking.
Does it make sense? something like that?