Problems using generated large random integers as ids in Rails?

Problems using generated large random integers as ids in Rails? - ruby-on-rails

In a Rails app, for business reasons, I don't want to leak how many objects I have or the difference between objects count in a period of time.
I think global unique ids are an overkill in this case, as I just need local unique ids for each table. As ids in a default Rails + Potgresql app are already bigints, I thought about generating very large (but not astronomically large) integers to use as ids:
class ApplicationRecord < ActiveRecord::Base
primary_abstract_class
before_create :set_large_random_id
private
def set_large_random_id
self.id = rand(1..999_999_999_999)
end
end
(I am OK my app failing because of a unique id collision once a trillionth).
Apart loosing the ordering given by sequential ids, is there any other problem or consideration I need to take into account by using large random ids?

I can't think of any major issues given that you don't care about collisions. But here are some small considerations:
Developer ergonomics:
If all your other tables have a normal incrementing id column, as they usually do in Rails projects, consistency may be nice. Otherwise this table will be the one exception where internal people looking at data have to order by created_at instead of by id. Debugging also becomes more difficult when you can't easily query the next and previous items.
The solution here would be to have a separate unique column with an index, and that one is exposed to the users. This could be some sort of random integer, or integers separated by hyphens like Amazon order ids.
Style:
If one user's id is 5 and another user's id is 999,999,999,999 then your UI will have to make both sizes look nice. It will also be noticeably odd to the user who has two objects with wildly different ids.
At a company I worked for we had our receipt ids look more like BGTM-ZIJL-52 (always 4/4/2) for consistency. (We also had to make sure we never generated rude words.)
Imported data:
Let's say you are storing order numbers from people's purchases, and you merge with another company and want to import their orders into your system. It would be easier to do that if you had a separate but flexible column for public ids (i.e. a varchar), so that it will support the other company's ids no matter what style they chose (unless there are collisions with yours).
Asides: You're probably aware of this, but you can of course retry inserting a row if you get a collision. There's no need for your app to fail. You could also have PostgreSQL generate the random-seeming ids: https://stackoverflow.com/a/20890246

You can use UUID in your database. I am using it and it is quite good with randomized large ID.
You can look into google for more info or look into this also: https://pawelurbanek.com/uuid-order-rails
You can check the ID.
Role.first
Role Load (0.4ms) SELECT "roles".* FROM "roles" ORDER BY "roles"."id" ASC LIMIT $1 [["LIMIT", 1]]
=> #<Role:0x00007f5834fa8910
id: "0d8acf6e-63d4-4845-a804-c84e8debb501",
name: "business_analyst",
created_at: Tue, 08 Nov 2022 10:46:47.321147000 +06 +06:00, updated_at: Tue, 08 Nov 2022 10:46:47.321147000 +06 +06:00>

Related

Rails how to make a table with composite key: (Category and Month/Year), and value: Amount

I need to report a budget to actual table, displaying net values by month per category.
Currently I'm calculating on the fly but it's already slow with hardly more than seeded data.
I think the proper way to do this is with a composite key table. The key is the unique combination of Category and Period (Month-Year), the value is the amount.
Each transaction would trigger a callback to update this table. At report time, a user would supply the period of the report (March 2022) and it would find all rows of that period.
I think it would look something like this:
key
value
March-2021/All Categories
$335
March-2021/Food
$75
March-2021/Fuel
$60
March-2021/Entertainment
$200
...
...
March-2022/All Categories
$49
March-2022/Food
$25
March-2022/Fuel
-$10
March-2022/Entertainment
$34
...
...
April-2022/All Categories
$58
April-2022/Food
$5
April-2022/Fuel
$30
April-2022/Entertainment
$23
April-2002/Some_New_Category
$20
I've read this SO question but I'd rather not blindly use a gem, and anyway it seems to involve composite 'primary' keys. (User model, and Organization model and Department model). Whereas I'm just using a Category model and a non-model period (Month-Year)
How to implement composite primary keys in rails
I can't seem to find anything about how to implement this. Am I approaching this wrong?

Granularity in Star Schema leads to multiple values in Fact Table?

I'm trying to understand star schema at the moment & struggling a lot with granularity.
Say I have a fact table that has session_id, user_id, order_id, product_id and I want to roll-up to sessions by user by week (keeping in mind that not every session would lead to an order or a product & the DW needs to track the sessions for non-purchasing users as well as those who purchase).
I can see no reason to track order_ids or session_ids in the fact table so it would become something like:
week_date, user_id, total_orders, total_sessions ...
But how would I then track product_ids if a user makes more than one purchase in a week? I assume I can't keep multiple product ids in an array (eg: "20/02/2012","5","3","PR01,PR32,PR22")?
I'm thinking it may have to be kept at 'every session' level but that could lead to a very large amount of data. How would you implement granularity for an example such as above?

Dimensional modelling required Dimensions as well as Facts.
You need a Date/Calendar dimension, which includes columns like this:
calendar (id,cal_date,cal_year,cal_month,...)
The "grain" of your fact table is the key to data storage. If you have transactions, then the transaction should be the grain, and you store one row per transaction. Use proper (integer) surrogate keys to your dimensions, and your table won't be as large as you fear.
Now you can write a query like this, to sum sales of product by year:
select product_name,cal_year,sum(purchase_amount)
from fact_whatever
inner join calendar on id = fact_whatever.calendar_id
inner join product on id = fact_whatever.product_id
group by product_name,cal_year

Database count column

I have two tables:
"sites" has_many "users"
"users" belongs_to "sites"
Is it better that whenever a users got added to sites I added column called users_count in sites table and increment it by one. Or is doing a conditional count on users table the best way?

"Better" is a subjective term.
However, I'll be adamant about this. There should not be two sources of the same information in a database, simply because they may get out of step.
The definitive way to discover how many users belong to a site is to use count to count them.
Third normal form requires that every non-key attribute depends on the key, the whole key, and nothing but the key (so help me, Codd).
If you add a user count to sites, that does not depend solely on the sites key value, it also depends on information in other tables.
You can revert from third normal form for performance if you understand the implications and mitigate the possibility of inconsistent data (such as with triggers) but the vast majority of cases should remain 3NF.

To normalize or not to normalize user_ids

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?

I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.

As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!

Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Generating sequential numbers in multi-user saas application

How do people generate auto_incrementing integers for a particular user in a typical saas application?
For example, the invoice numbers for all the invoices for a particular user should be auto_incrementing and start from 1. The rails id field can't be used in this case, as it's shared amongst all the users.
Off the top of my head, I could count all the invoices a user has, and then add 1, but does anyone know of any better solution?

Typical solution for any relation database could be a table like
user_invoice_numbers (user_id int primary key clustered, last_id int)
and a stored procedure or a SQL query like
update user_invoice_numbers set last_id = last_id + 1 where user_id = #user_id
select last_id from user_invoice_numbers where user_id = #user_id
It will work for users (if each user has a few simultaneously running transactions) but will not work for companies (for example when you need companies_invoice_numbers) because transactions from different users inside the same company may block each other and there will be a performance bottleneck in this table.
The most important functional requirement you should check is whether your system is allowed to have gaps in invoice numbering or not. When you use standard auto_increment, you allow gaps, because in most database I know, when you rollback transaction, the incremented number will not be rolled back. Having this in mind, you can improve performance using one of the following guidelines
1) Exclude the procedure that you use for getting new numbers from the long running transactions. Let's suppose that insert into invoice procedure is a long running transaction with complex server-side logic. In this case you first acquire a new id , and then, in separate transaction insert new invoice. If last transaction will be rolled back, auto-number will not decrease. But user_invoice_numbers will not be locked for long time, so a lot of simultaneous users could insert invoices at the same time
2) Do not use a traditional transactional database to store the data with last id for each user. When you need to maintain simple list of keys and values there are lot of small but fast database engines that can do that work for you. List of Key/Value databases. Probably memcached is the most popular. In the past, I saw the projects where simple key/value storages where implemented using Windows Registry or even a file system. There was a directory where each file name was the key and inside each file was the last id. And this rough solution was still better then using SQL table, because locks were issued and released very quickly and were not involved into transaction scope.
Well, if my proposal for the optimization seems to be overcomplicated for your project, forget about this now, until you will actually run into performance issues. In most projects simple method with an additional table will work pretty fast.

You could introduce another table associated with your "users" table that tracks the most recent invoice number for a user. However, reading this value will result in a database query, so you might as well just get a count of the user's invoices and add one, as you suggested. Either way, it's a database hit.

If the invoice numbers are independent for each user/customer then it seems like having "lastInvoice" field in some persistent store (eg. DB record) associated with the user is pretty unavoidable. However this could lead to some contention for the "latest" number.
Does it really matter if we send a user invoices 1, 2, 3 and 5, and never send them invoice
4? If you can relax the requirement a bit.
If the requirement is actually "every invoice number must be unique" then we can look at all the normal id generating tricks, and these can be quite efficient.
Ensuring that the numbers are sequenctial adds to the complexity, does it add to the business benefit?

I've just uploaded a gem that should resolve your need (a few years late is better than never!) :)
https://github.com/alisyed/sequenceid/

Not sure if this is the best solution, but you could store the last Invoice ID on the User and then use that to determine the next ID when creating a new Invoice for that User. But this simple solution may have problems with integrity, will need to be careful.

Do you really want to generate the invoice IDs in an incremental format? Would this not open security holes (where in, if a user can guess the invoice number generation, they can change it in the request and may lead to information disclosure).
I would ideally generate the numbers randomly (and keep track of used numbers). This prevents collisions as well (Chances of collision are reduced as the numbers are allocated randomly over a range).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart