Let's say I'm building a Facebook clone in Rails
Currently my routes are pretty standard, like
/group/1/post/3
I'd love to make some synthetic ID's using the same numbering scheme that sites like Facebook use. There seem to be two general types of routes
# Only numbers
/group/10101830214008379/post/159476674458072
# Hash / Hex
/group/da295c4b/post/815fe818
Outside of aesthetics -
What are some advantages/disadvantages to using either approach?
Is there a good industry standard or best practice for generating synthetic ids for concepts like users, groups, posts, etc..
What's the best way in Ruby/Rails to generate each of these IDs? I know of SecureRandom.hex but that seems to generate a long hash.
Thanks!
What are some advantages/disadvantages to using either approach?
Using sequential numbers
Advantages: Easy to implement
Disadvantages: Possible vector for attack. See this video for a high-level overview.
Using random numbers
Advantages: Solves the problems outlined in the video re: sequential record attacks
Disadvantages: Since there's only 10 bits of entropy, ID's would have to be much longer if your app grows.
Base 64 (use this instead of hex)
Advantages: 64 bits of entropy means an ID 5 chars long would have 64^5 possible permutations. This allows for comparatively much shorter URLs. Use SecureRandom.urlsafe_base64 for this.
Disadvantages: None, really.
Is there a good industry standard or best practice for generating synthetic ids for concepts like users, groups, posts, etc..
To my knowledge, no. Anything sufficiently random and of sufficient length should be fine. Within your model, you'd want to check if an ID is taken first so you don't have duplicates, but outside of that there's little to worry about.
What's the best way in Ruby/Rails to generate each of these IDs? I know of SecureRandom.hex but that seems to generate a long hash.
Like I said above, I recommend using SecureRandom.urlsafe_base64
What are some advantages/disadvantages to using either approach?
I think that the main advantage is that you can generate a new entity (e.g. a post) without having to rely on a sequential id generation (from the database). This is especially useful for highly concurrent or distributed systems, where you want to be able to create new entries without having to a) do the creation in a sequence or b) without running into conflicts.
Is there a good industry standard or best practice for generating synthetic ids for concepts like users, groups, posts, etc..
UUID is one widely used standard for this.
What's the best way in Ruby/Rails to generate each of these IDs? I know of SecureRandom.hex but that seems to generate a long hash.
SecureRandom.uuid
As a more human friendly and nicer alternative to uuids you could use SecureRandom.urlsafe_base64, which has a higher probability to generate non-unique values though.
Related
I have been researching quite a bit and the general consensus is to avoid serialized hashes in a DB whenever possible, however the design I have lends itself to this structure, so I'm hoping to get some opinions and/or advice. Here is the scenario:
I have a model/table :products which houses financial products. Each product has_many investment strategies, which I had originally stored in a separate :strategies model/table. Since each product has completely different strategies, and each strategy has different attributes, its become extremely difficult (and hacky) to manipulate each strategy's attributes into normalized, consistent columns (to the point where I have products that I simply cannot add to the application). Additionally, a strategy's attributes can sometimes change based on the amount of money allocated to that strategy.
In order to solve this issue, I am looking into removing the :strategies model/table altogether and simply adding a strategies column to my :products model/table. The new column would house a multi-dimensional hash of each product's strategies. This option allows for tremendous flexibility from a data storage perspective.
My primary question is, do I lose any functionality by restructuring my database this way? There will be times when I need to search a product by it's strategy's attributes and I have read that searching within a multi-dimensional hash is difficult at best. Is this considered bad practice? Is there a third solution that I haven't considered?
The advantages of rolling with multiple tables for this design is you can leverage the database to protect your data with constraints, functions and triggers. The database is the only place you can protect your customers data with 100% confidence. These tried and true techniques have lost their luster in recent years and are viewed as cumbersome and/or unnecessary to those who do not understand them.
Hash based stores within relational databases are currently changing quickly due to popularity of nosql databases, however, traditionally it has been difficult to fully protect your customers data from the database with this implementation. Therefore, the application layer is where much of this protection lives. With that said, this is being innovated on and maybe someday they will solve it.
The big advantage of using the hash as a column in a table is you can get up and going more quickly while your figuring out your problem. In addition, you can pivot more easily because most modifications are made in the application layer on the fly.
Full text seaching and complex queries can also be a bit more difficult if your using an hash based store within a relational database.
General rule of thumb is if you need the data to safe and or have some complex reporting to do, go relational. Think a big financial services type app ;) Otherwise if your building a more social, data display style app, or just mocking things up there is nothing wrong with a serialized hash column. Most importantly remember to write tests so you can refactor more confidently if you choose wrong!
My $0.02
I would be curious to know which decision you choose and how it has worked out.
I've got this medium-sized app that is starting to get too complex. I'm considering splitting it in two. But I'm uncertain about how would I share information between those.
I've been able to make two big groups of models; One group deals with "pictures" and the other one deals with "sales data".
Some utility models, such as the authentication/authorization related ones, will have to be copied over, I guess. But let's concentrate on the two Big Groups.
The two data sets are maintanied by different people, so they would split quite naturally.
The only place the two groups "overlap" are a couple reports, that pull data from both "pictures" and "sales data". The information in both cases resembles an array of hashes, with different depths, pointing to calculus (around 60 numbers per system).
That's pretty much the only thing holding the split; I'm not sure about what would be the best way to share information between both apps.
I'd appreciate any pointers to what would be the best way to accomplish this. Should I try to use the same database for both apps? Should I use some kind of web service instead?
The simple solution would be getting both the applications to use the same database. The problem of doing so would be that you'd get some code duplication on the models on the overlap. You could of course fix it with a git submodule or custom gem... An interesting to look into regarding this would be rails engines.
A different solution would be that 1 application has the data and expose an RESTful API and the other pulls from it. But then you need to decide which one gets to "manage" the reports.
It's a pretty complex decision and I can't help you make it without all the data, but I hope this has been helpful ^^
Also, duplicating the code will create caching problems, concurrency problems.
I'm working on a project with developers who have not worked with Ruby OR Rails before.
They have created a schema that is too complicated, in my opinion. The schema has 117 tables, and obtaining the simplest piece of information would require traversing/joining 7 tabels...and of course, there's no "main" table that serves as a sort of key between them. The schema renders many of the rails tools like 'find' method, and many of the has_many/belongs to relationships almost useless. And coding for all of these relationships will likely be more time-consuming than we have the money to code for.
THE QUESTION:
Assuming you are VERY convinced (IMHO...hehe) that the schema is not ideal, and there are multiple ways to represent the domain, how would you argue FOR simplifying the schema (aside from what I've already said)?
I'll stand up in 2 roles here
DBA: Database admin/designer.
Dev: Application developer.
I assume the DBA is a person who really know all the Database tricks. Reaallyy Knows.
DBA:
Database is the key of the application and should have predefined structure in order to serve its purpose well and with best performance.
If you cannot use random schema (which is reasonably normalised and good) then the tools are wrong.
Dev:
The database is just a data store, so we need to keep it simple and concentrate on the application.
DBA:
Database is not a store it is the core of the application. There is no application without database.
Dev:
No. The application is the core. There is no application without the front-end and the business logic applied to it.
And the war begins...
Both points are valid and it is always trade off.
If the database will ONLY be used by RoR, then you can use it more like a simple store.
If the DB can be used by other application OR it will be used with large amount of data and high traffic it must enforce some best practices.
Generally there is no way you can disagree with DBA.
But they can understand your situation and might allow you to loose the standards a bit so you could be more productive.
So you need to work closely, together.
And you need to talk to each other to explain and prove the point why database should be like this or that.
Otherwise, the team is broken and project can be failure with hight probability.
ActiveRecord is a very handy tool. But it cannot do everything for you. It does not provide Database structure by default that you expect exactly. So it should be tuned.
On the other side. If DBA can accept that all PKs are Auto incremented integers that would make Developer's life easier (ActiveRecord does it by default).
On the other side, if developers would accept some of DBA constraints it would make DBA's life easier.
Now to answer your question:
how would you argue FOR simplifying the schema
Do not argue. Meet the team and deliver the message and point on WHY it should be done.
Maybe it really shouldn't and you don't know all the things, maybe they are not aware of something.
You could agree on the general structure of the database AND try to describe it using RoR migrations as a meta language.
This way they would see the general picture, and you would use your great ActiveRecords.
And also everybody would be on the same page.
Your DB schema should reflect the domain and its relationships.
De-normalisation should only be done when you have measured that there is a performance problem.
7 joins is not excessive or bad, provided you have good indexes in place.
The general way to make this argument up the chain is based on cost. If you do things simply, there will be less code and fewer bugs. The system will be able to be built more quickly, or with more features, and thus will create more ROI. If you can get the money manager on board with that approach, he or she may let you dictate terms to the team. There is the counterargument that extreme over-normalization prevents bad data, but I have found that this is not the case, as the complexity it engenders tends to lead to more errors and more database code in general.
The architectural and technical argument here is simple. You have decided to use Ruby on Rails. Therefore you have decided to use the ActiveRecord pattern. The ActiveRecord pattern is driven by having the database tables match the object model. That's the pattern in use here, and in many other places, so the best practices they are trying to apply for extreme data normalization simply do not apply. Buy a copy of Patterns of Enterprise Application Architecture and put the little red bookmark at page 160 so they can understand how the pattern works from the architecture perspective.
What the DBA types tend to be unaware of is how much work ActiveRecord does for you, from query generation, cascading deletes, optimistic locking, auto populated columns, versioning (with acts_as_versioned), soft deletes (with acts_as_paranoid), etc. There is a strong argument to use well tested, community supported library functions to perform these operations versus custom code that must be maintained by a DBA.
The real issue with DBAs is then that they need some work to do. Let them focus on monitoring performance, finding slow queries in the code, creating indexes and doing backups.
If you end up losing the political battle for a sane schema, you may want to consider switching to DataMapper. It's the next pattern in PoEAA. The other thing you may be able to get them to do is to create views in the database that correspond to the object model. This way, you could use many of the finding capabilities in the ActiveRecord model based on the views, but have custom insert, update, and delete methods.
For example: http://stackoverflow.com/questions/396164/exposing-database-ids-security-risk and http://stackoverflow.com/questions/396164/blah-blah loads the same question.
(I guess this is DB id of Questions table? Is this standard in ASP.NET?)
What are the pros and cons of using this type of scheme in your web app?
Well, for one, simple id's are usually sequential, so it's quite easy to guess at and retrieve other data from your application.
Load JSON at runtime rather than dynamically via AJAX
https://stackoverflow.com/questions/395858/doesnt-matter-what-I-type-here
Now, having said that, that might also be seen as a bonus, because nobody in their right mind would make their whole security hinge on the fact that you have to clink on a link to get to your secure data, and thus easy discoverability of the data might be good.
However, one point is that you're at some point going to reindex your database, having something that makes the old url's invalid would be bad, if for no other reason that search engines would still have old links.
Also, here on SO it's quite normal to use links like this to other questions, so if they at some point want to reindex and thus renumber things (or move to guid's), they will still have to keep the old structure and id's.
Now, is this likely to ever happen or be needed? Probably no.
I wouldn't worry too much about it, just build your security as though every entrypoint to your application is known and there should be no problems.
The database ID is used to lookup the question in the database. It's numerical which means: fast. If you would leave it out you had to lookup the title which is a lot slower.
The question itself is part of the url to make it "search engine friendly". It'll be higher ranked by g**gle etc.
Pro:
Super easy to retrieve the page information. Take the ID, call the database, viola. Your table will (should) be indexed to make this lookup super fast.
Guaranteed unique URL.
Con:
IDs in your system are being publicly displayed. Not a problem in a publicly available system like SO. However, proper security measures on the back end can make this not a problem even on sensitive systems.
Ugly URLs. 6+ digit numbers are just hard to remember, and makes it more difficult to distinguish pages, if the number is all that identifies it. This can also has SEO consequences, as URLs with more relevant and well structured information are generally ranked better. SO compensates by providing the post name in the URL as well. While I still can't rattle off a particular post to my buddy at lunch, I can still find it easier in the browser history.
Slower lookups. Doing text searches on a database is generally slower.
But remember in a community like this there is a higher (although still minimal) chance of the same question name being posted at the same time, which would break things, thus some kind of unique identification need be applied, ID's are probably quite logical in the context that this particular web application was developed in.
I dont think it's bad practice, and fairly common, to do it in ASP.NET and other frameworks. As #lassevk said, if your security depends on it, then you need some more checks in there (can user X get to record Y), but it more comes down to the SEO-friendlyness of the URLs for public sites.
For example, SO's URLs are fairly friendly:
Pros and cons of using DB id in the URL?
google rates information at the START of the URL higher than at the end, so having it look like:
https://stackoverflow.com/pros-and-cons-of-using-db-id-in-the-url/q/407120
should get a higher ranking for "pros and cons of using db id in the url". It's not the only factor, but it is quite a major one - look at Amazon's format, they do it for a very good reason:
http://www.amazon.com/Maverick-Ricardo-Semler/dp/0712678867
http://server/book-name/dp/book-id
Wordpress does it like this:
http://server/yyyy/mm/dd/name-of-the-post
however, if you post two posts on the same day called "foo", you get:
http://server/yyyy/mm/dd/foo
http://server/yyyy/mm/dd/foo2
the slug (foo/foo2) isn't a PK, but it IS maintained as unique over the posts table.
I think putting the ID in the URL isn't a problem, unless your URL is a GUID! Way too long, and hard to type. If it's an int, or some kind of short guid (eg 6-8 chars), then it shouldn't be a problem.
Would any experienced Erlang programmers out there ever recommend association lists over records?
One case might be where two (or more) nodes on different machines are exchanging messages. We want to be able to upgrade the software on each machine independently. Some upgrades may involve adding a field to one (or more) of the messages being sent. It seems like using a record as the message would mean you'd always have to do the upgrade on both machines in lock step so that the extra field didn't cause the receiver to ignore the record. Whereas if you used something like an association list (which still has a "record-like" API), the not-yet-upgraded receiver would still receive the message successfully and just ignore the new field. I realize this isn't always the desired behavior, but often it is. Also, assume the messages are fairly small so the lookup time doesn't matter.
Assuming the above makes some sense, I have the following additional questions:
Is there a standard (or widely used) library for alists? Some trivial googling didn't turn up anything.
Are there other cases where you would use an association list (or something like it)?
You have basically three choices:
Use Records
Use Association Lists (proplists)
Use Combination
I use records where the likelihood of changing it is very low. That way I get the pattern matching and speed up that I want.
I use proplists where I need hashtable like functionality. I get flexibility at the expense of pattern matching and speed.
And sometimes I use both. A record with one field that is a proplist. That way I can pattern match on a portion of it and yet have flexibility where I need it.
All three choices have different trade-offs so you basically just have to evaluate your particular needs and make a choice. It may take some prototyping and playing around to figure out which trade-offs make sense and which features you absolutely must have.
For small amount of keys you can use lists aka proplists for bigger you should use dict. In both cases biggest disadvantage is that you can't use pattern match in way as used for records. There is also speed penalty but it is in most cases irrelevant.
Note that lists:keysearch/3 is pretty much "assq".