Rails: Caching a Tree in memory on the server - ruby-on-rails

I have a postgresql database which contains multidimensional data. What I did was I wrote a data structure that sorts all database rows into a tree format. Now the database is large and so I dont want to generate the tree every time a request comes in from a browser. What Id like to do is construct the tree once in a certain time period and persist it in memory on the server.
The tree is read only by the way. So now each time a request comes in the tree need not be generated new, its already there.
How can I make this happen. Im not an expert programmer, just a beginner and definitely new to web programming. So some of these concepts are new to me.
But if you could please point me in the right direction in terms of the concepts involved here, I can google the rest.
Or if you have actual links or examples that would be fantastic.
Thanks

There are several ways to approach this problem. It depends on just how close to the application you want the variables. If you're really looking to have them right "on top" of the application, for fastest possible use, then you could look at using a global variable "$tree" and hooking in to the application flow. Other options might include memcached, which is still pretty darn close to the application. Redis would be a good option for an in-memory database that could be shared between instances of an application, as it is a NoSQL database that you query. Not quite as close to the application though.
Generally, those are your primary options. In-application variables that survive requests. Application frameworks that will help variables survive requests and provide you a querying mechanism. Or, an In-Memory databases that will allow you to store and query rapidly from multiple instances. Each is a viable option, though I'm pretty sure you'd get a lot of 'community' flack for using a straight up global variable (such practices are considered unclean for their lack of thread-safety and other such concerns).

Related

master slave exposes technical debt

Using rails and postgresql.
I wrote my app without having in mind to use a master slave configuration.
Now, I've gotten master slave set up in the app and now I'm running into some technical debt. The same process in my app writes to the db and then immediately reads from the db. The read is not taking place on the read db but the data isn't there. Before, this wasn't efficient but it didn't cause any problems because both dbs were the same. Now, this is blowing up in my face.
The problem for me is that its difficult to find all the places in the code where this problem exists. Can someone can please suggest to me a technique to get my tests to run in such a way where the reads and the writes use different dbs that aren't updated so that I can figure out where my issues are?
Other solutions will also be welcomed!
I strongly recommend you rethink your master/slave configuration or whether master/slave is even right for your application.
It's not "tech debt" to build a system that assumes data written to persistent store can be read back immediately. It's normal and correct. While you might reasonably be able to avoid the pattern
write A, ..., look up A.key
with various simple cache schemes, trying to code around e.g.
write A, ..., complex query that *might* fetch A
requires you to retain a copy of A and determine whether it would satisfy the WHERE clause of the query in separate code, simply because you can't rely on the query results. Unless your system is very small and simple, trying to do this system-wide will produce a super-complex, fragile, expensive, and ugly code base. I strongly recommend you don't try it.
The usual purpose of a master/slave persistent store organization is to off-line read traffic that's not time-dependent on writes. For example, if your system mines data to produce summaries accessible to users, you'd offline the metric computation and have it mine the slave. This prevents mining queries from drawing resources away from user request handling. The small delay between write on master and copy to slave is no problem.
If your app is struggling because there's too much load on persistent store, you probably want partitioned data (sometimes called sharding), not master/slave. Partitioning can expose you to a different kind of problem: no cross-partition transactions. But this is usually easier to work through than what you're attempting.
After studying this area, I agree with Gene that master slave should only be used for reads that have been written a significant time before the read.
My ORIGINAL concept was that its better to utilize a functional programming style whereby the process retains all the information in the parameters and then doesn't make recourse to the database. The downside of this approach is that the human mind has a hard time with functional programming and in a massive computer program it makes sense to not insist on this added complication.
If you want to write a functional method or process then that is great and very efficient but there shouldn't be anything in the code that insists on this.

Read-only web apps with Rails/Sinatra

I am working on an administrative web app in Rails. Because of various implementation details that are not really relevant, the database backing this app will have all of the content needed to back another separate website. It seems like there are two obvious options:
Build a web app that somehow reads from the same database in a read-only fashion.
Add a RESTful API to the original app and build the second site in such a way as for it to take its content from the API.
My question is this: are either of these options feasible? If so, which of them seems like the better option? Do Rails, Sinatra, or any of the other Rack-based web frameworks lend themselves particularly well to this sort of project? (I am leaning towards Sinatra because it seems more lightweight than Rails and I think that my Rails experience will carry-over to it nicely.)
Thanks!
Both of those are workable and I have employed both in the past, but I'd go with the API approach.
Quick disclaimer: one thing that's not clear is how different these apps are in function. For example, I can imagine the old one being a CRUD app that works on individual records and the new one being a reporting app that does big complicated aggregation queries. That makes the shared DB (maybe) more attractive because the overlap in how you access the data is so small. I'm assuming below that's not the case.
Anyway, the API approach. First, the bad:
One more dependency (the old app). When it breaks, it takes down both apps.
One more hop to get data, so higher latency.
Working with existing code is less fun than writing new code. Just is.
But on the other hand, the good:
Much more resilient to schema changes. Your "old" app's API can have tests, and you can muck with the database to your heart's content (in the context of the old app) and just keep your API to its spec. Your new app won't know the difference, which is good. Abstraction FTW. This the opposite side of the "one more dependency" coin.
Same point, but from different angle: in the we-share-the-database approach, your schema + all of SQL is effectively your API, and it has two clients, the old app and the new. Unless your two apps are doing very different things with the same data, there's no way that's the best API. It's too poorly defined.
The DB admin/instrumentation is better. Let's say you mess up some query and hose your database. Which app was it? Where are these queries coming from? Basically, the fewer things that can interact with your DB, the better. Related: optimize your read queries in one place, not two.
If you used RESTful routes in your existing app for the non-API actions, I'm guessing your API needs will have a huge overlap with your existing controller code. It may be a matter of just converting your data to JSON instead of passing it to a view. Rails makes it very easy to use an action to respond to both API and user-driven requests. So that's a big DRY win if it's applicable.
What happens if you find out you do want some writability in your new app? Or at least access to some field your old app doesn't care about (maybe you added it with a script)? In the shared DB approach, it's just gross. With the other, it's just a matter of extending the API a bit.
Basically, the only way I'd go for the shared DB approach is that I hated the old code and wanted to start fresh. That's understandable (and I've done exactly that), but it's not the architecturally soundest option.
A third option to consider is sharing code between the two apps. For example, you could gem up the model code. Now your API is really some Ruby classes that know how to talk to your database. Going even further, you could write a Sinatra app and mount it inside of the existing Rails app and reuse big sections it. Then just work out the routing so that they look like separate apps to the outside world. Whether that's practical obviously depends on your specifics.
In terms of specific technologies, both Sinatra and Rails are fine choices. I tend towards Rails for bigger projects and Sinatra for smaller ones, but that's just me. Do what feels good.

Is using Redis right for this situation?

I'm planning on creating an app (Rails) that will have a very large collection of users - it'll start small but I would like it to be able to handle a million or more.
I want to build a system that will be able to handle 2500+ requests per second. Each request will require a write (for logging purposes) as well as a read from the enormous list of users, indexed by username (I was recommended to use MongoDB for this purpose) and the results of the read will be sent back to the user.
I am a little unclear about how mongo will handle both reads and writes, so I had this idea of using Mongo to sort of permanently store the records and then load them up into Redis every time the server starts up for even faster access so that Mongo doesn't have to deal with anything but the writes.
Does that sound reasonable or is that a huge misuse of Mongo and Redis?
The speed of delivery is of utmost importance.
It's possible, actually, to create the entire application using just Redis. What you'd want to do is research design patterns for Redis. A good place to start is this PDF by Karl Seguin called The Little Redis book.
For example, use Redis's hashes to save all users' information.
Further, if planned well you don't need to have another persistent storage such as Mongo or MySQL in conjunction with Redis as Redis is persistent itself. You just need to pick a good sharding/replication strategy that'll allow you to be flexible enough for future systemic changes.
I think the stack that you are asking about is certainly a very good solution and one that's pretty battle tested for high performance sites. Trello (created by same people who created this very site) uses a similar architecture as well as craigslist.
Trello Tech Stack Writeup
Craigslist also uses this
Redis is fast and has a great pub/sub mechanism in addition to normal invalidation type features that makes it a superior cache to most. Mongo is a db i'm very familiar with and think it's great for all sorts of data store purposes as well as being a solid enterprise db that scales well, protects data integrity and checks off a bunch of marks in the SLA enterprise jargon checklist
I think it's a great combination but really the question should be is do I even need this. For your load I think Mongo itself could handle this quite nicely (and give data integrity) and also if you really want you can run it on server with enough memory to make sure your dataset fits inside memory (denormalizing and good schema design is key). Foursquare runs exclusively on Mongo in memory.
So think if this is necessary but remember simple always wins. Redis/Mongo is super powerful but it will also take a lot more work to master two data stores and administer them.
Thanks,
Prasith
As others have mentioned, using a single service makes more sense to me. There's reason to keep the logging data in memory though. I'd try using something simple, a logfile if possible, or Scribe or Flume if you need to distribute the writes.

Is Using Db4o For Web Sites a judicious choice?

Is using Db4o as a backend datastore for a Web site (ASP.NET MVC) a judicious choice as an alternative to MS SQL Server ?
The main issue with DB4o is: Can you cut your object net in some useful manner? If not, then you'll keep too many objects in RAM for too long and your performance will suffer.
For example, in SQL, you can create a cursor and then easily traverse a huge set of results. You can also query for a small set of columns while DB4o always loads the whole objects (and its references and the references of the references). With DB4o, you must make sure that DB4o doesn't try to pull in all objects from the DB at once.
You'll also need to get used to querying things your "DB" by filling out example objects which feels weird in the beginning.
That depends, what kind of site your creating, the traffic your expecting etc...Are you going to handle a million requests a second, or 100 a minute...Does your domain justify using a Object Database? Do you really need it?
In general, most sites are not heavy hitters so they might not require all the scale out functionality (I believe and this is only a belief that traditional RDBMS have been tested and designed to handle extreme loads where as Object DB's might not have been given the same attention).
So then the question is does your domain justify this? Your going to base a core piece of your site on a technology that you will not find a lot of experts in. So how do you handle turn over rate? Are you willing to take the cost associated with training all current and future employees on this?

server side db programming: why?

Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!).
Server-side logic is often much faster, even with procedural approach.
You can fine-tune your grant options and hide the data you don't want to show
All queries in one places are more convenient than if they were scattered all around the code.
And here's a (very subjective) article in my blog on the reason I prefer stored procedures:
Schema Junk
BTW, triggers (as opposed to functions / stored procedures / packages) I generally dislike.
They are completely other story.
You're keeping the processing in the database, along with the data.
If you process on the server side, then you have to transfer the data out to a server process across the network, process it, and (optionally) send it back. You have the network bandwidth/latency issues, plus memory overheads.
To clarify - if I have 10m rows of data, my two extreme scenarios are to a) pull those 10m rows across the network and process on the server side, or b) process in place in the database using the server and language (SQL) optimised for this purpose. Note that this is a generalisation and not a hard-and-fast rule, but it's the one I follow for most scenarios.
When many heterogeneous applications and various other systems need to access your single database and be sure through their operations data stays consistent without integrity conflicts. So you put your logic into triggers and stored procedures that will offer an interface to external clients.
Maybe not for most web-based systems, but certainly for enterprise databases. Stored procedures and the like allow you much greater control over security and performance, as well as offering a bit of encapsulation for the database itself. You can change the schema all you want as long as the stored procedure interface remains the same.
In (almost) every situation you would keep the processing that is part of the database in the database. Application code cannot substitute for triggers, you won't get very far before you have updated the database and failed to fire the application's equivalent of the triggers (the first time you use the DBMS's management console, for instance).
Let the database do the database work and let the application to the application's work. If you have a specific performance problem with the database, and that performance problem can be addressed by moving processing from the database, in that case you might want to consider doing so.
But worrying about database performance without a database performance problem existing (which is what you seem to be doing here) is both silly and, sadly, apparently a pre-occupation of many Stackoverlow posters.
Least scalable? SQL???
Look up, "federating."
If the database is shared, having logic in the database is better in order to control everything that happens. If it's not it might just make the system overly complicated.
If you have multiple applications that talk to your database, stored procedures and triggers can enforce correctness more pervasively. Accordingly, if correctness is more important than convenience, putting logic in the database is sensible.
Scalability may be a red herring, though. Sometimes it's easier to express the behavior you want in the domain layer of an OO language, but it can be actually more expensive than doing the idiomatic SQL way.
The security mechanism at a previous company was first built in the service layer, then pushed to the db side. The motivation was actually due to some limitations in a data access framework we were using. The solution turned out to be a bit buggy because our security model was complicated, but the upside was that bugs only had to be fixed in the database; we didn't have to worry about different clients following different rules.
Triggers mean 3rd-party apps can modify the database without creating logical inconsistencies.
If you do that, you are tying your business logic to your model. If you code all your business logic in T-SQL, you aren't going to have a lot of fun if later you need to use Oracle or what have you as your database server. Actually, I'm not sure I understand this question exactly. How do you think this would improve scalability? It really shouldn't.
Personally, I'm really not a fan of triggers, particularly in a database dedicated to a single application. I hate trying to track down why some data is inconsistent, to find it's down to a poorly written trigger (and they can be tricky to get exactly correct).
Security is another advantage of using stored procs. You do not have to set the security at the table level if you don't use dynamic code (Including ithe stored proc). This means your users cannot do anything unless they have a proc to to it. This is one way of reducing the possibility of fraud.
Further procs are easier to performance tune than most application code and even better, when one needs to change, that is all you have to put on production, not recomplie the whole application.
Data integrity must be maintained at the database level. That means constraints, defaults values, foreign keys, possibly triggers (if you have very complex rules or ones involving multiple tables). If you do not do this at the database level, you will eventually have integrity issues. Peolpe will write a quick fix for a problem and run the code in the query window and the required rules are missed creating a larger problem. A millino new records will have to be imported through an ETL program that doesn't access the application because going through the application code would take too long running one record at a time.
If you think you are building an application where scalibility will be an issue, you need to hire a database professional and follow his or her suggestions for design based on performance. Databases can scale to terrabytes of data but only if they are originally designed by someone is a specialist in this kind of thing. When you wait until the while application is runnning slower than dirt and you havea new large client coming on board, it is too late. Database design must consider performance from the beginning as it is very hard to redesign when you already have millions of records.
A good way to reduce scalability of your data tier is to interact with it on a procedural basis. (Fetch row..process... update a row, repeat)
This can be done within a stored procedure by use of cursors or within an application (fetch a row, process, update a row) .. The result (poor performance) is the same.
When people say they want to do processing in their application it sometimes implies a procedural interaction.
Sometimes its necessary to treat data procedurally however from my experience developers with limited database experience will tend to design systems in a way that do not leverage the strenght of the platform because they are not comfortable thinking in terms of set based solutions. This can lead to severe performance issues.
For example to add 1 to a count field of all rows in a table the following is all thats needed:
UPDATE table SET cnt = cnt + 1
The procedural treatment of the same is likely to be orders of magnitude slower in execution and developers can easily overlook concurrency issues that make their process inconsistant. For example this kind of code is inconsistant given the avaliable read isolation levels of many RDMBS platforms.
SELECT id,cnt FROM table
...
foreach row
...
UPDATE table SET cnt = row.cnt+1 WHERE id=row.id
...
I think just in terms of abstraction and ease of servicing a running environment utilizing stored procedures can be a useful tool.
Procedure plan cache and reduced number of network round trips in high latency environments can also have significant performance advantages.
It is also true that trying to be too clever or work very complex problems in the RDBMS's half-baked procedural language can easily become a recipe for disaster.
"Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!)."
What makes you think that "scalability" is the only relevant concern in a system design ? I agree with rexem where he commented that it is very obvious that you are "not" biased ...
Databases are sets of assertions of fact. Those sets become more valuable if they can also be guaranteed to conform to certain integrity rules. Those guarantees are not worth a dime if it is the applications that are expected to enforce such integrity. Triggers and sprocs are the only way SQL systems have to allow such guarantees to be offered by the DBMS itself.
That aspect outweighs "scalability" anytime, anywhere, anyhow.

Resources