Storing a global hash in Ruby on Rails - ruby-on-rails

I am looking at an application where I need to build out the user friend model as a graph structure. I need to go several degrees deep so using standard SQL in MySQL will not work due to the circular references. I have looked at the graph algorithms available and they involve loading the entire record set into a Graph object then doing operations on that. I can't afford to do this for every operation.
I would like to store the Graph object as a global object in memory and just make calls and updates to it. However, since Rails scales by creating separate processes I am going to have an almost immediate synchronization problem, since a single Rails process is only going to scale to a few simultaneous users.
Does anyone know a way to store an object in memory in Rails and keep it in sync between both requests and between the multiple mongrel/whatever processes?
At this point I am looking at a Java service for the graph operations since it scales using a thread model instead of a process model. I can scale big enough that I won't have to deal with the scaling issue for a while.
I would like to have an all Rails solution though because it will be easier to maintain and build.

One option would be to build a small rack application that you load the data in to. The logic for querying and computing your data will be in that rack app. You can use normal HTTP calls to localhost in order to transmit whatever you need to transmit (generated HTML? Something else?) from the rack app to your rails app.
This is basically a workaround to having multiple processes for your Rails application. I'm sure there are better solutions out there, such as memcached, NoSQL databases, and so on.

Redis is the tool you're looking for.

It sounds like you may need a distributed hash table, or maybe something something like CouchDB as an alternative to a RDBMS.

Related

Respond with large amount of objects through a Rails API

I currently have an API for one of my projects and a service that is responsible for generating export files as CSVs, archive and store them somewhere in the cloud.
Since my API is written in Rails and my service in plain Ruby, I use the Her gem in the service to interact with the API. But I find my current implementation less performant, since I do a Model.all in my service, which in turn triggers a request that may contain way too many objects in the response.
I am curious on how to improve this whole task. Here's what I've thought of:
implement pagination at API level and call Model.where(page: xxx) from my service;
generate the actual CSV at API level and send the CSV back to the service (this may be done sync or async).
If I were to use the first approach, how many objects should I retrieve per page? How big should a response be?
If I were to use the second approach, this would bring quite an overhead to the request (and I guess API requests shouldn't take that long) and I also wonder whether it's really the API's job to do this.
What approach should I follow? Or, is there something better that I'm missing?
You need to pass a lot of information through a ruby process, that's always not simple, I don't think you're missing anything here.
If you decide to generate CSVs at the API level then what do you get with maintaining the service? You could just ditch the service altogether because replacing your service with an nginx proxy would do the same thing better (if you're just streaming the response from API host)?
If you decide to paginate, there will be a performance reduction for sure, but nobody can tell you exactly how much you should paginate - bigger pages will be faster and consume more memory (reducing throughput by being able to run less workers), smaller pages will be slower and consume less memory but demand more workers because of IO wait times,
exact numbers will depend on the IO response times of your API app and the cloud and your infrastructure, I'm afraid no one can give you a simple answer you can follow without experimentation with a stress test, and once you set up a stress test, you will get a number of your own anyway - better than anybody's estimate.
A suggestion, write a bit more about your problem, constraints you are working under etc and maybe someone can help you with a bit more radical solution. For some reason I get the feeling that what you're really looking for is a background processor like sidekiq or delayed job, or maybe connect your service to the DB directly through a DB view if you are anxoius to decouple your apps, or an nginx proxy for API responses, or nothing at all... but I really can't tell without more information.
I think it really depends how you want do define 'performance' and what your goal for your API is. Do you want to make sure no request to your API takes longer than 20msec to respond, than adding pagination would be a reasonable approach. Especially if the CSV generation is just an edge case, and the API is really built for other services. The number of items per page would then be limited by the speed at which you can deliver them. Your service would not be particularly more performant (even less so), since it needs to call the service multiple times.
Creating an async call (maybe with a webhook as callback) would be worth adding to your API if you think it is a valid use case for services to dump the whole record set.
Having said that, I think strictly speaking it is the job of the API to be quick and responsive. So maybe try to figure out how caching can improve response times, so paging through all the records is reasonable. On the other hand it is the job of the service to be mindful of the amount of calls to the API, so maybe store old records locally and only poll for updates instead of dumping the whole set of records each time.

share session among different type of web servers

Some web services in my company are built with different web apps.(Rails, Django, PHP)
What's the better practice to share the session status
So that user won't have to login again and again among different servers.
I build my Rails apps in AWS auto scaling group.
That is, even I browse the same website, but next time I may browser on another server, so that I have to login again. because the server doesn't have my session status.
I need some better idea or keywords for me to search about that kind of issue.
Thanks in advance
I can think of two ways in which you can achieve this objective
Implement a custom session handling mechanism that makes use of database session management, i.e. all sessions will be stored in a special table in the database and will be accessible to all the servers.
Have a Central Authentication Service (CAS) which will act as a proxy to all the other servers. This will then mean that this step has to happen before the requests reach the load balancer.
If you look around, option 1 might be recommended by many, but it may also be an overkill since you'll need custom session management in each of the servers. However, your choice would probably depend on the specific objectives you want to achieve, the overall flexibility of the system architecture and the amount of time you have on your hands. The CAS might be a more straightforward way of solving the problem.
Storing user sessions in your applications database wouldn't be recommended option for AWS.
The biggest problem with using a database, is that you need to write some clean up script that runs every so often to clear the table of all the expired user sessions. This is messy, creates more overhead, and puts more pressure on your DB.
However, if you do want to use an actual database for this, you should use a NoSQL database like Dynamo. This will give you much better performance than a relational database. It's probably more cost effective too in terms of data transfer. However, the biggest problem with this is that you still need that annoying clean up script. Note There is built in support in the SDK for using PHP with Dynamo for storing the user's session:
http://docs.aws.amazon.com/aws-sdk-php/v2/guide/feature-dynamodb-session-handler.html
The best but most costly solution is to use an ElastiCache cluster. You can set a TTL of your choice which means you won't have to worry about clean up scripts. Also ElastiCache will get you much better performance than Dynamo or any relational DB as the data is stored in the RAM of the ElastiCache nodes. The main drawback of ElastiCache is that currently, it can't scale dynamically. So if too many users logged in at once, if you didn't have a big enough cluster already provisioned, things could get ugly.
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html
But you can bet that all the biggest, baddest and best applications being hosted on AWS are either using Dynamo, ElastiCache or a custom NoSQL or Cache cluster solution for storing user sessions.
Note that both of these services are supported in all of the AWS SDKs:
https://aws.amazon.com/tools/

How is request processing with rails, redis, and node.js asynchronous?

For web development I'd like to mix rails and node.js since I want to get the best out of both worlds (rails for fast web development and node for concurrency). I know that some people choose to just use full ruby stack with eventmachine that is integrated into rails controller so that every request can be nonblocking by using fiber in event-loop model. I have been able to understand how that works in a big picture.
At this moement however I want to try doing nonblocking request processing with rails and node.js with message queue concept. I heard that this can be achieved by using redis as an intermediary. I'm still having trouble trying to figure out how that works as of now. From what I can understand: so we have 2 apps A (rails) and B (node.js) and redis. rails app will handle requests from users that go through controllers in REST manner, and then from there rails will pass that through redis, and then redis will form queues and node.js app will pick up that queue and do whatever necessary afterhand (write or read from backend db).
My questions:
So how would that improve concurrency and scalability? from what i
know since rails handle the requests through controllers
synchronously, and then write to redis, the requests will be
blocking still, even though node.js end can pickup the queue
asynchronously. (I have a feeling that it's not asynchronous yet if it's not end to end
non-blocking).
Would node.js be considered a proxy or an application here if redis
is the intermediary?
I'm new to redis and learning it still. If I'm using 100% noSQL
solution for my backend database, such as mongoDB or couchDB, are they replaceable by redis entirely or is redis more seen as a
messaging queue tool like rabbitMQ?
Is messaging queue a different concurrency concept than threading or
event-loop model or is it supposed to supplement them?
That's all my question. I'm new to message queue concept. Will appreciate any help and pointers to right direction and articles that help me learn more. thanks.
You are mixing some things here that don't go together.
Let's first make sure we are on the same page regarding the strengths/weaknesses of the involved technologies
Rails: Used for it's web-development simplicity and perfect for serving database-backed web-applications.
Not very performant when having to serve a large number of long running requests as you'd run out of threads on your Ruby workers - but well suited for anything that can scale horizontally with more web-nodes (multiple web-servers - 1 db).
Node.js: Great for high-concurrency scenarios. Not as easy as rails to write a regular web-application in it. But can handle near an insane amount of long-running low-cpu tasks efficiently.
Redis: A Key-Value Store that supports operations on it's data-structures (increment/decrement values, append/prepent push/pop to lists - all operations that make this DB work consistently with multiple clients writing at once)
Now as you can see, there is no benefit in having Rails AND Node serve the same request - communicating through Redis. Going through the Rails Stack would not provide any benefit if the requests ends up being handled by the Node server.
And even if you only offload some processing to the node server, it's still the Rails webserver that handles the requests and has to wait for a response from node - killing the desired scalability. It simply makes no sense.
Where you would a setup with Node and Rails together is in certain areas of your app that have drastically different scaling requirements.
If you are for example writing a Website that displays live stats for Football games you can easily see that there are two different concerns in your app: The "normal" Site that contains signup, billing and profile stuff that screams for a quick implementation through rails. And the "live" portion of the site where users see live results and you expect to handle a lot of clients at once - all waiting for something to happen (low cpu - high concurrency).
In such a case it may be beneficial to actually seperate the two parts of the site into a Ruby and a Node app, with then sharing data about the user through a store like Redis (but actually you just need some shared state that both can look at and write to for synchronization purposes).
So you would use for example Rails for the Signup/Login portions - once signed up write the session cookie into redis alongside with the permissions of the user (what game is he allowed to follow) and hand the user off to the Node.js app.
There the Node app can read the session information from Redis and serve the user.
Word of advice:
You don't get scalability by simply throwing Node.js into your Toolbox. You really have to find out what Node.js is good at (low-cpu high-io concurrent operations) and how you can leverage that to remedy some of the problems your currently chosen technology has.
I can answer 3 for you. Redis does not guarantee that when you perform an operation that result will actually be on disk, also transaction handling it a bit "different". It also requires for the whole database to be in memory. Depending on the situation this can be an issue or not. It is however incredibly fast. It is not a messaging queue, you can easily make a queue out of it, but it is not it's purpose. If you want to have a queuing system only you can probably do better with something else.

How to architect Rails site that can be edited while running?

I am writing a Rails app that "scrapes/navigates" some other websites and webservices for content. I am using Mechanize and Savon to do the heavylifting.
But given the dynamic nature of the web, I'd like to make my calls to these editable by the admin users of the site - rather than requiring me to release a new version of the site.
The actual scraping thread happens async to the website, using the daemons gem.
My requirements are:
Thinking that the scraping/webservice calling code is quite simple, the easiest route is to make the whole class editable by the admins.
Keep a history of the scraping code - so that we can fairly easily revert if we introduce a problem.
Initially use the code from the file system, but as soon as thats been edited and stored somewhere, to use that code instead.
I am thinking my options are:
Store the code in the db (with a history table for the old versions)
Store the code in a private git repo somewhere and access that for the history/latest versions.
I am thinking the git route might be easiest, given its raison d'etre is to track file history...
But perhaps there is a gem/plugin that does all this for me, out of the box?
Thanks in advance for any tips/advice.
~chris
I really hope you aren't doing something like what's talked about here...
Assuming you are doing a proper mixin, there used to be a gem called "acts_as_versioned" which would do something like you want. It's been a while so I don't know if it's been turned into a plugin or if it's been abandoned. Essentially the process it uses was to provide a combination key for your versioned table.
Your database would have a structure like this:
Key column (id for the record)
Version column (id for the record's version)
All the record attributes
Let's say you had a table for your scripts, and the script you wanted has three versions. Your table would have the following records:
123, 3, '#Be good now'
123, 2, 'puts "Hi"'
123, 1, '#Do not be bad'
Getting the most recent version would be as simple as
Scripts.find :first, :conditions=>{:id=>123}, :order=>"version desc"
Rolling back would be as simple as removing the most recent version, or having another table with a pointer to the active version. It's up to you.
You are correct in that git, subversion, mercurial and company are going to be much better at this. To provide support, you just follow these steps:
Check out the script on the server (using a tag so you can manage what goes there at any time)
Set up a cron job to check out the new script periodically (like every six hours or whatever you feel comfortable with)
The daemon you have for running the script should run the new version automatically.
IF your site is already under source control, and IF you're running under mod_rails/passenger, you could follow this procedure:
edit scraping code
commit change locally
touch yourapp/tmp/restart.txt
that should give you history of the change and you shouldn't have to re-deploy.
A bit safer, but not sure if it's possible for you is on a test/developement server: make change, commit locally, test it, then on production server, git pull then touch tmp/restart.txt
I've written some big spiders and page analyzers in the past, and one of the things to keep in mind is what code is providing what service to the entire application.
Rails is providing the presentation of the data being gathered by your spidering engine. The presentation is one side of the coin, and spidering is the other, and they should be two separate code bases, tied together by some data-sharing mechanism, which, in your case, is the database. The database gives you some huge advantages as does having Rails available, when your spidering code is separate. It sounds like you have some separation already, but I'd recommend creating a wider gap. With that in mind, here's how I've done it before, and what I'd do now.
Previously, I had a separate app for my spidering that was spawning multiple spider tasks. Each task would look at a bunch of different URLs, throw their results in the database, then quit. Each time one quit the main app would spawn another spider to process more URLs. Each loop, the main app checked a YAML configuration file for run-time parameters, like how many sub-tasks it should have running, how many URLs they'd get, how long they'd wait for connections, etc. It stored the last modification date of the config file each time it loaded it so, if I made a change to the file, the app would sense it in a reasonably short time, reread the file, and adjust its behavior.
All state information about the URLs/pages/sites being scraped/spidered, was kept in the database so I could check on its progress. I could see how many had been processed or remained in the queue, the various result codes, and the content being returned. If I didn't like something I could even tweak the filters to skip junk pages, knowing the spidering tasks would be updated in a few minutes.
That system worked extremely well, spidered a major customer's series of websites without a glitch, running for several weeks as I added new sites to the list. (We were helping one of the Fortune 50 companies improve their sites, and every site had been designed and implemented by a different team, making every site completely different. My code had to be flexible and robust; I was really happy with how it worked out.)
To change it, these days I'd use a database table to hold all the configuration info. That way I could easily build an admin form, and let someone else inherit the task of adjusting the app's runtime configuration. The spider tasks would also be written so they'd pull their configuration from the database, rather than inherit it from the main app. I originally had the main app do all the administration and pass the config info to the spidering apps because I wanted to keep the number of connections to the database as low as possible. I was using Postgres and now know it could have easily handled the load, so by letting the individual tasks handle their configuration I could have made it more responsive.
By making the spidering engine separate from the presentation engine it was possible to temporarily stop one or the other without affecting the progress of the spidering job. Once I had the auto-reload of the prefs in place I don't think I had to stop the spidering engine, I just adjusted its prefs. It literally ran for weeks without stopping and we eventually pulled the plug because we had enough data for our needs.
So, I'd recommend tweaking your code so your spidering engine doesn't rely on Rails, instead it will be fired off by cron or a separate scheduling app. If you have to temporarily stop Rails your engine will run anyway. If you have to temporarily stop the engine then Rails can continue serving pages. The database sits between the two acting as the glue.
Of course, if the database goes down you're hosed all the way around, but what else is new? :-)
EDIT: Chris said:
"I see your point about the splitting the code out, though my Ruby-fu is low - not sure how far I can separate things without having to have copies of the ActiveModel/migrations stuff, plus some shared model classes."
If we look at your application as spider engine <--> | <-- database --> | <--> Rails/MVC/presentation, where the engine and Rails separately read and write to the database, and look at what each does well, that helps figure out how to break them into separate code bases.
Rails is designed to handle migrations, so let it. There's no reason to reinvent that wheel. But, how often do you do migrations, and what is effected when you do? You do them seldom once the application is stable, and, at that point you'd do them in a maintenance cycle to tweak the database. You can shut down the spidering engine and the web interface for a few minutes, migrate the database, then bring things up and you're off and running. Migrations are a necessary evil, but are hardly show-stoppers once in production. Most enterprises have "Software Sunday", or some pre-announced window of maintenance, so do the same.
ActiveRecord, modeling and associations are pretty easy to deal with too. The models are in a file that is required internally by Rails already, so the spidering engine can inherit the database know-how that way too; Multiple apps/scripts can use the same model file. You don't see the Rails books talk about it much, but ActiveRecord is actually pretty easy to use outside of Rails. Search the googles for activerecord without rails for more info.
You can pull in ActiveSupport also if you want some of its extensions to classes by doing a regular require, but the Rails "view" and "controller" logic, which normally applies to presenting the web interface, shouldn't be needed at all in the engine.
Business logic, which goes in the controllers in Rails could even be refactored into separate methods that get required by the Rails side of things and by the spidering engine. It's a different way of looking at Rails but falls in line with the "DRY" mantra - don't repeat yourself, so make things modular and require (or require_relative) bits and pieces that are the building blocks of the entire system.
If you don't want a totally separate codebase, you can take advantage of Rail's script runner, which gives a script access to the ActiveRecord::Base and ActiveRecord::Associations and ActiveSupport. Do a rails runner -h from your app's main directory, or search for "rails runner" for more info. runner is not good for a job that starts and runs many times an hour, because Rail's startup cost is high. But, if you have a long-running task, say one that runs in parallel with your rails app, then it's a great choice. I'd give it serious consideration for the spidering side of your application. Eventually you might want to break the spidering-engine out to a separate host so the presentation side has a dedicated host, so runner will help you buy time and do it in small steps.

Load Ruby on Rails models without loading the entire framework

I'm looking to create a custom daemon that will run various database tasks such as delaying mailings and user notifications (each notice is a separate row in the notifications table). I don't want to use script/runner or rake to do these tasks because it is possible that some of the tasks only require the create of one or two database rows or thousands of rows depending on the task. I don't want the overhead of launching a ruby process or loading the entire rails framework for each operation. I plan to keep this daemon in memory full time.
To create this daemon I would like to use my models from my ruby on rails application. I have a number of rails plugins such as acts_as_tree and AASM that I will need loaded if I where to use the models. Some of the plugins I need to load are custom hacks on ActiveRecord::Base that I've created. (I am willing to accept removing or recoding some of the plugins if they need components from other parts of rails.)
My questions are
Is this a good idea?
And - Is this possible to do in a way that doesn't have me manually including each file in my models and plugins?
If not a good idea
What is a good alternative?
(I am not apposed to doing writing my own SQL queries but I would have to add database constraints and a separate user for the daemon to prevent any stupid accidents. Given my lack of familiarity with configuring a database, I would like to use active record as a crutch.)
It sounds like your concern is that you don't want to pay the time- or memory- cost to spin up the rails stack every time your task needs to be run? If you plan on keeping the daemon running full-time, as you say, you can just daemonize a process that has loaded your rails stack and will only have to pay that memory- or time-related penalty for loading the stack one time, when the daemon starts up.
Async_worker is a good example of this sort of pattern: It uses beanstalk to pass messages to one or more worker processes that are each just daemons that have loaded the full rails stack.
One thing you have to pay attention to when doing this is that you'll need to restart your daemonized processes upon a deploy so they can reload your updated rails stack. I'm using this for a url-shortener app (the single async worker process I have running sits around waiting to save referral data after the visitor gets redirected), and it works well, I just have an after:deploy capistrano task that restarts any async worker(s).
You can load up one aspect of Rails such as ActiveRecord but when you get right down to it the cost of loading the entire environment is not much more than just loading ActiveRecord itself. You could certainly just not include aspects like ActionMailer or some of the side bits but I'm going to guess that you're not going to see much win out of it.
What I would suggest instead is either running through runner/console like you said you didn't want to but rather than bootstrapping each time, try to batch things so that you're doing 1000 at a time instead of 1. There are a lot of projects that use this style, some of the bulk mailers spring to mind if you want examples. DJ (delayed_job) does similar by storing a bit in the database saying that this code needs to be run at some point in the future using the environment stack but it tries to batch together as much as it can so you may get win from that.
The other option is to have a persistent mini-rails app with as much stripped out as possible so that the memory usage is lower which can listen for requests and do your bidding when you want it to. This would be more memory but the latency of bootstrapping would be essentially nullified.
Lastly, as an afterthought, this would be a great use for Postgres.

Resources