I would like to hear about from the community a nice pattern to the following problem.
I had a "do-everything" server, which were webserver, mysql, crawlers server. Since two or three weeks, using monitoring tools, i saw that always when my crawlers were running, my load average was going over 5 (a 4 core server, would be ok to have until 4.00 as load). So, i've got another server and i want to move my crawlers to there. My question is. As soon as i have the data crawled in my crawler server, i have to insert in my database. And i wouldn't like to open a remote connection and insert it in the database, since i prefer to use the Rails framework, btw i'm using rails, to keep easier to create all relationships, and etc.
problem to be solved:
server, has the crawled data (bunch of csv files) and i want to move it to a remote server and insert it in my db using rails.
restriction: I don't want to run mysql (slave + master) since it would require a deeper analysis to know where happens more write operations.
Ideas:
move the csvs from crawlers to remove server using (ssh, rsync) and importing it during the day
write an API in the crawler server that my remote server can pull (many times at day) and import the data
any other idea or good patterns around this theme?
With a slight variation to the second pattern you have noted you could have a API in your web-app-server/db-server. Which the crawler will use to report in his data. He could do this in batches, real-time or only in a specific window of time (day/night time...etc).
This pattern will let the crawler decide when to report in the data. rather than having the web-app do the 'polling' for data.
Related
I am new to Ruby on Rails. I am using PostgreSQL database with Ruby and Rails 3.2.13. We already created 200K rows of records in PostgreSQL database. I need to send the same 200K records to another standalone windows application.I created a ROR REST API for this purpose. Currently the REST API takes long time to process the data and gets time out after 3 mins.
I am sending 1000 records at a time so that the API will send 1 - 1000 then 1001 - 2000 and so on. This is avoiding the time out. Is this a good approach in handling bulk data.
Do ROR have any build in function to handle this type of operation. Please help me.
Thanks
In short, no I don't think this is a good approach for transferring bulk data.
Putting the security concerns aside (which is reason itself not to), you'll need to handle data/schema consistency, connection reliability, datatype integrity, parsing and sanitizing strings, serialization/deserialization, etc... Sounds like a huge headache to me.
Bulk data transfer between databases isn't a responsibility/concern of Rails. I'd stick to doing this entirely on the backend and set up a new database as a replication slave of the master.
So the context is that I have a client application that generates logs and I want to occasionally upload this data to a backend. The backend will function as an analytics server, storing, processing and displaying this data - so as you can imagine there will be some querying involved.
In terms of data collection peak load, I expect to have about 5k clients, each generating about 50 - 100 lines per day, and I'd like the solution I'm tackling to be able to process that kind of data. If you do the math, thats upwards of 1 million log lines / month.
In terms of data analytics load, it will be fairly low - I expect a couple of us (admins) to run queries to harvest some info once a week or so from all the logs.
My application is currently running RoR + Postgres, though I'm open to using a different dB if it maps better to my needs. Current contenders in my head are MongoDB & Cassandra, but I don't really want to leave Postgres if it can scale to get the job done.
I'd recommend a purpose built tool like logstash for this:
http://logstash.net/
Another alternative would be Apache Flume:
http://flume.apache.org/
For my experiences, you will need an search engine to do troubleshooting and analysis when you have a lot of logs, instead of using database. (Search engine will more faster than database.)
For now, I am using logstash+Elasticsearch+Kibana total solution to build up my Log system.
Logstash is a tool can parse the logs and make it more human
readable.
Elasticsearch is a search engine to do indexing and
searching on your logs.
Kibana is a webUI that you can use it
to communicate with your Elasticsearch.
This is an Kibana Demo website. You can visit it. http://demo.kibana.org/ .
It provides the search interface and analysis tools such as Pie chart, Table, etc.
In my project, My application generates over 1.5 million logs per day. This Log system can handle all this logs.
Enjoy it.
If you are looking for a database solution that will grow with requests, then I would recommend looking beyond Postgres.
Cassandra is really well-suited for time-series data, though key-value stores are not suited for ad-hoc analytics. One idea could be to store your logs in Cassandra, and then roll them up into a different system later.
For straightforward storing-and-displaying of data, take a look at Graphite, a realtime graphing project.
You can create your own custom graphs with Graphite, and save them as dashboards.
I want to store webpages fetched by a web crawler. I don't have any random access. so whenever i want to read the stored data, i read from the start to the end.
We have tried solutions like HBase but one of the most good things about HBase is random access to records which we don't need at all. HBase has not proved to be stable for us after 1.5 years of test.
I want just a stack or queue on top of HDFS becuase the number of webpages is about 1 billion. I don't even want the queue behaviour of ActiveMQ i just want to be able to store the webpages so that i can read them all in case of a failure.
I don't want to use Files because i don't want to handle things like file rotations, file consistencies and ...
It is worth to mention that we need HDFS so we can run MapReduce jobs on the data when we want to send all the stored data to a solr cluster and to have good things like redundancy and availability by HDFS.
Is there a service on HDFS that just stores JMS records without any functionality for random access and without transparent view of records?
I've been working on a rails project that's unusual for me in a sense that it's not going to be using a MySQL database and instead will roll with mongoDB + Redis.
The app is pretty simple - "boot up" data from mongoDB to Redis, after which point rails will be ready to take requests from users which will consist mainly of pulling data from redis, (I was told it'd be pretty darn fast at this) doing a quick calculation and sending some of the data back out to the user.
This will be happening ~1500-4500 times per second, with any luck.
Before the might of the user army comes down on the server, I was wondering if there was a way to "simulate" the page requests somehow internally - like running a rake task to simply execute that page N times per second or something of the sort?
If not, is there a way to test that load and then time the requests to get a rough idea of the delay most users will be looking at?
Caveat
Performance testing is a very broad topic, and the right tool often depends on the type and quality of results that you need. As just one example of the issues you have to deal with, consider what happens if you write a benchmark spec for a specific controller action, and call that method 1000 times in a row. This might give a good idea of performance of that controller method, but it might be making the same redis or mongo query 1000 times, the results of which the database driver may be caching. This also ignores the time it'll take your web server to respond and serve up the static assets that are part of the request (this may be okay, especially if you have other tests for this).
Basic Tools
ab, or ApacheBench, is a simple commandline tool that you can use to test the throughput and speed of your app. I usually go to this first when I want to send a thousand requests at a web server, or test how many simultaneous requests my app can handle (e.g. when comparing mongrel, unicorn, thin, and goliath). Because all requests originate from the same server, this is good for a small number of requests, but as the number of requests grow, you'll be limited by the resources on your testing machine (network stack, cpu, and maybe memory).
Benchmark is a standard ruby class, and is great for quickly spitting out some profiling information. It can also be used with Test::Unit and RSpec. If you want a rake task for doing some quick benchmarking, this is probably the place to start
mechanize - I like using mechanize for quickly scripting an interaction with a page. It handles cookies and forms, but won't go and grab assets like images by default. It can be a good tool if you're rolling your own tests, but shouldn't be the first one to go to.
There are also some tools that will simulate actual users interacting with the site (they'll download assets as a browser would, and can be configured to simulate several different users). Most notable are The Grinder and Tsung. While still very much in development, I'm currently working on tsung-rails to make it easier to automate rails load testing with tsung, and would love some help if you choose to go in this direction :)
Rails Profiling Links
Good overview for writing performance tests
Great slide deck covering most of the latest tools for profiling at various levels
I have a web crawler that looks for specific information I want and returns it. This is run daily.
The issue is that my crawler has to do two things.
Get the link it has to crawl.
Crawl said link and push stuff to the db.
The issue with #1 is, there are 700+ links in total. These links don't change VERY frequently - maybe once a month?
So one option is just to do a separate crawl for the 'list of links', once a month, and dump the links into the db.
Then, have the crawler do a db hit for each of those 700 links every day.
Or, I can just have a nested crawl within my crawler - where every single time the crawler is run (daily), it updates this list of 700 URLs and stores it in an array and pulls it from this array to do crawl each link.
Which is more efficient and be less taxing on Heroku - or whichever host?
It depends on how you measure "efficiency" and "taxing", but the local database hit is almost certain to be faster and "better" than an HTTP request + parsing an HTML(?) response for the links.
Further, not that it likely matters, but (assuming your database and adapter support it) you can begin to iterate through the DB request results and process them without waiting for or fetching the entire set into memory.
Network latency and resources are going to be much worse than poking at a DB that is already sitting there, running, and designed to be queried efficiently and quickly.
However: once per day? Is there a good reason to spend any energy optimizing this task?