High-performance Screen Scrape with Ruby on Rails!

High-performance Screen Scrape with Ruby on Rails! - ruby-on-rails

I need to get many data and performance is an essential requirement. Do you have any suggestions?
Thanks in advance!

I've been using the mechanize gem (http://mechanize.rubyforge.org/mechanize/) with good results. Performance is always a problem because HTTP response times can vary, but there is nothing inherently slow about mechanize.
I was looking to gain optimal performance by adding concurrency where the order of retrieval was not important. It can take quite a bit of tweaking to get the receive buffers isolated so the buffer from one does not get corrupted by a response from another thread.
Good luck with this.

Related

Rails profiling time taken for each line execution while process request

Is there any better way to find time taken for every logic executed while processing HTTP request. We need profiling individually rather than group and give time taken for multiple things. We tried NewRelic, AppSignal to get this data but no luck.
I have added screenshot too, here we could see ActiveRecord took around 1.5seconds and view rendering view (response) took around 300ms. When we sum these two it's around 2seconds but total time taken for the request is 10seconds. We are not able to find where the rest of 8seconds taken. NewRelic saying 90% of time taken from controller action but no breakdown for this. Is there any better tools to get more detailed info?
Note: Most of time it's working fine and we have issue in specific time. But we don't have what is causing slowness, to identify this only we are expecting tool to find this.

AppSignal can help you with this, here is an example of an event timeline AppSignal provides when you are looking at the performance of a specific action.
This is provided out of the box for Rails. There will be some cases where you want to narrow down even further that's where custom instrumentation comes in, it helps you find a specific piece of code that is causing performance problems.
You can reach out to us at AppSignal and I will be happy to help.

There may be tooling options that give you a little more information out-of-the-box. But, if you're really trying to identify bottlenecks you may want to take things into your own hands and do some good old-fashioned debugging. Use ruby's built-in Benchmark module in your controller actions and log the results. This may help you identify which sections of code are chewing up response times.
https://ruby-doc.org/stdlib-3.0.0/libdoc/benchmark/rdoc/Benchmark.html

Will many attributes use (much) more memory in a Rails app?

As a Ruby on Rails-programmer I am constantly struggling with memory problems and my apps on Heroku often hit the 100% mark. As I have learned RoR by learning-by-doing and mostly by myself I believe I have missed quite a lot of conventions and best practises in terms of, specifically, memory conservation.
Measuring memory usage is difficult, even using great gems as derailed as it will only give me very indirect pointers to code that will use too much memory.
I tend to use a lot of gems and a lot of attributes on my major models. My most important one, my Product model, holds some 40 attributes and there are about 400.000 objects in my database. Which brings me to my question, or rather clarifications.
A. I assume that if I do a Product.all request in the controller it will load some (400.000*40 = ) 16M "fields" (not sure what to call it) into memory, right?
B. If I do Product.where(<query that brings up half of the objects>) will load 200.000*40= 8M "fields" into memory, right?
C. If I do Product.select(:id,:name, :price) would bring 400.000*3= 1.2M "fields" into memory, right?
D. I also assume that if I do selections of attributes that are integers (such as price) it will be less expensive than strings (such as name) which in term will be less expensive than text (such as description). In other word a Product.select(:id) will use less memory than Product.select(:long_description_with_1000_characters), right?
E. I also assume that if I do a search such as Product.all.limit(30), it will use less memory than a Product.all.limit(500), right?
F. Considering all this, and assuming the answer is YES on all of the above I assume it would be worth the time to go through my code to find "fat and greedy" requests of the 16M type. In that case, what would be the most effective way, or tool, of understanding how many "fields" a certain request will use? At the moment, I think I will just go through the code and try to picture, and trouble shoot, each database request to see if it can be slimmed down.
Edit: Another way of phrasing the (F) question: If I do a change to a certain database-request. How can I tell if it is using less memory, or not? My current approach is to upload to Heroku on my production app and check the total memory usage, which is VERY blunt obviously.
FYI: I am using Scout on Heroku to find memory bloats which is useful to some extent. I also use bullet gem to find N+1 issues, which helps to speed up requests but I am not sure if it affects memory issues.
Any pointers to blog posts or similar that discuss this would be interesting as well. In addition, any comments on how to make sure requests like the above will not bloat or use too much memory would be highly appreciated.

Where should computations take place for complex algorithms

Background:
I'm a software engineering student and I was checking out several algorithms for recommendation systems. One of these algorithms, a collaborative filtering has a lot of loops int it, it has to go through all of the users and for each user all of the ratings he has made on movies, or other rateable items.
I was thinking of implementing it on ruby for a rails app.
The point is there is a lot of data to be processed so:
Should this be done in the database? using regular queries? using PL/SQL or something similar (Testing dbs is extremely time consuming and hard, specially for these kind of algorithms )
Should I do a background job that caches the results of the algorithm? (If so the data is processed on memory and if there are millions of users, how well does this scale)
Should I run the algorithm every time there is a request or every x requests? (Again, the data is processed in memory)
The Question:
I know there are things that do this like Apache Mahout but they rely on Hadoop for scaling. Is there another way out? is there a Mahout or Machine Learning equivalent for ruby and if so how where does the computation take place?

Here is my thoughts on each of the methods:
No it should not. Some calculations would be much faster to run in your database and some would not. However it would be hard and time consuming to test exactly which calculations that should be runned in your db, and you would properly experience that some part of the algorithm is slow in postgreSQL or whatever you use.
More importantly: this is not the right place to run logic, as you say yourself, it would be hard to test and it's overall a bad practice. It would also affect the performance of your requests overall each time the db have to calculate the algorithm. Also the db would still use a lot of memory processing this so that isn't a advantage.
By far the best solution. See below for more explanation.
This is a much better solution than number one. However this would mean that your apps performance would be very unstable. Some times all resources would be free for normal requests, and some times you would use all your resources on you calculations.
Option 2 is the best solution, as this doesn't interfere with the performance of the rest off your app and is much easier to scale as it works in isolation. If for example you experience that your worker can't keep up, you can just add some more running processes.
More importantly you would be able to run the background processes on a separate server and thereby easily monitor the memory and resource usage, and scale your server as necessary.
Even for real time updates a background job will be the best solution (if of course the calculation is not small enough to be done in the request). You could create a "high priority" queue that has enough resources to almost always be empty. If you need to show the result to the user with a reload, you would have to add some kind of push notification after a background job is complete. This notification could then trigger an update on the page through javascript (you can also check out the new live stream function of rails 4).
I would recommend something like Sidekiq with Redis. You could then cache the results in memcache or you could recalculate the result each time, that really depends on how often you would need to calculate this. With this solution, however, it would be much easier to setup a stable cache if you want it.
Where I work, we have an application that runs some heavy queries with a lot of calculations like this. Each night these jobs are queued and then run on a isolated server over the next few hours. This scales really well, and is also easy to monitor with new relic.
Hope this helps, and makes sense (I know my English isn't perfect), but please feel free to ask if I misunderstood something or you have more questions.

Techniques for tracking threads in Grand Central?

I suspect my app is creating lots of threads using dispatch_async() calls. I've seen crash reports with north of 50 and 80 threads. It's a large code base I didn't write and haven't fully dissected. What I would like to do is get a profile of our thread usage; how many threads we're creating, when we're creating them, etc.
My goal is to figure out of we are spending all of our time swapping threads and if using an NSOperationQueue would be better so we have more control than we do by just dispatch_async'ing blocks all over willy-nilly.
Any ideas / techniques for investigating this are welcome.

Looks like you need to take a look at Instruments. Learn about it from Apple docs or WWDC sessions or wherever you want. There are many resources
Generally NSOperationQueues are definitely better if you need to implement some dependicies.
As Brad Llarson pointed there are a few WWDC sessions which are helpful in many cases. However besides optimizing your calls you should consider making your code more human readable and simply better. I haven't ever seen source code with as many as 80 queues on iOS. There must be something wrong with the architecture of app.
Let me know anyone if I am wrong.

If you are spinning that many threads, you are most likely I/O bound. Also, Mike's article is great, but it's quite old (though still relevant wrt regular queues).
Instead of using dispatch_async, you should be using dispatch_io and friends for your I/O requirements. They handle all the asynchronous monitoring and callbacks for you... and will not overrun your process with extraneous processing threads.

UIWebView - What load times should I expect?

I have had a look through lots of posts but can't seem to find anything on UIWebView load times and the kind of performance to expect.
I have an app that calls in a cached html page which is about 4.2KB in size. The view takes around 3.5 seconds to render through a wi-fi connection. How good or bad is this?
I'm trying to get the page as quick as possible but I can't really find anything more I can do. So I guess the question is, should I be getting better performance than this?
Any useful links would be appreciated.

It really doesn't have much to do with the size of the HTML. Although 4.2Kb is rather big. Rendering performance is mostly related to "what" it renders. For example, if you have lots of tables, or even worse -- nested tables --, then the rendering will be really slow. If you're HTML uses bad recursive entity definitions, it'll be horribly slow too. I would recommend taking full advantage of HTML5 for complex rendering.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart