Best way to space out execution / requests (for a web crawler)

Best way to space out execution / requests (for a web crawler) - ruby-on-rails

I am building a web crawler. It grabs a main page, which has about 50 links. Then I need to do 2 requests for each link. I dont want to make 101 requests within a few seconds. What is the best way to randomly space these out? I want to somewhat mimic human activity.
urls.each do |url|
# Do Something
sleep([1,2,3,4,5].sample)
end
This will work, but is this the best was to do it? Let's say this takes about 1 hour to run, will the server become slow to respond when when it's sleeping?

Mimicking human behavior is very hard. It's impossible if you're requesting literally every single link on a website anyway. I would probably just request one page per second. The server will not be impacted at all by a thread that's currently sleeping.

Related

Should I POST high-rate user actions to my server on a per-action basis or send the batch of events once the session is closed?

I'm building a site where users can watch a video and click as many times as they want to "like" it. It's a bit like Periscope's "Hearts" function for those who know it.
The viewers are viewing the video on a web browser for now. Every "Like" is input into a heroku-hosted REDIS instance, so the write/read are fairly cheap. However potentially there could be a high rate of simultaneous input as many users watch a video at the same time.
In this scenario, I'm facing two options:
Send an event to the REDIS instance every time the user "likes." convenience: story the "like" right away with all relevant information. Inconvenience: lots of concurrent likes into the server.
Cache the "likes" locally and only send to REDIS once the session is over. Problem: at any time the user can close his browser (and potentially never return) so the "like" information could be lost permanently.
Any advice on which option is preferable?

Don't cache.
First, it's a really big complication as you won't know when the session is really over.
Second, Redis increment is probably as fast or faster than your cache. I bet your concern is Rails only, not Redis.
You may eventually want to make another endpoint - maybe a simple Sinatra app - to simply handle likes. I noticed autosuggest gems sometimes do this (for example) and it saves all the overhead of a rails request.
If it is a successful app, the concern could be someone writing a script to 'like' continually. You may need to put in some throttle to allow a limited number of requests over time.

How can I throttle requests per minute in ASP.NET MVC?

I want to be able to say that if a request from the same user (for an API) starts to happen quickly enough that their requests per minute reaches a certain level, I want to start denying the requests until it slows down. (Just like the guys at Zendesk did).
The question is two fold, what's an efficient way of calculating the request rate (minimal DB read/writes) and where in the MVC hierarchy (Action filter, Controller method override?) would this code best reside?

Two words, Reactive Framework.
It has all sorts of candy and sugary syntax to get throttling and managing of events to become less and less of a head ache, while I would bet will trickle back through and kill some complexity down stream.

View counter in ASP.NET MVC

I'm going to create a view counter for articles. I have some questions:
Should I ignore article's author
when he opens the article?
I don't want to update database each
time. I can store in a
Dictionary<int, int> (articleId, viewCount) how many times
each article was viewed. After 100
hits I can update the database.
I should only count the hit once per
hour for each user and article. (if
the user opens one article many
times during one hour the view count
should be incremented only once).
For each question I want to know your suggestions how to do it right.
I'm especially interested how to do #3. Should I store the time when the user opened the article in a cookie? Does it mean that I should create a new cookie for each page?

I think I know the answer - they are analyzing the IIS log as Ope suggested.
Hidden image src is set to
http://stackoverflow.com/posts/3590653/ivc/[Random code]
[Random code] is needed because many people may share the same IP (in a network, for example) and the code is used to distinguish users.

Sure - I think that is a good idea
and 3. are related: The issue is where would you actually store this dictionary and logic.
An ASP.NET application or session scope are of course the easiest choice, but there you really need to understand the logic of application pools. ASP.NET applications are recycled from time to time: when there is no action on the site for a certain period or in special situations - e.g. if the process starts to take too much memory the application is shut down and a new one is started in the next request. There are events for session and application shut-down, but at least some years ago they were not really reliable: In many special cases they did not always fire. Perhaps they are better now, but it is painful to test. And 1 hour is really a long time: Usually sessions are kept alive only like 20 minutes after last request.
A reliable way would be to have a separate Windows service (a lot of work to program) or always storing to database with double-view analyses (quite a lot of overhead for such a small feature).
Do you have access to IIS logs? How about analyzing IIS logs e.g. every 30 minutes with some kind of timer process and taking the count from there? Or then just store all the hits to the database with user information and calculate the unique hits with a similar timed process.
One final question: Are you really sure none of the thousands of counter applications/services in the Internet wouldn't do the job close enough to your requirements?
Good luck!

This is the screenshot of this page in Firebug. You can see that there is a request which returns 204 status code (No Content).
This is stackoverflow's view counter. They are using a hidden image which point to a controller's action.
I have many articles. How to track which articles the user visited already?
P.S. BTW, why is this request made two times?

Data logged to a file; how do I rotate logs and how do I parse the data to not have 'gaps' in the data?

I've got a web application that, for performance reasons, throws any data sent into a logfile.
I've got two concerns with this approach:
How do I best rotate logs, in order to not lose data?
For each user session multiple requests are logged. Each request has a unique id so there is an easy way for me to tie the requests to the session. The problem is, however, that if I rotate the logs I risk ending up with one request in one log and another request in another log.
How do I arrange my parsing in a way that allows me to parse all requests from a given session? I am willing to define a session timelimit, for example that the requests must, at maximum be 30 minutes apart.
If I had a hourly log rotation at 00 minutes:
What if the user made one request at 13:59 and one at 14:01 - The user would end up having requests in two different logs.

Answer to part 1: If you're on *nix, use syslog/logger. Check the logger(1) and syslog.conf(5) man pages.
Answer to part 2: You're not forced to look at just one log file at a time. less ${SERVICE}* will normally open all the relevant log files together: when you get to the bottom of a page, :n will move you to the next file and :p back.
Alternatively, use a log analyser program. Steve Kemp's post on promptly finding needles in syslog haystacks covers, together with its comments, a lot of ground.

Twitter app development best practices?

Let's imagine app which is not just another way to post tweets, but something like aggregator and need to store/have access to tweets posted throught.
Since twitter added a limit for API calls, app should/may use some cache, then it should periodically check if tweet was not deleted etc.
How do you manage limits? How do you think good trafficed apps live while not whitelistted?

To name a few.
Aggressive caching. Don't call out to the API unless you have to.
I generally pull down as much data as I can upfront and store it somewhere. Then I operate off the local store until it runs out and needs to be refreshed.
Avoid doing things in real time. Queue up requests and make them on a timer.
If you're on Linux, cronjobs are the easiest way to do this.
Combine requests as much as possible.

Well you have 100 requests per hour, so the question is how do you balance it between the various types of requests. I think the best option is the way is how TweetDeck which allows you to set the percentage and saves the rest of the % for posting (because that is important too):
(source: livefilestore.com)
Around the caching a database would be good, and I would ignore deleted ones - once you have downloaded the tweet it doesn't matter if it was deleted. If you wanted to, you could in theory just try to open the page with the tweet and if you get a 404 then it's been deleted. That means no cost against the API.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart