How to prevent hackers from scraping our database? [duplicate] - ruby-on-rails

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do you stop scripters from slamming your website hundreds of times a second?
I am building a web application in RubyOnRails, which is based on a large body of data. The application makes for powerful navigation and intersection of the data, as well as a community model for adding more data.
In that respect one could compare it with StackOverflow.com: a big bunch of data, structured in a fairly simple way.
I intend to offer the content under a CreativeCommons license, but if the site "hits it off", I need to discourage copycats. My biggest fear is screen scraping scripters, not only leeching away the raw data, but also incurring huge usage peaks on my servers.
I wonder if RubyOnRails offers any way to throttle (obviously automated) requests, e.g. to reduce their response-time at the benefit of regular users. Perhaps this requires Apache or Phusion Passenger settings?
EDIT: My target is not to recognize user types, but to reduce responsiveness to overly active users, e.g. maximize the number of requests handled per IP address per unit of time (?)

My suggestion would be to limit any easy iterative navigation of your websites which was the primary way I have seen harvesting programs work. The simple encryption of your id numbers used as GET variables would make stripmining your info more difficult. You can only try and make getting your information onerous. You won't be able to prevent it completely.

You could present a captcha to the "overly active users", just like SO does when you edit too fast. That should effectively hinder automatic spider like scraping.

You might also want to look into using some Rack middleware to do rate limiting, like this recent article covered for doing API limiting (such as what you'd want at Twitter or similar).

I believe all you could do is put hoops for the user to jump though. Ultimately there is no foolproof way to distinguish a regular user from a bot.

Related

Wicket: “large memory footprint!”, "Does Wicket scale?".. etc

Wicket uses the Session heavily which could mean “large memory footprint” (as stated by some developers) for larger apps with lots of pages. If you were to explain to a bunch of CTOs from Fortune 500 that they have to adopt Apache Wicket for their large web application deployments and that their fears about Wicket problems with scaling are just bad assumptions; what would you argue?
PS:
The question concerns only
scaling.
Technical details and real world
examples are very welcomed.
IMO credibility for Apache Wicket in very large scale deployment is satisfied with the following URL: http://mobile.walmart.com View the source.
See also http://mexico.com, http://vegas.com, http://adscale.de, and look those domains up with alexa to see their ranking.
So, yes it is quite possible to build internet scale applications using Wicket. But whether or not you are using Wicket, Struts, SpringMVC, or just plain old JSPs: internet scale software development is hard. No framework can make that easy for you. No framework can give you software with a next-next-finish wizard that services 5M users.
Well, first of all, explain where the footprint comes from, and it is mainly the PageMap.
The next step would be to explain what a page map does, what is it for and what problems it solves (back buttons and popup dialogs for example). Problems, which would have to be solved manually, at similar memory costs but at a much bigger development cost and risk.
And finally, tell them how you can affect what goes in the page map, the secondary page cache and thus how the size can be kept under control.
Obviously you can also show them benchmarks, but probably an even better bet is to drop a line to Martijn Dashorst (although I believe he's reading this post anyway :)).
In any case, I'd try to put two points across:
There's nothing Wicket stores in memory which you wouldn't have to store in memory anyway. It's just better organised, easier to develop, keep consistent, and test.
Java itself means that you're carrying some inevitable excess baggage around all the time. If they are so worried about footprint, maybe Java isn't the language they want to use at all. There are hundreds of large traffic websites written in other languages, so that's a perfectly workable solution. The worst thing they can do is to go with Java, take on the excess baggage and then not use the advantages that come with an advanced framework.
Wicket saves the last N pages in the session. This is done to be able to load the page faster when it is needed. It is needed mostly in two cases - using browser back button or in Ajax applications.
The back button is clear, no need to explain, I think.
About Ajax - each ajax requests needs the current page (the last page in the session cache) to find a component in it and call its callback method, update some model, etc.
From their on the session size completely depends on your application code. It will be the same for any web framework.
The number of pages to cache (N above) is configurable, i.e. depending on the type of your application you may tweak it as your find appropriate. Even when there is no inmemory cache (N=0) the pages are stored in the disk (again configurable) and the page will be find again, just it will be a bit slower.
About some references:
http://fabulously40.com/ - social network with many users,
several education sites - I know two in USA and one in Netherlands. They also have quite a lot users,
currently I work on a project that expects to be used by several million users. Wicket 1.5 will be improved wherever we find hotspots.
Send this to your CTO ;-)

Achieving 25K Concurrent connections in RubyOnRails Application

I am trying to build a suggestion board application. where a users raises a query and multiple people will post at the same time. expected to be supporting atleast 25k concurrent users. now the question format also has checkboxes or radio buttons, in thats case they will be writing to DB.
Please let me know how can achieve this in Ruby on Rails.
- hardware support (specific Hardware LB)
- software support like (DB clustering/App server clustering/ Web traffic resolution)
I think your best plan is to worry about scaling to this level once you have that many users. There's nothing stopping you from achieving this in Rails, or indeed any other framework/language.
The problem with trying to design your architecture up-front to scale to this level is that, at this point, you have absolutely no idea where the pain points are going to be. Are there specific pages which are going to hit the database harder, are some of your pages heavy on HTML and images... there are so many questions to ask that simply cannot be answered effectively until you've gotten something out there.
This doesn't mean that you shouldn't worry about scaling - by all means, try to design your data structures in such a way as to allow you to scale later. But put off any major decisions, and think about them later when you have some hard data to work with.

Does Seaside scale?

Seaside is known as "the heretical web framework". One of the points that make it heretical is that it has much shared state. That however is something which, in my current understanding, hinders easy scaling.
Ruby on rails on the other hand shares as less state as possible. It has been known to scale pretty well, even if it is dog slow compared to modern smalltalk vms. flickr uses php and has scaled to an extremly big infrastructure...
So has anybody some experience in the scaling of Seaside?
Ramon Leon shares some of his experience on upscaling seaside on his (excellent) blog. You can read very concrete ideas with sample code about configuring and tuning seaside.
Enjoy :-)
http://onsmalltalk.com/scaling-seaside-more-advanced-load-balancing-and-publishing
http://onsmalltalk.com/scaling-seaside-redux-enter-the-penguin
http://onsmalltalk.com/stateless-sitemap-in-seaside
Short answer:
you can scale Seaside applications like hell yah
Long answer:
In the IT domain, scaling is one thing but it has two dimensions:
horozontal
vertical
Almost everybody thought about scaling in the vertical dimension. That was until intel and friends reached some physical barriers and started to add cores to compensate the current impossibility of adding MHz.
That's when all we started to be more aware of scaling horizontally as the way to go.
Why am I telling you this?
Because Seaside is a smalltalk image running in a VM and that is roughly the same situation of a system in a server of a monocore processor.
Taking that as foundation, you scale web apps by making a cluster of servers. It's the natural thing to do, it's the fault tolerant thing to do, is the topologically intelligent thing to do, is the flexible thing to do, I guess you get the idea...
So, if for scaling, you do the same as intel & friends, you embrace the horizontal way. And it's even cheaper that the vertical way (that will lead you to IBM and Sun servers that are as good as expensive).
RoR applications are typically scaled horizontally. Google has countless cheap servers to do their thing. It works perfectly fine no matter how dramatized people want's to impress you throwing at you a bunch of forgettable twitter whales.
If they talk to you about that, you just be polite and hear what they say but remember this:
perfect is the enemy of the good
the unfinished perfect will never be as value as the good thing done
BTW, Amazon does something like that too (and it kind of couple geolocation so they enhance the chances of attending your requests with the cluster that is closest to your location).
On the other hand, the way Avi scaled dabbledb (Seaside based web application company bought by twitter) was using one vm per customer account (starting up and shutting down those on demand).
Having a lot of state in an image doesn't make scaling impossible nor complicated.
Just different.
The way to go is with a load balancer that uses sticky sessions so you can have one image attenting all the requests of an user session. You make things so any worker-image behind the load balancer can attend any user of a given app. And that's pretty much it.
To be able to do that you need to share the persistent objects among workers. All the users databases needs to be accessible by the workers at anytime and need to deal well with concurrency.
We designed airflowing scalable in that way.
It's economically convenient too because you can start with N very small (depending on the RAM of your first server) and increase it on demand until you reach the hardware limit.
Once you reach the hardware limit, you just add another host to the cluster and recofigure the balancer (and the access to the databases).
Simple, economic and elegant.
http://dabbledb.com/ seems to scale quite well. Moreover, one can use GemStone GLASS to run Seaside.
On this interview Avi Bryant the creator of Seaside and Co founder DabbleDB
Explains how they make it scale.
From what I understand:
each customer has it's own Squeak
Image.
When a customer comes Apache decides based on the user name which port to send it to.
Based on the port it starts the customer's Squeak Image.
That way it can grow to an infinite number of servers.
I think this solution works for them based on the specifics of their application each customer doesn't need to share info between them. So no need for o centralized DB.
Anyway it is better to watch the interview rather than my half-made summary.
Yes, Seaside scales down fantastically. A single developer can create and maintain complex applications for small groups very well.
[coming back to this after a few years]
This actually is much more important than scaling up. Computer speed still grows a lot, and 99% of all applications can now run on one machine. Speed of development, and especially maintenance now totally dominates TCO.
I would rephrase your question slightly to: does Seaside prevent/discourage you from creating applications that scale? I would say usually no. Seaside doesn't have a default way to store your data (just like php on its doesn't, though Seaside gives you a few extra options) and my impression is interacting with your data tends to be the biggest hurdle when it comes to scaling.
If you want to store your data in a monolithic SQL db, like with rails, you can do that. Or you can use an object database. Or you can use a separate object database for each user, or separate db for each project, or a separate db for each user and project. Or you can store everything in a series of flat files or you can just store your data as objects in your VM's memory.
And because of continuations you don't need re-fetch your data out of your datastore on every webpage call. As when you are using a desktop application you can pull data out of your datastore when your user begins interacting with it, set the appropriate variables, and then use those variables between webcalls until the user is done with the data at which point you can update the datastore. When you don't share state you have to hit the datastore on every single webcall.
Of course this doens't mean scaling is free, it just means you have a larger domain in which to search for scaling solutions.
All that said, for many applications rails will scale much easier simply because there exist large hosting solutions for rails (and php for that matter) that will offer you a huge amount of resources without having to rent and setup a custom box.
These are just my impressions from reading and talking to people.
I just reminded that there is link on Pharo's success stories : a Seaside Web application with up to 1000 concurrent users for a large health insurance in Argentina .
Pharo success stories
Issys Tracking:
Load balancing: Apache as a proxy/balancer (round robin with session
affinity)
Server setup: 5 Pharo images on 3 different servers (Linux and Windows 2003)
GUI: Heavily AJAX-based. All code written in Smalltak: Seaside 3.0, Seaside JQuery integration and JQWidgetBox.
Persistency: Glorp (OR mapper) and OpenDBX (DB client)
Databases: large PostgreSQL and MS SQL Server DBs
From the Wikipedia article, it's a total pig. Prior to that, it hadn't scaled to the point where I'd heard of it. :)

To go API or not [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
My company has this Huge Database that gets fed with (many) events from multiple sources, for monitoring and reporting purposes. So far, every new dashboard or graphic from the data is a new Rails app with extra tables in the Huge Database and full access to the database contents.
Lately, there has been an idea floating around of having external (as in, not our company but sister companies) clients to our data, and it has been decided we should expose a read-only RESTful API to consult our data.
My point is - should we use an API for our own projects too? Is it overkill to access a RESTful API, even for "local" projects, instead of direct access to the database? I think it would pay off in terms of unifying our team's access to the data - but is it worth the extra round-trip? And can a RESTful API keep up with the demands of running 20 or so queries per second and exposing the results via JSON?
Thanks for any input!
I think there's a lot to be said for consistency. If you're providing an API for your clients, it seems to me that by using the same API you'll understand it better wrt. supporting it for your clients, you'll be testing it regularly (beyond your regression tests), and you're sending a message that it's good enough for you to use, so it should be fine for your clients.
By hiding everything behind the API, you're at liberty to change the database representations and not have to change both API interface code (to present the data via the API) and the database access code in your in-house applications. You'd only change the former.
Finally, such performance questions can really only be addressed by trying it and measuring. Perhaps it's worth knocking together a prototype API system and studying it under load ?
I would definitely go down the API route. This presents an easy to maintain interface to ALL the applications that will talk to your application, including validation etc. Sure you can ensure database integrity with column restrictions and stored procedures, but why maintain that as well?
Don't forget - you can also cache the API calls in the file system, memory, or using memcached (or any other service). Where datasets have not changed (check with updated_at or etags) you can simply return cached versions for tremendous speed improvements. The addition of etags in a recent application I developed saw HTML load time go from 1.6 seconds to 60 ms.
Off topic: An idea I have been toying with is dynamically loading API versions depending on the request. Something like this would give you the ability to dramatically alter the API while maintaining backwards compatibility. Since the different versions are in separate files it would be simple to maintain them separately.
Also if you use the Api internally then you should be able to reduce the amount of code you are having to maintain as you will just be maintaining the API and not the API and your own internal methods for accessing the data.
I've been thinking about the same thing for a project I'm about to start, whether I should build my Rails app from the ground up as a client of the API or not. I agree with the advantages already mentioned here, which I'll recap and add to:
Better API design: You become a user of your API, so it will be a lot more polished when you decided to open it;
Database independence: with reduced coupling, you could later switch from an RDBMS to a Document Store without changing as much;
Comparable performance: Performance can be addressed with HTTP caching (although I'd like to see some numbers comparing both).
On top of that, you also get:
Better testability: your whole business logic is black-box testable with basic HTTP resquest/response. Headless browsers / Selenium become responsible only for application-specific behavior;
Front-end independence: you not only become free to change database representation, you become free to change your whole front-end, from vanilla Rails-with-HTML-and-page-reloads, to sprinkled-Ajax, to full-blown pure javascript (e.g. with GWT), all sharing the same back-end.
Criticism
One problem I originally saw with this approach was that it would make me lose all the amenities and flexibilities that ActiveRecord provides, with associations, named_scopes and all. But using the API through ActveResource brings a lot of the good stuff back, and it seems like you can also have named_scopes. Not sure about associations.
More Criticism, please
We've been all singing the glories of this approach but, even though an answer has already been picked, I'd like to hear from other people what possible problems this approach might bring, and why we shouldn't use it.

How do image hosting sites enforce content policies?

I'm trying to figure out how to best implement a public data hosting service.
How do websites that let users upload pictures enforce their terms of service regarding obscene pictures? Do they use image processing algorithms to flag potential violations (too many skin-colored pixels)? I think Imageshack looks at the websites that their pictures are hotlinked on, and checks for keywords. If it detects anything porn related, then it removes the picture and bans the account. Are there other methods?
Is enforcement largely automated or is it based more on user reports?
I suppose it depends on the scale of your "public data hosting service".
If it's something small with maybe a couple hundreds pictures per day flowing in, you can moderate them on your own.
If it's a couple hundred thousands you'll need an amount of human beings sorting the weeds out. It's either a moderator team or users themselves who submit abuse reports.
Which one to go, can be dependent on your budget/financial success of your service as well as on the type of the service. If it's something simple like Rapidshare where one does not see what the other does, the chances that users will see each others content and through this notice and hopefully report unacceptable content are small. If it's something very social like Flickr you can bet on it reports will be flowing in.
I suppose you could automate something but it's almost an impossible task. You can't automatically detect porn. You can't automatically detect images violating copyrights - making footprints of copyrighting material in order to compare them with the uploaded stuff is a real challenge for companies with resources like Rapidshare, Youtube and others. For now this kind of work can effectively be done only by humans.
There are also legal issues to it. In some countries the service owner is not liable for what users contribute (well, if he's cooperative enough to delete certain content at request), in others he will get the charges himself for not having premoderated all the incoming content. Also think of this with regard to whatever and wherever you are going to launch.
I don't have links, but while it's certainly a difficult task prone to errors, software to detect improper content does exist. Or at least that's what the Security Manager at NASA told me - if if was just a means to scare me I don't know ;-)

Resources