Wicket: “large memory footprint!”, "Does Wicket scale?".. etc - scalability

Wicket uses the Session heavily which could mean “large memory footprint” (as stated by some developers) for larger apps with lots of pages. If you were to explain to a bunch of CTOs from Fortune 500 that they have to adopt Apache Wicket for their large web application deployments and that their fears about Wicket problems with scaling are just bad assumptions; what would you argue?
PS:
The question concerns only
scaling.
Technical details and real world
examples are very welcomed.

IMO credibility for Apache Wicket in very large scale deployment is satisfied with the following URL: http://mobile.walmart.com View the source.
See also http://mexico.com, http://vegas.com, http://adscale.de, and look those domains up with alexa to see their ranking.
So, yes it is quite possible to build internet scale applications using Wicket. But whether or not you are using Wicket, Struts, SpringMVC, or just plain old JSPs: internet scale software development is hard. No framework can make that easy for you. No framework can give you software with a next-next-finish wizard that services 5M users.

Well, first of all, explain where the footprint comes from, and it is mainly the PageMap.
The next step would be to explain what a page map does, what is it for and what problems it solves (back buttons and popup dialogs for example). Problems, which would have to be solved manually, at similar memory costs but at a much bigger development cost and risk.
And finally, tell them how you can affect what goes in the page map, the secondary page cache and thus how the size can be kept under control.
Obviously you can also show them benchmarks, but probably an even better bet is to drop a line to Martijn Dashorst (although I believe he's reading this post anyway :)).
In any case, I'd try to put two points across:
There's nothing Wicket stores in memory which you wouldn't have to store in memory anyway. It's just better organised, easier to develop, keep consistent, and test.
Java itself means that you're carrying some inevitable excess baggage around all the time. If they are so worried about footprint, maybe Java isn't the language they want to use at all. There are hundreds of large traffic websites written in other languages, so that's a perfectly workable solution. The worst thing they can do is to go with Java, take on the excess baggage and then not use the advantages that come with an advanced framework.

Wicket saves the last N pages in the session. This is done to be able to load the page faster when it is needed. It is needed mostly in two cases - using browser back button or in Ajax applications.
The back button is clear, no need to explain, I think.
About Ajax - each ajax requests needs the current page (the last page in the session cache) to find a component in it and call its callback method, update some model, etc.
From their on the session size completely depends on your application code. It will be the same for any web framework.
The number of pages to cache (N above) is configurable, i.e. depending on the type of your application you may tweak it as your find appropriate. Even when there is no inmemory cache (N=0) the pages are stored in the disk (again configurable) and the page will be find again, just it will be a bit slower.
About some references:
http://fabulously40.com/ - social network with many users,
several education sites - I know two in USA and one in Netherlands. They also have quite a lot users,
currently I work on a project that expects to be used by several million users. Wicket 1.5 will be improved wherever we find hotspots.
Send this to your CTO ;-)

Related

Reflection and performance in web

We know Reflection is a quite expensive engine. But nevertheless ASP.NET MVC is full of it. And there is so much ways to use and implement additional reflection-based practices like ORM, different mappings between DTO-entities-view models, DI frameworks, JSON-parsing and many many others.
So I wonder do they all affect performance so much that it is strongly recommended to avoid using reflection as much as possible and find any another solutions like scaffolding etc? And what is the tool to perform server's load testing?
There's nothing wrong with Reflection. Just use it wisely, a.k.a cache the results so that you don't have to perform those expensive calls over and over again. Reflection is used extensively in ASP.NET MVC. For example when the controller and action names are parsed from the route, Reflection is used to find the corresponding method to invoke. Except that once found, the result is cached so that the next time someone requests same controller and action name, the method to be invoked is fetched from the cache.
So if you are using a third party framework check the documentation/source code whether it uses reflection and whether it caches the results of those calls.
And if you have to use it in your code, same rule applies => cache it.
For stress testing, this SO post gives quite a few possibilities: Stress Testing ASP.Net application.
I have thought about this question myself, and come to the following conclusions:
Most people don't spend their days resubmitting pages over and over again. The time the user spends reading and consuming pages which at worst contain a few Ajax calls is minimal when taken into context with the time spent visiting an actual website. Even if you have a million concurrant users of your application, you will generally not have to deal with a million requests at any given time.
The web is naturally based on string comparisons... there are no types in an HTTP response, so any web application is forced to deal with these kinds of tasks as a fact of everyday life. The fewer string comparisons and dynamic objects the better, but they are at their core, unavoidable.
Although things like mapping by string comparison or dynamic type checking are slow, a site built with a non-compiled, weakly-typed language like PHP will contain far more of these actions. Despite the number of possible performance hits in MVC compared to a C# console application, it is still a superior solution to many others in the web domain.
The use of any framework will have a performance cost associated with it. An application built in C# with the .NET framework will for all intents and purposes not perform as well as an application written in C++. However, the benefits are better reliability, faster coding time and easier testing among others. Given how the speed of computers has exploded over the past decade or two, we have come to accept a few extra milliseconds here and there in exchange for these benefits (which are huge).
Given these points, in developing ASP.NET MVC applications I don't avoid things such as reflection like the plague, because it is clear that they can have quite a positive impact on how your application functions. They are tools, and when properly employed have great benefits for many applications.
As for performance, I like to build the best solution I can and then go back and run stress tests on it. Maybe the reflection I implemented in class X isn't a performance problem after all? In short, my first task is to build a great architecture, and my second is to optimise it to squeeze every last drop of performance from it.

Internal Linking with domain or with / for seo

When we mention internal links for a website should the internal links be mentioned with the domain or with /. Which would be better from the SEO point of view. For example is my page is www.testdomain.com/about.htm, and I give an internal link to this page from another page, should I mention the internal link as
About
or
About
Which would suit SEO better? Thanks in advance.
From an SEO standpoint: no difference whatsoever.
From a maintenance standpoint: please go with About
Although there is no difference in both styles but i would say stick to the standard method. In this case i will say "About" (second one) is the right way..
There are a couple of really good reasons to code relative URLs
1) It is much easier and faster to code.
When you are a web developer and you're building a site and there thousands of pages, coding relative versus absolute URLs is a way to be more efficient. You'll see it happen a lot.
2) Staging environments
Another reason why you might see relative versus absolute URLs is some content management systems -- and SharePoint is a great example of this -- have a staging environment that's on its own domain. Instead of being example.com, it will be examplestaging.com. The entire website will basically be replicated on that staging domain. Having relative versus absolute URLs means that the same website can exist on staging and on production, or the live accessible version of your website, without having to go back in and recode all of those URLs. Again, it's more efficient for your web development team. Those are really perfectly valid reasons to do those things. So don't yell at your web dev team if they've coded relative URLS, because from their perspective it is a better solution.
Relative URLs will also cause your page to load slightly faster. However, in my experience, the SEO benefits of having absolute versus relative URLs in your website far outweigh the teeny-tiny bit longer that it will take the page to load. It's very negligible. If you have a really, really long page load time, there's going to be a whole boatload of things that you can change that will make a bigger difference than coding your URLs as relative versus absolute.
Page load time, in my opinion, not a concern here. However, it is something that your web dev team may bring up with you when you try to address with them the fact that, from an SEO perspective, coding your website with relative versus absolute URLs, especially in the nav, is not a good solution.
There are even better reasons to use absolute URLs
1) Scrapers
If you have all of your internal links as relative URLs, it would be very, very, very easy for a scraper to simply scrape your whole website and put it up on a new domain, and the whole website would just work. That sucks for you, and it's great for that scraper. But unless you are out there doing public services for scrapers, for some reason, that's probably not something that you want happening with your beautiful, hardworking, handcrafted website. That's one reason. There is a scraper risk.
2) Preventing duplicate content issues
But the other reason why it's very important to have absolute versus relative URLs is that it really mitigates the duplicate content risk that can be presented when you don't have all of these versions of your website resolving to one version. Google could potentially enter your site on any one of these four pages, which they're the same page to you. They're four different pages to Google. They're the same domain to you. They are four different domains to Google.
But they could enter your site, and if all of your URLs are relative, they can then crawl and index your entire domain using whatever format these are. Whereas if you have absolute links coded, even if Google enters your site on www. and that resolves, once they crawl to another page, that you've got coded without the www., all of that other internal link juice and all of the other pages on your website, Google is not going to assume that those live at the www. version. That really cuts down on different versions of each page of your website. If you have relative URLs throughout, you basically have four different websites if you haven't fixed this problem.
Again, it's not always a huge issue. Duplicate content, it's not ideal. However, Google has gotten pretty good at figuring out what the real version of your website is.
You do want to think about internal linking, when you're thinking about this. If you have basically four different versions of any URL that anybody could just copy and paste when they want to link to you or when they want to share something that you've built, you're diluting your internal links by four, which is not great. You basically would have to build four times as many links in order to get the same authority. So that's one reason.
3) Crawl Budget
The other reason why it's pretty important not to do is because of crawl budget. I'm going to point it out like this instead.
When we talk about crawl budget, basically what that is, is every time Google crawls your website, there is a finite depth that they will. There's a finite number of URLs that they will crawl and then they decide, "Okay, I'm done." That's based on a few different things. Your site authority is one of them. Your actual PageRank, not toolbar PageRank, but how good Google actually thinks your website is, is a big part of that. But also how complex your site is, how often it's updated, things like that are also going to contribute to how often and how deep Google is going to crawl your site.
It's important to remember when we think about crawl budget that, for Google, crawl budget cost actual dollars. One of Google's biggest expenditures as a company is the money and the bandwidth it takes to crawl and index the Web. All of that energy that's going into crawling and indexing the Web, that lives on servers. That bandwidth comes from servers, and that means that using bandwidth cost Google actual real dollars.
So Google is incentivized to crawl as efficiently as possible, because when they crawl inefficiently, it cost them money. If your site is not efficient to crawl, Google is going to save itself some money by crawling it less frequently and crawling to a fewer number of pages per crawl. That can mean that if you have a site that's updated frequently, your site may not be updating in the index as frequently as you're updating it. It may also mean that Google, while it's crawling and indexing, may be crawling and indexing a version of your website that isn't the version that you really want it to crawl and index.
So having four different versions of your website, all of which are completely crawlable to the last page, because you've got relative URLs and you haven't fixed this duplicate content problem, means that Google has to spend four times as much money in order to really crawl and understand your website. Over time they're going to do that less and less frequently, especially if you don't have a really high authority website. If you're a small website, if you're just starting out, if you've only got a medium number of inbound links, over time you're going to see your crawl rate and frequency impacted, and that's bad. We don't want that. We want Google to come back all the time, see all our pages. They're beautiful. Put them up in the index. Rank them well. That's what we want. So that's what we should do.

Achieving 25K Concurrent connections in RubyOnRails Application

I am trying to build a suggestion board application. where a users raises a query and multiple people will post at the same time. expected to be supporting atleast 25k concurrent users. now the question format also has checkboxes or radio buttons, in thats case they will be writing to DB.
Please let me know how can achieve this in Ruby on Rails.
- hardware support (specific Hardware LB)
- software support like (DB clustering/App server clustering/ Web traffic resolution)
I think your best plan is to worry about scaling to this level once you have that many users. There's nothing stopping you from achieving this in Rails, or indeed any other framework/language.
The problem with trying to design your architecture up-front to scale to this level is that, at this point, you have absolutely no idea where the pain points are going to be. Are there specific pages which are going to hit the database harder, are some of your pages heavy on HTML and images... there are so many questions to ask that simply cannot be answered effectively until you've gotten something out there.
This doesn't mean that you shouldn't worry about scaling - by all means, try to design your data structures in such a way as to allow you to scale later. But put off any major decisions, and think about them later when you have some hard data to work with.

Does Seaside scale?

Seaside is known as "the heretical web framework". One of the points that make it heretical is that it has much shared state. That however is something which, in my current understanding, hinders easy scaling.
Ruby on rails on the other hand shares as less state as possible. It has been known to scale pretty well, even if it is dog slow compared to modern smalltalk vms. flickr uses php and has scaled to an extremly big infrastructure...
So has anybody some experience in the scaling of Seaside?
Ramon Leon shares some of his experience on upscaling seaside on his (excellent) blog. You can read very concrete ideas with sample code about configuring and tuning seaside.
Enjoy :-)
http://onsmalltalk.com/scaling-seaside-more-advanced-load-balancing-and-publishing
http://onsmalltalk.com/scaling-seaside-redux-enter-the-penguin
http://onsmalltalk.com/stateless-sitemap-in-seaside
Short answer:
you can scale Seaside applications like hell yah
Long answer:
In the IT domain, scaling is one thing but it has two dimensions:
horozontal
vertical
Almost everybody thought about scaling in the vertical dimension. That was until intel and friends reached some physical barriers and started to add cores to compensate the current impossibility of adding MHz.
That's when all we started to be more aware of scaling horizontally as the way to go.
Why am I telling you this?
Because Seaside is a smalltalk image running in a VM and that is roughly the same situation of a system in a server of a monocore processor.
Taking that as foundation, you scale web apps by making a cluster of servers. It's the natural thing to do, it's the fault tolerant thing to do, is the topologically intelligent thing to do, is the flexible thing to do, I guess you get the idea...
So, if for scaling, you do the same as intel & friends, you embrace the horizontal way. And it's even cheaper that the vertical way (that will lead you to IBM and Sun servers that are as good as expensive).
RoR applications are typically scaled horizontally. Google has countless cheap servers to do their thing. It works perfectly fine no matter how dramatized people want's to impress you throwing at you a bunch of forgettable twitter whales.
If they talk to you about that, you just be polite and hear what they say but remember this:
perfect is the enemy of the good
the unfinished perfect will never be as value as the good thing done
BTW, Amazon does something like that too (and it kind of couple geolocation so they enhance the chances of attending your requests with the cluster that is closest to your location).
On the other hand, the way Avi scaled dabbledb (Seaside based web application company bought by twitter) was using one vm per customer account (starting up and shutting down those on demand).
Having a lot of state in an image doesn't make scaling impossible nor complicated.
Just different.
The way to go is with a load balancer that uses sticky sessions so you can have one image attenting all the requests of an user session. You make things so any worker-image behind the load balancer can attend any user of a given app. And that's pretty much it.
To be able to do that you need to share the persistent objects among workers. All the users databases needs to be accessible by the workers at anytime and need to deal well with concurrency.
We designed airflowing scalable in that way.
It's economically convenient too because you can start with N very small (depending on the RAM of your first server) and increase it on demand until you reach the hardware limit.
Once you reach the hardware limit, you just add another host to the cluster and recofigure the balancer (and the access to the databases).
Simple, economic and elegant.
http://dabbledb.com/ seems to scale quite well. Moreover, one can use GemStone GLASS to run Seaside.
On this interview Avi Bryant the creator of Seaside and Co founder DabbleDB
Explains how they make it scale.
From what I understand:
each customer has it's own Squeak
Image.
When a customer comes Apache decides based on the user name which port to send it to.
Based on the port it starts the customer's Squeak Image.
That way it can grow to an infinite number of servers.
I think this solution works for them based on the specifics of their application each customer doesn't need to share info between them. So no need for o centralized DB.
Anyway it is better to watch the interview rather than my half-made summary.
Yes, Seaside scales down fantastically. A single developer can create and maintain complex applications for small groups very well.
[coming back to this after a few years]
This actually is much more important than scaling up. Computer speed still grows a lot, and 99% of all applications can now run on one machine. Speed of development, and especially maintenance now totally dominates TCO.
I would rephrase your question slightly to: does Seaside prevent/discourage you from creating applications that scale? I would say usually no. Seaside doesn't have a default way to store your data (just like php on its doesn't, though Seaside gives you a few extra options) and my impression is interacting with your data tends to be the biggest hurdle when it comes to scaling.
If you want to store your data in a monolithic SQL db, like with rails, you can do that. Or you can use an object database. Or you can use a separate object database for each user, or separate db for each project, or a separate db for each user and project. Or you can store everything in a series of flat files or you can just store your data as objects in your VM's memory.
And because of continuations you don't need re-fetch your data out of your datastore on every webpage call. As when you are using a desktop application you can pull data out of your datastore when your user begins interacting with it, set the appropriate variables, and then use those variables between webcalls until the user is done with the data at which point you can update the datastore. When you don't share state you have to hit the datastore on every single webcall.
Of course this doens't mean scaling is free, it just means you have a larger domain in which to search for scaling solutions.
All that said, for many applications rails will scale much easier simply because there exist large hosting solutions for rails (and php for that matter) that will offer you a huge amount of resources without having to rent and setup a custom box.
These are just my impressions from reading and talking to people.
I just reminded that there is link on Pharo's success stories : a Seaside Web application with up to 1000 concurrent users for a large health insurance in Argentina .
Pharo success stories
Issys Tracking:
Load balancing: Apache as a proxy/balancer (round robin with session
affinity)
Server setup: 5 Pharo images on 3 different servers (Linux and Windows 2003)
GUI: Heavily AJAX-based. All code written in Smalltak: Seaside 3.0, Seaside JQuery integration and JQWidgetBox.
Persistency: Glorp (OR mapper) and OpenDBX (DB client)
Databases: large PostgreSQL and MS SQL Server DBs
From the Wikipedia article, it's a total pig. Prior to that, it hadn't scaled to the point where I'd heard of it. :)

How to prevent hackers from scraping our database? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do you stop scripters from slamming your website hundreds of times a second?
I am building a web application in RubyOnRails, which is based on a large body of data. The application makes for powerful navigation and intersection of the data, as well as a community model for adding more data.
In that respect one could compare it with StackOverflow.com: a big bunch of data, structured in a fairly simple way.
I intend to offer the content under a CreativeCommons license, but if the site "hits it off", I need to discourage copycats. My biggest fear is screen scraping scripters, not only leeching away the raw data, but also incurring huge usage peaks on my servers.
I wonder if RubyOnRails offers any way to throttle (obviously automated) requests, e.g. to reduce their response-time at the benefit of regular users. Perhaps this requires Apache or Phusion Passenger settings?
EDIT: My target is not to recognize user types, but to reduce responsiveness to overly active users, e.g. maximize the number of requests handled per IP address per unit of time (?)
My suggestion would be to limit any easy iterative navigation of your websites which was the primary way I have seen harvesting programs work. The simple encryption of your id numbers used as GET variables would make stripmining your info more difficult. You can only try and make getting your information onerous. You won't be able to prevent it completely.
You could present a captcha to the "overly active users", just like SO does when you edit too fast. That should effectively hinder automatic spider like scraping.
You might also want to look into using some Rack middleware to do rate limiting, like this recent article covered for doing API limiting (such as what you'd want at Twitter or similar).
I believe all you could do is put hoops for the user to jump though. Ultimately there is no foolproof way to distinguish a regular user from a bot.

Resources