Related
I want to build an application that would handle CRM, client database, time tracking etc.
The app would be sold to about 100 clients. The thing is that every one of clients will have slightly different needs i.e. they want to integrate app with their own accounting software or have added functionality only for them and in their instance, they want only those modules, that they are using.
I'll need to manage 100 different instances of the same app, so I want it to be easy. For example if I'd fix a bug for one instance I want it to be easy to propagate this bugfix to all other instances and if I create a module for one client I want to easily add it to another client's instance if he would want to.
What is the best way to do this in this situation?
I would be really thankful if someone would point me to the right direction ;)
Generally this is handled by making things data-driven. Everybody gets the same code, but with different feature-flags which enable/disable various bits of the application.
Feature flags can be driven in a database, config files, environment variables, etc depending on your needs.
When we mention internal links for a website should the internal links be mentioned with the domain or with /. Which would be better from the SEO point of view. For example is my page is www.testdomain.com/about.htm, and I give an internal link to this page from another page, should I mention the internal link as
About
or
About
Which would suit SEO better? Thanks in advance.
From an SEO standpoint: no difference whatsoever.
From a maintenance standpoint: please go with About
Although there is no difference in both styles but i would say stick to the standard method. In this case i will say "About" (second one) is the right way..
There are a couple of really good reasons to code relative URLs
1) It is much easier and faster to code.
When you are a web developer and you're building a site and there thousands of pages, coding relative versus absolute URLs is a way to be more efficient. You'll see it happen a lot.
2) Staging environments
Another reason why you might see relative versus absolute URLs is some content management systems -- and SharePoint is a great example of this -- have a staging environment that's on its own domain. Instead of being example.com, it will be examplestaging.com. The entire website will basically be replicated on that staging domain. Having relative versus absolute URLs means that the same website can exist on staging and on production, or the live accessible version of your website, without having to go back in and recode all of those URLs. Again, it's more efficient for your web development team. Those are really perfectly valid reasons to do those things. So don't yell at your web dev team if they've coded relative URLS, because from their perspective it is a better solution.
Relative URLs will also cause your page to load slightly faster. However, in my experience, the SEO benefits of having absolute versus relative URLs in your website far outweigh the teeny-tiny bit longer that it will take the page to load. It's very negligible. If you have a really, really long page load time, there's going to be a whole boatload of things that you can change that will make a bigger difference than coding your URLs as relative versus absolute.
Page load time, in my opinion, not a concern here. However, it is something that your web dev team may bring up with you when you try to address with them the fact that, from an SEO perspective, coding your website with relative versus absolute URLs, especially in the nav, is not a good solution.
There are even better reasons to use absolute URLs
1) Scrapers
If you have all of your internal links as relative URLs, it would be very, very, very easy for a scraper to simply scrape your whole website and put it up on a new domain, and the whole website would just work. That sucks for you, and it's great for that scraper. But unless you are out there doing public services for scrapers, for some reason, that's probably not something that you want happening with your beautiful, hardworking, handcrafted website. That's one reason. There is a scraper risk.
2) Preventing duplicate content issues
But the other reason why it's very important to have absolute versus relative URLs is that it really mitigates the duplicate content risk that can be presented when you don't have all of these versions of your website resolving to one version. Google could potentially enter your site on any one of these four pages, which they're the same page to you. They're four different pages to Google. They're the same domain to you. They are four different domains to Google.
But they could enter your site, and if all of your URLs are relative, they can then crawl and index your entire domain using whatever format these are. Whereas if you have absolute links coded, even if Google enters your site on www. and that resolves, once they crawl to another page, that you've got coded without the www., all of that other internal link juice and all of the other pages on your website, Google is not going to assume that those live at the www. version. That really cuts down on different versions of each page of your website. If you have relative URLs throughout, you basically have four different websites if you haven't fixed this problem.
Again, it's not always a huge issue. Duplicate content, it's not ideal. However, Google has gotten pretty good at figuring out what the real version of your website is.
You do want to think about internal linking, when you're thinking about this. If you have basically four different versions of any URL that anybody could just copy and paste when they want to link to you or when they want to share something that you've built, you're diluting your internal links by four, which is not great. You basically would have to build four times as many links in order to get the same authority. So that's one reason.
3) Crawl Budget
The other reason why it's pretty important not to do is because of crawl budget. I'm going to point it out like this instead.
When we talk about crawl budget, basically what that is, is every time Google crawls your website, there is a finite depth that they will. There's a finite number of URLs that they will crawl and then they decide, "Okay, I'm done." That's based on a few different things. Your site authority is one of them. Your actual PageRank, not toolbar PageRank, but how good Google actually thinks your website is, is a big part of that. But also how complex your site is, how often it's updated, things like that are also going to contribute to how often and how deep Google is going to crawl your site.
It's important to remember when we think about crawl budget that, for Google, crawl budget cost actual dollars. One of Google's biggest expenditures as a company is the money and the bandwidth it takes to crawl and index the Web. All of that energy that's going into crawling and indexing the Web, that lives on servers. That bandwidth comes from servers, and that means that using bandwidth cost Google actual real dollars.
So Google is incentivized to crawl as efficiently as possible, because when they crawl inefficiently, it cost them money. If your site is not efficient to crawl, Google is going to save itself some money by crawling it less frequently and crawling to a fewer number of pages per crawl. That can mean that if you have a site that's updated frequently, your site may not be updating in the index as frequently as you're updating it. It may also mean that Google, while it's crawling and indexing, may be crawling and indexing a version of your website that isn't the version that you really want it to crawl and index.
So having four different versions of your website, all of which are completely crawlable to the last page, because you've got relative URLs and you haven't fixed this duplicate content problem, means that Google has to spend four times as much money in order to really crawl and understand your website. Over time they're going to do that less and less frequently, especially if you don't have a really high authority website. If you're a small website, if you're just starting out, if you've only got a medium number of inbound links, over time you're going to see your crawl rate and frequency impacted, and that's bad. We don't want that. We want Google to come back all the time, see all our pages. They're beautiful. Put them up in the index. Rank them well. That's what we want. So that's what we should do.
Wicket uses the Session heavily which could mean “large memory footprint” (as stated by some developers) for larger apps with lots of pages. If you were to explain to a bunch of CTOs from Fortune 500 that they have to adopt Apache Wicket for their large web application deployments and that their fears about Wicket problems with scaling are just bad assumptions; what would you argue?
PS:
The question concerns only
scaling.
Technical details and real world
examples are very welcomed.
IMO credibility for Apache Wicket in very large scale deployment is satisfied with the following URL: http://mobile.walmart.com View the source.
See also http://mexico.com, http://vegas.com, http://adscale.de, and look those domains up with alexa to see their ranking.
So, yes it is quite possible to build internet scale applications using Wicket. But whether or not you are using Wicket, Struts, SpringMVC, or just plain old JSPs: internet scale software development is hard. No framework can make that easy for you. No framework can give you software with a next-next-finish wizard that services 5M users.
Well, first of all, explain where the footprint comes from, and it is mainly the PageMap.
The next step would be to explain what a page map does, what is it for and what problems it solves (back buttons and popup dialogs for example). Problems, which would have to be solved manually, at similar memory costs but at a much bigger development cost and risk.
And finally, tell them how you can affect what goes in the page map, the secondary page cache and thus how the size can be kept under control.
Obviously you can also show them benchmarks, but probably an even better bet is to drop a line to Martijn Dashorst (although I believe he's reading this post anyway :)).
In any case, I'd try to put two points across:
There's nothing Wicket stores in memory which you wouldn't have to store in memory anyway. It's just better organised, easier to develop, keep consistent, and test.
Java itself means that you're carrying some inevitable excess baggage around all the time. If they are so worried about footprint, maybe Java isn't the language they want to use at all. There are hundreds of large traffic websites written in other languages, so that's a perfectly workable solution. The worst thing they can do is to go with Java, take on the excess baggage and then not use the advantages that come with an advanced framework.
Wicket saves the last N pages in the session. This is done to be able to load the page faster when it is needed. It is needed mostly in two cases - using browser back button or in Ajax applications.
The back button is clear, no need to explain, I think.
About Ajax - each ajax requests needs the current page (the last page in the session cache) to find a component in it and call its callback method, update some model, etc.
From their on the session size completely depends on your application code. It will be the same for any web framework.
The number of pages to cache (N above) is configurable, i.e. depending on the type of your application you may tweak it as your find appropriate. Even when there is no inmemory cache (N=0) the pages are stored in the disk (again configurable) and the page will be find again, just it will be a bit slower.
About some references:
http://fabulously40.com/ - social network with many users,
several education sites - I know two in USA and one in Netherlands. They also have quite a lot users,
currently I work on a project that expects to be used by several million users. Wicket 1.5 will be improved wherever we find hotspots.
Send this to your CTO ;-)
Seaside is known as "the heretical web framework". One of the points that make it heretical is that it has much shared state. That however is something which, in my current understanding, hinders easy scaling.
Ruby on rails on the other hand shares as less state as possible. It has been known to scale pretty well, even if it is dog slow compared to modern smalltalk vms. flickr uses php and has scaled to an extremly big infrastructure...
So has anybody some experience in the scaling of Seaside?
Ramon Leon shares some of his experience on upscaling seaside on his (excellent) blog. You can read very concrete ideas with sample code about configuring and tuning seaside.
Enjoy :-)
http://onsmalltalk.com/scaling-seaside-more-advanced-load-balancing-and-publishing
http://onsmalltalk.com/scaling-seaside-redux-enter-the-penguin
http://onsmalltalk.com/stateless-sitemap-in-seaside
Short answer:
you can scale Seaside applications like hell yah
Long answer:
In the IT domain, scaling is one thing but it has two dimensions:
horozontal
vertical
Almost everybody thought about scaling in the vertical dimension. That was until intel and friends reached some physical barriers and started to add cores to compensate the current impossibility of adding MHz.
That's when all we started to be more aware of scaling horizontally as the way to go.
Why am I telling you this?
Because Seaside is a smalltalk image running in a VM and that is roughly the same situation of a system in a server of a monocore processor.
Taking that as foundation, you scale web apps by making a cluster of servers. It's the natural thing to do, it's the fault tolerant thing to do, is the topologically intelligent thing to do, is the flexible thing to do, I guess you get the idea...
So, if for scaling, you do the same as intel & friends, you embrace the horizontal way. And it's even cheaper that the vertical way (that will lead you to IBM and Sun servers that are as good as expensive).
RoR applications are typically scaled horizontally. Google has countless cheap servers to do their thing. It works perfectly fine no matter how dramatized people want's to impress you throwing at you a bunch of forgettable twitter whales.
If they talk to you about that, you just be polite and hear what they say but remember this:
perfect is the enemy of the good
the unfinished perfect will never be as value as the good thing done
BTW, Amazon does something like that too (and it kind of couple geolocation so they enhance the chances of attending your requests with the cluster that is closest to your location).
On the other hand, the way Avi scaled dabbledb (Seaside based web application company bought by twitter) was using one vm per customer account (starting up and shutting down those on demand).
Having a lot of state in an image doesn't make scaling impossible nor complicated.
Just different.
The way to go is with a load balancer that uses sticky sessions so you can have one image attenting all the requests of an user session. You make things so any worker-image behind the load balancer can attend any user of a given app. And that's pretty much it.
To be able to do that you need to share the persistent objects among workers. All the users databases needs to be accessible by the workers at anytime and need to deal well with concurrency.
We designed airflowing scalable in that way.
It's economically convenient too because you can start with N very small (depending on the RAM of your first server) and increase it on demand until you reach the hardware limit.
Once you reach the hardware limit, you just add another host to the cluster and recofigure the balancer (and the access to the databases).
Simple, economic and elegant.
http://dabbledb.com/ seems to scale quite well. Moreover, one can use GemStone GLASS to run Seaside.
On this interview Avi Bryant the creator of Seaside and Co founder DabbleDB
Explains how they make it scale.
From what I understand:
each customer has it's own Squeak
Image.
When a customer comes Apache decides based on the user name which port to send it to.
Based on the port it starts the customer's Squeak Image.
That way it can grow to an infinite number of servers.
I think this solution works for them based on the specifics of their application each customer doesn't need to share info between them. So no need for o centralized DB.
Anyway it is better to watch the interview rather than my half-made summary.
Yes, Seaside scales down fantastically. A single developer can create and maintain complex applications for small groups very well.
[coming back to this after a few years]
This actually is much more important than scaling up. Computer speed still grows a lot, and 99% of all applications can now run on one machine. Speed of development, and especially maintenance now totally dominates TCO.
I would rephrase your question slightly to: does Seaside prevent/discourage you from creating applications that scale? I would say usually no. Seaside doesn't have a default way to store your data (just like php on its doesn't, though Seaside gives you a few extra options) and my impression is interacting with your data tends to be the biggest hurdle when it comes to scaling.
If you want to store your data in a monolithic SQL db, like with rails, you can do that. Or you can use an object database. Or you can use a separate object database for each user, or separate db for each project, or a separate db for each user and project. Or you can store everything in a series of flat files or you can just store your data as objects in your VM's memory.
And because of continuations you don't need re-fetch your data out of your datastore on every webpage call. As when you are using a desktop application you can pull data out of your datastore when your user begins interacting with it, set the appropriate variables, and then use those variables between webcalls until the user is done with the data at which point you can update the datastore. When you don't share state you have to hit the datastore on every single webcall.
Of course this doens't mean scaling is free, it just means you have a larger domain in which to search for scaling solutions.
All that said, for many applications rails will scale much easier simply because there exist large hosting solutions for rails (and php for that matter) that will offer you a huge amount of resources without having to rent and setup a custom box.
These are just my impressions from reading and talking to people.
I just reminded that there is link on Pharo's success stories : a Seaside Web application with up to 1000 concurrent users for a large health insurance in Argentina .
Pharo success stories
Issys Tracking:
Load balancing: Apache as a proxy/balancer (round robin with session
affinity)
Server setup: 5 Pharo images on 3 different servers (Linux and Windows 2003)
GUI: Heavily AJAX-based. All code written in Smalltak: Seaside 3.0, Seaside JQuery integration and JQWidgetBox.
Persistency: Glorp (OR mapper) and OpenDBX (DB client)
Databases: large PostgreSQL and MS SQL Server DBs
From the Wikipedia article, it's a total pig. Prior to that, it hadn't scaled to the point where I'd heard of it. :)
After much reading on ruby on rails and multiple database connections, it seems that I have found something that not that many folks do, at least not with ror. I am used to querying many different databases and schemas and pulling back the information either for a report or for one seamless page. So, a user doesn't have to log on to several different systems. I can create a page that has all the systems on one or two web pages.
Is that not a normal occurrence in the web and database driven design?
EDIT: Is this because most all my original code is in classic asp?
I really honestly think that most ORM designers don't seem to take the thought that users may want to access more than one database into account. This seems to be a pretty common limitation in the ORM universe.
Our client website runs across 3 databases, so I do this to. Actually, I'm condensing everything into views off of one central database which then connects to the others.
I never considered this to be "normal" behavior though. I would guess that most of the time you would be designing for one system and working against that.
EDIT: Just to elaborate, we use Linq to SQL for our data layer and we define the objects against the database views. This way we keep reports and application code working off the same data model. There is some extra work setting up the Linq entities, because you have to manually define primary keys and set up associations... however so far it has definitely proven worthwhile. We tried to do so with Entity Framework, but had a lot of trouble getting the relationships set up appropriately and had to give up. The funny thing is I had thought Entity Framework was supposed to be designed for more advanced scenarios like ours...
It is not uncommon to hit multiple databases during a single part of an application's workflow. However, in every instance that I have done it, this has been performed through several web service calls, which among other things wrap the databases in question.
I have not, to my knowledge, ever had a need to hit multiple databases directly at once and merge results into a single report.
I've seen this kind of architecture in corporate Portals- where lots of data is pulled in via different data sources. The whole point of a portal is to bring silo'd systems together- users might not want to be using lots of systems in isolation (especially if they have to sign into each one). In that sort of scenario it is normal, particularly if it is a large company that has expanded rapidly and has a large number of heterogenous systems.
In your case whether this is the right thing to do depends on why you have these seperate DBs.
With ORM's it may be a little difficult. However, it can be done. Pull the objects as needed from the various databases, then use them as a composite to create a new object that is the actual one that is desired. If you can skip the ORM part of the process, then you can directly query the databases and build your object directly.
Pulling data from two databases and compiling a report is not uncommon, but because cross-database queries cannot be optimized by the query engine of either database, OLTP systems typically use a single database, to keep the application performant.
If you build the system from the ground up, it is not advisable to do it this way. If you are working with a system you didn't design, there is no much choice and it is not uncommon (that is the difference between "organic" and "planned" grow).
Not counting master and various test instances, I hit nine databases on a regular basis. Yes, I inherited it, and yes, "Classic" ASP figures prominently. Of course, all the "brillant" designers of this mess are long gone. We're replacing it with things more sane as quickly as we safely can.
I would think that if you're building a new system, and keep adding databases and get to the point of two or three databases, it's probably time to re-think your design. OTOH, if you're aggregating data from multiple, disparate systems, then, no, it's not that strange. Depending on the timliness you need, and your budget for throwing hardware at the problem, and if your data is mostly static, this would be a good scenario for a "reporting server" that pulls the data down from the Live server periodically.