Scalability of LanguageTool - scalability

I'd like to scale the LanguageTool HTTP Server so it can handle a large number of user requests at a time and process very large texts. Which is the best approach to achieve this?

I'm the author of LanguageTool. The short answer is that you can scale it like any HTTP service: run several instances and put them behind a load balancer. As LanguageTool server has no state, this should be easy. The longer answer requires more information: how long are the documents and in which languages are they? Do you need spell checking or only the features that go beyond spell checking? LanguageTool is much slower for some languages than for others. For example, English is quite slow, most languages with a low number of rules (see https://languagetool.org/languages/) are faster.

Related

Is JSP not a good choice as a server side scripting language for consumer-scale multinational portals? e.g. for Facebook/Gmail scale or beyond?

The site/app would be AJAX heavy and will have UI MashUps. Would using an AJAX toolkit or lot of Javascript in the webpage go against scale goals, in other words, would more bandwidth consumption due to Javascript-heaviness would affect bottomline?
Sorry to ask two questions together, somehow they are related to overall SCALE goal at client tier.
Thanks in advance,
Deb
Sorry, but your question is unclear.
You first ask about JSP being a good choice, then talk about scalability. The two must be treated separately.
JSP is a server-side platform, just as PHP and ASP.NET. According only to the design of your web application, it's perfectly scalable. Actually, you can choose any of the listed platforms if your goal is scalability, plus a little more hints.
Second, Javascript and bandwidth consumption. If you have a good (and I mean good) AJAX toolkit, then I suppose lots of JS contents are static. I mean lots of functions and class libraries are stored in static JS file that don't change during software's lifetime, and that's exactly what we want!! The only part that change is the page-level scripting and the XMLHTTP responses.
Now, all the libraries can be cached by browsers, dramatically reducing bandwidth consumption.
My hint
is to use a static content domain, possibly powered by a Content Distribution Network.
This will unload a lot your JSP-busy servers on a scalable application.
Remember
Your web application must be correctly designed to be scalable (don't rely on session variables, for example)

What really is scaling?

I've heard that people say that they've made a scalable web application..
What really is scaling?
What can be done by developers to make their application scalable?
What are the factors that are looked after by developers during scaling?
Any tips and tricks about scaling web applications with asp.net and sql server...
What really is scaling?
Scaling is the increasing in capacity and/or usage of your application.
What do developers do to make their application scalable?
Either allow their applications to scale vertically or horizontally.
Horizontal scaling is about doing things in parallel.
Vertical scaling is about doing things faster. This typically means more powerful hardware.
Often when people talk about horizontal scalability the ideal is to have (near-)linear scalability. This means that if one $5k production box can handle 2,000 concurrent users then adding 4 more should handle 10,000 concurrent users. The closer it is to that figure the better.
The ideal for highly scalable apps is to have near-limitless near-linear horizontal scalability such that you can just plug in another box and your capacity increases by an expected amount with little or no diminishing returns.
Ideally redundancy is part of the equation too but that's typically a separate issue.
The poster child for this kind of scalability is, of course, Google.
What are the factors that are looked after by developers during scaling?
How much scale should be planned for? There's no point spending time and money on a problem you'll never have;
Is it possible and/or economical to scale vertically? This is the preferred option as it is typically much, much cheaper (in the short term);
Is it worth the (often significant) cost to enable your application to scale horizontally? Distributed/multithreaded apps are significantly more difficult and expensive to write.
Any tips and tricks about scaling web applications...
Yes:
Don't worry about problems you'll never have;
Don't worry much about problems you're unlikely to have. Chances are things will have changed long before you have them;
Don't be afraid to throw away code and start again. Having automated tests makes this far easier; and
Think in terms of developer time being expensive.
(4) is a key point. You might have a poorly written app that will require $20,000 of hardware to essentially fix. Nowadays $20,000 buys a lot of power (64+GB of RAM, 4 quad core CPUs, etc), probably more than 99% of people will ever need. Is it cheaper just to do that or spend 6 months rewriting and debugging a new app to make it faster?
It's easily the first option.
So I'll add another item to my list: be pragmatic.
My 2c definition of "scalable" is a system whose throughput grows linearly (or at least predictably) with resources. Add a machine and get 2x throughput. Add another machine and get 3x throughput. Or, move from a 2p machine to a 4p machine, and get 2x throughput.
It rarely works linearly, but a well-designed system can approach linear scalability. Add $1 of HW and get 1 unit worth of additional performance.
This is important in web apps because the potential user base is ~1b people.
Contention for resources within the app, when it is subjected to many concurrent requests, is what causes scalability to suffer. The end result of such a system is that no matter how much hardware you use, you cannot get it to deliver more throughput. It "tops out". The HW-cost versus performance curve goes asymptotic.
For example, if there's a single app-wide in-memory structure that needs to be updated for each web transaction or interaction, that structure will become a bottleneck, and will limit scalability of the app. Adding more CPUs or more memory or (maybe) more machines won't help increase throughput - you will still have requests lining up to lock that structure.
Often in a transactional app, the bottleneck is the database, or a particular table in the database.
What really is scaling?
Scaling means accommodating increases in usage and data volume, and ideally the implementation should be maintainable.
What developers do to make their application scalable?
Use a database, but cache as much as possible while accommodating user experience (possibly in the session).
Any tips and tricks about scaling web applications...
There are lots, but it depends on the implementation. What programming language(s), what database, etc. The question needs to be refined.
Scalable means that your app is prepared for (and capable of handling) future growth. It can handle higher traffic, more activity, etc. Making your site more scalable can entail various things. You may work on storing more in cache, rather than querying the database(s) unnecessarily. It may entail writing better queries, to keep connections to a minimum, and resources freed up.
Resources:
Seattle Conference on Scalability (Video)
Improving .NET Application Performance and Scalability (Video)
Writing Scalable Applications with PHP
Scalability, on Wikipedia
Books have been written on this topic. An excellent one which targets internet applications but describes principles and practices that can be applied in any development effort is Scalable Internet Architectures
May I suggest a "User-Centric" definition;
Scalable applications provide a consistent level of experience to each user irrespective of the number of users.
For web applications this means 24/7 anywhere in the world. However, given the diversity of the available bandwidth around the world and developer's lack of control over its performance and availability, we may re-define it as follows:
Scalable web applications provide a consistent response time, measured at the server TCP port in use, irrespective of the number of requests.
To achieve this the developer must avoid or remove all performance bottle-necks. Currently the most challenging issue is the scalability of distributed RDBMS systems.

How different is Amazon Simple DB from Apache CouchDB?

Other than the monetary aspects, how different is Amazon's SimpleDB from Apache's CouchDB in the following terms
Interfacing with programming languages like Java, C++ etc
Performance and Scalability
Installation and maintenance
I'm a fairly heavy SimpleDB user (I'm the developer of http://www.backupsdb.com/) but am currently migrating some projects off SimpleDB and into Couch, so I guess I can see this from both sides now.
1. Interfacing with programming languages like Java, C++ etc
Easier with Couch as you can talk to it very easily using JSON. SimpleDB is a bit more work, largely due to the complexities of signing each request for security and the lower level access you get which requires you to implement exponential back off in the case of busy signals etc. You can get good libraries for SimpleDB though in many languages now and this takes the pain away in many respects.
2. Performance and Scalability
I don't have any benchmarks, but for my own use case, CouchDB outperforms SimpleDB. It's harder to scale though - SimpleDB is great at that, you chuck more at it and it autoscales around you.
There are lots of occasionally irritating limits in SimpleDB though, limits on the number of attributes, size of attributes, number of domains etc. The main annoyance for many applications is the attribute size limit which means you can't store large forum posts for example. The workaround is to offload those into something else such as S3, but it's a bit annoying at times. Obviously CouchDB doesn't have that issue and indeed the fact that you can attach large files to documents is one thing that particularly attracts me to it.
Scaling wise, you should also possibly be looking at bigcouch which gives you a distributed cluster and is closer to what you get with SDB.
3. Installation and Maintenance
I actually found it much easier with CouchDB. I suspect it depends on which library you need to use for SimpleDB, but when I was starting with it, the Amazon supplied libraries weren't very mature and the open source community ones had various issues that meant getting up and running and doing something serious with it took more time than I would have liked. I suspect this is much better now.
CouchDB was surprisingly easy to install and I love the web interface to it. Indeed that would be my major criticism of SimpleDB - Amazon still don't have any form of web console for it despite having web consoles for almost every other service. That's why we wrote the very basic BackupSDB just so we could extract data in XML and run queries from a web browser, I'd like to have seen Amazon do something similar (but more powerful and better) by now and have been very surprised that they haven't. There are lots of third party firefox plugins and some applications for it though but I have the impression that SimpleDB isn't that widely used - this is only a hunch really.
4. Other Observations
The biggest issue I think is that with SimpleDB you are entrusting all your data to a third party with no easy way of getting it out (you'll need to write something to do that), plus your costs keep gently rising. When you get to the point that the cost is comparable to a powerful dedicated database server, you kind of feel you'd get better value that way, but the migration headache is non trivial by this point as you'll have a large commitment to the cloud.
I started off as a huge Amazon evangelist, and for most things I still am, but when it comes to SDB, I feel it's a bit of a hobby project for Amazon the way the Apple TV was for Steve jobs.

What is Erlang's secret to scalability?

Erlang is getting a reputation for being untouchable at handling a large volume of messages and requests. I haven't had time to download and try to get inside Mr. Erlang's understanding of switching theory... so I'm wondering if someone can teach me (or point to a good instructional site.)
Say as a thought-experiment I wanted to port the Erlang ejabberd to a combination of Python and C, in a way that gave me the same speed and scalability. What structures or patterns would I have to understand and implement? (Does Python's Twisted already do this?)
How/why do functional languages (specifically Erlang) scale well? (for discussion of why)
http://erlang.org/course/course.html (for a tutorial chain)
As far as porting to other languages, a message passing system would be easy to do in most modern languages. Getting the functional style can be done in Python easily enough, although you wouldn't get the internal dispatching features of Erlang "for free". Stackless Python can replicate much of Erlang's concurrency features, although I can't speak to details as I haven't used it much. If does appear to be much more "explicit" (in that it requires you to define the concurrency in code in places that Erlang's design will allow concurrency to happen internally).
Erlang is not only about scalability but mostly about
reliability
soft real-time characteristics (enabled by soft real-time GC which is possible because immutability [no cycles] and share nothing and so)
performance in concurrent tasks (cheap task switch, cheap process spawn, actors model, ...)
scalability - debatable in current state , but rapidly evolving (about 32 cores well, it is better than most competitors but should be better in near future).
Another of the features of erlang that have an impact on scalability is the the lightweight cheap processes. Since processes have so little overhead erlang can spawn far more of them than most other languages. You get more bang for your buck with erlang processes than many other languages give you.
I think the best choice for Erlang is Network bound applications - makes communication much simpler between nodes and things like heartbeat monitoring, auto restart using supervisor are built into OTP.

What weaknesses can be found in using Erlang?

I am considering Erlang as a potential for my upcoming project. I need a "Highly scalable, highly reliable" (duh, what project doesn't?) web server to accept HTTP requests, but not really serve up HTML. We have thousands of distributed clients (other systems, not users) that will be submitting binary data to central cluster of servers for offline processing. Responses would be very short, success, fail, error code, minimal data. We want to use HTTP since it is our best chance of traversing firewalls.
Given this limited information about the project, can you provide any weaknesses that might pop up using a technology like Erlang? For instance, I understand Erlang's text processing capabilities might leave something to be desired.
You comments are appreciated.
Thanks.
This sounds like a perfect candidate for a language like Erlang. The scaling properties of the language are very good, but if you're worried about the data processing abilities, you shouldn't be. It's a very powerful language, with many libraries available for developers. It's an old language, and it's been heavily used/tested in the past, so everything you want to do has probably already been done to some degree.
Make sure you use erlang version R11B5 or newer! Earlier versions of erlang did not provide the ability to timeout tcp sends. This results in stalled or malicious clients being able to execute a DoS attack on your application by refusing to recv data you send them, thus locking up the sending process.
See issue OTP-6684 from R11B5's release notes.
With Erlang the scalability and reliability is there but from your project definition you don't outline what type of text processing you will need.
I think Erlang's main limitation might be finding experienced developers in your area. Do some research on the availability of Erlang architects and coders.
If you are going to teach yourself or have your developers learn it on the job keep in mind that it is a very different way of coding and that while the core documentation is good a lot of people do wish there were more examples. Of course the very active community easily makes up for that.
I understand Erlang's text processing
capabilities might leave something to
be desired.
The starling project already provides basic unicode support and there is a EEP (Erlang Enhancement Proposal) currently in draft, but going in to bring it into the mainstream of Erlang/OTP support.
I encountered some problems with Redis read performance from Erlang. Here is my question. I tend to think the reason is Erlang-written module, which has troubles while processing tons of strings during communication with Redis.

Resources