datomic and write scalability - scalability

http://www.datomic.com/faq.html tells that datomic is not designed for write scalability. So if there is a use case where write scalability is needed then datomic is not the right choice. Is this understanding right?
Further I understand that increasing number of transactors could help in getting higher write scalability. But that way if I want to achieve same write scalability as Cassandra then I have to use double the machines. Half for storage and half for transactors, thereby increasing the cost of cluster. As far as I understand this is not recommended by datomic (datomic advises to use more transactors for HA). What is groups/experts opinion and recommendation on this?

You are correct: Datomic is not optimized for write scalability. This doesn't mean that Datomic can't handle some fairly write-heavy operations—what it means is that write capacity can't be scaled infinitely.
It is incorrect to think that you can add additional transactors to get higher write scalability. The fundamental trade-off of Datomic's architecture is that it features a single transactor to serialize and broadcast changes, capping write capacity but providing exceptional read scalability. This is a good fit for the majority of projects, which do more reading than writing.
The only reason to run additional transactors is to provide high availability—all transactors except one will sit in a standby mode, ready to take over if the active transactor fails.
TLDR: You will not achieve the same write scalability as Cassandra, as Datomic and Cassandra are not architected to solve the same problems.

Related

What message storage is the best to use for ActiveMQ?

ActiveMq v 5.5 comes with default message storage configured as KahaDB. Does anyone use it in enterprise level solutions? Should it be replaced with MSSQL instead? And what benefits can each of them have?
The persistence mechanism should be based on your application's needs. A closely related concern is going to be failover/availability.
Speaking purely of the speed of message persistence, KahaDB is going to be the fastest; it's tuned specifically for patterns surrounding messaging (writing/reading/discarding). Were you to use something like MSSQL, even with journaling enabled, you're going to give up orders of magnitude (in mgs/sec) in efficiency. This setup works well if you are concerned with publishing high volumes of messages and are willing to leave message recovery up to an admin or some "invented-here" process.
So, why would you choose a different persistence mechanism? High availability.
Re: something like a relational database, it's probably something already available in your enterprise, meaning someone's (hopefully) gone through the effort of clustering and testing disaster recovery. This means you should be able to have a master/slave setup and your messages will be recoverable if a master were to go down. The slave will detect a loss of lock and start using the exact same message store. This setup is ideal if your performance threshold is much lower but you are extremely concerned about up-time and ensuring that you can always publish/subscribe messages.
Regardless, in a well-tuned system, we are talking >= hundreds msgs/sec so performance considerations are likely not going to be your first concern. Should performance really be that crucial, I'd consider looking at something like RabbitMQ, which definitely lends itself well to being extremely efficient at the cost of making high availability more complex to set up.
Here's a discussion on some of the failover options with ActiveMQ. I've settled on using a shared file master/slave setup with a KahaDB being shared on a SAN. Seems to be a nice middle-ground solution.

For distributed applications, which to use, ASIO vs. MPI?

I am a bit confused about this. If you're building a distributed application, which in some cases may perform parallel operations (although not necessarily mathematical), should you use ASIO or something like MPI? I take it MPI is a higher level than ASIO, but it's not clear where in the stack one would begin.
I know nothing about ASIO but from a quick Google it looks to me to be a lot lower level than MPI. For me the whole point of MPI is so that I can program against a higher level of abstraction from the messaging than, it seems, ASIO provides. Where you begin depends on your needs. For mine, parallelising scientific codes for high-performance, the obvious answer is MPI. I'm not sure I'd use it, or at least not sure it would be my default choice, if I were writing more general-purpose distributed, as opposed to parallel, applications. Well, actually, it probably would be my default choice to avoid learning another approach (most of which are less portable and less long-lived than MPI) but I'll admit it might not be the best choice if starting from an equal footing.
As far as I know MPI is currently incapable of handling the situation, when the new distributed nodes want to join the already started group. The problems also may occur if one of the nodes goes offline.
MPI does not reveal any network related machinery that is underneath. Thus if you would ever need something on the lower level -- you're in trouble. If you on the other hand do not aticipate such a need, then you'll save yourself a lot of time using MPI.

What really is scaling?

I've heard that people say that they've made a scalable web application..
What really is scaling?
What can be done by developers to make their application scalable?
What are the factors that are looked after by developers during scaling?
Any tips and tricks about scaling web applications with asp.net and sql server...
What really is scaling?
Scaling is the increasing in capacity and/or usage of your application.
What do developers do to make their application scalable?
Either allow their applications to scale vertically or horizontally.
Horizontal scaling is about doing things in parallel.
Vertical scaling is about doing things faster. This typically means more powerful hardware.
Often when people talk about horizontal scalability the ideal is to have (near-)linear scalability. This means that if one $5k production box can handle 2,000 concurrent users then adding 4 more should handle 10,000 concurrent users. The closer it is to that figure the better.
The ideal for highly scalable apps is to have near-limitless near-linear horizontal scalability such that you can just plug in another box and your capacity increases by an expected amount with little or no diminishing returns.
Ideally redundancy is part of the equation too but that's typically a separate issue.
The poster child for this kind of scalability is, of course, Google.
What are the factors that are looked after by developers during scaling?
How much scale should be planned for? There's no point spending time and money on a problem you'll never have;
Is it possible and/or economical to scale vertically? This is the preferred option as it is typically much, much cheaper (in the short term);
Is it worth the (often significant) cost to enable your application to scale horizontally? Distributed/multithreaded apps are significantly more difficult and expensive to write.
Any tips and tricks about scaling web applications...
Yes:
Don't worry about problems you'll never have;
Don't worry much about problems you're unlikely to have. Chances are things will have changed long before you have them;
Don't be afraid to throw away code and start again. Having automated tests makes this far easier; and
Think in terms of developer time being expensive.
(4) is a key point. You might have a poorly written app that will require $20,000 of hardware to essentially fix. Nowadays $20,000 buys a lot of power (64+GB of RAM, 4 quad core CPUs, etc), probably more than 99% of people will ever need. Is it cheaper just to do that or spend 6 months rewriting and debugging a new app to make it faster?
It's easily the first option.
So I'll add another item to my list: be pragmatic.
My 2c definition of "scalable" is a system whose throughput grows linearly (or at least predictably) with resources. Add a machine and get 2x throughput. Add another machine and get 3x throughput. Or, move from a 2p machine to a 4p machine, and get 2x throughput.
It rarely works linearly, but a well-designed system can approach linear scalability. Add $1 of HW and get 1 unit worth of additional performance.
This is important in web apps because the potential user base is ~1b people.
Contention for resources within the app, when it is subjected to many concurrent requests, is what causes scalability to suffer. The end result of such a system is that no matter how much hardware you use, you cannot get it to deliver more throughput. It "tops out". The HW-cost versus performance curve goes asymptotic.
For example, if there's a single app-wide in-memory structure that needs to be updated for each web transaction or interaction, that structure will become a bottleneck, and will limit scalability of the app. Adding more CPUs or more memory or (maybe) more machines won't help increase throughput - you will still have requests lining up to lock that structure.
Often in a transactional app, the bottleneck is the database, or a particular table in the database.
What really is scaling?
Scaling means accommodating increases in usage and data volume, and ideally the implementation should be maintainable.
What developers do to make their application scalable?
Use a database, but cache as much as possible while accommodating user experience (possibly in the session).
Any tips and tricks about scaling web applications...
There are lots, but it depends on the implementation. What programming language(s), what database, etc. The question needs to be refined.
Scalable means that your app is prepared for (and capable of handling) future growth. It can handle higher traffic, more activity, etc. Making your site more scalable can entail various things. You may work on storing more in cache, rather than querying the database(s) unnecessarily. It may entail writing better queries, to keep connections to a minimum, and resources freed up.
Resources:
Seattle Conference on Scalability (Video)
Improving .NET Application Performance and Scalability (Video)
Writing Scalable Applications with PHP
Scalability, on Wikipedia
Books have been written on this topic. An excellent one which targets internet applications but describes principles and practices that can be applied in any development effort is Scalable Internet Architectures
May I suggest a "User-Centric" definition;
Scalable applications provide a consistent level of experience to each user irrespective of the number of users.
For web applications this means 24/7 anywhere in the world. However, given the diversity of the available bandwidth around the world and developer's lack of control over its performance and availability, we may re-define it as follows:
Scalable web applications provide a consistent response time, measured at the server TCP port in use, irrespective of the number of requests.
To achieve this the developer must avoid or remove all performance bottle-necks. Currently the most challenging issue is the scalability of distributed RDBMS systems.

Evaluate software minimum requirements

Is there a way to evaluate the minimum requirements of a software? I mean, how can I discover, for example, the minimum amount of RAM that my application will need?
Thanks!
A profiler will not help you here. Neither will estimating the size of data structures.
A profiler can certainly tell you where your code is spending the most CPU time, but it will not tell you if you are missing performance targets - e.g. if your users will be happy, or unhappy with the performance of your application on any given system.
Simply computing the size of data structures, and how many may be allocated at any one time will not at all give you an accurate picture of memory usage over time. The reason is that memory usage is determined by many other factors including how much I/O your application does, what OS services your application uses, and most importantly the temporal nature of how your application uses memory.
The most effective way to understand minimum requirements is to
Make sure you have an effective way of measuring performance using metrics that are important to your user. the best metric is response time. Depending on your app, a rate such as throughput or operations per second may be applicable. Your measurements could be empirical (e.g. just try it) but that is least effective. This is best done with some kind of instrumentation. On windows, the choice is [ETW][1]. Other operating systems have other suitable mechanisms.
Have some kind of automated method of exercising your application. This will let you make repeated and reliable measurements.
Measure your application using various memory sizes and see where performance begins to suffer. This may also expose performance bugs that prevent your application from performing well. If you have access to platforms of various performance levels, use those as well. You didn't indicate what your app does, but testing on a netbook with 1GB of memory is great for many (not all) client applications.
You can do the same with the CPU and other components such as disk, networking or the GPU.
Also note that there is no simple answer here - doing an effective job at setting minimum requirements is real work. This is especially true if your application is participatory sensitive to one platform aspect or another.
There are other factors as well - for example, your app may run fine in one configuration until the user opens another application that may be memory hungry or a CPU pig. Users rarely only have one application open.
This means that in addition to specifying minimum requirements you must do an effective job in setting user expectations - that is explaining when your application will perform well, and when it won't, and what the factors are that impact performance.
[1]: http://msdn.microsoft.com/en-us/library/ms751538.aspxstrong text
Ideally, you'd decide on the minimum requirements of a piece of software based on your target audience, and then test your software during development on that configuration to ensure it delivers a satisfactory experience.
You can look at a system running your software and see how much memory is being consumed by your application, and use that to guide how much memory is being consumed. CPU is a little bit more complex - you could try to model your CPU requirements, but doing this accurately can be challenging.
But ultimately, you need to test your app on the base system you are targeting.
Given the data structures used by the application, estimate how much space they will take up in normal use. Using that estimation, set up a number of machines (virtual or physical) to test the estimate in different scenarios (i.e. different target operating systems, different virtual memory settings, etc).
Then measure the performance of the application in the different scenarios. Your minimum settings will be the machine that performs the least adequately while still being acceptable.
You could try using a performance profiler on your software while stress testing it.
You could use virtualization to repeatedly run a representative test suite with different amounts of RAM in the virtual machine...when the performance falls below acceptable levels due to swapping, you've found the memory requirement.

What is Erlang's secret to scalability?

Erlang is getting a reputation for being untouchable at handling a large volume of messages and requests. I haven't had time to download and try to get inside Mr. Erlang's understanding of switching theory... so I'm wondering if someone can teach me (or point to a good instructional site.)
Say as a thought-experiment I wanted to port the Erlang ejabberd to a combination of Python and C, in a way that gave me the same speed and scalability. What structures or patterns would I have to understand and implement? (Does Python's Twisted already do this?)
How/why do functional languages (specifically Erlang) scale well? (for discussion of why)
http://erlang.org/course/course.html (for a tutorial chain)
As far as porting to other languages, a message passing system would be easy to do in most modern languages. Getting the functional style can be done in Python easily enough, although you wouldn't get the internal dispatching features of Erlang "for free". Stackless Python can replicate much of Erlang's concurrency features, although I can't speak to details as I haven't used it much. If does appear to be much more "explicit" (in that it requires you to define the concurrency in code in places that Erlang's design will allow concurrency to happen internally).
Erlang is not only about scalability but mostly about
reliability
soft real-time characteristics (enabled by soft real-time GC which is possible because immutability [no cycles] and share nothing and so)
performance in concurrent tasks (cheap task switch, cheap process spawn, actors model, ...)
scalability - debatable in current state , but rapidly evolving (about 32 cores well, it is better than most competitors but should be better in near future).
Another of the features of erlang that have an impact on scalability is the the lightweight cheap processes. Since processes have so little overhead erlang can spawn far more of them than most other languages. You get more bang for your buck with erlang processes than many other languages give you.
I think the best choice for Erlang is Network bound applications - makes communication much simpler between nodes and things like heartbeat monitoring, auto restart using supervisor are built into OTP.

Resources