When I read about the the definition of scalability on different websites. I came to know in context of CPU & software that it means that as the number of CPUs are added, the performance of the software improves.
Whereas, the description of scalability in the book on "An introduction to parallel programming by Peter Pacheco" is different which is as:
"Suppose we run a parallel program with a fixed number of processes/threads and a fixed input size, and we obtain an efficiency E. Suppose we now increase the number of processes/threads that are used by the program. If we can find a corresponding rate of increase in the problem size so that the program always has efficiency E, then the program is
scalable.
My question is what is the proper definition of scalability? and if I am performing a test for scalability of a parallel software, which definition among the two should I look be looking at?
Scalability is an application's ability to function correctly and maintain an acceptable user experience when used by a large number of clients.
Preferably, this ability should be achieved through elegant solutions in code, but where this isn't possible, the application's design must allow for horizontal growth using hardware (adding more computers, rather than increasing the performance of one computer).
Scalability is a concern which grows with the size of a business. Excellent examples are Facebook (video) and Dropbox (video). Also, here's a great explanation of various approaches to scalability from a session at Harvard.
Scalability also refers to the ability of a user interface to adapt to various screen sizes while maintaining the user experience.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
First, I read this. But I would like to expand. To summarize:
When designing safety-critical systems a designer has to evaluate some metrics to get the confidence that the system will work as expected. It is, kind of, a mathematical proof with low enough complexity to be accessible to a human being. It has to do with accountability, reliability, auditability, etc...
On the other hand, at this point, AI is a black box that seems to work very well, but most of the times we do not have a proof of its correctness (mainly because the thing going on in the box is too complex to be analyzed), it is more like a statistical certainty:
We trained the system and the system performed well for all the tests.
So, some questions?
Q1. Do these two vague thoughts make sense nowadays?
Q2. Is possible to use AI for safety-critical system and be sure of its performance? Can we have certainty about the deterministic behavior of AI? Any reference?
Q3. I guess there are already some companies selling safety-critical systems based on AI in the automotive realm for example. How do they manage to certify their products for such a restrictive market?
EDIT
About Q1: thanks to Peter, I realized that, for example, for the automotive example, there are not requirements about total certainty. ASIL D level, the most restrictive level for automotive systems, requires only an upper bound for the probability of failure. So do other ISO26262 standards and levels. I would refine the question:
Q1. Is there any safety standard in system design, at any level/subcomponent, in any field/domain, that requires total certainty?
About Q2: Even though total certainty were not required, the question still holds.
About Q3: Now I understand how they would be able to achieve certification. Anyhow, any reference would be very welcome.
No solution or class of technology actually gets certified for safety-critical systems. When specifying the system, hazards are identified, requirements are defined to avoid or mitigate those hazards to an appropriate level of confidence, and evidence is provided that the design and then the implementation meet those requirements. Certification is simply sign-off that, within context of the particular system, appropriate evidence has been provided to justify a claim that the risk (product of likelihood of some event occurring, and the adverse impact if that event occurs) is acceptably low. At most, a set of evidence is provided or developed for a particular product (in your case an AI engine) which will be analysed in the context of other system components (for which evidence also needs to be obtained or provided) and the means of assembling those components into a working system. It is the system that will receive certification, not the technologies used to build it. The evidence provided with a particular technology or subsystem might well be reused but it will be analysed in context of the requirements for each complete system the technology or subsystem is used in.
This is why some technologies are described as "certifiable" rather than "certified". For example, some real-time operating systems (RTOS) have versions that are delivered with a pack of evidence that can be used to support acceptance of a system they are used in. However, those operating systems are not certified for use in safety critical systems, since the evidence must be assessed in context of the overall requirements of each total system in which the RTOS is employed.
Formal proof is advocated to provide the required evidence for some types of system or subsystems. Generally speaking, formal proof approaches do not scale well (complexity of the proof grows at least as fast as complexity of the system or subsystem) so approaches other than proof are often employed. Regardless of how evidence is provided, it must be assessed in the context of requirements of the overall system being built.
Now, where would an AI fit into this? If the AI is to be used to meet requirements related to mitigating or avoiding hazards, it is necessary to provide evidence that they do so appropriately in context of the total system. If there is a failure of the AI to meet those requirements, it will be necessary for the system as a whole (or other subsystems that are affected by the failure of the AI to meet requirements) to contain or mitigate the effects, so the system as a whole meets its complete set of requirements.
If the presence of the AI prevents delivery of sufficient evidence that the system as a whole meets its requirements, then the AI cannot be employed. This is equally true whether it is technically impossible to provide such evidence, or if real-world constraints prevent delivery of that evidence in context of the system being developed (e.g. constraints on available manpower, time, and other resources affecting ability to deliver the system and provide evidence it meets its requirements).
For a sub-system with non-deterministic behaviour, such as the learning of an AI, any inability to repeatably give results over time will make it more difficult to provide required evidence. The more gaps there are in the evidence provided, the more it is necessary to provide evidence that OTHER parts of the system mitigate the identified hazards.
Generally speaking, testing on its own is considered a poor means of providing evidence. The basic reason is that testing can only establish the presence of a deficiency against requirements (if the testing results demonstrate) but cannot provide evidence of the absence of a deficiency (i.e. a system passing all its test cases does not provide evidence about anything not tested for). The difficulty is providing a justification that testing provides sufficient coverage of requirements. This introduces the main obstacle to using an AI in a system with safety-related requirements - it is necessary for work at the system level to provide evidence that requirements are met, because it will be quite expensive to provide sufficient test-based evidence with the AI.
One strategy that is used at the system level is partitioning. The interaction of the AI with other sub-systems will be significantly constrained. For example, the AI will probably not directly interact with actuators that can cause a hazard, but will instead make requests to other subsystems. Then the burden of evidence is placed on how well the other subsystems meet requirements, including the manner they interact with actuators. As part of providing that evidence, the other subsystems may check all the data or requests from the AI, and ignore any that would cause an inappropriate actuation (or any other breach of overall system requirements). As a result of this, the AI itself may not actually meet any safety-related requirements at all - it might simply take information or provide information to other subsystems, and those other subsystems actually contribute more directly to meeting the overall system requirements. Given that the developers of an AI probably cannot provide all the needed evidence, it is a fair bet that system developers will try to constrain the effects an AI - if employed - can have on behaviour of the total system.
Another strategy is to limit the learning opportunities for the AI. For example, evidence will be provided with each training set - in the context of the AI itself - that the AI behaves in a predictable manner. That evidence will need to be provided in total every time the training set is updated, and then the analysis for the system as a whole will need to be redone. That is likely to be a significant undertaking (i.e. a long and expensive process) so the AI or its training sets will probably not be updated at a particularly high rate.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the beginning, I would like to describe my current position and the goal that I would like to achieve.
I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and social network analysis and therefore have gained some theoretical concepts useful for implementing machine learning algorithms and feed in the real data.
On simple examples, the algorithms work well and the running time is acceptable whereas the big data represent a problem if trying to run algorithms on my PC. Regarding the software I have enough experience to implement whatever algorithm from articles or design my own using whatever language or IDE (so far have used Matlab, Java with Eclipse, .NET...) but so far haven't got much experience with setting-up infrastructure. I have started to learn about Hadoop, NoSQL databases, etc, but I am not sure what strategy would be the best taking into consideration the learning time constraints.
The final goal is to be able to set-up a working platform for analyzing big data with focusing on implementing my own machine learning algorithms and put all together into production, ready for solving useful question by processing big data.
As the main focus is on implementing machine learning algorithms I would like to ask whether there is any existing running platform, offering enough CPU resources to feed in large data, upload own algorithms and simply process the data without thinking about distributed processing.
Nevertheless, such a platform exists or not, I would like to gain a picture big enough to be able to work in a team that could put into production the whole system tailored upon the specific customer demands. For example, a retailer would like to analyze daily purchases so all the daily records have to be uploaded to some infrastructure, capable enough to process the data by using custom machine learning algorithms.
To put all the above into simple question: How to design a custom data mining solution for real-life problems with main focus on machine learning algorithms and put it into production, if possible, by using the existing infrastructure and if not, design distributed system (by using Hadoop or whatever framework).
I would be very thankful for any advice or suggestions about books or other helpful resources.
First of all, your question needs to define more clearly what you intend by Big Data.
Indeed, Big Data is a buzzword that may refer to various size of problems. I tend to define Big Data as the category of problems where the Data size or the Computation time is big enough for "the hardware abstractions to become broken", which means that a single commodity machine cannot perform the computations without intensive care of computations and memory.
The scale threshold beyond which data become Big Data is therefore unclear and is sensitive to your implementation. Is your algorithm bounded by Hard-Drive bandwidth ? Does it have to feet into memory ? Did you try to avoid unnecessary quadratic costs ? Did you make any effort to improve cache efficiency, etc.
From several years of experience in running medium large-scale machine learning challenge (on up to 250 hundreds commodity machine), I strongly believe that many problems that seem to require distributed infrastructure can actually be run on a single commodity machine if the problem is expressed correctly. For example, you are mentioning large scale data for retailers. I have been working on this exact subject for several years, and I often managed to make all the computations run on a single machine, provided a bit of optimisation. My company has been working on simple custom data format that allows one year of all the data from a very large retailer to be stored within 50GB, which means a single commodity hard-drive could hold 20 years of history. You can have a look for example at : https://github.com/Lokad/lokad-receiptstream
From my experience, it is worth spending time in trying to optimize algorithm and memory so that you could avoid to resort to distributed architecture. Indeed, distributed architectures come with a triple cost. First of all, the strong knowledge requirements. Secondly, it comes with a large complexity overhead in the code. Finally, distributed architectures come with a significant latency overhead (with the exception of local multi-threaded distribution).
From a practitioner point of view, being able to perform a given data mining or machine learning algorithm in 30 seconds is one the key factor to efficiency. I have noticed than when some computations, whether sequential or distributed, take 10 minutes, my focus and efficiency tend to drop quickly as it becomes much more complicated to iterate quickly and quickly test new ideas. The latency overhead introduced by many of the distributed frameworks is such that you will inevitably be in this low-efficiency scenario.
If the scale of the problem is such that even with strong effort you cannot perform it on a single machine, then I strongly suggest to resort to on-shelf distributed frameworks instead of building your own. One of the most well known framework is the MapReduce abstraction, available through Apache Hadoop. Hadoop can be run on 10 thousands nodes cluster, probably much more than you will ever need. If you do not own the hardware, you can "rent" the use of a Hadoop cluster, for example through Amazon MapReduce.
Unfortunately, the MapReduce abstraction is not suited to all Machine Learning computations.
As far as Machine Learning is concerned, MapReduce is a rigid framework and numerous cases have proved to be difficult or inefficient to adapt to this framework:
– The MapReduce framework is in itself related to functional programming. The
Map procedure is applied to each data chunk independently. Therefore, the
MapReduce framework is not suited to algorithms where the application of the
Map procedure to some data chunks need the results of the same procedure to
other data chunks as a prerequisite. In other words, the MapReduce framework
is not suited when the computations between the different pieces of data are
not independent and impose a specific chronology.
– MapReduce is designed to provide a single execution of the map and of the
reduce steps and does not directly provide iterative calls. It is therefore not
directly suited for the numerous machine-learning problems implying iterative
processing (Expectation-Maximisation (EM), Belief Propagation, etc.). The
implementation of these algorithms in a MapReduce framework means the
user has to engineer a solution that organizes results retrieval and scheduling
of the multiple iterations so that each map iteration is launched after the reduce
phase of the previous iteration is completed and so each map iteration is fed
with results provided by the reduce phase of the previous iteration.
– Most MapReduce implementations have been designed to address production needs and
robustness. As a result, the primary concern of the framework is to handle
hardware failures and to guarantee the computation results. The MapReduce efficiency
is therefore partly lowered by these reliability constraints. For example, the
serialization on hard-disks of computation results turns out to be rather costly
in some cases.
– MapReduce is not suited to asynchronous algorithms.
The questioning of the MapReduce framework has led to richer distributed frameworks where more control and freedom are left to the framework user, at the price of more complexity for this user. Among these frameworks, GraphLab and Dryad (both based on Direct Acyclic Graphs of computations) are well-known.
As a consequence, there is no "One size fits all" framework, such as there is no "One size fits all" data storage solution.
To start with Hadoop, you can have a look at the book Hadoop: The Definitive Guide by Tom White
If you are interested in how large-scale frameworks fit into Machine Learning requirements, you may be interested by the second chapter (in English) of my PhD, available here: http://tel.archives-ouvertes.fr/docs/00/74/47/68/ANNEX/texfiles/PhD%20Main/PhD.pdf
If you provide more insight about the specific challenge you want to deal with (type of algorithm, size of the data, time and money constraints, etc.), we probably could provide you a more specific answer.
edit : another reference that could prove to be of interest : Scaling-up Machine Learning
I had to implement a couple of Data Mining algorithms to work with BigData too, and I ended up using Hadoop.
I don't know if you are familiar to Mahout (http://mahout.apache.org/), which already has several algorithms ready to use with Hadoop.
Nevertheless, if you want to implement your own Algorithm, you can still adapt it to Hadoop's MapReduce paradigm and get good results. This is an excellent book on how to adapt Artificial Intelligence algorithms to MapReduce:
Mining of Massive Datasets - http://infolab.stanford.edu/~ullman/mmds.html
This seems to be an old question. However given your usecase, the main frameworks focusing on Machine Learning in Big Data domain are Mahout, Spark (MLlib), H2O etc. However to run Machine Learning algorithms on Big Data you have to convert them to parallel programs based on Map Reduce paradigm. This is a nice article giving a brief introduction to major (not all) big Data frameworks:
http://www.codophile.com/big-data-frameworks-every-programmer-should-know/
I hope this will help.
Actually, I'm not expecting an answer to the specific question. I'm really wondering if there are any studies out there that might give some insight into usage patterns across the spectrum.
More precisely: Are there any published surveys on how much of the call stack programs typically use across different platforms, workloads, compilers, etc.?
EDIT: In response to some comments suggesting that the question is meaningless...
My own observations hint that stack utilisation follows something resembling an exponential distribution with a mean on the order of tens of bytes. I was hoping for some kind of indication of the stability of the mean along different dimensions. I.e., if I measured the stack consumption across a wide range of programs, would they exhibit a similar p.d.f., no matter how I group the results, or will, say, Linux programs consistently have bigger/smaller stacks, on average, than Windows programs, or statically-typed languages vs dynamically-typed languages, and so on?
Contrast this with, say, total RAM usage, which is influenced by the specifics of the problem at hand, in particular, the working set required by that program to efficiently carry out its duties. My hypothesis is that the distribution of stack utilisation will be relatively stable across a wide range of environments, and I simply want to know if that or a similar hypothesis has ever been confirmed or falsified.
(Note: I won't pretend that my observations are accurate, comprehensive or in any way scientific. That's why I'm here, asking the question.)
I could interpret your question in a way. In Java, The default native stack size is 128k, with a minimum value of 1000 bytes. The default java stack size is 400k, with a minimum value of 1000 bytes. Of course you can extend the sizes using -ss and -oss parameters respectively.
More precisely : I don't understand your need for published surveys on stacks across platforms.
I always get this argument against RoR that it dont scale but I never get any appropriate answer wtf it really means? So here is a novice asking, what the hell is this " scaling " and how you measure it?
What the hell is this "scaling"...
As a general term, scalability means the responsiveness of a project to different kinds of demand. A project that scales well is one that doesn't have any trouble keeping up with requests for more of its services -- or, at the least, doesn't have to start turning away requests because it can't handle them.
It's often the case that simply increasing the size of a problem by an order of magnitude or two exposes weaknesses in the strategies that were used to solve it. When such weaknesses are exposed, it might be said that the solution to the problem doesn't "scale well".
For example, bogo sort is easy to implement, but as soon as you're sorting more than a handful of things, it starts taking a very long time to get the answer you want. It would be fair to say that bogo sort doesn't scale well.
... and how you measure it?
That's a harder question to answer. In general, there aren't units associated with scalability; statements like "that system is N times as scalable as this one is" at best would be an apples-to-oranges comparison.
Scalability is most frequently measured by seeing how well a system stands up to different kinds of demand in test conditions. People might say a system scales well if, over a wide range of demand of different kinds, it can keep up. This is especially true if it stands up to demand that it doesn't currently experience, but might be expected to if there's a sudden surge in popularity. (Think of the Slashdot/Digg/Reddit effects.)
Scaling or scalability refers to how a project can grow or expand to respond to the demand:
http://en.wikipedia.org/wiki/Scalability
Scalability has a wide variety of uses as indicated by Wikipedia:
Scalability can be measured in various dimensions, such as:
Load scalability: The ability for a distributed system to easily
expand and contract its resource pool
to accommodate heavier or lighter
loads. Alternatively, the ease with
which a system or component can be
modified, added, or removed, to
accommodate changing load.
Geographic scalability: The ability to maintain performance,
usefulness, or usability regardless of
expansion from concentration in a
local area to a more distributed
geographic pattern.
Administrative scalability: The ability for an increasing number of
organizations to easily share a single
distributed system.
Functional scalability: The ability to enhance the system by
adding new functionality at minimal
effort.
In one area where I work we are concerned with the performance of high-throughput and parallel computing as the number of processors is increased.
More generally it is often found that increasing the problem by (say) one or two orders of magnitude throws up a completely new set of challenges which are not easily predictable from the smaller system
It is a term for expressing the ability of a system to keep its performance as it grows over time.
Ideally what you want, is a system to reach linear scalability. It means that by adding new units of resources, the system equally grows in its ability to perform.
For example: It means, that when three webapp servers can handle a thousand concurrent users, that by adding three more servers, it can handle double the amount, two thousand concurrent users in this case and no less.
If a system does not have the property of linear scalability, there is a point where adding more resources, e.g. hardware, will not bring any additional benefit, performance, for instance, converges to zero: As more and more servers are put to the task. In the above example, the additional benefit of each new server becomes smaller and smaller until it reaches zero.
Thus, scalability is the factor that tells you what you get as output from a given input. It's value range lies between 0 and positive infinity, in theory. In practice, anything equal to 1 is most desirable...
Scalability refers to the ability for a system to accomodate a changing number of users. This can be an increasing or decreasing number of users as we now try to plan our systems around cloud computing and rented computing time.
Think about what is involved in making an order entry system designed for 1000 reps scale to accomodate 100,000 reps. What hardware needs to be added? What about the databases? In a nutshell, this is scalability.
Scalability of an application refers to how it is able to perform as the load on the application changes. This is often affected by the number of connected users, amount of data in a database, etc.
It is the ability for a system to accept an increased workload, more functionality, changing database, ... without impacting the original design or system.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
G'day,
I was reading the item Quantify in the book "97 Things Every Software Architect Should Know" (sanitised Amazon link) and it got me wondering how to quantify scalability.
I have designed two systems for a major British broadcasting corporation that are used to:
detect the country of origin for incoming HTTP requests, or
determine the suitable video formats for a mobile phones's screen geometry and current connection type.
Both of the designs were required to provide scalability.
My designs for both systems are scalable horizontally behind caching load-balancing layers which are used to handle incoming requests for both of these services and distribute them across several servers which actually provide the service itself. Initial increases in service capacity are made by adding more servers behind the load-balance layer, hence the term horizontal scalability.
There is a limit to the scalability of this architecture however if the load balance layer starts having difficulty coping with the incoming request traffic.
So, is it possible to quantify scalability? Would it be an estimate of how many additional servers you could add to horizontally scale the solution?
I think this comes down to what scalability means in a given context and therefore the answer would be it depends.
I've seen scalability in requirements for things that simply didn't exist yet. For example, a new loan application tool that specifically called out needing to work on the iPhone and other mobile devices in the future.
I've also seen scalability used to describe potential expansion of more data centers and web servers in different areas of the world to improve performance.
Both examples above can be quantifiable if there is a known target for the future. But scalability may not be quantifiable if there really is no known target or plan which makes it a moving target.
I think it is possible in some contexts - for example scalability of a web application could be quantified in terms of numbers of users, numbers of concurrent requests, mean and standard deviation of response time, etc. You can also get into general numbers for bandwidth and storage, transactions per second, and recovery times (for backup and DR).
You can also often give numbers within the application domain - let's say the system supports commenting, you can quantify what is the order of magnitude of the number of comments that it needs to be able to store.
It is however worth bearing in mind that not everything that matters can be measured, and not everything that can be measured matters. :-)
The proper measure of scalability (not the simplest one;-) is a set of curves defining resource demanded (CPUs, memory, storage, local bandwidth, ...), and performance (e.g. latency) delivered, as the load grows (e.g. in terms of queries per second, but other measures such as total data throughput demanded may also be appropriate for some applications). Decision makers will typically demand that such accurate but complex measures be boiled down to a few key numbers (specific spots on some of the several curves), but I always try to negotiate for more-accurate as against simpler-to-understand measurements of such key metrics!-)
When I think of scalability I think of:
performance - how responsive the app needs to be for a given load
how large a load the app can grow into and at what unit cost (if its per server include software, support, etc)
how fast you can scale the app up and how much buffer you want over peak period usage (we can add 50% more bandwidth in 2-3 hours and require a 30% buffer over planned peak usage)
Redundancy is something else, but should also be included and considered.
"The system shall scale as to maintain a linear relationship of X for cost/user".
Here's one way:
"assume that a single processor can process 100 units of work per second..."
From http://www.information-management.com/issues/19971101/972-1.html