Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
G'day,
I was reading the item Quantify in the book "97 Things Every Software Architect Should Know" (sanitised Amazon link) and it got me wondering how to quantify scalability.
I have designed two systems for a major British broadcasting corporation that are used to:
detect the country of origin for incoming HTTP requests, or
determine the suitable video formats for a mobile phones's screen geometry and current connection type.
Both of the designs were required to provide scalability.
My designs for both systems are scalable horizontally behind caching load-balancing layers which are used to handle incoming requests for both of these services and distribute them across several servers which actually provide the service itself. Initial increases in service capacity are made by adding more servers behind the load-balance layer, hence the term horizontal scalability.
There is a limit to the scalability of this architecture however if the load balance layer starts having difficulty coping with the incoming request traffic.
So, is it possible to quantify scalability? Would it be an estimate of how many additional servers you could add to horizontally scale the solution?
I think this comes down to what scalability means in a given context and therefore the answer would be it depends.
I've seen scalability in requirements for things that simply didn't exist yet. For example, a new loan application tool that specifically called out needing to work on the iPhone and other mobile devices in the future.
I've also seen scalability used to describe potential expansion of more data centers and web servers in different areas of the world to improve performance.
Both examples above can be quantifiable if there is a known target for the future. But scalability may not be quantifiable if there really is no known target or plan which makes it a moving target.
I think it is possible in some contexts - for example scalability of a web application could be quantified in terms of numbers of users, numbers of concurrent requests, mean and standard deviation of response time, etc. You can also get into general numbers for bandwidth and storage, transactions per second, and recovery times (for backup and DR).
You can also often give numbers within the application domain - let's say the system supports commenting, you can quantify what is the order of magnitude of the number of comments that it needs to be able to store.
It is however worth bearing in mind that not everything that matters can be measured, and not everything that can be measured matters. :-)
The proper measure of scalability (not the simplest one;-) is a set of curves defining resource demanded (CPUs, memory, storage, local bandwidth, ...), and performance (e.g. latency) delivered, as the load grows (e.g. in terms of queries per second, but other measures such as total data throughput demanded may also be appropriate for some applications). Decision makers will typically demand that such accurate but complex measures be boiled down to a few key numbers (specific spots on some of the several curves), but I always try to negotiate for more-accurate as against simpler-to-understand measurements of such key metrics!-)
When I think of scalability I think of:
performance - how responsive the app needs to be for a given load
how large a load the app can grow into and at what unit cost (if its per server include software, support, etc)
how fast you can scale the app up and how much buffer you want over peak period usage (we can add 50% more bandwidth in 2-3 hours and require a 30% buffer over planned peak usage)
Redundancy is something else, but should also be included and considered.
"The system shall scale as to maintain a linear relationship of X for cost/user".
Here's one way:
"assume that a single processor can process 100 units of work per second..."
From http://www.information-management.com/issues/19971101/972-1.html
Related
When I read about the the definition of scalability on different websites. I came to know in context of CPU & software that it means that as the number of CPUs are added, the performance of the software improves.
Whereas, the description of scalability in the book on "An introduction to parallel programming by Peter Pacheco" is different which is as:
"Suppose we run a parallel program with a fixed number of processes/threads and a fixed input size, and we obtain an efficiency E. Suppose we now increase the number of processes/threads that are used by the program. If we can find a corresponding rate of increase in the problem size so that the program always has efficiency E, then the program is
scalable.
My question is what is the proper definition of scalability? and if I am performing a test for scalability of a parallel software, which definition among the two should I look be looking at?
Scalability is an application's ability to function correctly and maintain an acceptable user experience when used by a large number of clients.
Preferably, this ability should be achieved through elegant solutions in code, but where this isn't possible, the application's design must allow for horizontal growth using hardware (adding more computers, rather than increasing the performance of one computer).
Scalability is a concern which grows with the size of a business. Excellent examples are Facebook (video) and Dropbox (video). Also, here's a great explanation of various approaches to scalability from a session at Harvard.
Scalability also refers to the ability of a user interface to adapt to various screen sizes while maintaining the user experience.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Now I want to make app to show pic from server.But found it is so slowly to make the pics come out. Is there any way or code to make the http request be more faster?
Approach the problem from a different angle. Think it as an entropy/economy thing.
There is two distant point in the problem you are describing and they want to transfer a data between them. Lets say this is going to cost 100 units to realize in ideal conditions. And again lets assume it is not possible to further lower this cost. This cost is where the energy required to transfer is minimum
Now assuming that transfer rates are not under our control. Here are some theoretical "seemingly" improvements which are actually just different trade-off sets.
Forward Caching / Caching: Preload/download all images whenever possible so they will be ready when the user requests them. Install everything static with the first time installation.
Trade-off: You spent most of your 100 points on disk space and pre-process power this may make your app go little slower always but the performance will be great once they are loaded on disk. Effectiveness decrease as images you want changes frequently
Compression / Mapping: If your bottleneck is at transfer rates compress/map the images as much as you can so they will be transferred with low cost but when they arrive you will use much processor power once they arrive at the app.
Trade-off: CPU power is used a lot more then before but while in transfer they will be moving fast. Side that compresses uses a lot more memory and side that decompress also uses more memory and CPU. Only good if your images are slow moving because they are huge then you will see this trade-offs benefits. If you want to try this install 7z and check advanced zipping settings and try really huge maps.
Drawing algorithms: If your images are drawings instead of bitmaps/real pictures. Send only vector graphics format, change all your pictures(raster images to be technical) to vector images. Will greatly reduces the number of bytes to carry an image to app but it will need more processor power at the end.
Trade-off: Aside from not all pictures can be converted to vector graphics. you are going to be using more CPU and memory but there excellent libraries already built in which are very optimized. You can also think this as "mathematical compression" where a single infinite line can not be stored in any computer in universe a simple one line mathematical expression such as x = y + 1 will make it happen.
Write your own server: If you are sure the bottleneck is at the communication time with the service provider(in this case a http server which is most probably very efficient). Write you own server which answers you very quickly and starts sending the file. The likely hood of this improvements is to low to even talk about a trade off.
Never be forced to send repeating information: Design and tag your information and pictures in such a way. You will only send non-repeating chunks where the receiving side will store any chunk received to further improve its cache of information.
Trade-Off:Combination of 1,2 and 3 you this is just another form of disturbing 100 points of cost. You get the point.
BitTorrent Ideology:If the bottleneck is your servers bandwidth there is a simple formula to see if using your user's bandwidth is logical and have positive effects. Again it is probably only effective if your data set is very large.
Trade-Off: This is an interesting option to discuss about trade-off. You will be using your users bandwidth and proximity to compensate for your servers lack of bandwidth. It requires a little bit more CPU then conventional way of acquiring data(Maintain more TCP connections)
As a closing note: There can not be a function call that can improve and make the cost of transferring information from 100 points to 95 points. In current technology level it seems that that we are really close to effectively transfer. I mean compression, mapping and various other techniques are pretty mature including network transfer methodologies but there is always a room for improvement. For example currently we think sending data with the light speed is the absolute maximum as they are electrical signals but quantum entangled observation technique denies this limit where two entangled particles theoretically send and receive information in infinite speed across universe(??). Also If you can develop a better and more efficient way of compression that will be awesome.
Anyway, as you have asked a question which does not provide much information where we can talk. I would strongly recommend thinking like an engineer, creating a test environment pointing out to the main cause and attacking it with all you got. If you can define the problem with better mathematical expression or pointing out the bottleneck we can answer it better than generic information theory.
Also one last note. I am not saying information transfer is not going to be any more efficient it may go up %1000 percent tomorrow, I am just saying this field is pretty mature to get any huge improvements without working on mathematics and theory for years. Like any other field of research.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the beginning, I would like to describe my current position and the goal that I would like to achieve.
I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and social network analysis and therefore have gained some theoretical concepts useful for implementing machine learning algorithms and feed in the real data.
On simple examples, the algorithms work well and the running time is acceptable whereas the big data represent a problem if trying to run algorithms on my PC. Regarding the software I have enough experience to implement whatever algorithm from articles or design my own using whatever language or IDE (so far have used Matlab, Java with Eclipse, .NET...) but so far haven't got much experience with setting-up infrastructure. I have started to learn about Hadoop, NoSQL databases, etc, but I am not sure what strategy would be the best taking into consideration the learning time constraints.
The final goal is to be able to set-up a working platform for analyzing big data with focusing on implementing my own machine learning algorithms and put all together into production, ready for solving useful question by processing big data.
As the main focus is on implementing machine learning algorithms I would like to ask whether there is any existing running platform, offering enough CPU resources to feed in large data, upload own algorithms and simply process the data without thinking about distributed processing.
Nevertheless, such a platform exists or not, I would like to gain a picture big enough to be able to work in a team that could put into production the whole system tailored upon the specific customer demands. For example, a retailer would like to analyze daily purchases so all the daily records have to be uploaded to some infrastructure, capable enough to process the data by using custom machine learning algorithms.
To put all the above into simple question: How to design a custom data mining solution for real-life problems with main focus on machine learning algorithms and put it into production, if possible, by using the existing infrastructure and if not, design distributed system (by using Hadoop or whatever framework).
I would be very thankful for any advice or suggestions about books or other helpful resources.
First of all, your question needs to define more clearly what you intend by Big Data.
Indeed, Big Data is a buzzword that may refer to various size of problems. I tend to define Big Data as the category of problems where the Data size or the Computation time is big enough for "the hardware abstractions to become broken", which means that a single commodity machine cannot perform the computations without intensive care of computations and memory.
The scale threshold beyond which data become Big Data is therefore unclear and is sensitive to your implementation. Is your algorithm bounded by Hard-Drive bandwidth ? Does it have to feet into memory ? Did you try to avoid unnecessary quadratic costs ? Did you make any effort to improve cache efficiency, etc.
From several years of experience in running medium large-scale machine learning challenge (on up to 250 hundreds commodity machine), I strongly believe that many problems that seem to require distributed infrastructure can actually be run on a single commodity machine if the problem is expressed correctly. For example, you are mentioning large scale data for retailers. I have been working on this exact subject for several years, and I often managed to make all the computations run on a single machine, provided a bit of optimisation. My company has been working on simple custom data format that allows one year of all the data from a very large retailer to be stored within 50GB, which means a single commodity hard-drive could hold 20 years of history. You can have a look for example at : https://github.com/Lokad/lokad-receiptstream
From my experience, it is worth spending time in trying to optimize algorithm and memory so that you could avoid to resort to distributed architecture. Indeed, distributed architectures come with a triple cost. First of all, the strong knowledge requirements. Secondly, it comes with a large complexity overhead in the code. Finally, distributed architectures come with a significant latency overhead (with the exception of local multi-threaded distribution).
From a practitioner point of view, being able to perform a given data mining or machine learning algorithm in 30 seconds is one the key factor to efficiency. I have noticed than when some computations, whether sequential or distributed, take 10 minutes, my focus and efficiency tend to drop quickly as it becomes much more complicated to iterate quickly and quickly test new ideas. The latency overhead introduced by many of the distributed frameworks is such that you will inevitably be in this low-efficiency scenario.
If the scale of the problem is such that even with strong effort you cannot perform it on a single machine, then I strongly suggest to resort to on-shelf distributed frameworks instead of building your own. One of the most well known framework is the MapReduce abstraction, available through Apache Hadoop. Hadoop can be run on 10 thousands nodes cluster, probably much more than you will ever need. If you do not own the hardware, you can "rent" the use of a Hadoop cluster, for example through Amazon MapReduce.
Unfortunately, the MapReduce abstraction is not suited to all Machine Learning computations.
As far as Machine Learning is concerned, MapReduce is a rigid framework and numerous cases have proved to be difficult or inefficient to adapt to this framework:
– The MapReduce framework is in itself related to functional programming. The
Map procedure is applied to each data chunk independently. Therefore, the
MapReduce framework is not suited to algorithms where the application of the
Map procedure to some data chunks need the results of the same procedure to
other data chunks as a prerequisite. In other words, the MapReduce framework
is not suited when the computations between the different pieces of data are
not independent and impose a specific chronology.
– MapReduce is designed to provide a single execution of the map and of the
reduce steps and does not directly provide iterative calls. It is therefore not
directly suited for the numerous machine-learning problems implying iterative
processing (Expectation-Maximisation (EM), Belief Propagation, etc.). The
implementation of these algorithms in a MapReduce framework means the
user has to engineer a solution that organizes results retrieval and scheduling
of the multiple iterations so that each map iteration is launched after the reduce
phase of the previous iteration is completed and so each map iteration is fed
with results provided by the reduce phase of the previous iteration.
– Most MapReduce implementations have been designed to address production needs and
robustness. As a result, the primary concern of the framework is to handle
hardware failures and to guarantee the computation results. The MapReduce efficiency
is therefore partly lowered by these reliability constraints. For example, the
serialization on hard-disks of computation results turns out to be rather costly
in some cases.
– MapReduce is not suited to asynchronous algorithms.
The questioning of the MapReduce framework has led to richer distributed frameworks where more control and freedom are left to the framework user, at the price of more complexity for this user. Among these frameworks, GraphLab and Dryad (both based on Direct Acyclic Graphs of computations) are well-known.
As a consequence, there is no "One size fits all" framework, such as there is no "One size fits all" data storage solution.
To start with Hadoop, you can have a look at the book Hadoop: The Definitive Guide by Tom White
If you are interested in how large-scale frameworks fit into Machine Learning requirements, you may be interested by the second chapter (in English) of my PhD, available here: http://tel.archives-ouvertes.fr/docs/00/74/47/68/ANNEX/texfiles/PhD%20Main/PhD.pdf
If you provide more insight about the specific challenge you want to deal with (type of algorithm, size of the data, time and money constraints, etc.), we probably could provide you a more specific answer.
edit : another reference that could prove to be of interest : Scaling-up Machine Learning
I had to implement a couple of Data Mining algorithms to work with BigData too, and I ended up using Hadoop.
I don't know if you are familiar to Mahout (http://mahout.apache.org/), which already has several algorithms ready to use with Hadoop.
Nevertheless, if you want to implement your own Algorithm, you can still adapt it to Hadoop's MapReduce paradigm and get good results. This is an excellent book on how to adapt Artificial Intelligence algorithms to MapReduce:
Mining of Massive Datasets - http://infolab.stanford.edu/~ullman/mmds.html
This seems to be an old question. However given your usecase, the main frameworks focusing on Machine Learning in Big Data domain are Mahout, Spark (MLlib), H2O etc. However to run Machine Learning algorithms on Big Data you have to convert them to parallel programs based on Map Reduce paradigm. This is a nice article giving a brief introduction to major (not all) big Data frameworks:
http://www.codophile.com/big-data-frameworks-every-programmer-should-know/
I hope this will help.
I always get this argument against RoR that it dont scale but I never get any appropriate answer wtf it really means? So here is a novice asking, what the hell is this " scaling " and how you measure it?
What the hell is this "scaling"...
As a general term, scalability means the responsiveness of a project to different kinds of demand. A project that scales well is one that doesn't have any trouble keeping up with requests for more of its services -- or, at the least, doesn't have to start turning away requests because it can't handle them.
It's often the case that simply increasing the size of a problem by an order of magnitude or two exposes weaknesses in the strategies that were used to solve it. When such weaknesses are exposed, it might be said that the solution to the problem doesn't "scale well".
For example, bogo sort is easy to implement, but as soon as you're sorting more than a handful of things, it starts taking a very long time to get the answer you want. It would be fair to say that bogo sort doesn't scale well.
... and how you measure it?
That's a harder question to answer. In general, there aren't units associated with scalability; statements like "that system is N times as scalable as this one is" at best would be an apples-to-oranges comparison.
Scalability is most frequently measured by seeing how well a system stands up to different kinds of demand in test conditions. People might say a system scales well if, over a wide range of demand of different kinds, it can keep up. This is especially true if it stands up to demand that it doesn't currently experience, but might be expected to if there's a sudden surge in popularity. (Think of the Slashdot/Digg/Reddit effects.)
Scaling or scalability refers to how a project can grow or expand to respond to the demand:
http://en.wikipedia.org/wiki/Scalability
Scalability has a wide variety of uses as indicated by Wikipedia:
Scalability can be measured in various dimensions, such as:
Load scalability: The ability for a distributed system to easily
expand and contract its resource pool
to accommodate heavier or lighter
loads. Alternatively, the ease with
which a system or component can be
modified, added, or removed, to
accommodate changing load.
Geographic scalability: The ability to maintain performance,
usefulness, or usability regardless of
expansion from concentration in a
local area to a more distributed
geographic pattern.
Administrative scalability: The ability for an increasing number of
organizations to easily share a single
distributed system.
Functional scalability: The ability to enhance the system by
adding new functionality at minimal
effort.
In one area where I work we are concerned with the performance of high-throughput and parallel computing as the number of processors is increased.
More generally it is often found that increasing the problem by (say) one or two orders of magnitude throws up a completely new set of challenges which are not easily predictable from the smaller system
It is a term for expressing the ability of a system to keep its performance as it grows over time.
Ideally what you want, is a system to reach linear scalability. It means that by adding new units of resources, the system equally grows in its ability to perform.
For example: It means, that when three webapp servers can handle a thousand concurrent users, that by adding three more servers, it can handle double the amount, two thousand concurrent users in this case and no less.
If a system does not have the property of linear scalability, there is a point where adding more resources, e.g. hardware, will not bring any additional benefit, performance, for instance, converges to zero: As more and more servers are put to the task. In the above example, the additional benefit of each new server becomes smaller and smaller until it reaches zero.
Thus, scalability is the factor that tells you what you get as output from a given input. It's value range lies between 0 and positive infinity, in theory. In practice, anything equal to 1 is most desirable...
Scalability refers to the ability for a system to accomodate a changing number of users. This can be an increasing or decreasing number of users as we now try to plan our systems around cloud computing and rented computing time.
Think about what is involved in making an order entry system designed for 1000 reps scale to accomodate 100,000 reps. What hardware needs to be added? What about the databases? In a nutshell, this is scalability.
Scalability of an application refers to how it is able to perform as the load on the application changes. This is often affected by the number of connected users, amount of data in a database, etc.
It is the ability for a system to accept an increased workload, more functionality, changing database, ... without impacting the original design or system.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
UPDATED: I'm asking this from a development perspective, however to illustrate, a canoical non-development example that comes to mind is that if it costs, say, $10,000
to keep a uptime rate of 99%, then it theoretically can cost $100,000 to keep a rate
of 99.9%, and possibly $1,000,000 to keep a rate of 99.99%.
Somewhat like calculus in approaching 0, as we closely approach 100%,
the cost can increase exponentially. Therefore, as a developer or PM, where do you decide
that the deliverable is "good enough" given the time and monetary constraints, e.g.: are you getting a good ROI at 99%, 99.9%,
99.99%?
I'm using a non-development example because I'm not sure of a solid metric for development. Maybe in the above example "uptime" could be replaced with "function point to defect ratio", or some such reasonable measure rate of bugs vs. the complexity of code. I would also welcome input regarding all stages of a software development lifecycle.
Keep the classic Project Triangle constraints in mind (quality vs. speed vs. cost). And let's assume that the customer wants the best quality you can deliver given the original budget.
There's no way to answer this without knowing what happens when your application goes down.
If someone dies when your application goes down, uptime is worth spending millions or even billions of dollars on (aerospace, medical devices).
If someone may be injured if your software goes down, uptime is worth hundreds of thousands or millions of dollars (industrial control systems, auto safety devices)
If someone looses millions of dollars if your software goes down, uptime is worth spending millions on (financial services, large e-commerce apps).
If someone looses thousands of dollars if your software goes down, uptime is worth spending thousands on (retail, small e-commerce apps).
If someone will swear at the computer and looses productivity while it reboots when your software goes down, then uptime is worth spending thousands on (most internal software).
etc.
Basically take (cost of going down) x (number of times the software will go down) and you know how much to spend on uptime.
The Quality vs Good Enough discussion I've seen has a practical ROI at 95% defect fixes. Obviously show stoppers / critical defects are fixed (and always there are the exceptions like air-plane autopilots etc, that need to not have so many defects).
I can't seem to find the reference to the 95% defect fixes, it is either in Rapid Development or in Applied Software Measurement by Caper Jones.
Here is a link to a useful strategy for attacking code quality:
http://www.gamedev.net/reference/articles/article1050.asp
The client, of course, would likely balk at that number and might say no more than 1 hour of downtime per year is acceptable. That's 12 times more stable. Do you tell the customer, sorry, we can't do that for $100,000, or do you make your best attempt, hoping your analysis was conservative?
Flat out tell the customer what they want isn't reasonable. In order to gain that kind of uptime, a massive amount of money would be needed, and realistically, the chances of reaching that percentage of uptime constantly just isn't possible.
I personally would go back to the customer and tell them that you'll provide them with the best setup with 100k and set up an outage report guideline. Something like, for every outage you have, we will complete an investigation as to why this outage happened, and how what we will do to make the chances of it happening again almost non existent.
I think offering SLAs is just a mistake.
I think the answer to this question depends entirely on the individual application.
Software that has an impact on human safety has much different requirements than, say, an RSS feed reader.
The project triangle is a gross simplification. In lots of cases you can actually save time by improving quality. For example by reducing repairs and avoiding costs in maintenance. This is not only true in software development.Toyota lean production proved that this works in manufacturing too.
The whole process of software development is far too complex to make generalizations on cost vs quality. Quality is a fuzzy concept that consists of multiple factors. Is testable code of higher quality than performant code? Is maintainable code of higher quality than testable code? Do you need testable code for an RSS reader or performant code? And for a fly-by-wire F16?
It's more productive to make informed desisions on a case-by-case basis. And don't be afraid to over-invest in quality. It's usually much cheaper and safer than under-investing.
To answer in an equally simplistic way..
..When you stop hearing from the customers (and not because they stopped using your product).. except for enhancement requests and bouquets :)
And its not a triangle, it has 4 corners - Cost Time Quality and Scope.
To expand on what "17 of 26" said, the answer depends on value to the customer. In the case of critical software, like aircrafct controller applications, the value to the customer of a high quality rating by whatever measure they use is quite high. To the user of an RSS feed reader, the value of high quality is considerably lower.
It's all about the customer (notice I didn't say user - sometimes they're the same, and sometimes they're not).
Chasing the word "Quality" is like chasing the horizon. I have never seen anything (in the IT world or outside) that is 100% quality. There's always room for improvement.
Secondly, "quality" is an overly broad term. It means something different to everyone and subjective in it's degree of implementation.
That being said, every effort boils down to what "engineering" means--making the right choices to balance cost, time and key characteristics (ie. speed, size, shape, weight, etc.) These are constraints.