Reasons for NOT scaling-up vs. -out? - scalability

As a programmer I make revolutionary findings every few years. I'm either ahead of the curve, or behind it by about π in the phase. One hard lesson I learned was that scaling OUT is not always better, quite often the biggest performance gains are when we regrouped and scaled up.
What reasons to you have for scaling out vs. up? Price, performance, vision, projected usage? If so, how did this work for you?
We once scaled out to several hundred nodes that would serialize and cache necessary data out to each node and run maths processes on the records. Many, many billions of records needed to be (cross-)analyzed. It was the perfect business and technical case to employ scale-out. We kept optimizing until we processed about 24 hours of data in 26 hours wallclock. Really long story short, we leased a gigantic (for the time) IBM pSeries, put Oracle Enterprise on it, indexed our data and ended up processing the same 24 hours of data in about 6 hours. Revolution for me.
So many enterprise systems are OLTP and the data are not shard'd, but the desire by many is to cluster or scale-out. Is this a reaction to new techniques or perceived performance?
Do applications in general today or our programming matras lend themselves better for scale-out? Do we/should we take this trend always into account in the future?

Because scaling up
Is limited ultimately by the size of box you can actually buy
Can become extremely cost-ineffective, e.g. a machine with 128 cores and 128G ram is vastly more expensive than 16 with 8 cores and 8G ram each.
Some things don't scale up well - such as IO read operations.
By scaling out, if your architecture is right, you can also achieve high availability. A 128-core, 128G ram machine is very expensive, but to have a 2nd redundant one is extortionate.
And also to some extent, because that's what Google do.

Scaling out is best for embarrassingly parallel problems. It takes some work, but a number of web services fit that category (thus the current popularity). Otherwise you run into Amdahl's law, which then means to gain speed you have to scale up not out. I suspect you ran into that problem. Also IO bound operations also tend to do well with scaling out largely because waiting for IO increases the % that is parallelizable.

The blog post Scaling Up vs. Scaling Out: Hidden Costs by Jeff Atwood has some interesting points to consider, such as software licensing and power costs.

Not surprisingly, it all depends on your problem. If you can easily partition it with into subproblems that don't communicate much, scaling out gives trivial speedups. For instance, searching for a word in 1B web pages can be done by one machine searching 1B pages, or by 1M machines doing 1000 pages each without a significant loss in efficiency (so with a 1,000,000x speedup). This is called "embarrassingly parallel".
Other algorithms, however, do require much more intensive communication between the subparts. Your example requiring cross-analysis is the perfect example of where communication can often drown out the performance gains of adding more boxes. In these cases, you'll want to keep communication inside a (bigger) box, going over high-speed interconnects, rather than something as 'common' as (10-)Gig-E.
Of course, this is a fairly theoretical point of view. Other factors, such as I/O, reliability, easy of programming (one big shared-memory machine usually gives a lot less headaches than a cluster) can also have a big influence.
Finally, due to the (often extreme) cost benefits of scaling out using cheap commodity hardware, the cluster/grid approach has recently attracted much more (algorithmic) research. This makes that new ways of parallelization have been developed that minimize communication, and thus do much better on a cluster -- whereas common knowledge used to dictate that these types of algorithms could only run effectively on big iron machines...

Related

Is there any way to calculate DRAM access latency (cycles) from data size?

I need to calculate DRAM access latency using given data size to be transfered between DRAM-SRAM
The data is seperated to "load size" and "store size" and "number of iteration of load and store" is given.
I think the features I need to consider are many like first DRAM access latency, transfer one word latency, address load latency etc..
Is there some popular equation to get this by given information?
Thank you in advance.
Your question has many parts, I think I can help better if I knew the ultimate goal? If it's simply to measure access latency:
If you are using an x86 processor maybe the Intel Memory Latency Checker will help
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.
If not x86, I think the Gem5 Simulator has what you are looking for, here is the main page but more specifically, for your needs, I think this config for Gem5 will be the most helpful.
Now regarding a popular equation, the best I could find is this Carnegie Melon paper that goes over my head: https://users.ece.cmu.edu/~omutlu/pub/chargecache_low-latency-dram_hpca16.pdf However, it looks like your main "features" as you put it revolve around cores and memory channels. The equation from the paper:
Storagebits = C ∗MC ∗Entries∗(EntrySizebits +LRUbits)
Is used to create a cache that will ultimately (the goal of ChargeCache) reduce access latency in DRAM. I'm sure this isn't the equation you are looking for but just a piece of the puzzle. The LRUbits relate to the cache this mechanism (in the memory controller, no DRAM modification necessary) creates.
EntrySizebits is determined by this equation EntrySizebits = log2(R)+log2(B)+log2(Ro)+1 and
R, B, and Ro are the number of ranks, banks, and rows in DRAM, respectively
I was surprised to learn highly charged rows (recently accessed) will have a significantly lower access latency.
If this goes over your head as well, maybe this 2007 paper by Ulrich Drepper titled What Every Programmer Should Know About Memory will help you find the elements you need for your equation. I'm still working through this paper myself, and there is some dated references but those depend on what cpu you're working with. Hope this helps, I look forward to being corrected on any of this, as I'm new to the topic.

If a computer can be Turing complete with one instruction what is the purpose of having many instructions?

I understand the concept of a computer being Turing complete ( having a MOV or command or a SUBNEG command and being able to therefore "synthesize" other instructions such as ). If that is true what is the purpose of having 100s of instructions like x86 has for example? Is to increase efficiency?
Yes.
Equally, any logical circuit can be made using just NANDs. But that doesn't make other components redundant. Crafting a CPU from NAND gates would be monumentally inefficient, even if that CPU performed only one instruction.
An OS or application has a similar level of complexity to a CPU.
You COULD compile it so it just used a single instruction. But you would just end up with the world's most bloated OS.
So, when designing a CPU's instruction set, the choice is a tradeoff between reducing CPU size/expense, which allows more instructions per second as they are simpler, and smaller size means easier cooling (RISC); and increasing the capabilities of the CPU, including instructions that take multiple clock-cycles to complete, but making it larger and more cumbersome to cool (CISC).
This tradeoff is why math co-processors were a thing back in the 486 days. Floating point math could be emulated without the instructions. But it was much, much faster if it had a co-processor designed to do the heavy lifting on those floating point things.
Remember that a Turing Machine is generally understood to be an abstract concept, not a physical thing. It's the theoretical minimal form a computer can take that can still compute anything. Theoretically. Heavy emphasis on theoretically.
An actual Turing machine that did something so simple as decode an MP3 would be outrageously complicated. Programming it would be an utter nightmare as the machine is so insanely limited that even adding two 64-bit numbers together and recording the result in a third location would require an enormous amount of "tape" and a whole heap of "instructions".
When we say something is "Turing Complete" we mean that it can perform generic computation. It's a pretty low bar in all honesty, crazy things like the Game of Life and even CSS have been shown to be Turing Complete. That doesn't mean it's a good idea to program for them, or take them seriously as a computational platform.
In the early days of computing people would have to type in machine codes by hand. Adding two numbers together and storing the result is often one or two operations at most. Doing it in a Turing machine would require thousands. The complexity makes it utterly impractical on the most basic level.
As a challenge try and write a simple 4-bit adder. Then if you've successfully tackled that, write a 4-bit multiplier. The complexity ramps up exponentially once you move to things like 32 or 64-bit values, and when you try and tackle division or floating point values you're quickly going to drown in the outrageousness of it all.
You don't tell the CPU which transistors to flip when you're typing in machine code, the instructions act as macros to do that for you, but when you're writing Turing Machine code it's up to you to command it how to flip each and every single bit.
If you want to learn more about CPU history and design there's a wealth of information out there, and you can even implement your own using transistor logic or an FPGA kit where you can write it out using a higher level design language like Verilog.
The Intel 4004 chip was intended for a calculator so the operation codes were largely geared towards that. The subsequent 8008 built on that, and by the time the 8086 rolled around the instruction set had taken on that familiar x86 flavor, albeit a 16-bit version of same.
There's an abstraction spectrum here between defining the behaviour of individual bits (Turing Machine) and some kind of hypothetical CPU with an instruction for every occasion. RISC and CISC designs from the 1980s and 1990s differed in their philosophy here, where RISC generally had fewer instructions, CISC having more, but those differences have largely been erased as RISC gained more features and CISC became more RISC-like for the sake of simplicity.
The Turing Machine is the "absolute zero" in terms of CPU design. If you can come up with something simpler or more reductive you'd probably win a prize.

Machine Learning & Big Data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the beginning, I would like to describe my current position and the goal that I would like to achieve.
I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and social network analysis and therefore have gained some theoretical concepts useful for implementing machine learning algorithms and feed in the real data.
On simple examples, the algorithms work well and the running time is acceptable whereas the big data represent a problem if trying to run algorithms on my PC. Regarding the software I have enough experience to implement whatever algorithm from articles or design my own using whatever language or IDE (so far have used Matlab, Java with Eclipse, .NET...) but so far haven't got much experience with setting-up infrastructure. I have started to learn about Hadoop, NoSQL databases, etc, but I am not sure what strategy would be the best taking into consideration the learning time constraints.
The final goal is to be able to set-up a working platform for analyzing big data with focusing on implementing my own machine learning algorithms and put all together into production, ready for solving useful question by processing big data.
As the main focus is on implementing machine learning algorithms I would like to ask whether there is any existing running platform, offering enough CPU resources to feed in large data, upload own algorithms and simply process the data without thinking about distributed processing.
Nevertheless, such a platform exists or not, I would like to gain a picture big enough to be able to work in a team that could put into production the whole system tailored upon the specific customer demands. For example, a retailer would like to analyze daily purchases so all the daily records have to be uploaded to some infrastructure, capable enough to process the data by using custom machine learning algorithms.
To put all the above into simple question: How to design a custom data mining solution for real-life problems with main focus on machine learning algorithms and put it into production, if possible, by using the existing infrastructure and if not, design distributed system (by using Hadoop or whatever framework).
I would be very thankful for any advice or suggestions about books or other helpful resources.
First of all, your question needs to define more clearly what you intend by Big Data.
Indeed, Big Data is a buzzword that may refer to various size of problems. I tend to define Big Data as the category of problems where the Data size or the Computation time is big enough for "the hardware abstractions to become broken", which means that a single commodity machine cannot perform the computations without intensive care of computations and memory.
The scale threshold beyond which data become Big Data is therefore unclear and is sensitive to your implementation. Is your algorithm bounded by Hard-Drive bandwidth ? Does it have to feet into memory ? Did you try to avoid unnecessary quadratic costs ? Did you make any effort to improve cache efficiency, etc.
From several years of experience in running medium large-scale machine learning challenge (on up to 250 hundreds commodity machine), I strongly believe that many problems that seem to require distributed infrastructure can actually be run on a single commodity machine if the problem is expressed correctly. For example, you are mentioning large scale data for retailers. I have been working on this exact subject for several years, and I often managed to make all the computations run on a single machine, provided a bit of optimisation. My company has been working on simple custom data format that allows one year of all the data from a very large retailer to be stored within 50GB, which means a single commodity hard-drive could hold 20 years of history. You can have a look for example at : https://github.com/Lokad/lokad-receiptstream
From my experience, it is worth spending time in trying to optimize algorithm and memory so that you could avoid to resort to distributed architecture. Indeed, distributed architectures come with a triple cost. First of all, the strong knowledge requirements. Secondly, it comes with a large complexity overhead in the code. Finally, distributed architectures come with a significant latency overhead (with the exception of local multi-threaded distribution).
From a practitioner point of view, being able to perform a given data mining or machine learning algorithm in 30 seconds is one the key factor to efficiency. I have noticed than when some computations, whether sequential or distributed, take 10 minutes, my focus and efficiency tend to drop quickly as it becomes much more complicated to iterate quickly and quickly test new ideas. The latency overhead introduced by many of the distributed frameworks is such that you will inevitably be in this low-efficiency scenario.
If the scale of the problem is such that even with strong effort you cannot perform it on a single machine, then I strongly suggest to resort to on-shelf distributed frameworks instead of building your own. One of the most well known framework is the MapReduce abstraction, available through Apache Hadoop. Hadoop can be run on 10 thousands nodes cluster, probably much more than you will ever need. If you do not own the hardware, you can "rent" the use of a Hadoop cluster, for example through Amazon MapReduce.
Unfortunately, the MapReduce abstraction is not suited to all Machine Learning computations.
As far as Machine Learning is concerned, MapReduce is a rigid framework and numerous cases have proved to be difficult or inefficient to adapt to this framework:
– The MapReduce framework is in itself related to functional programming. The
Map procedure is applied to each data chunk independently. Therefore, the
MapReduce framework is not suited to algorithms where the application of the
Map procedure to some data chunks need the results of the same procedure to
other data chunks as a prerequisite. In other words, the MapReduce framework
is not suited when the computations between the different pieces of data are
not independent and impose a specific chronology.
– MapReduce is designed to provide a single execution of the map and of the
reduce steps and does not directly provide iterative calls. It is therefore not
directly suited for the numerous machine-learning problems implying iterative
processing (Expectation-Maximisation (EM), Belief Propagation, etc.). The
implementation of these algorithms in a MapReduce framework means the
user has to engineer a solution that organizes results retrieval and scheduling
of the multiple iterations so that each map iteration is launched after the reduce
phase of the previous iteration is completed and so each map iteration is fed
with results provided by the reduce phase of the previous iteration.
– Most MapReduce implementations have been designed to address production needs and
robustness. As a result, the primary concern of the framework is to handle
hardware failures and to guarantee the computation results. The MapReduce efficiency
is therefore partly lowered by these reliability constraints. For example, the
serialization on hard-disks of computation results turns out to be rather costly
in some cases.
– MapReduce is not suited to asynchronous algorithms.
The questioning of the MapReduce framework has led to richer distributed frameworks where more control and freedom are left to the framework user, at the price of more complexity for this user. Among these frameworks, GraphLab and Dryad (both based on Direct Acyclic Graphs of computations) are well-known.
As a consequence, there is no "One size fits all" framework, such as there is no "One size fits all" data storage solution.
To start with Hadoop, you can have a look at the book Hadoop: The Definitive Guide by Tom White
If you are interested in how large-scale frameworks fit into Machine Learning requirements, you may be interested by the second chapter (in English) of my PhD, available here: http://tel.archives-ouvertes.fr/docs/00/74/47/68/ANNEX/texfiles/PhD%20Main/PhD.pdf
If you provide more insight about the specific challenge you want to deal with (type of algorithm, size of the data, time and money constraints, etc.), we probably could provide you a more specific answer.
edit : another reference that could prove to be of interest : Scaling-up Machine Learning
I had to implement a couple of Data Mining algorithms to work with BigData too, and I ended up using Hadoop.
I don't know if you are familiar to Mahout (http://mahout.apache.org/), which already has several algorithms ready to use with Hadoop.
Nevertheless, if you want to implement your own Algorithm, you can still adapt it to Hadoop's MapReduce paradigm and get good results. This is an excellent book on how to adapt Artificial Intelligence algorithms to MapReduce:
Mining of Massive Datasets - http://infolab.stanford.edu/~ullman/mmds.html
This seems to be an old question. However given your usecase, the main frameworks focusing on Machine Learning in Big Data domain are Mahout, Spark (MLlib), H2O etc. However to run Machine Learning algorithms on Big Data you have to convert them to parallel programs based on Map Reduce paradigm. This is a nice article giving a brief introduction to major (not all) big Data frameworks:
http://www.codophile.com/big-data-frameworks-every-programmer-should-know/
I hope this will help.

What is Scaling?

I always get this argument against RoR that it dont scale but I never get any appropriate answer wtf it really means? So here is a novice asking, what the hell is this " scaling " and how you measure it?
What the hell is this "scaling"...
As a general term, scalability means the responsiveness of a project to different kinds of demand. A project that scales well is one that doesn't have any trouble keeping up with requests for more of its services -- or, at the least, doesn't have to start turning away requests because it can't handle them.
It's often the case that simply increasing the size of a problem by an order of magnitude or two exposes weaknesses in the strategies that were used to solve it. When such weaknesses are exposed, it might be said that the solution to the problem doesn't "scale well".
For example, bogo sort is easy to implement, but as soon as you're sorting more than a handful of things, it starts taking a very long time to get the answer you want. It would be fair to say that bogo sort doesn't scale well.
... and how you measure it?
That's a harder question to answer. In general, there aren't units associated with scalability; statements like "that system is N times as scalable as this one is" at best would be an apples-to-oranges comparison.
Scalability is most frequently measured by seeing how well a system stands up to different kinds of demand in test conditions. People might say a system scales well if, over a wide range of demand of different kinds, it can keep up. This is especially true if it stands up to demand that it doesn't currently experience, but might be expected to if there's a sudden surge in popularity. (Think of the Slashdot/Digg/Reddit effects.)
Scaling or scalability refers to how a project can grow or expand to respond to the demand:
http://en.wikipedia.org/wiki/Scalability
Scalability has a wide variety of uses as indicated by Wikipedia:
Scalability can be measured in various dimensions, such as:
Load scalability: The ability for a distributed system to easily
expand and contract its resource pool
to accommodate heavier or lighter
loads. Alternatively, the ease with
which a system or component can be
modified, added, or removed, to
accommodate changing load.
Geographic scalability: The ability to maintain performance,
usefulness, or usability regardless of
expansion from concentration in a
local area to a more distributed
geographic pattern.
Administrative scalability: The ability for an increasing number of
organizations to easily share a single
distributed system.
Functional scalability: The ability to enhance the system by
adding new functionality at minimal
effort.
In one area where I work we are concerned with the performance of high-throughput and parallel computing as the number of processors is increased.
More generally it is often found that increasing the problem by (say) one or two orders of magnitude throws up a completely new set of challenges which are not easily predictable from the smaller system
It is a term for expressing the ability of a system to keep its performance as it grows over time.
Ideally what you want, is a system to reach linear scalability. It means that by adding new units of resources, the system equally grows in its ability to perform.
For example: It means, that when three webapp servers can handle a thousand concurrent users, that by adding three more servers, it can handle double the amount, two thousand concurrent users in this case and no less.
If a system does not have the property of linear scalability, there is a point where adding more resources, e.g. hardware, will not bring any additional benefit, performance, for instance, converges to zero: As more and more servers are put to the task. In the above example, the additional benefit of each new server becomes smaller and smaller until it reaches zero.
Thus, scalability is the factor that tells you what you get as output from a given input. It's value range lies between 0 and positive infinity, in theory. In practice, anything equal to 1 is most desirable...
Scalability refers to the ability for a system to accomodate a changing number of users. This can be an increasing or decreasing number of users as we now try to plan our systems around cloud computing and rented computing time.
Think about what is involved in making an order entry system designed for 1000 reps scale to accomodate 100,000 reps. What hardware needs to be added? What about the databases? In a nutshell, this is scalability.
Scalability of an application refers to how it is able to perform as the load on the application changes. This is often affected by the number of connected users, amount of data in a database, etc.
It is the ability for a system to accept an increased workload, more functionality, changing database, ... without impacting the original design or system.

Quality vs. ROI - When is Good Enough, good enough? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
UPDATED: I'm asking this from a development perspective, however to illustrate, a canoical non-development example that comes to mind is that if it costs, say, $10,000
to keep a uptime rate of 99%, then it theoretically can cost $100,000 to keep a rate
of 99.9%, and possibly $1,000,000 to keep a rate of 99.99%.
Somewhat like calculus in approaching 0, as we closely approach 100%,
the cost can increase exponentially. Therefore, as a developer or PM, where do you decide
that the deliverable is "good enough" given the time and monetary constraints, e.g.: are you getting a good ROI at 99%, 99.9%,
99.99%?
I'm using a non-development example because I'm not sure of a solid metric for development. Maybe in the above example "uptime" could be replaced with "function point to defect ratio", or some such reasonable measure rate of bugs vs. the complexity of code. I would also welcome input regarding all stages of a software development lifecycle.
Keep the classic Project Triangle constraints in mind (quality vs. speed vs. cost). And let's assume that the customer wants the best quality you can deliver given the original budget.
There's no way to answer this without knowing what happens when your application goes down.
If someone dies when your application goes down, uptime is worth spending millions or even billions of dollars on (aerospace, medical devices).
If someone may be injured if your software goes down, uptime is worth hundreds of thousands or millions of dollars (industrial control systems, auto safety devices)
If someone looses millions of dollars if your software goes down, uptime is worth spending millions on (financial services, large e-commerce apps).
If someone looses thousands of dollars if your software goes down, uptime is worth spending thousands on (retail, small e-commerce apps).
If someone will swear at the computer and looses productivity while it reboots when your software goes down, then uptime is worth spending thousands on (most internal software).
etc.
Basically take (cost of going down) x (number of times the software will go down) and you know how much to spend on uptime.
The Quality vs Good Enough discussion I've seen has a practical ROI at 95% defect fixes. Obviously show stoppers / critical defects are fixed (and always there are the exceptions like air-plane autopilots etc, that need to not have so many defects).
I can't seem to find the reference to the 95% defect fixes, it is either in Rapid Development or in Applied Software Measurement by Caper Jones.
Here is a link to a useful strategy for attacking code quality:
http://www.gamedev.net/reference/articles/article1050.asp
The client, of course, would likely balk at that number and might say no more than 1 hour of downtime per year is acceptable. That's 12 times more stable. Do you tell the customer, sorry, we can't do that for $100,000, or do you make your best attempt, hoping your analysis was conservative?
Flat out tell the customer what they want isn't reasonable. In order to gain that kind of uptime, a massive amount of money would be needed, and realistically, the chances of reaching that percentage of uptime constantly just isn't possible.
I personally would go back to the customer and tell them that you'll provide them with the best setup with 100k and set up an outage report guideline. Something like, for every outage you have, we will complete an investigation as to why this outage happened, and how what we will do to make the chances of it happening again almost non existent.
I think offering SLAs is just a mistake.
I think the answer to this question depends entirely on the individual application.
Software that has an impact on human safety has much different requirements than, say, an RSS feed reader.
The project triangle is a gross simplification. In lots of cases you can actually save time by improving quality. For example by reducing repairs and avoiding costs in maintenance. This is not only true in software development.Toyota lean production proved that this works in manufacturing too.
The whole process of software development is far too complex to make generalizations on cost vs quality. Quality is a fuzzy concept that consists of multiple factors. Is testable code of higher quality than performant code? Is maintainable code of higher quality than testable code? Do you need testable code for an RSS reader or performant code? And for a fly-by-wire F16?
It's more productive to make informed desisions on a case-by-case basis. And don't be afraid to over-invest in quality. It's usually much cheaper and safer than under-investing.
To answer in an equally simplistic way..
..When you stop hearing from the customers (and not because they stopped using your product).. except for enhancement requests and bouquets :)
And its not a triangle, it has 4 corners - Cost Time Quality and Scope.
To expand on what "17 of 26" said, the answer depends on value to the customer. In the case of critical software, like aircrafct controller applications, the value to the customer of a high quality rating by whatever measure they use is quite high. To the user of an RSS feed reader, the value of high quality is considerably lower.
It's all about the customer (notice I didn't say user - sometimes they're the same, and sometimes they're not).
Chasing the word "Quality" is like chasing the horizon. I have never seen anything (in the IT world or outside) that is 100% quality. There's always room for improvement.
Secondly, "quality" is an overly broad term. It means something different to everyone and subjective in it's degree of implementation.
That being said, every effort boils down to what "engineering" means--making the right choices to balance cost, time and key characteristics (ie. speed, size, shape, weight, etc.) These are constraints.

Resources