How is scalability defined for parallel code? - scalability

I'm interested in finding out whether there is a formal definition as to whether a parallel code is scalable, or whether it is just a trendy word? If I measure the serial wall time as t_S and the parallel wall time as t(P), then I can define the efficiency as E(P) = t_S / (t(P) * P), is there a criterion for how the efficiency has to change with P (and the problem size) for the code to be deemed scalable?

Scalable means that with extra machines or cpu cores (scale up vs. scale out) performance (ability to handle increasingly large workloads) improves. Serial code is thus not scalable. Parallel code can be. Amdahl's law limits how scalable a system can be.
Scalability is often more important than efficiency. A scalable but inefficient system can handle more load by just adding hardware. An efficient but unscalable system requires major code rework in order to handle larger loads.

see Amdahl's law and Gustafson's law for some formal definitions of some scalability metrics.

Related

How determinate number of rounds in TFF context

In TFF, It is necessary to determinate number of rounds. So, to obtain optimal performance of our model, How we can know the optimal number of rounds?
TFF does not necessarily need you to specify the number of rounds for federated training beforehand. TFF is more about specifying the federated aspect of your computation (which you can essentially think of as specifying the communication), and considers actually "running" the rounds to be at the system level.
When you write TFF, generally you are writing at three levels (explanation of this statement here); the question you are asking (and every concern TFF considers a "system concern") is at the Python level. Since Python controls the actual invocation of your computation written in TFF, you can stop training with any criterion expressible in Python. E.g. if you want to monitor performance on a validation set and use that as a stopping criteria, this is entirely doable. If you have a tff.utils.IterativeProcess ip, and evaluation function eval_fn (see here for an example), this could be implemented as something like:
while True:
data = sample_client_data()
state, metrics = ip.next(state, data)
eval_metrics = eval_fn(state)
if condition(eval_metrics):
break
Abstractly: since the Python drives the experiment process, you can stop whenever you want to, based on any observable characteristic of the training procedure. Therefore you do not in fact need to know how many rounds you will be running beforehand.
A more direct answer to the original question is, I think at this point in the history of FL, not quite achievable for the general case; nobody (as far as I am aware) knows of reliable system-level settings for FL at this point. This is not surprising; it is somewhat akin to knowing beforehand how many epochs one should specify in datacenter training, which I think tends to be quite problem-dependent. FL is similar in this regard. Practically speaking, my advice tends to be: monitor performance on a validation set, run for as long as you can, and keep the state of your highest-performing model on the val set around. I think a more general answer than this may be quite difficult.

Bayesian Optimization does not improve prediction accuracy

What could be the reason for this?
There is not any guarantee that Bayesian optimization will provide optimal hyperparameter values; quoting from the definitive textbook Deep Learning, by Goodfellow, Bengio, and Courville (page 430):
Currently, we cannot unambiguously recommend Bayesian hyperparameter
optimization as an established tool for achieving better deep learning results or
for obtaining those results with less effort. Bayesian hyperparameter optimization
sometimes performs comparably to human experts, sometimes better, but fails
catastrophically on other problems. It may be worth trying to see if it works on a
particular problem but is not yet sufficiently mature or reliable.
In other words, it is actually just a heuristic (like grid search), and what you report does not necessarily mean that you are doing something wrong or that there is a problem with the procedure to be corrected...
I would like to extend a perfect #desertnaut answer by a small intuition what could go wrong and how one can improve Bayesian optimization. Bayesian optimization usually use some form of computation of distance (and correlation) between points (hyperparameters). Unfortunately - usually it is close to impossible to impose such geometrical structure on the parameter space. One of important issues connected to this problem is to impose a Lipshitz or linear dependency between optimized value and hyperparameters. To understand that in more details let us have a look at:
Integer(50, 1000, name="estimators")
parameter. Let us inspect how adding 100 estimators could change the behavior of optimization problem. If we add 100 estimators to 50 - we will triple the number of estimators and would probably significantly increase the expressive power. How ever changing from 900 to 1000 should not be as important. So if the optimization process start with - let's say 600 estimators as a first guess - it would notice that changing estimators by approximately 50 is not changing a lot so it would skip optimizing this hyper-parameter (as it assumes quasi continuous-linear dependency). This might seriously harm the exploration process.
In order to overcome this issue it is better to use some sort of log distribution for this parameter. Similar trick was applied e.g. to learning_rate parameter.

Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN

The convolutional layers are most computationally intense parts of Convolutional neural networks (CNNs).Currently the common approach to impement convolutional layers is to expand the image into a column matrix(im2col) and perform and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. However im2col operation need load and store the image data, and also need another memory block to hold the intermediate data.
If I need to optimize the convolutional implementation, I may choose to direct implementation with SIMD instructions. Such method will not incur any memory operation overhead.
the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.
From the following link, at the end of the link
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
So I hope to know the reason. May floating point operations require more instruction cycle? or the input image is not much large, so it may residue in the cache and the memory operations don't need access DDR and consume less cycles.
Cache-blocking a GEMM is possible so you get mostly L1 cache hits (see also What Every Programmer Should Know About Memory?).
Fitting in the large shared L3 cache on typical x86 CPUs is not sufficient to make things efficient. The per-core L2 caches are typically 256kiB, and even that's slower than the 32kiB L1d cache.
Memory latency is very slow compared to a CPU core clock, but memory/cache bandwidth is not terrible these days with fast DDR4, or L3 cache hits. (But like I said, for a matmul with good cache blocking / loop tiling you can reuse data while it's still hot in L1d if you only transpose parts of the input matrix on the fly. Reducing off-core bandwidth requirements is also important for efficient matmul, not just transposing one so its columns are sequential in memory.)
Beyond that, sequential access to memory is essential for efficient SIMD (loading a vector of multiple contiguous elements, letting you multiply / add / whatever 4 or 8 packed float elements with one CPU instruction). Striding down columns in a row-major matrix would hurt throughput even if the matrix was small enough to fit in L1d cache (32kiB).

How to differentiate between real improvement and random noise?

I am building an automatic translator in moses. To improve its performance, I use log-linear weight optimisation. This technique has a random component, which can affect slightly the final result (but I do not know exactly how much).
Suppose that the current performance of the model is 25 BLEU.
Suppose now I modify the language model (e.g. change the smoothing), and I get a performance of 26 BLEU.
My question is: how can I know if the improvement is because the modification, or is just noise from the random component?
This is pretty much what statistics is all about. You can basically do one of the two things (from the basic set of solutions, of course there are many more advanced):
try to measure/model/quantify the effect of randomness, if you know what is causing it, you might be able to actually compute how much it can affect your model. If analytical solution is not possible, you can always train 20 models with the same data/settings, gather results and estimate noise distribution. Once you have this you can perform statistical tests to check whether the improvement is statistically significant (for example by ANOVA tests).
simpler approach (but more expensive in terms of data/time) is to simply reduce the variance by averaging. In short - instead of training one model (or evaluating model once) which has this hard to determine noise component - do it many times, 10, 20, and average the results. This way you reduce the variance of the results in your analysis. This can (and should) be combined with the previous option - since now you have 20 results per run, thus you can again use statistical testes to see whether these are significantly different things.

Why are the elements of the matrix and vector types in the F# Powerpack mutable?

F# is often promoted as a functional language where data is immutable by default, however the elements of the matrix and vector types in the F# Powerpack are mutable. Why is this?
Furthermore, for which reason are sparse matrices implemented as immutable as opposed to normal matrices?
The standard array type ('T[]) in F# is also mutable. You're mostly correct -- F# is a functional language where data immutability is encouraged, but not required. Basically, F# allows you to do write both mutable/imperative code and immutable/functional code; it's up to you to decide the best way to implement the code for your specific application.
Another reason for having mutable arrays and matrices is performance -- it is possible to implement very fast algorithms with immutable types, but users writing scientific computations usually only care about one thing: achieving maximum performance. That being that case, it follows that the arrays and matrices should be mutable.
For truly high performance, mutability is required, in one specific case : Provided that your code is perfectly optimized and that you master everything it is doing down to the cache (L1, L2) pattern of access of your program, then nothing beats a low level, to the metal approach.
This happens mostly only when you have one well specified problem that stays constant for 20 years, aka mostly in scientific tasks.
As soon as you depart from this specific case, in 99.99% the bottlenecks arise from having a too low level representation (induced by a low level langage) in which you can't express the final, real-world optimization trade-off of your problem at hand.
Bottom line, for performance, the following approach is the only way (i think) :
High level / algorithmic optimization first
Once every high level ways has been explored, low level optimization
You can see how as a consequence of that :
You should never optimize anything without FIRST measuring the impact : improvements should only be made if they yield enormous performance gains and/or do not degrade your domain logic.
You eventually will reach, if your problem is stable and well defined, the point where you will have no choice but to go to the low level, and play with memory/mutability

Resources