Guarantee Print Order After Parallelism - ios

I have X amount of cores doing unique work in parallel, however, their output needs to be printed in order.
Object {
Data data
int order
}
I've tried putting the objects in a min heap after they're done with their parallel work, however, even that is too much of a bottleneck.
Is there any way I could have work done in parallel and guarantee the print order? Is there a known term for my problem? Have others encountered it before?

Is there any way I could have work done in parallel and guarantee the print order?
Needless to say, we design parallelized routines with focus on an efficiency, but not constraining the order of the calculations. The printing of the results at the end, when everything is done, should dictate the ordering. In fact, parallel routines often do calculations in such a way that they’re conspicuously not in order (e.g., striding on each thread) to minimize thread and synchronization overhead.
The only question is how you structure the results to allow efficient storage and efficient, ordered retrieval. I often just use a mutable buffer or a pre-populated array. It’s very efficient in terms of both storage and retrieval. Or you can use a dictionary, too. It depends upon the nature of your Data. But I’d avoid the order property pattern in your result Object.
Just make sure you’re using optimized build if using standard Swift collections, as this can have a material impact on performance.

Q : Is there a known term for my problem?
Yes, there is. A con·​tra·​dic·​tion:
Definition of contradiction…2a : a proposition, statement, or phrase that asserts or implies both the truth and falsity of something// … both parts of a contradiction cannot possibly be true …— Thomas Hobbes
2b : a statement or phrase whose parts contradict each other// a round square is a contradiction in terms
3a : logical incongruity
3b : a situation in which inherent factors, actions, or propositions are inconsistent or contrary to one anothersource: Merriam-Webster
Computer science, having borrowed the terms { PARALLEL | SERIAL | CONCURRENT } from the theory of systems, respects the distinctive ( and never overlapping ) properties of each such class of operations, where:
[PARALLEL] orchestration of units-of-work implies, that any and every work-unit: a) starts and b) gets executed and c) gets finished at the same time, i.e. all get into/out-of [PARALLEL]-section at once and being elaborated at the very same time, not otherwise.
[SERIAL] orchestration of units-of-work implies, that all work-units be processed in a one, static, known, particular order, starting work-unit(s) in such an order, just a (known)-next one after previous one has finished its work - i.e. one-after-another, not otherwise.
[CONCURRENT] orchestration of units-of-work permits to start more than one unit-of-work, if resources and system conditions permit (scheduler priorities obeyed), resulting in unknown order of execution and unknown time of completion, as both the former and the latter depend on unknown externalities (system conditions and (non)-availability of resources, that are/will be needed for a particular work-unit elaboration)
Whereas there is an a-priori known, inherently embedded sense of an ORDER in [SERIAL]-type of processing ( as it was already pre-wired into the units-of-work processing-orchestration-code ), it has no such meaning in either [CONCURRENT], where opportunistic scheduling makes a wished-to-have order an undeterministically random result from the system states, skewed by the coincidence of all other externalities, and the same wished-to-have order is principally singular value in true [PARALLEL] by definition, as all start/execute/finish at-the-same-time - so all units-of-work being executed in [PARALLEL] fashion have no other chance, but be both 1st and last at the same time.
Q : Is there any way I could have work done in parallel and guarantee the print order?
No, unless you intentionally or unknowingly violate the [PARALLEL] orchestration rules and re-enter a re-[SERIAL]-iser logic into the work-units, so as to imperatively enforce any such wished-to-have ordering, that is not known, the less natural for the originally [PARALLEL] work-units' orchestration ( as is a common practice in python - using a GIL-monopolist indoctrinated stepping - as an example of such step )
Q : Have others encountered it before?
Yes. Since 2011, each and every semester this or similar questions reappear here, on Stack Overflow at growing amounts every year.

Related

replicating trees between ACID RDB using CRDT

I'm interested in replicating "hierachies" of data say similar to addresses.
Area
District
Sector
Unit
but you may have different pieces of data associated to each layer, so you may know the area of Sectors, but not of units, and you may know the population of a unit, basically its not a homogenious tree.
I know little about replication of data except brushing Brewers theorem/CAP, and some naive intuition about what eventual consistency is.
I'm looking for SIMPLE mechanisms to replicate this data from an ACID RDB, into other ACID RDBs, systemically the system needs to eventually converge, and obviously each RDB will enforce its own local consistent view, but any 2 nodes may not match at any given time (except 'eventually').
The simplest way to approach this is to simple store all the data in a single message from some designated leader and distribute it...like an overnight dump and load process, but thats too big.
So the next simplest thing (I thought) was if something inside an area changes, I can export the complete set of data inside an area, and load it into the nodes, thats still quite a coarse algorithm.
The next step was if, say an 'object' at any level changed, was to send all the data in the path to that 'object', i.e. if something in a sector is amended, you would send the data associated to the sector, its parent the district, and its parent the sector (with some sort of version stamp and lets say last update wins)....what i wanted to do was to ensure that any replication 'update' was guaranteed to succeed (so it needs the whole path, which potentially would be created if it didn't exist).
then i stumbled on CRDTs and thought....ah...I'm reinventing the wheel here, and the algorithms are allegedly easy in principle, but tricky to get correct in practice
are there standards accepted patterns to do this sort of thing?
In my use case the hierarchies are quite shallow, and there is only a single designated leader (at this time), I'm quite attracted to state based CRDTs because then I can ignore ordering.
Simplicity is the key requirement.
Actually it appears I've reinvented (in a very crude naive way) the SHELF algorithm.
I'll write some code and see if I can get it to work, and try to understand whats going on.

Would there be a practical application for a more memory efficient boolean?

I've noticed that booleans occupy a whole byte, despite only needing 1 bit. I was wondering whether we could have something like
struct smartbool{char data;}
, which would store 8 booleans at once.
I am aware that it would take more time to retrieve data, although would the tradeoff be a practical application in some scenarios?
Am I missing something about the memory usage of booleans?
Normally variables are aligned on word boundaries, memory use is balanced against efficiency of access. For one-off boolean variables it may not make sense to store them in a denser form.
If you do need a bunch of booleans you can use things like this BitSet data structure: https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/BitSet.html.
There is a type of database index that stores booleans efficiently:
https://en.wikipedia.org/wiki/Bitmap_index. The less space an index takes up the easier it is to keep in memory.
There are already widely used data types that support multiple booleans, they are called integers. you can store and retrieve multiple booleans in an integral type, using bitwise operations, screening out the bits you don't care about with a pattern of bits called a bitmask.
This sort of "packing" is certainly possible and sometimes useful, as a memory-saving optimization. Many languages and libraries provide a way to make it convenient, e.g. std::vector<bool> in C++ is meant to be implemented this way.
However, it should be done only when the programmer knows it will happen and specifically wants it. There is a tradeoff in speed: if bits are used, then setting / clearing / testing a specific bool requires first computing a mask with an appropriate shift, and setting or clearing it now requires a read-modify-write instead of just a write.
And there is a more serious issue in multithreaded programs. Languages like C++ promise that different threads can freely modify different objects, including different elements of the same array, without needing synchronization or causing a data race. For instance, if we have
bool a, b; // not atomic
void thread1() { /* reads and writes a */ }
void thread2() { /* reads and writes b */ }
then this is supposed to work fine. But if the compiler made a and b two different bits in the same byte, concurrent accesses to them would be a data race on that byte, and could cause incorrect behavior (e.g. if the read-modify-writes being done by the two threads were interleaved). The only way to make it safe would be to require that both threads use atomic operations for all their accesses, which are typically many times slower. And if the compiler could freely pack bools in this way, then every operation on a potentially shared bool would have to be made atomic, throughout the entire program. That would be prohibitively expensive.
So this is fine if the programmer wants to pack bools to save memory, is willing to take the hit to speed, and can guarantee that they won't be accessed concurrently. But they should be aware that it's happening, and have control over whether it does.
(Indeed, some people feel that having C++ provide this with vector<bool> was a mistake, since programmers have to know that it is a special exception to the otherwise general rule that vector<T> behaves like an array of T, and different elements of the vector can safely be accessed concurrently. Perhaps they should have left vector<bool> to work in the naive way, and given a different name to the packed version, similar to std::bitset.)

Wrapper for slow indicator for backtesting

If a technical indicator works very slow, and I wish to include it in an EA ( using iCustom() ), is there a some "wrapper" that could cache the indicator results to a file based on the particular indicator inputs?
This way I could get a better speed next time when I backtest it using the same set of parameters, since the "wrapper" could read the result from file rather than recalculate the result from the indicator.
I heard that some developers did that for their needs in order to speed up backtesting, but as far as i know, there's no publicly available solution.
If I had to solve this problem, I would create a class with two fields (datetime and indicator value, or N buffers of the indicator), and a collection class similar to CArrayObj.mqh but with an option to apply binary search, or to start looking for element from a specific index, not from the very beginning of the array.
Recent MT4 Builds added VERY restrictive conditions for Indicators
In early years of MT4, this was not so cruel as it is these days.
FACT#1: fileIO is 10.000x ~ 100.000x slower than memIO:
This means, there is no benefit from "pre-caching" values to disk.
FACT#2: Processing Performance has HARD CEILING:
All, yes ALL, Custom Indicators, that are being used in MetaTrader4 Terminal ( be it directly in GUI, or indirectly, via Template(s) or called via iCustom() calls & in Strategy Tester via .tpl + iCustom() ) ALL THESE SHARE A SINGLE THREAD ...
FACT#3: Strategy Tester has the most demanding needs for speed:
Thus - eliminate all, indeed ALL, non-core indicators from tester.tpl template and save it as "blank", to avoid any part of such non-core processing.
Next, re-design the Custom Indicator, where possible, so as to avoid any CPU-ops & MEM-allocation(s), that are not necessary.
I remember a Custom Indicatore designs with indeed deep-convolutions, which could have been re-engineered so as to keep just a triangular sparse-matrix with necessary updates, that has increased the speed of Indicator processing more than 10.000x, so code-revision is the way.
So, rather run a separate MetaTrader4 Terminal, just for BackTesting, than having to wait for many hours just due to un-compressible nature of numerical processing under a traffic-jam congestion in the shared use of the CustomIndicator-solo-Thread that none scheduling could improve.
FACT#4: O/S can increase a process priority:
Having got to the Devil's zone, it is a common practice to spin-up the PRIO for the StrategyTester MT4, up to the "RealTime PRIO" in the O/S tools.
One may even additionally "lock" this MT4-process onto a certain CPU-core(s) and setup all other processes with adjacent CPU-core-AFFINITY, so that these two distinct groups of processes do not jump one to the other group's CPU-core(s). Hard, but if squeezing the performance to the bleeding edge, this is a must.

Dynamic Process parameter adjustment in Semiconductor manufacturing data

I have process parameter data from semiconductor manufacturing.and requirement is to suggest what could be the best parameter adjustment to be made to process parameter to get better yield ie best path for high yield. what machine learning /Statistical models best suits this requirement
Note:I have thought of using decision tree which can give us best path for high yield.
Would like to know it any other methods that can be more efficient
data is like
lotno x1 x2 x3 x4 x5 yield(%)
<95% yield is considered as 0 and >95% as 1
I'm not really sure of the question here, but as a former semiconductor process engineer, here is how I look at the yield improvement approach - perspective.
Process Development.
DOE: Typically, I would run structured DOEs to understand my process (#4). I would first identify "potential" 'factors', and run various "screening" experiments to identify statistical significance. With the goal basically here to identify the most statistically significant (and for that matter, least significant) factors. So these are inherently simple experiments, low # of "levels" which don't target understanding of the curvature of the response surface, they just look for magnitude change of response vs factor. Generally, I am most concerned with 'Process' factors, but it is important to recognize that the influence of variable inputs can come from more than just "machine knobs' as example. Variable can arise from 1) People, 2) Environment (moisture, temp, etc), 3) consumables (used in the process), 4) Equipment (is 40 psi on this tool really 40 psi and the same as 40 psi on a different tool) 4) Process variable settings.
With the most statistically significant factors, I would run more elaborate DOE using the major factors and analyze this data to develop a model. There are generally more 'levels' used here to allow for curvature insight of the response surface via the analysis. There are many types of well known standard experimental design structures here. And there is software such as JMP that is specifically set up to do this analysis.
From here, the idea would be to generate a model in the form of Response = F (Factors). That allows you to essentially optimize the response based upon these factors where the response is a reflection of your yield criteria.
From here, the engineer would typically execute confirmation runs with optimized factors to confirm optimized response.
Note that the software analysis typically allows for the engineer to illuminate any run order dependence. The execution of the DOE is typically performed in a randomized cell fashion. (Each 'cell' is a set of conditions for the experiment). Similarly the experiments include some level of repetition to gauge 'repeatability' of the 'system'. This inclusion can be explicit (run the same cell twice), but there is also some level of repeatability inherent in the design as well since you are running multiple cells, albeit at difference settings. But generally, the experiment includes explicitly repeated cells.
And finally there is the concept of manufacturability, which includes constraints of time, cost, physical limits, equipment capability, etc. (The ideal process works great, but it takes 10 years, costs 1 million dollars and requires projected settings outsides the capability of the tool.)
Since you have manufacturing data, hopefully, you have the data that captures the other types of factors as well (1,2,3), so you should specifically analyze the data to try to identify such effects. This is typically done as A vs B comparisons. Person A vs B, Tool A vs B, Consumable A vs B, Consumable lot A vs B, Summer vs Winter, etc.
Basically, there are all sorts of comparisons you could envision here and check for statistically differences across two sets of populations.
A comment on response: What is the yield criteria? You should know this in order to formulate the model. For semiconductors, we have both line yield (process yield) but there is also device yield. I assume for your work, you are primarily concerned with line yield. So minimizing variability in the factors (from 1,2,3,4) to achieve the desired response (target response(s) with minimal variability) is the primary goal.
APC (Advanced Process Control).
In many cases, there is significant trending that results from whatever reason; crappy tool control (the tool heats up), crappy consumable (the target material wears, the polishing pad wears, the chemical bath gets loaded, whatever), and so the idea here is how to adjust the next batch/lot/wafer based upon the history of what came prior. Either improve the manufacturing to avoid/minimize this trending (run order dependence) or adjust process to accommodate it to achieve the desired response.
Time for lunch, hope this helps...if you post on the specific process module type, and even equipment and consumables, I might be able to provide more insight.

What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().
From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.

Resources