What is the best way to implement Lambda-architecture batch_layer and serving_layer? - lambda-architecture

If I am build a project applying Lambda-architecture now, should I split the batch layer and the serving layer, i.e. program A do the batch layer's work, program B do the serving layer's? they are physically independent but logically relevant, since program A can tell B to work after A finishes the pre-compute works.
If so, would you please tell me how to implement it? I am thinking about IPC. If IPC could help, what is the specific way?
BTW, what does "batch view" mean exactly? why and How does the serving layer index it?

What is the best way to implement Lambda Architecture batch_layer and serving_layer? That totally depends upon the specific requirements, system environment, and so on. I can address how to design Lambda Architecture batch_layer and serving_layer, though.
Incidentally, I was just discussing this with a colleague yesterday and this is based on that discussion. I will explain in 3 parts and for the sake of this discussion lets say we are interested in designing a system that computes the most read stories (a) of the day, (b) of the week, (c) of the year:
Firstly in a lambda architecture, it is important to divide the problem that you are trying to solve with respect to time first and features second. So if you model your data as an incoming stream then the speed layer deals with the 'head' of the stream, e.g. current day's data; the batch layer deals with the 'head' + 'tail', the masterset.
Secondly, divide the features between these time-based lines. For instance, some features can be done using the 'head' of the stream alone, while other features require a wider breadth of data than the 'head', e.g. masterset. In our example, lets say that we define the speed layer to compute one day's worth of data. Then the Speed layer would compute most read stories (a) of the day in the so-called Speed View; while the Batch Layer would compute most read stories (a) of the days, (b) of the week, and (c) of the year in the so-called Batch View. Note that yes there may appear to be a bit of redundancy but hold on to that thought.
Thirdly, serving layer response to queries from clients regarding Speed View and Batch View and merges results accordingly. There will necessarily be overlap in the results from the Speed View and the Batch View. No matter, this is divide of Speed vs Batch, among other benefits, allows us to minimize exposure to risks such as (1) rolling out bugs, (2) corrupt data delivery, (3) long-running batch processes, etc. Ideally, issues will be caught in the speed view and then fixes will be applied prior to the batch view re-compute. If it is then all is well and good.
In summary, no IPC needs to be used since they are completely independent from each other. So program A does not need to communicate to program B. Instead, the system relies upon some overlap of processing. For instance, if program B computes its Batch view based on a daily basis then program A needs to compute the Speed view for day plus any additional time that processing may take. This extra time needs to include any downtime in the batch layer.
Hope this helps!
Notes:
Redundancy of the batch layer - it is necessary to have at least some redundancy in the batch layer since the serving layer must be able to provide a single cohesive view of results to queries. At the least, the redundancy may help avoid time-gaps in query responses.
Assessing which features are in the speed layer - this step will not always be as convenient as in the 'most read stories' example here. This is more of an art form.

Related

Inquiry of Drake simulator automatic differentiation

I have a question regarding the drake simulator's automatic differentiation abilities. I have a paper coming out soon in a few months and some of the feedback was that I didn't comment enough on automatic differentiation.
I am familiar with automatic differentiation but am unclear how it works with physics simulators exactly.As far as i'm aware, once you have constructed the graph, you can query it several times with a forwards pass and calculate the partial derivatives of outputs with respect to inputs. In my head, querying such a graph should be computationally quick.
In the drake simulator, once I load a scene, lets say a robot arm with a single free body item (like a cube or cylinder), does it create a graph that you can query regardless of the state of the system? Or does the graph need to be reconstructed depending on the system state. For instance would the same graph work in a situation when the arm was in contact with the free body item and also when it is doing free space motion?
There is this paper (https://arxiv.org/pdf/2202.13986.pdf) where they use drake for contact based manipulation tasks in python. Their optimization takes significant time and they claim it is down to drakes automatic differentiation scheme. The only way I can think getting the derivatives over their trajectories takes so long is if at each time step, a new graph needs to be constructed.
Is anyone able to comment on this from the drake team? Or maybe even link me a useful document regarding how drake's automatic differentiation works? I have been unsuccessful in finding this information myself so far.
Drake uses Eigen's AutoDiffScalar instead of double to obtain derivatives from the same code we use for computation. That method does not build a graph at all but rather performs rote propagation of the chain rule though the computation, ending up finally with both the result and partial derivatives of that result with respect to any chosen variables.

Methods to Find 'Best' Cut-Off Point for a Continuous Target Variable

I am working on a machine learning scenario where the target variable is Duration of power outages.
The distribution of the target variable is severely skewed right (You can imagine most power outages occur and are over with fairly quick, but then there are many, many outliers that can last much longer) A lot of these power outages become less and less 'explainable' by data as the durations get longer and longer. They become more or less, 'unique outages', where events are occurring on site that are not necessarily 'typical' of other outages nor is data recorded on the specifics of those events outside of what's already available for all other 'typical' outages.
This causes a problem when creating models. This unexplainable data mingles in with the explainable parts and skews the models ability to predict as well.
I analyzed some percentiles to decide on a point that I considered to encompass as many outages as possible while I still believed that the duration was going to be mostly explainable. This was somewhere around the 320 minute mark and contained about 90% of the outages.
This was completely subjective to my opinion though and I know there has to be some kind of procedure in order to determine a 'best' cut-off point for this target variable. Ideally, I would like this procedure to be robust enough to consider the trade-off of encompassing as much data as possible and not telling me to make my cut-off 2 hours and thus cutting out a significant amount of customers as the purpose of this is to provide an accurate Estimated Restoration Time to as many customers as possible.
FYI: The methods of modeling I am using that appear to be working the best right now are random forests and conditional random forests. Methods I have used in this scenario include multiple linear regression, decision trees, random forests, and conditional random forests. MLR was by far the least effective. :(
I have exactly the same problem! I hope someone more informed brings his knowledge. I wander to what point is a long duration something that we want to discard or that we want to predict!
Also, I tried treating my data by log transforming it, and the density plot shows a funny artifact on the left side of the distribution ( because I only have durations of integer numbers, not floats). I think this helps, you also should log transform the features that have similar distributions.
I finally thought that the solution should be stratified sampling or giving weights to features, but I don't know exactly how to implement that. My tries didn't produce any good results. Perhaps my data is too stochastic!

SGD weight space updates with asynchronous training

I'm looking at creative ways to speed up training time for my neural nets and also maybe reducing vanishing gradient. I was considering breaking up the net onto different nodes, using classifiers on each node as backprop "boosters", and then stacking the nodes on top of each other with sparse connections between each node (as many as I can get away with without ethernet network saturation making it pointless). If I do this, I am uncertain if I have to maintain some kind of state between nodes and train synchronously on the same example (probably defeats the purpose of speeding up the process), OR I can simply train on the same data but asynchronously. I think I can, and the weight space can still be updated and propagated down my sparse connections between nodes even if they are training on different examples, but uncertain. Can someone confirm this is possible or explain why not?
It is possible to do what you suggest, however it is a formidable amount of work for one person to undertake. The most recent example that I'm aware of is the "DistBelief" framework, developed by a large research/engineering team at Google -- see the 2012 NIPS paper at http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf.
Briefly, the DistBelief approach partitions the units in a neural network so that each worker machine in a cluster is responsible for some disjoint subset of the overall architecture. Ideally the partitions are chosen to minimize the amount of cross-machine communication required (i.e., a min-cut through the network graph).
Workers perform computations locally for their part of the network, and then send updates to the other workers as needed for links that cross machine boundaries.
Parameter updates are handled by a separate "parameter server." The workers send gradient computations to the parameter server, and periodically receive updated parameter values from the server.
The entire setup runs asynchronously and works pretty well. Due to the async nature of the computations, the parameter values for a given computation might be "stale," but they're usually not too far off. And the speedup is pretty good.

Neo4j partition

Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.

What is Scaling?

I always get this argument against RoR that it dont scale but I never get any appropriate answer wtf it really means? So here is a novice asking, what the hell is this " scaling " and how you measure it?
What the hell is this "scaling"...
As a general term, scalability means the responsiveness of a project to different kinds of demand. A project that scales well is one that doesn't have any trouble keeping up with requests for more of its services -- or, at the least, doesn't have to start turning away requests because it can't handle them.
It's often the case that simply increasing the size of a problem by an order of magnitude or two exposes weaknesses in the strategies that were used to solve it. When such weaknesses are exposed, it might be said that the solution to the problem doesn't "scale well".
For example, bogo sort is easy to implement, but as soon as you're sorting more than a handful of things, it starts taking a very long time to get the answer you want. It would be fair to say that bogo sort doesn't scale well.
... and how you measure it?
That's a harder question to answer. In general, there aren't units associated with scalability; statements like "that system is N times as scalable as this one is" at best would be an apples-to-oranges comparison.
Scalability is most frequently measured by seeing how well a system stands up to different kinds of demand in test conditions. People might say a system scales well if, over a wide range of demand of different kinds, it can keep up. This is especially true if it stands up to demand that it doesn't currently experience, but might be expected to if there's a sudden surge in popularity. (Think of the Slashdot/Digg/Reddit effects.)
Scaling or scalability refers to how a project can grow or expand to respond to the demand:
http://en.wikipedia.org/wiki/Scalability
Scalability has a wide variety of uses as indicated by Wikipedia:
Scalability can be measured in various dimensions, such as:
Load scalability: The ability for a distributed system to easily
expand and contract its resource pool
to accommodate heavier or lighter
loads. Alternatively, the ease with
which a system or component can be
modified, added, or removed, to
accommodate changing load.
Geographic scalability: The ability to maintain performance,
usefulness, or usability regardless of
expansion from concentration in a
local area to a more distributed
geographic pattern.
Administrative scalability: The ability for an increasing number of
organizations to easily share a single
distributed system.
Functional scalability: The ability to enhance the system by
adding new functionality at minimal
effort.
In one area where I work we are concerned with the performance of high-throughput and parallel computing as the number of processors is increased.
More generally it is often found that increasing the problem by (say) one or two orders of magnitude throws up a completely new set of challenges which are not easily predictable from the smaller system
It is a term for expressing the ability of a system to keep its performance as it grows over time.
Ideally what you want, is a system to reach linear scalability. It means that by adding new units of resources, the system equally grows in its ability to perform.
For example: It means, that when three webapp servers can handle a thousand concurrent users, that by adding three more servers, it can handle double the amount, two thousand concurrent users in this case and no less.
If a system does not have the property of linear scalability, there is a point where adding more resources, e.g. hardware, will not bring any additional benefit, performance, for instance, converges to zero: As more and more servers are put to the task. In the above example, the additional benefit of each new server becomes smaller and smaller until it reaches zero.
Thus, scalability is the factor that tells you what you get as output from a given input. It's value range lies between 0 and positive infinity, in theory. In practice, anything equal to 1 is most desirable...
Scalability refers to the ability for a system to accomodate a changing number of users. This can be an increasing or decreasing number of users as we now try to plan our systems around cloud computing and rented computing time.
Think about what is involved in making an order entry system designed for 1000 reps scale to accomodate 100,000 reps. What hardware needs to be added? What about the databases? In a nutshell, this is scalability.
Scalability of an application refers to how it is able to perform as the load on the application changes. This is often affected by the number of connected users, amount of data in a database, etc.
It is the ability for a system to accept an increased workload, more functionality, changing database, ... without impacting the original design or system.

Resources