I have a pipeline that translates a bounded data source into a set of RPCs to a third-party system, and want to have a reasonable balance between batching requests for efficiency and enforcing a maximum batch size. Is GroupIntoBatches the appropriate transform to use in this case? Are there any concerns around efficiency in batch mode that I should be aware of?
Based on the unit tests, it appears that the "final" batch will be emitted for a bounded source (even if it doesn't make up a full batch), correct?
I think that GroupIntoBatches is a good approach for this use case. Keep in
mind that this transform uses KV pairs and the parallelism that you want to achieve will be limited by the number of keys. I suggest taking a look at this answer.
Regarding the batch size, yes, the batches may be of a lower size if there are not enough elements. Take a look at this fun example of the beam Python documentation:
GroupIntoBatches will work. If you're running a batch pipeline and don't have a natural key on which to group (making a random one will often result in batches that are too small or parallelism that is too small and can interact poorly with liquid sharding) you should consider using BatchElements instead which can batch without keys and can be configured with either a fixed or dynamic batch size.
Related
In influxdb 1.5, the /write path can accept multiple points in a POST request.
What is a reasonable maximum payload size for this? 100 point? 1,000? 10,000? More?
Since you question uses the word "should" and I assume that any way of sending metrics to InfluxDB uses /write under the hood, I feel that official docs actually has a generalized answer for your question:
...This means that batching points together is required to achieve high throughput performance. (Optimal batch size seems to be 5,000-10,000 points per batch for many use cases.)
In addition to that, InfluxDB write capabilities are directly related to your hardware sizing.
Note, that 10,000 is not an upper limit, but just an official recommendation. I believe InfluxDB can process way more than that in a single batch. After all, it is the best to check it empirically, particularly on your hardware.
I had some problems with 25000 and more points. The points were written by a little python script out of a pandas dataframe. The code was near by the example from influx (dataframe to influxdb with python).
It did not matter how many lines and columns were present, the error was reproducible over the sum to be written points.
It is better to stay below 20000 points per transfer to avoid exceptions.
It's certainly possible to view a Dask graph at any stage while holding onto the object. Though once .compute() is called on a Dask object, there is an opportunity to apply additional optimizations to the Dask graph before running the computation. Any optimizations applied at this stage would impact how the computation is run. However this optimized graph would not necessarily be attached to a corresponding Dask object available to the user. Is there a way to also view the final Dask graph that was actually used for the computation?
The graph is not easily accessible after it has been submitted.
If you are using the distributed scheduler you can inspect the state there after submission, but it is no longer in a form that matches the traditional graph specification.
The best option I can think of is to optimize the graph before computing, and to investigate this. This isn't guaranteed to be exactly the same, but is likely close.
I created a Neo4j 3 database that includes some test data and also a small application that will send http cypher requests to Neo4j. These requests are always of the same time. Acutally its a query template that just differs by some attributes. I am interested in the performance of these statements.
I know that I can use the PROFILE to get some information in the browser. But I want to execute a set of statements, e. g. 10 example queries, several times and calculate the average performance. Is there an easy way or a tool to do this or do I have to write e. g. a Python script that collects these values? It does not have to be a big application, I just want to see some general performance metrics.
I don't think there is an out-of-the-box tool for benchmarking Neo4j yet. So your best option is to implement your own solution - but you have to be careful if you want to get results that are (to some degree) representative:
Check the docs on performance.
Give the Neo4j JVM sufficient time to warmup. This means that you'll want to run a warmup phase with the queries and discard the execution times of them.
Instead of using a client-server architecture, you can also opt to use Neo4j in embedded mode, which will give you a better idea of the query performance (without the overhead of the driver and the serialization/deserialization process). However, in this case you have to implement benchmark over the JVM (in Java or possibly Jython).
Run each query multiple times. Do not use the average as it is more sensitive to outlier values (you can get high values for a number of reasons, e.g. if the OS scheduler starts some job in the background during a particular query execution).
A good paper in the topic, How not to lie with statistics: the correct way to summarize benchmark results, argues that you should use the geometric mean.
It is also common practice in performance experiments in computer science papers to use the median value. I tend to use this option - e.g. this figure shows the execution times of two simple SPARQL queries on in-memory RDF engines (Jena and Sesame), for their first executions and the median values of 5 consecutive executions.
Note however, that Neo4j employs various caching mechanisms, so if you only run the same query multiple times, it will only need to compute the results on the first execution and following executions will use the cache - unless the database is updated between the query executions.
As a good approximation, you can design the benchmark to resemble your actual workload as closely as possible - in many cases, application-specific macrobenchmarks make more sense than microbenchmarks. So if each query will only be evaluated once by the application, it is perfectly acceptable to benchmark only the first evaluation.
(Bonus.) Another good read in the topic is The Benchmark Handbook - chapter 1 discusses the most important criteria for domain-specific benchmarks (relevance, portability, scalability and simplicity). These are probably not required for your benchmark but these are nice to now.
I worked on a cross-technology benchmark considering relational, graph and semantic databases, including Neo4j. You might find some useful ideas or code snippets in the repository: https://github.com/FTSRG/trainbenchmark
I know its a very specific to the environment running the code but given dask calculates its execution plan in advance into a DAG is there a way to understand how long that execution should take?
The progress bar is a great help once execution is running but would it be possible to understand beforehand how long to expect a series of operations should take?
Short Answer
No.
Explanation
The Dask scheduler just executes Python functions. It doesn't think about where they came from or the broader context of what they represent (for example, a dataframe join or matrix multiply). From its perspective it has just been asked to execute a graph of opaque function calls. This generality is a weakness (hard to perform high level analysis) but also Dask's main strength, because it can be applied to a broad variety of problems outside of any particular domain or specialty.
The distributed scheduler does maintain an exponentially weighted average of each function's duration which could be used to create an estimate of a task graph. I would search the scheduler.py file for task_duration if you're interested in building this.
If I am build a project applying Lambda-architecture now, should I split the batch layer and the serving layer, i.e. program A do the batch layer's work, program B do the serving layer's? they are physically independent but logically relevant, since program A can tell B to work after A finishes the pre-compute works.
If so, would you please tell me how to implement it? I am thinking about IPC. If IPC could help, what is the specific way?
BTW, what does "batch view" mean exactly? why and How does the serving layer index it?
What is the best way to implement Lambda Architecture batch_layer and serving_layer? That totally depends upon the specific requirements, system environment, and so on. I can address how to design Lambda Architecture batch_layer and serving_layer, though.
Incidentally, I was just discussing this with a colleague yesterday and this is based on that discussion. I will explain in 3 parts and for the sake of this discussion lets say we are interested in designing a system that computes the most read stories (a) of the day, (b) of the week, (c) of the year:
Firstly in a lambda architecture, it is important to divide the problem that you are trying to solve with respect to time first and features second. So if you model your data as an incoming stream then the speed layer deals with the 'head' of the stream, e.g. current day's data; the batch layer deals with the 'head' + 'tail', the masterset.
Secondly, divide the features between these time-based lines. For instance, some features can be done using the 'head' of the stream alone, while other features require a wider breadth of data than the 'head', e.g. masterset. In our example, lets say that we define the speed layer to compute one day's worth of data. Then the Speed layer would compute most read stories (a) of the day in the so-called Speed View; while the Batch Layer would compute most read stories (a) of the days, (b) of the week, and (c) of the year in the so-called Batch View. Note that yes there may appear to be a bit of redundancy but hold on to that thought.
Thirdly, serving layer response to queries from clients regarding Speed View and Batch View and merges results accordingly. There will necessarily be overlap in the results from the Speed View and the Batch View. No matter, this is divide of Speed vs Batch, among other benefits, allows us to minimize exposure to risks such as (1) rolling out bugs, (2) corrupt data delivery, (3) long-running batch processes, etc. Ideally, issues will be caught in the speed view and then fixes will be applied prior to the batch view re-compute. If it is then all is well and good.
In summary, no IPC needs to be used since they are completely independent from each other. So program A does not need to communicate to program B. Instead, the system relies upon some overlap of processing. For instance, if program B computes its Batch view based on a daily basis then program A needs to compute the Speed view for day plus any additional time that processing may take. This extra time needs to include any downtime in the batch layer.
Hope this helps!
Notes:
Redundancy of the batch layer - it is necessary to have at least some redundancy in the batch layer since the serving layer must be able to provide a single cohesive view of results to queries. At the least, the redundancy may help avoid time-gaps in query responses.
Assessing which features are in the speed layer - this step will not always be as convenient as in the 'most read stories' example here. This is more of an art form.