How to test Dataflow Pipeline with BigQuery - google-cloud-dataflow

I'd like to test my pipeline.
My pipeline extract data from BigQuery, then store data to GCS and S3.
Although there are some information about pipeline test here,
https://cloud.google.com/dataflow/pipelines/testing-your-pipeline,
it does not include about data model of extracting data from BigQuery.
I found following example for it, but it lacks of comment, so little bit difficult to understand.
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/test/java/com/google/cloud/dataflow/examples/cookbook/BigQueryTornadoesTest.java
Are there any good documents for test my pipeline?

In order to properly integration test your entire pipeline, please create a small amount of sample data stored in BigQuery. Also, please create a sample bucket/folder in S3 and GCS to store your output. Then run your pipeline as you normally would, using PipelineOptions to specify the test BQ table. You can use the DirectPipelineRunner if you want to run locally. It will probably be easiest to create a script which will first run the pipeline, then down the data from S3 and GCS and verify you see what you expect.
If you want to just test your pipeline's transforms on some offline data, then please follow the WordCount example.

In Google Cloud is easy to create end-to-end test using real resources like Pub/Sub topics and BigQuery tables.
By using Junit5 extension model https://junit.org/junit5/docs/current/user-guide/#extensions you can hide the complexity of creation and deleting of the required resources.
You can find a demo/seed here https://github.com/gabihodoroaga/dataflow-e2e-demo and a blog post here https://hodo.dev/posts/post-31-gcp-dataflow-e2e-tests/

Related

Spring Cloud Data Flow : Sample application to have multiple inputs and outputs

Spring Cloud Data Flow : Sample application to have multiple inputs and outputs.
Can anyone have a sample application to demonstrate spring cloud data flow with multiple inputs and outputs support
These samples apps uses multiple inputs and outputs.
The code from the acceptance tests provides an illustration of usage.
In dataflow you start by adding only one app to dsl with one of the required properties, then on deployment, you remove any extra properties and provide all the properties to connect the various streams.
The properties refer to each app by its registered name.

Conditional Flow - Spring Cloud Dataflow

In Spring Cloud Dataflow (Stream) what's the right generic way to implement the conditional flow based on a processor output
For example, in below case to have the execution flow through different path based on the product price spit by a transformer.
Example Process Diagram
You can use the router-sink to dynamically decide where to send the data based on the upstream event-data.
There is a simple SpEL as well as comprehensive support via a Groovy script as options, which can help with decision making and conditions. See README.
If you foresee more complex conditions and process workflows, alternatively, you could build a custom processor that can do the dynamic routing to various downstream named-channel destinations. This approach can also help with unit/IT testing the business-logic standalone.

Conditional iterations in Google cloud dataflow

I am looking at the opportunities for implementing a data analysis algorithm using Google Cloud Dataflow. Mind you, I have no experience with dataflow yet. I am just doing some research on whether it can fulfill my needs.
Part of my algorithm contains some conditional iterations, that is, continue until some condition is met:
PCollection data = ...
while(needsMoreWork(data)) {
data = doAStep(data)
}
I have looked around in the documentation and as far as I can see I am only able to do "iterations" if I know the exact number of iterations before the pipeline starts. In this case my pipeline construction code can just create a sequential pipeline with fixed number of steps.
The only "solution" I can think of is to run each iteration in separate pipelines, store the intermediate data in some database, and then decide in my pipeline construction whether or not to launch a new pipeline for the next iteration. This seems to be an extremely inefficient solution!
Are there any good ways to perform this kind of additional iterations in Google cloud dataflow?
Thanks!
For the time being, the two options you've mentioned are both reasonable. You could even combine the two approaches. Create a pipeline which does a few iterations (becoming a no-op if needsMoreWork is false), and then have a main Java program that submits that pipeline multiple times until needsMoreWork is false.
We've seen this use case a few times and hope to address it natively in the future. Native support is being tracked in https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/50.

How to compute custom metrics using Elasticsearch + Kibana?

It seems like a pretty easy question, but for some reason I still can't understand how to solve the same. I have an elastic search cluster which is using twitter river to download tweets. I would like to implement a sentiment analysis module which takes each tweet and computes a score (+ve/-ve) etc. I would like the score to be computed for each of the existing tweets as well as for new tweets and then visualize using Kibana.
However, I am not sure where should I place the call to this sentiment analysis module in the elastic search pipeline.
I have considered the option of modifying twitter river plugin but that will not work retrospectively.
Essentially, I need to answer two questions :-
1) how to call python/java code while indexing a document so that I can modify the json accordingly.
2) how to use the same code to modify all the existing documents in ES.
If you don't want an external application to do the analysis before indexing the documents in Elasticsearch, the best way I guess is to write a plugin that does it. You can write a plugin that implements a custom analyzer that does the sentiment analysis. Then in the mapping define the fields you want to run your analyzer on.
See examples of analysis plugins -
https://github.com/barminator/elasticsearch-analysis-annotation
https://github.com/yakaz/elasticsearch-analysis-combo/
To run the analysis on all existing documents you will need to reindex them after defining the correct mapping.

External data source with specflow

I find entering the data in the feature file of specflow very painful specially when it is repetitive and large data. Can we use an external data source like spreadsheet to enter this data and then use this external datasource in the feature file?
It's theoretically possible, but probably so much effort that you wouldn't want to do it.
The problem is that the feature file is simply a human readable form. When it is saved in Visual Studio it is parsed and converted into the feature.cs file and that is the one that is compiled and used for testing.
So your process would become
edit spreadsheet
export to feature file
get specflow's VS plugin to convert to feature.cs
run msbuild
run tests via Nunit or similar
I wouldn't do this. Instead I'd focus on getting my tests to be better examples. It sounds like you are to trying to exhaustively cover every possibility. Don't come up with examples to cover every possible case, but instead cover as much logic as possible with fewer tests.

Resources