According to the Beam website,
Often it is faster and simpler to perform local unit testing on your
pipeline code than to debug a pipeline’s remote execution.
I want to use test-driven development for my Beam/Dataflow app that writes to Bigtable for this reason.
However, following the Beam testing documentation I get to an impasse--PAssert isn't useful because the output PCollection contains org.apache.hadoop.hbase.client.Put objects, which don't override the equals method.
I can't get the contents of the PCollection to do validation on them either, since
It is not possible to get the contents of a PCollection directly - an
Apache Beam or Dataflow pipeline is more like a query plan of what
processing should be done, with PCollection being a logical
intermediate node in the plan, rather than containing the data.
So how can I test this pipeline, other than manually running it? I'm using Maven and JUnit (in Java since that's all the Dataflow Bigtable Connector seems to support).
The Bigtable Emulator Maven plugin can be used to write integration tests for this:
Configure the Maven Failsafe plugin and change your test case's ending from *Test to *IT to run as an integration test.
Install the Bigtable Emulator in the gcloud sdk on command line:
gcloud components install bigtable
Note that this required step is going to reduce code portability (e.g. will it run on your build system? On other devs' machines?) so I'm going to containerize it using Docker before deploying to the build system.
Add the emulator plugin to the pom per the README
Use the HBase Client API and see the example Bigtable Emulator integration test to set up your session and table(s).
Write your test as normal per the Beam documentation, except instead of using PAssert actually call CloudBigtableIO.writeToTable and then use the HBase Client to read the data from the table to verify it.
Here's an example integration test:
package adair.example;
import static org.apache.hadoop.hbase.util.Bytes.toBytes;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.UUID;
import java.util.stream.Collectors;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.transforms.Create;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;
import org.hamcrest.collection.IsIterableContainingInAnyOrder;
import org.junit.Assert;
import org.junit.Test;
import com.google.cloud.bigtable.beam.CloudBigtableIO;
import com.google.cloud.bigtable.beam.CloudBigtableTableConfiguration;
import com.google.cloud.bigtable.hbase.BigtableConfiguration;
/**
* A simple integration test example for use with the Bigtable Emulator maven plugin.
*/
public class DataflowWriteExampleIT {
private static final String PROJECT_ID = "fake";
private static final String INSTANCE_ID = "fakeinstance";
private static final String TABLE_ID = "example_table";
private static final String COLUMN_FAMILY = "cf";
private static final String COLUMN_QUALIFIER = "cq";
private static final CloudBigtableTableConfiguration TABLE_CONFIG =
new CloudBigtableTableConfiguration.Builder()
.withProjectId(PROJECT_ID)
.withInstanceId(INSTANCE_ID)
.withTableId(TABLE_ID)
.build();
public static final List<String> VALUES_TO_PUT = Arrays
.asList("hello", "world", "introducing", "Bigtable", "plus", "Dataflow", "IT");
#Test
public void testPipelineWrite() throws IOException {
try (Connection connection = BigtableConfiguration.connect(PROJECT_ID, INSTANCE_ID)) {
Admin admin = connection.getAdmin();
createTable(admin);
List<Mutation> puts = createTestPuts();
//Use Dataflow to write the data--this is where you'd call the pipeline you want to test.
Pipeline p = Pipeline.create();
p.apply(Create.of(puts)).apply(CloudBigtableIO.writeToTable(TABLE_CONFIG));
p.run().waitUntilFinish();
//Read the data from the table using the regular hbase api for validation
ResultScanner scanner = getTableScanner(connection);
List<String> resultValues = new ArrayList<>();
for (Result row : scanner) {
String cellValue = getRowValue(row);
System.out.println("Found value in table: " + cellValue);
resultValues.add(cellValue);
}
Assert.assertThat(resultValues,
IsIterableContainingInAnyOrder.containsInAnyOrder(VALUES_TO_PUT.toArray()));
}
}
private void createTable(Admin admin) throws IOException {
HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf(TABLE_ID));
tableDesc.addFamily(new HColumnDescriptor(COLUMN_FAMILY));
admin.createTable(tableDesc);
}
private ResultScanner getTableScanner(Connection connection) throws IOException {
Scan scan = new Scan();
Table table = connection.getTable(TableName.valueOf(TABLE_ID));
return table.getScanner(scan);
}
private String getRowValue(Result row) {
return Bytes.toString(row.getValue(toBytes(COLUMN_FAMILY), toBytes(COLUMN_QUALIFIER)));
}
private List<Mutation> createTestPuts() {
return VALUES_TO_PUT
.stream()
.map(this::stringToPut)
.collect(Collectors.toList());
}
private Mutation stringToPut(String cellValue){
String key = UUID.randomUUID().toString();
Put put = new Put(toBytes(key));
put.addColumn(toBytes(COLUMN_FAMILY), toBytes(COLUMN_QUALIFIER), toBytes(cellValue));
return put;
}
}
In Google Cloud you can do e2e testing of your Dataflow pipeline easily using real cloud resources like Pub/Sub topic and BigQuery tables.
By using Junit5 Extension Model (https://junit.org/junit5/docs/current/user-guide/#extensions) you can create custom classes that will handle the creation and deletion of the required resources for your pipeline.
You can find a demo/seed project here https://github.com/gabihodoroaga/dataflow-e2e-demo and a blog post here https://hodo.dev/posts/post-31-gcp-dataflow-e2e-tests/.
Related
I have a shared library of my own where a 3rd party java library gets 'grabed' and used more or less as follows:
package mypackage
#Grab(group='some.group', module='some-module', version = '1.0)
import some.group.some-module.Utils
#Singleton
class Grabber implements Serializable {
def static void callExternalMethod(String params){
Utils.method(params); // I would like to see what this one prints to stdout
}
}
The pipeline is more or less
#Library('mylibrary')
import mypackage.*
node{
...
stage("run") {
Grabber.callExternalMethod("some parameters")
}
}
The Grab works and the method is called, but I would like to include what Utils.method() normally prints to stdout on the Jenkins build output.
Is there a way to do this?
I am setting up a Jenkins Pipeline, which calls an external library with a compare XML function written in Groovy that utilises xmlunit.
The function looks as follows:
import java.util.List
import org.custommonkey.xmlunit.*
// Gives you a list of all the differences.
#NonCPS
void call(String xmlControl, String xmlTest) throws Exception {
String myControlXML = xmlControl
String myTestXML = xmlTest
DetailedDiff myDiff = new DetailedDiff(compareXML(myControlXML,
myTestXML));
List allDifferences = myDiff.getAllDifferences();
assertEquals(myDiff.toString(), 0, allDifferences.size());
}
However when running the pipeline in Jenkins it returns a java.io.NotSerializableException.
Checking StackOverflow it seemed like adding a the #NonCPS annotation might help.
But sadly it did not make a difference.
What more could I try to resolve the java.io.NotSerializableException?
In the interest of providing a minimal example of my problem, I'm trying to implement a simple Beam job that takes in a String as a side input and applies it to a PCollection which is read from a csv file in Cloud Storage. The result is then output to a .txt file in Cloud Storage.
So far, I have tried: Experimenting with PipelineResult.waitUntilFinish (as in (p.run().waitUntilFinish()), altering the placement of the two p.run() commands, and simplifying as much as possible by just using a string as my side input, always with the same result. Searching on Stack and Google just led me to the PR on the Beam repo which implemented the error message.
SideInputTest.java:
public class SideInputTest {
public static void main(String[] arg) throws IOException {
// Build a pipeline to read in string
DataflowPipelineOptions options1 = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options1.setRunner(DataflowRunner.class);
Pipeline p = Pipeline.create(options1);
// Build really simple side input
PCollectionView<String> sideInputView = p.apply(Create.of("foo"))
.apply(View.<String>asSingleton());
// Run p
p.run();
// Build main pipeline to read csv data
DataflowPipelineOptions options2 = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options2.setProject(PROJECT_NAME);
options2.setStagingLocation(STAGING_LOCATION);
options2.setRunner(DataflowRunner.class);
Pipeline p2 = Pipeline.create(options2);
p2.apply(TextIO.Read.from(INPUT_DATA))
.apply(ParDo.withSideInputs(sideInputView).of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String[] rowData = c.element().split(",");
String sideInput = c.sideInput(sideInputView);
c.output(rowData[0] + sideInput);
}
}))
.apply(TextIO.Write
.to(OUTPUT_DATA));
p2.run();
}
}
Full stack trace:
Caused by: java.lang.NullPointerException: Unknown producer for value SingletonPCollectionView{tag=Tag<org.apache.beam.sdk.util.PCollectionViews$SimplePCollectionView.<init>:435#3d93cb799b3970be>} while translating step ParDo(Anonymous)
at org.apache.beam.runners.dataflow.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:1079)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.getProducer(DataflowPipelineTranslator.java:508)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translateSideInputs(DataflowPipelineTranslator.java:926)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translateInputs(DataflowPipelineTranslator.java:913)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.access$1100(DataflowPipelineTranslator.java:112)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translateSingleHelper(DataflowPipelineTranslator.java:863)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translate(DataflowPipelineTranslator.java:856)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$7.translate(DataflowPipelineTranslator.java:853)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.visitPrimitiveTransform(DataflowPipelineTranslator.java:415)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:486)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:481)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$400(TransformHierarchy.java:231)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:206)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:321)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.translate(DataflowPipelineTranslator.java:365)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate(DataflowPipelineTranslator.java:154)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:514)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:151)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:210)
at com.xpw.SideInputTest.main(SideInputTest.java:63)
Currently using org.apache.beam packages #0.6.0
This code is taking a PCollectionView created in one pipeline (p.apply(Create.of("foo")).apply(View.<String>asSingleton());) and using it in another pipeline (p2).
PCollection's and PCollectionView's belong to a particular pipeline and reuse of them in a different pipeline is not supported.
You can create an analogous PCollectionView in p2.
I'm also confused as to what your pipeline p is trying to accomplish: the only transform it has is creating the view?.. so there's no data being processed in it. I think you should get rid of p entirely and just use p2.
I am using neo4j in embedded mode. So for some operations in database on server, i am tying to execute groovy script. Groovy script is running successfully without any error,but it is not creating any new record when i am checking neo4j-communinty tool.
Script
/**
* Created by prabjot on 7/1/17.
*/
#Grab(group="org.neo4j", module="neo4j-kernel", version="2.3.6")
#Grab(group="org.neo4j", module="neo4j-lucene-index", version="2.3.6")
#Grab(group='org.neo4j', module='neo4j-shell', version='2.3.6')
#Grab(group='org.neo4j', module='neo4j-cypher', version='2.3.6')
import org.neo4j.graphdb.factory.GraphDatabaseFactory
import org.neo4j.graphdb.Node
import org.neo4j.graphdb.Result
import org.neo4j.graphdb.Transaction
class Neo4jEmbeddedAccess {
public static void main(String[] args) {
def map=[:]
map.put("allow_store_upgrade","true")
map.put("remote_shell_enabled","true")
def db = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder("/opt/neo4j-community-3.0.4/data/databases/graph.db")
.setConfig(map)
.newGraphDatabase()
Transaction tx =db.beginTx()
Node person = db.createNode();
person.setProperty("name","prabjot")
print("id---->" + person.id);
Result result = db.execute("Match (country:Country) where id(country)=73 SET country.modified=true return country")
print(result)
tx.success();
println """starting embedded graph db
use bin/neo4j-shell from a new distribution to connect
we're keeping the graphdb open for 120 secs"""
db.shutdown()
}
Please help what i am doing wrong here, i have checked my db location but is same as i am using in script and tool.
Thanks
You forgot tx.close() which commits the Transaction
Sucess only marks it as successful
I want to use dynamic scheduling feature of Grails quartz plugin.
I am running grails 2.3.5 and the quartz plugin (quartz:1.0.2).
I am able to persist the quartz information to my mysql database and I am able to run normal quartz Jobs.
The problem is scheduling tasks dynamically. I am not getting this to work.
Here is my setup and what I am trying to do:
I have a simple Job in "grails-app/tao/marketing/MarketingJob" which looks like this:
package tao.marketing
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;
class MarketingJob {
static triggers ={}
def execute(JobExecutionContext context) {
try{
def today = new Date()
println today
}
catch (Throwable e) {
throw new JobExecutionException(e.getMessage(), e);
}
}
}
Which I now try to schedule dynamically from a Service.
package tao
import grails.transaction.Transactional
import tao.marketing.CampaignSchedule
import tao.Person
import jobs.tao.marketing.*
class ScheduleService {
def scheduleMarketingForPerson(CampaignSchedule campaignSchedule, Person person) {
log.info("Schedule new Marketing for: "+person.last_name)
campaignSchedule.scheduleActions.each {
Date today = new Date();
Date scheduleDate = today+it.afterXdays
log.info("ScheduleAction: "+it.id+": "+scheduleDate)
MarketingJob.schedule(scheduleDate, ["scheduleActions.id":it.id, "person.apiKey":person.apiKey])
}
}
}
In my IDE (STS) MarketingJob cannot be found.
MarketingJob.schedule(scheduleDate, ["scheduleActions.id":it.id, "person.apiKey":person.apiKey])
How do I correctly import the Marking Job?
Do I understand the dynamic scheduling feature correctly?
Could be that your job is in "package tao.marketing" and your import is "import jobs.tao.marketing.*"? I mean, import starts with "jobs"
The problem I had was that in my STS IDE I didn't have the jobs directory marked as a code directory. Thanks for all your comments.