classify new image on trained custom model in deeplearning4j (convolutional network) - deeplearning4j

Im new to deeplearning4J. I already experimented with its word2vec functionality and everything was fine. But now I am little bit confused regarding image classification. I was playing with this example:
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/convolution/AnimalsClassification.java
I changed the "save" flag to true and my model is stored into model.bin file.
Now comes the problematic part (I am sorry if this sounds as silly question, maybe I am missing something really obvious here)
I created separate class called AnimalClassifier and its purpose is to load model from model.bin file, restore neural network from it and then classify single image using restored network. For this single image I created "temp" folder -> dl4j-examples/src/main/resources/animals/temp/ where I put picture of polar bear that was previously used in training process in AnimalsClassification.java (I wanted to be sure that image would be classified correctly - therefore I reused picture from "bear" folder).
This my code trying to classify polar bear:
protected static int height = 100;
protected static int width = 100;
protected static int channels = 3;
protected static int numExamples = 1;
protected static int numLabels = 1;
protected static int batchSize = 10;
protected static long seed = 42;
protected static Random rng = new Random(seed);
protected static int listenerFreq = 1;
protected static int iterations = 1;
protected static int epochs = 7;
protected static double splitTrainTest = 0.8;
protected static int nCores = 2;
protected static boolean save = true;
protected static String modelType = "AlexNet"; //
public static void main(String[] args) throws Exception {
String basePath = FilenameUtils.concat(System.getProperty("user.dir"), "dl4j-examples/src/main/resources/");
MultiLayerNetwork multiLayerNetwork = ModelSerializer.restoreMultiLayerNetwork(basePath + "model.bin", true);
ParentPathLabelGenerator labelMaker = new ParentPathLabelGenerator();
File mainPath = new File(System.getProperty("user.dir"), "dl4j-examples/src/main/resources/animals/temp/");
FileSplit fileSplit = new FileSplit(mainPath, NativeImageLoader.ALLOWED_FORMATS, rng);
BalancedPathFilter pathFilter = new BalancedPathFilter(rng, labelMaker, numExamples, numLabels, batchSize);
InputSplit[] inputSplit = fileSplit.sample(pathFilter, 1);
InputSplit analysedData = inputSplit[0];
ImageRecordReader recordReader = new ImageRecordReader(height, width, channels);
recordReader.initialize(analysedData);
DataSetIterator dataIter = new RecordReaderDataSetIterator(recordReader, batchSize, 0, 4);
while (dataIter.hasNext()) {
DataSet testDataSet = dataIter.next();
String expectedResult = testDataSet.getLabelName(0);
List<String> predict = multiLayerNetwork.predict(testDataSet);
String modelResult = predict.get(0);
System.out.println("\nFor example that is labeled " + expectedResult + " the model predicted " + modelResult + "\n\n");
}
}
After running this, I get error:
java.lang.UnsupportedOperationException
at org.datavec.api.writable.ArrayWritable.toInt(ArrayWritable.java:47)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.getDataSet(RecordReaderDataSetIterator.java:275)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:186)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:389)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:52)
at org.deeplearning4j.examples.convolution.AnimalClassifier.main(AnimalClassifier.java:66)
Disconnected from the target VM, address: '127.0.0.1:63967', transport: 'socket'
Exception in thread "main" java.lang.IllegalStateException: Label names are not defined on this dataset. Add label names in order to use getLabelName with an id.
at org.nd4j.linalg.dataset.DataSet.getLabelName(DataSet.java:1106)
at org.deeplearning4j.examples.convolution.AnimalClassifier.main(AnimalClassifier.java:68)
I can see there is a method public void setLabels(INDArray labels) in MultiLayerNetwork.java but I don't get how to use (especially when it takes as argument INDArray).
I am also confused why I have to specify number of possible labels in constructor of RecordReaderDataSetIterator. I would expect that model already knows which labels to use (should not it use labels that were used during training automatically?). I guess, maybe I am loading the picture in completely wrong way...
So to summarize, I would like to achieve simply following:
restore network from model (this is working)
load image to be classified (also working)
classify this image using the same labels that were used during training (bear, deer, duck, turtle) (tricky part)
Thank you in advance for your help or any hints !

So summarizing your multiple questions here:
A record for images is 2 entries in a collection. The second 1 is the label. The label index is relative to the kind of record you pass in.
The second part of your question:
Multiple entries can be apart of a dataset. The list refers to a label for an item at a particular row in the minibatch.

Related

TinyYolo Deeplearning4j

I am using TinyYolo with deeplearning4j having read through this tutorial http://emaraic.com/blog/yolo-custom-object-detector but I am not quite sure whether I need more configuration to handle 720p images as the images in this example are 416x416. Is there a hard requirement of that? I am just struggling to fully understand some configurations as noted :
private static final int CHANNELS = 3;
private static final int GRID_WIDTH = 13;
private static final int GRID_HEIGHT = 13;
private static final int CLASSES_NUMBER = 1;
private static final int BOXES_NUMBER = 5;
private static final double[][] PRIOR_BOXES = {{1.5, 1.5}, {2, 2}, {3,3}, {3.5, 8}, {4, 9}};//anchors boxes
private static final int BATCH_SIZE = 4;
private static final int EPOCHS = 50;
private static final double LEARNIGN_RATE = 0.0001;
private static final int SEED = 1234;
private static final double LAMDBA_COORD = 1.0;
private static final double LAMDBA_NO_OBJECT = 0.5;
I have used darkflow with my labels and images before and had some strong success. I wanted to use deeplearning4j to have more integration with a java project I have. As I have struggled to import the models I have created that still work successfully with some of the python code I have but seem to have some nuance with exportation.
If someone could shed some light on this, I believe there should be ways to handle 720p images. I believe possibly needing to resize. I know darknet and darkflow performed this action themselves. Also if I do perform resizes, will my label annotation xml files then need changes?
Thanks for any help.
You need to resize your 720p input into tinyolo expected input , dimension 416x416
resize(rgbaMat, resizedImage, new Size(tinyyolowidth, tinyyoloheight));
Ref: https://github.com/yptheangel/dl4j-android-demo/blob/master/app/src/main/java/com/yptheangel/dl4jandroid/yolo_objdetection/ObjDetection.java

FatalExecutionEngineError on accessing a pointer set with memcpy_s

See update 1 below for my guess as to why the error is happening
I'm trying to develop an application with some C#/WPF and C++. I am having a problem on the C++ side on a part of the code that involves optimizing an object using GNU Scientific Library (GSL) optimization functions. I will avoid including any of the C#/WPF/GSL code in order to keep this question more generic and because the problem is within my C++ code.
For the minimal, complete and verifiable example below, here is what I have. I have a class Foo. And a class Optimizer. An object of class Optimizer is a member of class Foo, so that objects of Foo can optimize themselves when it is required.
The way GSL optimization functions take in external parameters is through a void pointer. I first define a struct Params to hold all the required parameters. Then I define an object of Params and convert it into a void pointer. A copy of this data is made with memcpy_s and a member void pointer optimParamsPtr of Optimizer class points to it so it can access the parameters when the optimizer is called to run later in time. When optimParamsPtr is accessed by CostFn(), I get the following error.
Managed Debugging Assistant 'FatalExecutionEngineError' : 'The runtime
has encountered a fatal error. The address of the error was at
0x6f25e01e, on thread 0x431c. The error code is 0xc0000005. This error
may be a bug in the CLR or in the unsafe or non-verifiable portions of
user code. Common sources of this bug include user marshaling errors
for COM-interop or PInvoke, which may corrupt the stack.'
Just to ensure the validity of the void pointer I made, I call CostFn() at line 81 with the void * pointer passed as an argument to InitOptimizer() and everything works. But in line 85 when the same CostFn() is called with the optimParamsPtr pointing to data copied by memcpy_s, I get the error. So I am guessing something is going wrong with the memcpy_s step. Anyone have any ideas as to what?
#include "pch.h"
#include <iostream>
using namespace System;
using namespace System::Runtime::InteropServices;
using namespace std;
// An optimizer for various kinds of objects
class Optimizer // GSL requires this to be an unmanaged class
{
public:
double InitOptimizer(int ptrID, void *optimParams, size_t optimParamsSize);
void FreeOptimizer();
void * optimParamsPtr;
private:
double cost = 0;
};
ref class Foo // A class whose objects can be optimized
{
private:
int a; // An internal variable that can be changed to optimize the object
Optimizer *fooOptimizer; // Optimizer for a Foo object
public:
Foo(int val) // Constructor
{
a = val;
fooOptimizer = new Optimizer;
}
~Foo()
{
if (fooOptimizer != NULL)
{
delete fooOptimizer;
}
}
void SetA(int val) // Mutator
{
a = val;
}
int GetA() // Accessor
{
return a;
}
double Optimize(int ptrID); // Optimize object
// ptrID is a variable just to change behavior of Optimize() and show what works and what doesn't
};
ref struct Params // Parameters required by the cost function
{
int cost_scaling;
Foo ^ FooObj;
};
double CostFn(void *params) // GSL requires cost function to be of this type and cannot be a member of a class
{
// Cast void * to Params type
GCHandle h = GCHandle::FromIntPtr(IntPtr(params));
Params ^ paramsArg = safe_cast<Params^>(h.Target);
h.Free(); // Deallocate
// Return the cost
int val = paramsArg->FooObj->GetA();
return (double)(paramsArg->cost_scaling * val);
}
double Optimizer::InitOptimizer(int ptrID, void *optimParamsArg, size_t optimParamsSizeArg)
{
optimParamsPtr = ::operator new(optimParamsSizeArg);
memcpy_s(optimParamsPtr, optimParamsSizeArg, optimParamsArg, optimParamsSizeArg);
double ret_val;
// Here is where the GSL stuff would be. But I replace that with a call to CostFn to show the error
if (ptrID == 1)
{
ret_val = CostFn(optimParamsArg); // Works
}
else
{
ret_val = CostFn(optimParamsPtr); // Doesn't work
}
return ret_val;
}
// Release memory used by unmanaged variables in Optimizer
void Optimizer::FreeOptimizer()
{
if (optimParamsPtr != NULL)
{
delete optimParamsPtr;
}
}
double Foo::Optimize(int ptrID)
{
// Create and initialize params object
Params^ paramsArg = gcnew Params;
paramsArg->cost_scaling = 11;
paramsArg->FooObj = this;
// Convert Params type object to void *
void * paramsArgVPtr = GCHandle::ToIntPtr(GCHandle::Alloc(paramsArg)).ToPointer();
size_t paramsArgSize = sizeof(paramsArg); // size of memory block in bytes pointed to by void pointer
double result = 0;
// Initialize optimizer
result = fooOptimizer->InitOptimizer(ptrID, paramsArgVPtr, paramsArgSize);
// Here is where the loop that does the optimization will be. Removed from this example for simplicity.
return result;
}
int main()
{
Foo Foo1(2);
std::cout << Foo1.Optimize(1) << endl; // Use orig void * arg in line 81 and it works
std::cout << Foo1.Optimize(2) << endl; // Use memcpy_s-ed new void * public member of Optimizer in line 85 and it doesn't work
}
Just to reiterate I need to copy the params to a member in the optimizer because the optimizer will run all through the lifetime of the Foo object. So it needs to exist as long as the Optimizer object exist and not just in the scope of Foo::Optimize()
/clr support need to be selected in project properties for the code to compile. Running on an x64 solution platform.
Update 1: While trying to debug this, I got suspicious of the way I get the size of paramsArg at line 109. Looks like I am getting the size of paramsArg as size of int cost_scaling plus size of the memory storing the address to FooObj instead of the size of memory storing FooObj itself. I realized this after stumbling across this answer to another post. I confirmed this by checking the value of paramsArg after adding some new dummy double members to Foo class. As expected the value of paramsArg doesn't change. I suppose this explains why I get the error. A solution would be to write code to correctly calculate the size of a Foo class object and set that to paramsArg instead of using sizeof. But that is turning out to be too complicated and probably another question in itself. For example, how to get size of a ref class object? Anyways hopefully someone will find this helpful.

How Flink and Beam SDKs handle windowing - Which is more efficient?

I am comparing the Apache Beam SDK with the Flink SDK for stream processing, in order to establish the cost/advantages of using Beam as an additional framework.
I have a very simple setup where a stream of data is read from a Kafka source and processed in parallel by a cluster of nodes running Flink.
From my understanding of how these SDKs work, the simplest way to process a stream of data window by window is:
Using Apache Beam (running on Flink):
1.1. Create a Pipeline object.
1.2. Create a PCollection of Kafka records.
1.3. Apply windowing function.
1.4. Transform pipeline to key by window.
1.5. Group records by key (window).
1.6. Apply whatever function is needed to the windowed records.
Using the Flink SDK
2.1. Create a Data Stream from a Kafka source.
2.2. Transform it into a Keyed Stream by providing a key function.
2.3. Apply windowing function.
2.4. Apply whatever function is needed to the windowed records.
While the Flink solution appears programmatically more succinct, in my experience, it is less efficient at high volumes of data. I can only imagine the overhead is introduced by the key extraction function, since this step is not required by Beam.
My question is: am I comparing like for like? Are these processes not equivalent? What could explain the Beam way being more efficient, since it uses Flink as a runner (and all the other conditions are the same)?
This is the code using the Beam SDK
PipelineOptions options = PipelineOptionsFactory.create();
//Run with Flink
FlinkPipelineOptions flinkPipelineOptions = options.as(FlinkPipelineOptions.class);
flinkPipelineOptions.setRunner(FlinkRunner.class);
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setParallelism(-1); //Pick this up from the user interface at runtime
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(flinkPipelineOptions);
// Create a PCollection of Kafka records
PCollection<KafkaRecord<byte[], byte[]>> kafkaCollection = p.apply(KafkaIO.<Long, String>readBytes()
.withBootstrapServers(KAFKA_IP + ":" + KAFKA_PORT)
.withTopics(ImmutableList.of(REAL_ENERGY_TOPIC, IT_ENERGY_TOPIC))
.updateConsumerProperties(ImmutableMap.of("group.id", CONSUMER_GROUP)));
//Apply Windowing Function
PCollection<KafkaRecord<byte[], byte[]>> windowedKafkaCollection = kafkaCollection.apply(Window.into(SlidingWindows.of(Duration.standardSeconds(5)).every(Duration.standardSeconds(1))));
//Transform the pipeline to key by window
PCollection<KV<IntervalWindow, KafkaRecord<byte[], byte[]>>> keyedByWindow =
windowedKafkaCollection.apply(
ParDo.of(
new DoFn<KafkaRecord<byte[], byte[]>, KV<IntervalWindow, KafkaRecord<byte[], byte[]>>>() {
#ProcessElement
public void processElement(ProcessContext context, IntervalWindow window) {
context.output(KV.of(window, context.element()));
}
}));
//Group records by key (window)
PCollection<KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>>> groupedByWindow = keyedByWindow
.apply(GroupByKey.<IntervalWindow, KafkaRecord<byte[], byte[]>>create());
//Process windowed data
PCollection<KV<IIntervalWindowResult, IPueResult>> processed = groupedByWindow
.apply("filterAndProcess", ParDo.of(new PueCalculatorFn()));
// Run the pipeline.
p.run().waitUntilFinish();
And this is the code using the Flink SDK
// Create a Streaming Execution Environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
env.setParallelism(6);
//Connect to Kafka
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", KAFKA_IP + ":" + KAFKA_PORT);
properties.setProperty("group.id", CONSUMER_GROUP);
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer010<>(Arrays.asList(REAL_ENERGY_TOPIC, IT_ENERGY_TOPIC), new JSONDeserializationSchema(), properties));
//Key by id
stream.keyBy((KeySelector<ObjectNode, Integer>) jsonNode -> jsonNode.get("id").asInt())
//Set the windowing function.
.timeWindow(Time.seconds(5L), Time.seconds(1L))
//Process Windowed Data
.process(new PueCalculatorFn(), TypeInformation.of(ImmutablePair.class));
// execute program
env.execute("Using Flink SDK");
Many thanks in advance for any insight.
Edit
I thought I should add some indicators that may be relevant.
Network Received Bytes
Flink SDK
taskmanager.2
2,644,786,446
taskmanager.3
2,645,765,232
taskmanager.1
2,827,676,598
taskmanager.6
2,422,309,148
taskmanager.4
2,428,570,491
taskmanager.5
2,431,368,644
Beam
taskmanager.2
4,092,154,160
taskmanager.3
4,435,132,862
taskmanager.1
4,766,399,314
taskmanager.6
4,425,190,393
taskmanager.4
4,096,576,110
taskmanager.5
4,092,849,114
CPU Utilisation (Max)
Flink SDK
taskmanager.2
93.00%
taskmanager.3
92.00%
taskmanager.1
91.00%
taskmanager.6
90.00%
taskmanager.4
90.00%
taskmanager.5
92.00%
Beam
taskmanager.2
52.0%
taskmanager.3
71.0%
taskmanager.1
72.0%
taskmanager.6
40.0%
taskmanager.4
56.0%
taskmanager.5
26.0%
Beam seems to use a lot more networking, whereas Flink uses significantly more CPU. Could this suggest that Beam is parallelising the processing in a more efficient way?
Edit No2
I am pretty sure that the PueCalculatorFn classes are equivalent, but I will share the code here to see if any obvious discrepancies between the two processes become apparent.
Beam
public class PueCalculatorFn extends DoFn<KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>>, KV<IIntervalWindowResult, IPueResult>> implements Serializable {
private transient List<IKafkaConsumption> realEnergyRecords;
private transient List<IKafkaConsumption> itEnergyRecords;
#ProcessElement
public void procesElement(DoFn<KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>>, KV<IIntervalWindowResult, IPueResult>>.ProcessContext c, BoundedWindow w) {
KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>> element = c.element();
Instant windowStart = Instant.ofEpochMilli(element.getKey().start().getMillis());
Instant windowEnd = Instant.ofEpochMilli(element.getKey().end().getMillis());
Iterable<KafkaRecord<byte[], byte[]>> records = element.getValue();
//Calculate Pue
IPueResult result = calculatePue(element.getKey(), records);
//Create IntervalWindowResult object to return
DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE_TIME.withZone(ZoneId.of("UTC"));
IIntervalWindowResult intervalWindowResult = new IntervalWindowResult(formatter.format(windowStart),
formatter.format(windowEnd), realEnergyRecords, itEnergyRecords);
//Return Pue keyed by Window
c.output(KV.of(intervalWindowResult, result));
}
private PueResult calculatePue(IntervalWindow window, Iterable<KafkaRecord<byte[], byte[]>> records) {
//Define accumulators to gather readings
final DoubleAccumulator totalRealIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
final DoubleAccumulator totalItIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
//Declare variable to store the result
BigDecimal pue = BigDecimal.ZERO;
//Initialise transient lists
realEnergyRecords = new ArrayList<>();
itEnergyRecords = new ArrayList<>();
//Transform the results into a stream
Stream<KafkaRecord<byte[], byte[]>> streamOfRecords = StreamSupport.stream(records.spliterator(), false);
//Iterate through each reading and add to the increment count
streamOfRecords
.map(record -> {
byte[] valueBytes = record.getKV().getValue();
assert valueBytes != null;
String valueString = new String(valueBytes);
assert !valueString.isEmpty();
return KV.of(record, valueString);
}).map(kv -> {
Gson gson = new GsonBuilder().registerTypeAdapter(KafkaConsumption.class, new KafkaConsumptionDeserialiser()).create();
KafkaConsumption consumption = gson.fromJson(kv.getValue(), KafkaConsumption.class);
return KV.of(kv.getKey(), consumption);
}).forEach(consumptionRecord -> {
switch (consumptionRecord.getKey().getTopic()) {
case REAL_ENERGY_TOPIC:
totalRealIncrement.accumulate(consumptionRecord.getValue().getEnergyConsumed());
realEnergyRecords.add(consumptionRecord.getValue());
break;
case IT_ENERGY_TOPIC:
totalItIncrement.accumulate(consumptionRecord.getValue().getEnergyConsumed());
itEnergyRecords.add(consumptionRecord.getValue());
break;
}
}
);
assert totalRealIncrement.doubleValue() > 0.0;
assert totalItIncrement.doubleValue() > 0.0;
//Beware of division by zero
if (totalItIncrement.doubleValue() != 0.0) {
//Calculate PUE
pue = BigDecimal.valueOf(totalRealIncrement.getThenReset()).divide(BigDecimal.valueOf(totalItIncrement.getThenReset()), 9, BigDecimal.ROUND_HALF_UP);
}
//Create a PueResult object to return
IWindow intervalWindow = new Window(window.start().getMillis(), window.end().getMillis());
return new PueResult(intervalWindow, pue.stripTrailingZeros());
}
#Override
protected void finalize() throws Throwable {
super.finalize();
RecordSenderFactory.closeSender();
WindowSenderFactory.closeSender();
}
}
Flink
public class PueCalculatorFn extends ProcessWindowFunction<ObjectNode, ImmutablePair, Integer, TimeWindow> {
private transient List<KafkaConsumption> realEnergyRecords;
private transient List<KafkaConsumption> itEnergyRecords;
#Override
public void process(Integer integer, Context context, Iterable<ObjectNode> iterable, Collector<ImmutablePair> collector) throws Exception {
Instant windowStart = Instant.ofEpochMilli(context.window().getStart());
Instant windowEnd = Instant.ofEpochMilli(context.window().getEnd());
BigDecimal pue = calculatePue(iterable);
//Create IntervalWindowResult object to return
DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE_TIME.withZone(ZoneId.of("UTC"));
IIntervalWindowResult intervalWindowResult = new IntervalWindowResult(formatter.format(windowStart),
formatter.format(windowEnd), realEnergyRecords
.stream()
.map(e -> (IKafkaConsumption) e)
.collect(Collectors.toList()), itEnergyRecords
.stream()
.map(e -> (IKafkaConsumption) e)
.collect(Collectors.toList()));
//Create PueResult object to return
IPueResult pueResult = new PueResult(new Window(windowStart.toEpochMilli(), windowEnd.toEpochMilli()), pue.stripTrailingZeros());
//Collect result
collector.collect(new ImmutablePair<>(intervalWindowResult, pueResult));
}
protected BigDecimal calculatePue(Iterable<ObjectNode> iterable) {
//Define accumulators to gather readings
final DoubleAccumulator totalRealIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
final DoubleAccumulator totalItIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
//Declare variable to store the result
BigDecimal pue = BigDecimal.ZERO;
//Initialise transient lists
realEnergyRecords = new ArrayList<>();
itEnergyRecords = new ArrayList<>();
//Iterate through each reading and add to the increment count
StreamSupport.stream(iterable.spliterator(), false)
.forEach(object -> {
switch (object.get("topic").textValue()) {
case REAL_ENERGY_TOPIC:
totalRealIncrement.accumulate(object.get("energyConsumed").asDouble());
realEnergyRecords.add(KafkaConsumptionDeserialiser.deserialize(object));
break;
case IT_ENERGY_TOPIC:
totalItIncrement.accumulate(object.get("energyConsumed").asDouble());
itEnergyRecords.add(KafkaConsumptionDeserialiser.deserialize(object));
break;
}
});
assert totalRealIncrement.doubleValue() > 0.0;
assert totalItIncrement.doubleValue() > 0.0;
//Beware of division by zero
if (totalItIncrement.doubleValue() != 0.0) {
//Calculate PUE
pue = BigDecimal.valueOf(totalRealIncrement.getThenReset()).divide(BigDecimal.valueOf(totalItIncrement.getThenReset()), 9, BigDecimal.ROUND_HALF_UP);
}
return pue;
}
}
And here is my custom deserialiser used in the Beam example.
KafkaConsumptionDeserialiser
public class KafkaConsumptionDeserialiser implements JsonDeserializer<KafkaConsumption> {
public KafkaConsumption deserialize(JsonElement jsonElement, Type type, JsonDeserializationContext jsonDeserializationContext) throws JsonParseException {
if(jsonElement == null) {
return null;
} else {
JsonObject jsonObject = jsonElement.getAsJsonObject();
JsonElement id = jsonObject.get("id");
JsonElement energyConsumed = jsonObject.get("energyConsumed");
Gson gson = (new GsonBuilder()).registerTypeAdapter(Duration.class, new DurationDeserialiser()).registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeDeserialiser()).create();
Duration duration = (Duration)gson.fromJson(jsonObject.get("duration"), Duration.class);
JsonElement topic = jsonObject.get("topic");
Instant eventTime = (Instant)gson.fromJson(jsonObject.get("eventTime"), Instant.class);
return new KafkaConsumption(Integer.valueOf(id != null?id.getAsInt():0), Double.valueOf(energyConsumed != null?energyConsumed.getAsDouble():0.0D), duration, topic != null?topic.getAsString():"", eventTime);
}
}
}
Not sure why the Beam pipeline you wrote is faster, but semantically it is not the same as the Flink job. Similar to how windowing works in Flink, once you assign windows in Beam, all following operations automatically take the windowing into account. You don't need to group by window.
Your Beam pipeline definition can be simplified as follows:
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(flinkPipelineOptions);
// Create a PCollection of Kafka records
PCollection<KafkaRecord<byte[], byte[]>> kafkaCollection = ...
//Apply Windowing Function
PCollection<KafkaRecord<byte[], byte[]>> windowedKafkaCollection = kafkaCollection.apply(
Window.into(SlidingWindows.of(Duration.standardSeconds(5)).every(Duration.standardSeconds(1))));
//Process windowed data
PCollection<KV<IIntervalWindowResult, IPueResult>> processed = windowedKafkaCollection
.apply("filterAndProcess", ParDo.of(new PueCalculatorFn()));
// Run the pipeline.
p.run().waitUntilFinish();
As for the performance, it depends on many factors but keep in mind that Beam is an abstraction layer on top of Flink. Generally speaking, I would be surprised if you saw increased performance with Beam on Flink.
edit: Just to clarify further, you don't group on the JSON "id" field in the Beam pipeline, which you do in the Flink snippet.
For what's worth, if the window processing can be pre-aggregated via reduce() or aggregate(), then the native Flink job should perform better than it currently does.
Many details, such as choice of state backend, serialization, checkpointing, etc. can also have a big impact on performance.
Is the same Flink being used in both cases -- i.e., same version, same configuration?

how to count number of rows in a file in dataflow in a efficient way?

I want to count total number of rows in a file.
Please explain your code if possible.
String fileAbsolutePath = "gs://sourav_bucket_dataflow/" + fileName;
PCollection<String> data = p.apply("Reading Data From File", TextIO.read().from(fileAbsolutePath));
PCollection<Long> count = data.apply(Count.<String>globally());
Now i want to get the value.
There are a variety of sinks that you can use to get data out of your pipeline. https://beam.apache.org/documentation/io/built-in/ has a list of the current built in IO transforms.
It sort of depends on what you want to do with that number. Assuming you want to use it in your future transformations, you may want to convert it to a PCollectionView object and pass it as a side input to other transformations.
PCollection<String> data = p.apply("Reading Data From File", TextIO.read().from(fileAbsolutePath));
PCollection<Long> count = data.apply(Count.<String>globally());
final PCollectionView<Long> view = count.apply(View.asSingleton());
A quick example to show you how to use the value as a side count:
data.apply(ParDo.of(new FuncFn(view)).withSideInputs(view));
Where:
class FuncFn extends DoFn<String,String>
{
private final PCollectionView<Long> mySideInput;
public FuncFn(PCollectionView<Long> mySideInput) {
this.mySideInput = mySideInput;
}
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
Long count = c.sideInput(mySideInput);
//other stuff you may want to do
}
}
Hope that helps!
where "input" in line 1 is the input. This will work.
PCollection<Long> number = input.apply(Count.globally());
number.apply(MapElements.via(new SimpleFunction<Long, Long>()
{
public Long apply(Long total)
{
System.out.println("Length is: " + total);
return total;
}
}));

Can I use setWorkerCacheMb in Apache Beam 2.0+?

My Dataflow job (using Java SDK 2.1.0) is quite slow and it is going to take more than a day to process just 50GB. I just pull a whole table from BigQuery (50GB), join with one csv file from GCS (100+MB).
https://cloud.google.com/dataflow/model/group-by-key
I use sideInputs to perform join (the latter way in the documentation above) while I think using CoGroupByKey is more efficient, however I'm not sure that is the only reason my job is super slow.
I googled and it looks by default, a cache of sideinputs set as 100MB and I assume my one is slightly over that limit then each worker continuously re-reads sideinputs. To improve it, I thought I can use setWorkerCacheMb method to increase the cache size.
However it looks DataflowPipelineOptions does not have this method and DataflowWorkerHarnessOptions is hidden. Just passing --workerCacheMb=200 in -Dexec.args results in
An exception occured while executing the Java class.
null: InvocationTargetException:
Class interface com.xxx.yyy.zzz$MyOptions missing a property
named 'workerCacheMb'. -> [Help 1]
How can I use this option? Thanks.
My pipeline:
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> rows = p.apply("Read from BigQuery",
BigQueryIO.read().from("project:MYDATA.events"));
// Read account file
PCollection<String> accounts = p.apply("Read from account file",
TextIO.read().from("gs://my-bucket/accounts.csv")
.withCompressionType(CompressionType.GZIP));
PCollection<TableRow> accountRows = accounts.apply("Convert to TableRow",
ParDo.of(new DoFn<String, TableRow>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String line = c.element();
CSVParser csvParser = new CSVParser();
String[] fields = csvParser.parseLine(line);
TableRow row = new TableRow();
row = row.set("account_id", fields[0]).set("account_uid", fields[1]);
c.output(row);
}
}));
PCollection<KV<String, TableRow>> kvAccounts = accountRows.apply("Populate account_uid:accounts KV",
ParDo.of(new DoFn<TableRow, KV<String, TableRow>>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
String uid = (String) row.get("account_uid");
c.output(KV.of(uid, row));
}
}));
final PCollectionView<Map<String, TableRow>> uidAccountView = kvAccounts.apply(View.<String, TableRow>asMap());
// Add account_id from account_uid to event data
PCollection<TableRow> rowsWithAccountID = rows.apply("Join account_id",
ParDo.of(new DoFn<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
if (row.containsKey("account_uid") && row.get("account_uid") != null) {
String uid = (String) row.get("account_uid");
TableRow accRow = (TableRow) c.sideInput(uidAccountView).get(uid);
if (accRow == null) {
LOG.warn("accRow null, {}", row.toPrettyString());
} else {
row = row.set("account_id", accRow.get("account_id"));
}
}
c.output(row);
}
}).withSideInputs(uidAccountView));
// Insert into BigQuery
WriteResult result = rowsWithAccountID.apply(BigQueryIO.writeTableRows()
.to(new TableRefPartition(StaticValueProvider.of("MYDATA"), StaticValueProvider.of("dev"),
StaticValueProvider.of("deadletter_bucket")))
.withFormatFunction(new SerializableFunction<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableRow apply(TableRow row) {
return row;
}
}).withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
p.run();
Historically my system have two identifiers of users, new one (account_id) and old one(account_uid). Now I need to add new account_id to our event data stored in BigQuery retroactively, because old data only has old account_uid. Accounts table (which has relation between account_uid and account_id) is already converted as csv and stored in GCS.
The last func TableRefPartition just store data into BQ's corresponding partition depending on each event timestamp. The job is still running (2017-10-30_22_45_59-18169851018279768913) and bottleneck looks Join account_id part.
That part of throughput (xxx elements/s) goes up and down according to the graph. According to the graph, estimated size of sideInputs is 106MB.
If switching to CoGroupByKey improves performance dramatically, I will do so. I was just lazy and thought using sideInputs is easier to handle event data which doesn't have account info as well.
Try one of:
1) setting the option using some code:
options.as(DataflowWorkerHarnessOptions.class).setWorkerCacheMb(500);
2) having your application register DataflowWorkerHarnessOptions with the PipelineOptionsFactory
3) Having your own options class extend DataflowWorkerHarnessOptions
There's a few things you can do to improve the performance of your code:
Your side input is a Map<String, TableRow>, but you're using only a single field in the TableRow - accRow.get("account_id"). How about making it a Map<String, String> instead, having the value be the account_id itself? That'll likely be quite a bit more efficient than the bulky TableRow object.
You could extract the value of the side input into a lazily initialized member variable in your DoFn, to avoid repeated invocations of .sideInput().
That said, this performance is unexpected and we are investigating whether there's something else going on.

Resources