reading parquet into Google DataFlow using AvroParquetInputFormat

reading parquet into Google DataFlow using AvroParquetInputFormat - avro

Trying to read simple Parquet file into my Google DataFlow Pipeline
using the following code
Read.Bounded<KV<Void, GenericData>> results = HadoopFileSource.readFrom("/home/avi/tmp/db_demo/simple.parquet", AvroParquetInputFormat.class, Void.class, GenericData.class);
trigger always the following exception when running the pipeline
IllegalStateException: Cannot find coder for class org.apache.avro.generic.GenericData
seems like this method inside HadoopFileSource can't handle this type of class as for coder
private <T> Coder<T> getDefaultCoder(Class<T> c) {
if (Writable.class.isAssignableFrom(c)) {
Class<? extends Writable> writableClass = (Class<? extends Writable>) c;
return (Coder<T>) WritableCoder.of(writableClass);
} else if (Void.class.equals(c)) {
return (Coder<T>) VoidCoder.of();
}
// TODO: how to use registered coders here?
throw new IllegalStateException("Cannot find coder for " + c);
}
any help will be appreciated
Avi

This is a problem with the design of HadoopFileSource. I would suggest moving to apache-beam or (scio) which is the apache "version" (and the "future") of dataflow sdk. Once you are on the beam, you can:
This is gonna be scala (but you can easily translate to java):
HDFSFileSource.from(
input,
classOf[AvroParquetInputFormat[AvroSchemaClass]],
AvroCoder.of(classOf[AvroSchemaClass]),
new SerializableFunction[KV[Void, AvroSchemaClass], AvroSchemaClass]() {
override def apply(e: KV[Void, AvroSchemaClass]): AvroSchemaClass =
CoderUtils.clone(AvroCoder.of(classOf[AvroSchemaClass]), e.getValue)
}
)
which is a alternative version of from that accepts coder.

Related

How can I inject with Guice my api into dataflow jobs without needed to be serializable?

This question is a follow on after such a great answer Is there a way to upload jars for a dataflow job so we don't have to serialize everything?
This made me realize 'ok, what I want is injection with no serialization so that I can mock and test'.
Our current method requires our apis/mocks to be serialiable BUT THEN, I have to put static fields in the mock because it gets serialized and deserialized creating a new instance that dataflow uses.
My colleague pointed out that perhaps this needs to be a sink and that is treated differently? <- We may try that later and update but we are not sure right now.
My desire is from the top to replace the apis with mocks during testing. Does someone have an example for this?
Here is our bootstrap code that does not know if it is in production or inside a feature test. We test end to end results with no apache beam imports in our tests meaning we swap to any tech if we want to pivot and keep all our tests. Not only that, we catch way more integration bugs and can refactor without rewriting tests since the contracts we test are customer ones we can't easily change.
public class App {
private Pipeline pipeline;
private RosterFileTransform transform;
#Inject
public App(Pipeline pipeline, RosterFileTransform transform) {
this.pipeline = pipeline;
this.transform = transform;
}
public void start() {
pipeline.apply(transform);
pipeline.run();
}
}
Notice that everything we do is Guice Injection based so the Pipeline may be direct runner or not. I may need to modify this class to pass things through :( but anything that works for now would be great.
The function I am trying to get our api(and mock and impl to) with no serialization is thus
private class ValidRecordPublisher extends DoFn<Validated<PractitionerDataRecord>, String> {
#ProcessElement
public void processElement(#Element Validated<PractitionerDataRecord>element) {
microServiceApi.writeRecord(element.getValue);
}
}
I am not sure how to pass in microServiceApi in a way that avoid serialization. I would be ok with delayed creation as well after deserialization using guice Provider provider; with provider.get() if there is a solution there too.

Solved in such a way that mocks no longer need static or serialization anymore by one since glass bridging the world of dataflow(in prod and in test) like so
NOTE: There is additional magic-ness we have in our company that passes through headers from service to service and through dataflow and that is some of it in there which you can ignore(ie. the RouterRequest request = Current.request();). so for anyone else, they will have to pass in projectId into getInstance each time.
public abstract class DataflowClientFactory implements Serializable {
private static final Logger log = LoggerFactory.getLogger(DataflowClientFactory.class);
public static final String PROJECT_KEY = "projectKey";
private transient static Injector injector;
private transient static Module overrides;
private static int counter = 0;
public DataflowClientFactory() {
counter++;
log.info("creating again(usually due to deserialization). counter="+counter);
}
public static void injectOverrides(Module dfOverrides) {
overrides = dfOverrides;
}
private synchronized void initialize(String project) {
if(injector != null)
return;
/********************************************
* The hardest part is this piece since this is specific to each Dataflow
* so each project subclasses DataflowClientFactory
* This solution is the best ONLY in the fact of time crunch and it works
* decently for end to end testing without developers needing fancy
* wrappers around mocks anymore.
***/
Module module = loadProjectModule();
Module modules = Modules.combine(module, new OrderlyDataflowModule(project));
if(overrides != null) {
modules = Modules.override(modules).with(overrides);
}
injector = Guice.createInjector(modules);
}
protected abstract Module loadProjectModule();
public <T> T getInstance(Class<T> clazz) {
if(!Current.isContextSet()) {
throw new IllegalStateException("Someone on the stack is extending DoFn instead of OrderlyDoFn so you need to fix that first");
}
RouterRequest request = Current.request();
String project = (String)request.requestState.get(PROJECT_KEY);
initialize(project);
return injector.getInstance(clazz);
}
}

I suppose this may not be what you're looking for, but your use case makes me think of using factory objects. They may depend on the pipeline options that you pass (i.e. your PipelineOptions object), or on some other configuration object.
Perhaps something like this:
class MicroserviceApiClientFactory implements Serializable {
MicroserviceApiClientFactory(PipelineOptions options) {
this.options = options;
}
public static MicroserviceApiClient getClient() {
MyPipelineOptions specialOpts = options.as(MySpecialOptions.class);
if (specialOpts.getMockMicroserviceApi()) {
return new MockedMicroserviceApiClient(...); // Or whatever
} else {
return new MicroserviceApiClient(specialOpts.getMicroserviceEndpoint()); // Or whatever parameters it needs
}
}
}
And for your DoFns and any other execution-time objects that need it, you would pass the factory:
private class ValidRecordPublisher extends DoFn<Validated<PractitionerDataRecord>, String> {
ValidRecordPublisher(MicroserviceApiClientFactory msFactory) {
this.msFactory = msFactory;
}
#ProcessElement
public void processElement(#Element Validated<PractitionerDataRecord>element) {
if (microServiceapi == null) microServiceApi = msFactory.getClient();
microServiceApi.writeRecord(element.getValue);
}
}
This should allow you to encapsulate the mocking functionality into a single class that lazily creates your mock or your client at pipeline execution time.
Let me know if this matches what you want somewhat, or if we should try to iterate further.
I have no experience with Guice, so I don't know if Guice configurations can easily pass the boundary between pipeline construction and pipeline execution (serialization / submittin JARs / etc).
Should this be a sink? Maybe, if you have an external service, and you're writing to it, you can write a PTransform that takes care of it - but the question of how you inject various dependencies will remain.

Apache Beam Stateful DoFn Periodically Output All K/V Pairs

I'm trying to aggregate (per key) a streaming data source in Apache Beam (via Scio) using a stateful DoFn (using #ProcessElement with #StateId ValueState elements). I thought this would be most appropriate for the problem I'm trying to solve. The requirements are:
for a given key, records are aggregated (essentially summed) across all time - I don't care about previously computed aggregates, just the most recent
keys may be evicted from the state (state.clear()) based on certain conditions that I control
Every 5 minutes, regardless if any new keys were seen, all keys that haven't been evicted from the state should be outputted
Given that this is a streaming pipeline and will be running indefinitely, using a combinePerKey over a global window with accumulating fired panes seems like it will continue to increase its memory footprint and the amount of data it needs to run over time, so I'd like to avoid it. Additionally, when testing this out, (maybe as expected) it simply appends the newly computed aggregates to the output along with the historical input, rather than using the latest value for each key.
My thought was that using a StatefulDoFn would simply allow me to output all of the global state up until now(), but it seems this isn't a trivial solution. I've seen hintings at using timers to artificially execute callbacks for this, as well as potentially using a slowly growing side input map (How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>) and somehow flushing this, but this would essentially require iterating over all values in the map rather than joining on it.
I feel like I might be overlooking something simple to get this working. I'm relatively new to many concepts of windowing and timers in Beam, looking for any advice on how to solve this. Thanks!

You are right that Stateful DoFn should help you here. This is a basic sketch of what you can do. Note that this only outputs the sum without the key. It may not be exactly what you want, but it should help you move forward.
class CombiningEmittingFn extends DoFn<KV<Integer, Integer>, Integer> {
#TimerId("emitter")
private final TimerSpec emitterSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("done")
private final StateSpec<ValueState<Boolean>> doneState = StateSpecs.value();
#StateId("agg")
private final StateSpec<CombiningState<Integer, int[], Integer>>
aggSpec = StateSpecs.combining(
Sum.ofIntegers().getAccumulatorCoder(null, VarIntCoder.of()), Sum.ofIntegers());
#ProcessElement
public void processElement(ProcessContext c,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) throws Exception {
if (SOME CONDITION) {
countValueState.clear();
doneState.write(true);
} else {
countValueState.addAccum(c.element().getValue());
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
#OnTimer("emitter")
public void onEmit(
OnTimerContext context,
#StateId("agg") CombiningState<Integer, int[], Integer> aggState,
#StateId("done") ValueState<Boolean> doneState,
#TimerId("emitter") Timer emitterTimer) {
Boolean isDone = doneState.read();
if (isDone != null && isDone) {
return;
} else {
context.output(aggState.getAccum());
// Set the timer to emit again
emitterTimer.align(Duration.standardMinutes(5)).setRelative();
}
}
}
}
Happy to iterate with you on something that'll work.

#Pablo was indeed correct that a StatefulDoFn and timers are useful in this scenario. Here is the with code I was able to get working.
Stateful Do Fn
// DomainState is a custom case class I'm using
type DoFnT = DoFn[KV[String, DomainState], KV[String, DomainState]]
class StatefulDoFn extends DoFnT {
#StateId("key")
private val keySpec = StateSpecs.value[String]()
#StateId("domainState")
private val domainStateSpec = StateSpecs.value[DomainState]()
#TimerId("loopingTimer")
private val loopingTimer: TimerSpec = TimerSpecs.timer(TimeDomain.EVENT_TIME)
#ProcessElement
def process(
context: DoFnT#ProcessContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value from potentially null values
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
stateKey.write(key)
stateValue.write(value)
if (flushState(value)) {
context.output(KV.of(key, value))
}
} else {
stateValue.clear()
}
}
#OnTimer("loopingTimer")
def onLoopingTimer(
context: DoFnT#OnTimerContext,
#StateId("key") stateKey: ValueState[String],
#StateId("domainState") stateValue: ValueState[DomainState],
#TimerId("loopingTimer") loopingTimer: Timer): Unit = {
... logic to create key/value checking for nulls
if (keepState(value)) {
loopingTimer.align(Duration.standardMinutes(5)).setRelative()
if (flushState(value)) {
context.output(KV.of(key, value))
}
}
}
}
With pipeline
sc
.pubsubSubscription(...)
.keyBy(...)
.withGlobalWindow()
.applyPerKeyDoFn(new StatefulDoFn())
.withFixedWindows(
duration = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
trigger = AfterWatermark.pastEndOfWindow(),
allowedLateness = Duration.ZERO,
// Only take the latest per key during a window
timestampCombiner = TimestampCombiner.END_OF_WINDOW
))
.reduceByKey(mostRecentEvent())
.saveAsCustomOutput(TextIO.write()...)

Jenkins scripted Pipeline: How to apply #NonCPS annotation in this specific case

I am working on a scripted Jenkins-Pipeline that needs to write a String with a certain encoding to a file as in the following example:
class Logger implements Closeable {
private final PrintWriter writer
[...]
Logger() {
FileWriter fw = new FileWriter(file, true)
BufferedWriter bw = new BufferedWriter(fw)
this.writer = new PrintWriter(bw)
}
def log(String msg) {
try {
writer.println(msg)
[...]
} catch (e) {
[...]
}
}
}
The above code doesn't work since PrintWriter ist not serializable so I know I got to prevent some of the code from being CPS-transformed. I don't have an idea on how to do so, though, since as far as I know the #NonCPS annotation can only be applied to methods.
I know that one solution would be to move all output-related code to log(msg) and annotate the method but this way I would have to create a new writer every time the method gets called.
Does someone have an idea on how I could fix my code instead?
Thanks in advance!

Here is a way to make this work using a log function that is defined in a shared library in vars\log.groovy:
import java.io.FileWriter
import java.io.BufferedWriter
import java.io.PrintWriter
// The annotated variable will become a private field of the script class.
#groovy.transform.Field
PrintWriter writer = null
void call( String msg ) {
if( ! writer ) {
def fw = new FileWriter(file, true)
def bw = new BufferedWriter(fw)
writer = new PrintWriter(bw)
}
try {
writer.println(msg)
[...]
} catch (e) {
[...]
}
}
After all, scripts in the vars folder are instanciated as singleton classes, which is perfectly suited for a logger. This works even without #NonCPS annotation.
Usage in pipeline is as simple as:
log 'some message'

Is this loginRequired(f)() the way to handle login required functions in dart?

I am new to Dart programming. I am trying to figure out what is the proper way (what everyone will do) to handle/guard those functions which are login required. The following is my first trial:
$ vim login_sample.dart:
var isLoggedIn;
class LoginRequiredException implements Exception {
String cause;
LoginRequiredException(this.cause);
}
Function loginRequired(Function f) {
if (!isLoggedIn) {
throw new LoginRequiredException("Login is reuiqred.");
}
return f;
}
void secretPrint() {
print("This is a secret");
}
void main(List<String> args) {
if (args.length != 1) return null;
isLoggedIn = (args[0] == '1') ? true : false;
try {
loginRequired(secretPrint)();
} on LoginRequiredException {
print("Login is required!");
}
}
then, run it with $ dart login_sample.dart 1 and $ dart login_sample.dart 2.
I am wondering if this is the recommended way to guard login required functions or not.
Thank you very much for your help.
Edited:
My question is more about general programming skills in Dart than how to use a plugin. In python, I just need to add #login_required decorator in the front of the function to protect it. I am wondering if this decorator function way is recommended in dart or not.
PS: All firebase/google/twitter/facebook etc... are blocked in my country.

I like the functional approach. I'd only avoid using globals, you can wrap it in a Context so you can mock then for tests and use Futures as Monads: https://dartpad.dartlang.org/ac24a5659b893e8614f3c29a8006a6cc

Passing the function is not buying much value. In a typical larger Dart project using a framework there will be some way to guard at a higher level than a function - such as an entire page or component/widget.
If you do want to guard at a per-function level you first need to decide with it should be the function or the call site that decides what needs to be guarded. In your example it is the call site making the decision. After that decision you can implement a throwIfNotAuthenticated and add a call at either the definition or call site.
void throwIfNotAuthenticated() {
if (!userIsAuthenticated) {
throw new LoginRequiredException();
}
}
// Function decides authentication is required:
void secretPrint() {
throwIfNotAuthenticated();
print('This is a secret');
}
// Call site decides authentication is required:
void main() {
// do stuff...
throwIfNotAuthenticated();
anotherSecreteMethod();
}

Dart JSON Decoder

I am surprised that dart does not have a built in object-to-json and json-to-object mapper.
I read that we have to hand code the mapping ourselves, which is not pleasant.
Anyways, although I have not thoroughly tested it for my use case, I found dart-exportable to be very helpful for half of my requirement.
Any suggested package for json to object decoding?

Your best option is to use the Smoke library.
It's a subset of the Mirrors functionality but has both a Mirrors-based and a Codegen-based implementation. It's written by the PolymerDart team, so it's as close to "Official" as we're going to get.
While developing, it'll use the Mirrors-based encoding/decoding; but for publishing you can create a small transformer that will generate code.
Seth Ladd created a code sample here, which I extended slightly to support child-objects:
abstract class Serializable {
static fromJson(Type t, Map json) {
var typeMirror = reflectType(t);
T obj = typeMirror.newInstance(new Symbol(""), const[]).reflectee;
json.forEach((k, v) {
if (v is Map) {
var d = smoke.getDeclaration(t, smoke.nameToSymbol(k));
smoke.write(obj, smoke.nameToSymbol(k), Serializable.fromJson(d.type, v));
} else {
smoke.write(obj, smoke.nameToSymbol(k), v);
}
});
return obj;
}
Map toJson() {
var options = new smoke.QueryOptions(includeProperties: false);
var res = smoke.query(runtimeType, options);
var map = {};
res.forEach((r) => map[smoke.symbolToName(r.name)] = smoke.read(this, r.name));
return map;
}
}
Currently, there is no support to get generic type information (eg. to support List) in Smoke; however I've raised a case about this here:
https://code.google.com/p/dart/issues/detail?id=20584
Until this issue is implemented, a "good" implementation of what you want is not really feasible; but I'm hopeful it'll be implemented soon; because doing something as basic as JSON serialisation kinda hinges on it!

I haven't had the time to complete it yet but dartson is currently working using mirrors. However a better solution would be using a transformer when compiling to JavaScript. https://pub.dartlang.org/packages/dartson

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

reading parquet into Google DataFlow using AvroParquetInputFormat - avro

Related

How can I inject with Guice my api into dataflow jobs without needed to be serializable?

Apache Beam Stateful DoFn Periodically Output All K/V Pairs

Jenkins scripted Pipeline: How to apply #NonCPS annotation in this specific case

Is this loginRequired(f)() the way to handle login required functions in dart?

Dart JSON Decoder

Categories

Resources