CombineFn Dataflow - Step not in order, creating null pointer - google-cloud-dataflow

I am a newbie in dataflow, please pardon if I made a newbie mistake,
I recently use dataflow/beam to process several data from pubsub, and I am using cloud-dataflow-nyc-taxi-tycoon as a starting point, but I upgrade it to sdk 2.2.0 to make it work with Big Table. I simulate it using http cloud function that send a single data to pubsub so the dataflow can ingest it, using below code
.apply("session windows on rides with early firings",
Window.<KV<String, TableRow>>into(
new GlobalWindows())
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1))
)
.accumulatingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply("group by", Combine.perKey(new LatestPointCombine()))
.apply("prepare to big table",
MapElements.via(new SimpleFunction<KV<String,TableRow>,TableRow >() {
#Override
public TableRow apply(KV<String, TableRow> input) {
TableRow tableRow = null;
try{
tableRow=input.getValue();
....
}
catch (Exception ex){
ex.printStackTrace();
}
return tableRow;
}
}))
.apply....
But it gives me an error at phase "group by"/CombineFn, after "session windows on rides with early firings", here is the logs from stackdriver
1. I create accumulator
2. I addinput
3. I mergeaccumulators
4. I extractoutput
5. I pre mutation_transform
6. W Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7. I mergeaccumulators
8. I create accumulator
9. E Uncaught exception:
10. E Execution of work for S0 for key db2871226f7cec594ebd976e6758ac7e failed. Will retry locally.
11. I Memory is used/total/max = 105/365/4949 MB, GC last/max = 1.00/1.00 %, #pushbacks=0, gc thrashing=false
12. I create accumulator
13. I addinput
14. I mergeaccumulators
15. I extractoutput
16. I pre mutation_transform
17. I mergeaccumulators
18. I create accumulator
19. E Uncaught exception:
20. E Execution of work for S0 for key db2871226f7cec594ebd976e6758ac7e failed. Will retry locally.
21. I create accumulator
...
My questions are :
A. What I dont understand is after step 4, (extract output), why the dataflow mergeaccumulator method called first (line 7.) and later on the create accumulator were called (line 8.) here is the mergeAccumulator method I wrote
public RidePoint mergeAccumulators(Iterable<RidePoint> latestList) {
//RidePoint merged = createAccumulator();
RidePoint merged=new RidePoint();
LOG.info("mergeaccumulators");
for (RidePoint latest : latestList) {
if (latest==null){
LOG.info("latestnull");
}else
if (merged.rideId == null || latest.timestamp > merged.timestamp){
LOG.info(latest.timestamp + " latest " + latest.rideId);
merged = new RidePoint(latest);
}
}
return merged;
}
B. It seems the data is null, and I dont know what caused it, but it reach at the end of the pipeline, elements added in "session windows on rides with early firings" is showing 1 element added, but below that "group by" phase, ... gives 52 elements added,
The Detailed Uncaught exception that shown in the log, looks like this :
(90c7caea3f5d5ad4): java.lang.NullPointerException: in com.google.codelabs.dataflow.utils.RidePoint in string null of string in field status of com.google.codelabs.dataflow.utils.RidePoint
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:161)
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:308)
org.apache.beam.sdk.coders.Coder.encode(Coder.java:143)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:952)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1071)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run(StreamingDataflowWorker.java:841)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:67)
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:128)
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:159)
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:90)
org.apache.avro.reflect.ReflectDatumWriter.writeField(ReflectDatumWriter.java:191)
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:159)
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
org.apache.beam.sdk.coders.AvroCoder.encode(AvroCoder.java:308)
org.apache.beam.sdk.coders.Coder.encode(Coder.java:143)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillBag.persistDirectly(WindmillStateInternals.java:575)
com.google.cloud.dataflow.worker.WindmillStateInternals$SimpleWindmillState.persist(WindmillStateInternals.java:320)
com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillCombiningState.persist(WindmillStateInternals.java:952)
com.google.cloud.dataflow.worker.WindmillStateInternals.persist(WindmillStateInternals.java:216)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.flushState(StreamingModeExecutionContext.java:513)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:363)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1071)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:133)
com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run(StreamingDataflowWorker.java:841)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Your question has many parts to it. But to start, here are some recommendations for debugging:
Don't swallow exceptions. Currently in your "prepare to big table" logic you have: catch (Exception ex){ ex.printStackTrace(); }. This hides the exception and is causing null elements to be returned from the function. It's better to understand and fix the exception here rather dealing with invalid data later.
Validate with the DirectRunner first. Make sure that your pipeline runs correctly on your machine using the Beam DirectRunner. This is the easiest way to understand and fix issues with the Beam model. You can run from the commandline or your favorite IDE and debugger. Then if your pipeline works on the DirectRunner but not on Dataflow, you know that there is a Dataflow-specific issue.
To touch on your specific questions:
A. What I dont understand is after step 4, (extract output), why the
dataflow mergeaccumulator method called first (line 7.) and later on
the create accumulator were called (line 8.)
Your code uses Combine.perKey, which will group elements by key value. So each unique key will cause an accumulator to be created. Dataflow also applies a set of optimizations which can parallelize and reorder independent operations, which could explain what you're seeing.
B. It seems the data is null, and I don;t know what caused it
The null values are likely those that hit an exception in your prepare to big table logic.
I'm not exactly sure what you mean with the output counts because I don't quite understand your pipeline topology. For example, your LatestPointCombine logic seems to output type RidePoint, but the "prepare to big table" function takes in a String. If you are still having trouble after following the suggestions above, you can post a Dataflow job_id and I can help investigate further.

Related

Are stored procedures in Cosmos DB automatically retried on conflict?

Stored procedures in Cosmos DB are transactional and run under isolation snapshop with optimistic concurrency control. That means that write conflicts can occur, but they are detected so the transaction is rolled back.
If such a conflict occurs, does Cosmos DB automatically retry the stored procedure, or does the client receive an exception (maybe a HTTP 412 precondition failure?) and need to implement the retry logic itself?
I tried running 100 instances of a stored procedures in parallel that would produce a write conflict by reading the a document (without setting _etag), waiting for a while and then incrementing an integer property within that document (again without setting _etag).
In all trials so far, no errors occurred, and the result was as if the 100 runs were run sequentially. So the preliminary answer is: yes, Cosmos DB automatically retries running an SP on write conflicts (or perhaps enforces transactional isolation by some other means like locking), so clients hopefully don't need to worry about aborted SPs due to conflicts.
It would be great to hear from a Cosmos DB engineer how this is achieved: retry, locking or something different?
You're correct in that this isn't properly documented anywhere. Here's how OCC check can be done in a stored procedure:
function storedProcedureWithEtag(newItem)
{
var context = getContext();
var collection = context.getCollection();
var response = context.getResponse();
if (!newItem) {
throw 'Missing item';
}
// update the item to set changed time
newItem.ChangedTime = (new Date()).toISOString();
var etagForOcc = newItem._etag;
var upsertAccecpted = collection.upsertDocument(
collection.getSelfLink(),
newItem,
{ etag: etagForOcc }, // <-- Pass in the etag
function (err2, feed2, options2) {
if (err2) throw err2;
response.setBody(newItem);
}
);
if (!upsertAccecpted) {
throw "Unable to upsert item. Id: " + newItem.id;
}
}
Credit: https://peter.intheazuresky.com/2016/12/22/documentdb-optimistic-concurrency-in-a-stored-procedure/
SDK does not retry on a 412, 412 failures are related to Optimistic Concurrency and in those cases, you are controlling the ETag that you are passing. It is expected that the user handles the 412 by reading the newest version of the document, obtains the newer ETag, and retries the operation with the updated value.
Example for V3 SDK
Example for V2 SDK

Parallel SQL queries

How does one run SQL queries with different column dimensions in parallel using dask? Below was my attempt:
from dask.delayed import delayed
from dask.diagnostics import ProgressBar
import dask
ProgressBar().register()
con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")
#delayed
def loadsql(sql):
return pd.read_sql_query(sql,con)
results = [loadsql(x) for x in sql_to_run]
dask.compute(results)
df1=results[0]
df2=results[1]
df3=results[2]
df4=results[3]
df5=results[4]
df6=results[5]
However this results in the following error being thrown:
DatabaseError: Execution failed on sql: "SQL QUERY"
ORA-01013: user requested cancel of current operation
unable to rollback
and then shortly thereafter another error comes up:
MultipleInstanceError: Multiple incompatible subclass instances of TerminalInteractiveShell are being created.
sql_to_run is a list of different sql queries
Any suggestions or pointers?? Thanks!
Update 9.7.18
Think this is more a case of me not reading documentation close enough. Indeed the con being outside the loadsql function was causing the problem. The below is the code change that seems to be working as intended now.
def loadsql(sql):
con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")
result = pd.read_sql_query(sql,con)
con.close()
return result
values = [delayed(loadsql)(x) for x in sql_to_run]
#MultiProcessing version
import dask.multiprocessing
results = dask.compute(*values, scheduler='processes')
#My sample queries took 56.2 seconds
#MultiThreaded version
import dask.threaded
results = dask.compute(*values, scheduler='threads')
#My sample queries took 51.5 seconds
My guess is, that the oracle client is not thread-safe. You could try running with processes instead (by using the multiprocessing scheduler, or the distributed one), if the conn object serialises - this may be unlikely. More likely to work, would be to create the connection within loadsql, so it gets remade for each call, and the different connections hopefully don't interfere with one-another.

Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?

Background
We have a pipeline that starts by receiving messages from PubSub, each with the name of a file. These files are exploded to line level, parsed to JSON object nodes and then sent to an external decoding service (which decodes some encoded data). Object nodes are eventually converted to Table Rows and written to Big Query.
It appeared that Dataflow was not acknowledging the PubSub messages until they arrived at the decoding service. The decoding service is slow, resulting in a backlog when many message are sent at once. This means that lines associated with a PubSub message can take some time to arrive at the decoding service. As a result, PubSub was receiving no acknowledgement and resending the message. My first attempt to remedy this was adding an attribute to each PubSub messages that is passed to the Reader using withAttributeId(). However, on testing, this only prevented duplicates that arrived close together.
My second attempt was to add a fusion breaker (example) after the PubSub read. This simply performs a needless GroupByKey and then ungroups, the idea being that the GroupByKey forces Dataflow to acknowledge the PubSub message.
The Problem
The fusion breaker discussed above works in that it prevents PubSub from resending messages, but I am finding that this GroupByKey outputs more elements than it receives: See image.
To try and diagnose this I have removed parts of the pipeline to get a simple pipeline that still exhibits this behavior. The behavior remains even when
PubSub is replaced by some dummy transforms that send out a fixed list of messages with a slight delay between each one.
The Writing transforms are removed.
All Side Inputs/Outputs are removed.
The behavior I have observed is:
Some number of the received messages pass straight through the GroupByKey.
After a certain point, messages are 'held' by the GroupByKey (presumably due to the backlog after the GroupByKey).
These messages eventually exit the GroupByKey (in groups of size one).
After a short delay (about 3 minutes), the same messages exit the GroupByKey again (still in groups of size one). This may happen several times (I suspect it is proportional to the time they spend waiting to enter the GroupByKey).
Example job id is 2017-10-11_03_50_42-6097948956276262224. I have not run the beam on any other runner.
The Fusion Breaker is below:
#Slf4j
public class FusionBreaker<T> extends PTransform<PCollection<T>, PCollection<T>> {
#Override
public PCollection<T> expand(PCollection<T> input) {
return group(window(input.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break in")))))
.apply("Getting iterables after breaking fusion", Values.create())
.apply("Flattening iterables after breaking fusion", Flatten.iterables())
.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break out")));
}
private PCollection<T> window(PCollection<T> input) {
return input.apply("Windowing before breaking fusion", Window.<T>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.discardingFiredPanes());
}
private PCollection<KV<Integer, Iterable<T>>> group(PCollection<T> input) {
return input.apply("Keying with random number", ParDo.of(new RandomKeyFn<>()))
.apply("Grouping by key to break fusion", GroupByKey.create());
}
private static class RandomKeyFn<T> extends DoFn<T, KV<Integer, T>> {
private Random random;
#Setup
public void setup() {
random = new Random();
}
#ProcessElement
public void processElement(ProcessContext context) {
context.output(KV.of(random.nextInt(), context.element()));
}
}
}
The PassthroughLoggers simply log the elements passing through (I use these to confirm that elements are indeed repeated, rather than there being an issue with the counts).
I suspect this is something to do with windows/triggers, but my understanding is that elements should never be repeated when .discardingFiredPanes() is used - regardless of the windowing setup. I have also tried FixedWindows with no success.
First, the Reshuffle transform is equivalent to your Fusion Breaker, but has some additional performance improvements that should make it preferable.
Second, both counters and logging may see an element multiple times if it is retried. As described in the Beam Execution Model, an element at a step may be retried if anything that is fused into it is retried.
Have you actually observed duplicates in what is written as the output of the pipeline?

Query timeout in Neo4j 3.0.6

It looks like previously working approach is deprecated now:
unsupported.dbms.executiontime_limit.enabled=true
unsupported.dbms.executiontime_limit.time=1s
According to the documentation new variables are responsible for timeouts handling:
dbms.transaction.timeout
dbms.transaction_timeout
At the same time the new variables look related to the transactions.
The new timeout variables look not working. They were set in the neo4j.conf as follows:
dbms.transaction_timeout=5s
dbms.transaction.timeout=5s
Slow cypher query isn't terminated.
Then the Neo4j plugin was added to model a slow query with transaction:
#Procedure("test.slowQuery")
public Stream<Res> slowQuery(#Name("delay") Number Delay )
{
ArrayList<Res> res = new ArrayList<>();
try ( Transaction tx = db.beginTx() ){
Thread.sleep(Delay.intValue(), 0);
tx.success();
} catch (Exception e) {
System.out.println(e);
}
return res.stream();
}
The function served by the plugin is executed with neoism Golang package. And the timeout isn't triggered as well.
The timeout is only honored if your procedure code invokes either operations on the graph like reading nodes and rels or explicitly checks if the current transaction is marked as terminate.
For the later, see https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/master/src/main/java/apoc/util/Utils.java#L41-L51 as example.
According to the documentation the transaction guard is interested in orphaned transactions only.
The server guards against orphaned transactions by using a timeout. If there are no requests for a given transaction within the timeout period, the server will roll it back. You can configure the timeout in the server configuration, by setting dbms.transaction_timeout to the number of seconds before timeout. The default timeout is 60 seconds.
I've not found a way how to trigger timeout for a query which isn't orphaned with a native functionality.
#StefanArmbruster pointed a good direction. The timeout triggering functionality can be got with creating a wrapper function in Neo4j plugin like it is made in apoc.

Huge performance drop in cassandra-orm after million records

I'm using cassandra-orm plugin (cassandra-orm:0.4.5) for migrating clicks from Postgres DB to Cassandra. (I know I could use raw data import, but I want to make use of groupBy and explicit indexes maintained by the plugin).
The migration procedure is simple: I select a bunch of clicks from Postgres (via GORM) and then I flush them to Cassandra. Every click is a new record and a new object is created in Grails and saved in Cassandra. With 20 threads I was able to reach throughput of 2000 clicks/sec. After importing 5 mil clicks the performance started to degrade dramatically to 50 clicks/sec.
I made some profiling and I've found out, that 19 threads were waiting (parked) and one thread was performing a rehash on Groovy's AbstractConcurrentMapBase.
stack trace for waiting threads:
Name: pool-4-thread-2
State: WAITING on org.codehaus.groovy.util.ManagedConcurrentMap$Segment#5387f7af
Total blocked: 45,027 Total waited: 55,891
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:34)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:101)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.nosql.Click.getProperty(Click.groovy)
stack trace for rehash thread:
Name: pool-4-thread-11
State: RUNNABLE
Total blocked: 46,544 Total waited: 57,433
Stack trace:
org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment.rehash(AbstractConcurrentMapBase.java:217)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:105)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.fma.nosql.Click.getProperty(Click.groovy)
After hours of debugging I've found out, that the issue is in dynamic property "_cassandra_cluster_" which is added to all plugin managed objects:
// cluster property (_cassandra_cluster_)
clazz.metaClass."${CLUSTER_PROP}" = null
This property is then internally saved in ThreadManagedMetaBeanProperty instance2Prop map. When the dynamic property is accessed def cluster = click._cassandra_cluster_ then the click instance is saved to instance2Prop map with soft reference. So far so good, soft references can be garbage collected, right. However there seems to be a bug in the ManagedConcurrentMap implementation which disregards the garbage collected elements and keep rehashing and expanding the map (described here and here).
Workaround
Since the map is internally saved on the class level, the only working solution was to restart the server. Eventually I've developed a dirty solution, which clears the internal map from zombie elements. Following code is running in a separate thread:
public void rehashClickSegmentsIfNecessary() {
ManagedConcurrentMap instanceMap = lookupInstanceMap(Click.class, "_cassandra_cluster_")
if(instanceMap.fullSize() - instanceMap.size() > 50000) {
//we have more than 50 000 zombie references in map
rehashSegments(instanceMap)
}
}
private void rehashSegments(ManagedConcurrentMap instanceMap) {
org.codehaus.groovy.util.ManagedConcurrentMap.Segment[] segments = instanceMap.segments
for(int i=0;i<segments.length;i++) {
segments[i].lock()
try {
segments[i].rehash()
} finally {
segments[i].unlock()
}
}
}
private ManagedConcurrentMap lookupInstanceMap(Class clazz, String prop) {
MetaClassRegistry registry = GroovySystem.metaClassRegistry
MetaClassImpl metaClass = registry.getMetaClass(clazz)
return metaClass.getMetaProperty(prop, false).instance2Prop
}
Do you have any production experience with cassandra-orm or any other grails plugin connecting to cassandra?

Resources