Dataflow concurrency error with ValueState - google-cloud-dataflow

The Beam 2.1 pipeline uses ValueState in a stateful DoFn. It runs fine with a single worker but when scaling is enabled will fail with "Unable to read value from state" and the root exception below. Any ideas what could cause this?
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.dataflow.worker.KeyTokenInvalidException: Unable to fetch data due to token mismatch for key ��
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:459)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
at com.google.cloud.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:309)
at com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillValue.read(WindmillStateInternals.java:384)
... 16 more
Caused by: com.google.cloud.dataflow.worker.KeyTokenInvalidException: Unable to fetch data due to token mismatch for key ��
at com.google.cloud.dataflow.worker.WindmillStateReader.consumeResponse(WindmillStateReader.java:469)
at com.google.cloud.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:411)
at com.google.cloud.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:306)
... 17 more

I believe that exception should just be rethrown. It is thrown by the state mechanism to indicate that additional work on that key should not be performed, and will be automatically retried by the Dataflow runner.
These typically indicate that either that particular work should be performed on a different worker (thus proceeding wouldn't be helpful).
It may be possible that misusing state -- storing the state object from one key and attempting to use it on a different key -- could also lead to these errors. If that is the case, you may be able to see more diagnostic messages in either the worker or shuffler logs in Stackdriver logging.
If neither retrying nor looking at logging and how you use the state objects help, please provide a job ID demonstrating the problem.

Related

QuickFIXJ 2.3 - onDisconnect(), onConnect()

I am using Quickfixj 2.3 for initiator. Vendor party is acceptor .
I have implemented SessionStateListener with the methods onConnectException and OnDisconnect.
I have resetOnLogon =Y in configuration file .
How can I catch specific exception like EndOfStream occurred ,due to wrong session data or due to acceptor allows only one session at a time or due to invalid Msg seq ?
Now, when the resetOnLogOn=Y,until the msgSeq satisfies, it keeps internally the disconnecting and initiating. I would like to logout manually in all other disconnects except this situation where it auto matches the seq number .
Thank you .
You actually cannot tell the scenarios listed in point 1 apart most of the time.
E.g. the counterparty normally will not tell you if you have wrong session data (I assume you mean wrong SenderCompID or TargetCompID) because that would disclose information about their system. Same goes for the duplicate session.
Only in the case of a "sequence number too low" event the counterparty normally will send this information in the 58/Text field of the Logout message.

CloudWatch Events Metrics - DeadLetterInvocations

The documentation says:
Metric: DeadLetterInvocations
Description:
Measures the number of times a rule’s target is not invoked in
response to an event. This includes invocations that would result in
triggering the same rule again, causing an infinite loop.
Valid Dimensions: RuleName
Units: Count
Can someone give a simple explanation of what the above description means in layman's terms.
I was also confused about dead letter invocation before,
correct explanation:
DeadLetterInvocations: Invocations that are failed temporarily and are being retried by the event rule itself. Only some events, such as those that fail due to a throttling or timeout error, are retried.
InvocationsSentToDlq: Permanently failed invocations sent to SQS DLqueue configured in the target.

Uploading Program to OpenMPI gives initialization error, on IntelMPI memory leak

I am a graduate student (master's) and use an in-house code for running my simulations that use MPI. Earlier, I used OpenMPI on a supercomputer we used to access and since it shut down I've been trying to switch to another supercomputer that has Intel MPI installed on it. The problem is, the same code that was working perfectly fine earlier now gives memory leaks after a set number of iterations (time steps). Since the code is relatively large and my knowledge of MPI is very basic, it is proving very difficult to debug it.
So I installed OpenMPI onto this new supercomputer I am using, but it gives the following error message upon execution and then terminates:
Invalid number of PE
Please check partitioning pattern or number of PE
NOTE: The error message is repeated for as many numbers of nodes I used to run the case (here, 8). Compiled using mpif90 with -fopenmp for thread parallelisation.
There is in fact no guarantee that running it on OpenMPI won't give the memory leak, but it is worth a shot I feel, as it was running perfectly fine earlier.
PS: On Intel MPI, this is the error I got (compiled with mpiifort with -qopenmp)
Abort(941211497) on node 16 (rank 16 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack:
PMPI_Isend(152)...........: MPI_Isend(buf=0x2aba1cbc8060, count=4900, dtype=0x4c000829, dest=20, tag=0, MPI_COMM_WORLD, request=0x7ffec8586e5c) failed
MPID_Isend(662)...........:
MPID_isend_unsafe(282)....:
MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object
Abort(203013993) on node 17 (rank 17 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack:
PMPI_Isend(152)...........: MPI_Isend(buf=0x2b38c479c060, count=4900, dtype=0x4c000829, dest=21, tag=0, MPI_COMM_WORLD, request=0x7fffc20097dc) failed
MPID_Isend(662)...........:
MPID_isend_unsafe(282)....:
MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object
[mpiexec#cx0321.obcx] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:357): write error (Bad file descriptor)
[mpiexec#cx0321.obcx] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:164): error sending cmd 15 to proxy
[mpiexec#cx0321.obcx] send_abort_rank_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:557): unable to send response downstream
[mpiexec#cx0321.obcx] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1576): unable to send abort rank to downstreams
[mpiexec#cx0321.obcx] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[mpiexec#cx0321.obcx] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1962): error waiting for event"
I will be happy to provide the code in case somebody is willing to take a look at it. It is written using Fortran with some of the functions written in C. My research progress has been completely halted due to this problem and nobody at my lab has enough experience with MPI to resolve this.

What causes dask job failure with CancelledError exception

I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs successfully.
If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.

JTA transaction timeout exception - weblogic 10.X

I changed the JTA transaction timeout from admin console and set to 300, even after changing it fails saying JTA transaction unexpectedly rolled back (maybe due to a timeout) with a:
weblogic.transaction.RollbackException: Transaction timed out after 181 seconds`
To make sure whether my changes (timeout value 300) got reflected for that domain or not I checked under domain config.xml it got reflected with 300.
My question is, is there any other place also do I need to update the transaction timeout value and do I need to restart the server ?
Full stack trace after the exception from server below:
Caused by: org.springframework.transaction.UnexpectedRollbackException: JTA transaction unexpectedly rolled back (maybe due to a timeout); nested exception is weblogic.transaction.RollbackException: Transaction
timed out after 180 seconds
BEA1-160A800A149091F72E5E
at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1031)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:709)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:678)
at org.springframework.transaction.interceptor.TransactionAspectSupport.completeTransactionAfterThrowing(TransactionAspectSupport.java:359)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at $Proxy103.saveRegistryData(Unknown Source)
at gov.cms.pqri.arch.submission.registry.bean.RegDataAccessManager.persistRegistry(RegDataAccessManager.java:54)
... 14 more
Caused by: weblogic.transaction.RollbackException: Transaction timed out after 180 seconds
BEA1-160A800A149091F72E5E
at weblogic.transaction.internal.TransactionImpl.throwRollbackException(TransactionImpl.java:1818)
at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:333)
at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:227)
at weblogic.transaction.internal.TransactionManagerImpl.commit(TransactionManagerImpl.java:281)
at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1028)
... 22 more
after changing the stuck Thread Max time to 300 under servers -> configuration -> tuning (tab) from admin console it is getting updated and working fine.
I have also came across this issue and have resolved the same, since this is related to JTA transaction so we need to increase the timeout of JTA as well along with the time out for stuck max thread. Please click on JTA from the weblogic console home and increase the JTA timeout from 30(by default) to 300.
We met same issue on Weblogic 12.1.2 [JTA transaction unexpectedly rolled back (maybe due to a timeout)] after all investigation we found the root cause of the problem.In my opinion it occurs due to huge dataset processing transactional and near the end of the process If an exception is thrown, JTA is rolling back data as expected.But it does not give the details of the error.In our case ,it mostly cause because of the database integrity (e.g we try to insert data a column with smaller size than data.)
In summary,it will be the best way to investigate db logs instead of increasing stuck Thread Max time.Thread max time can be a solution,but not a proper solution for real enterprise systems.
Also this issue discussed on another stackover link and hibernate jira issue
And solution suggested:
This is a default behaviour of Weblogic JTA realization. To obtain
root exception you should set system property
weblogic.transaction.allowOverrideSetRollbackReason to true.
One of the solution is add this line into
/bin/setDomainEnv.cmd:
set JAVA_OPTIONS=%JAVA_OPTIONS% -Dweblogic.transaction.allowOverrideSetRollbackReason=true
I got my JTA timeouts increased by adding jta.properties file into config folder of my app with lines:
com.atomikos.icatch.default_jta_timeout=600000
com.atomikos.icatch.max_timeout=600000

Resources