Google Cloud Dataflow jobs failing, inaccessible jars & 410 gone errors - google-cloud-dataflow

A number of my Google Cloud Dataflow jobs failed yesterday, reporting internal errors that I have not seen before.
Here are two examples:
Job ID 2016-01-31_12_14_47-10166346951693629111 failed with the following error:
Jan 31, 2016, 10:15:25 PM
(bc20d8395f1f7459): Staged package jetty-servlet-9.2.10.v20150310-3EcW9gR7xsTM1TnqPH__rQ.jar at location 'gs://XXXXXXXXX/jetty-servlet-9.2.10.v20150310-3EcW9gR7xsTM1TnqPH__rQ.jar' is inaccessible.
and job ID 2016-01-31_12_22_11-15290010907236071290 failed with this error:
Jan 31, 2016, 11:13:58 PM
(56214ba1d51ca7d6): java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone { "code" : 500, "errors" : [ { "domain" : "global", "message" : "Backend Error", "reason" : "backendError" } ], "message" : "Backend Error" }
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
at com.google.cloud.dataflow.sdk.runners.worker.TextSink$TextFileWriter.close(TextSink.java:243)
at com.google.cloud.dataflow.sdk.util.common.worker.WriteOperation.finish(WriteOperation.java:100)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:254)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:191)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:144)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:180)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:161)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:148)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone { "code" : 500, "errors" : [ { "domain" : "global", "message" : "Backend Error", "reason" : "backendError" } ], "message" : "Backend Error" }
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357) ... 4 more
Was there any maintenance or other work occurring on the Dataflow service that might have caused these errors?

Related

IOException from Serilog with SEQ? How to troubleshoot and solve?

I have a simple Windows Service with the following features :
.Net 6.0
RabbitMQ
Serilog
Serilog Seq, email, console, MSSql, Configuration...
Dependency Injection (Microsoft)
EntityFramework
HostBuilder
The service starts up fine MQ queues are created and log data is coming in to both console and Seq. A couple of seconds after startup(while application waiting on MQ messages) an IOException is thrown which is posted to both console and Seq, this is how the console looks like :
[17:31:25.720 [Information] Myapp.Cloud.SharedObjectService.Program "Myapp.Cloud.SharedObjectService" microservice starting up.
[17:31:25.790 [Information] Myapp.Cloud.SharedObjectService.Program Database: Data Source=Uranus;Initial Catalog=MyappCloudDev;User ID=sa;Password=jaffa
[17:31:26.689 [Information] "Myapp.Cloud.SharedObjectService" all built.
[17:31:26.982 [Information] Myapp.Cloud.MQ.MQCloudProducer MQCloudProducer.Producer is connecting to MQ service "Myapp.Cloud.MQ" at "localhost"
[17:31:26.982 [Information] Myapp.Cloud.MQ.MQCloudConsumer MQCloudConsumer.Connecting to MQ service "Myapp.Cloud.MQ" at "localhost" with user """" and DestinationCode "SharedObjectService"
[17:31:27.143 [Information] Myapp.Cloud.MQ.MQCloudProducer Producer connected to MQ service "Myapp.Cloud.MQ".
[17:31:27.144 [Information] Myapp.Cloud.MQ.MQCloudConsumer MQCloudConsumer.Connecting to MQ service "Myapp.Cloud.MQ" at "localhost" with user """" and DestinationCode "SharedObjectService"
[17:31:27.144 [Information] Myapp.Cloud.SharedObjectService.BusinessLogicLayer.SharedObjectService "SharedObject" microservice started.
[17:31:36.504 [Information] Myapp.Cloud.MQ.MQCloudConsumer MQCloudConsumer.Connected to MQ service queue "Myapp.Cloud.MQ":"SendSharedObject" at "localhost".
[17:31:36.504 [Information] Myapp.Cloud.MQ.MQCloudConsumer MQCloudConsumer.Connected to MQ service queue "Myapp.Cloud.MQ":"OutputResponse" at "localhost".
FirstChanceException : System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..
---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ReceiveAsync(Socket socket, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.ReceiveAsync(Memory`1 buffer, SocketFlags socketFlags, Boolean fromNetworkStream, CancellationToken cancellationToken)
at System.Net.Sockets.NetworkStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
at System.Net.Http.HttpConnection.<CheckUsabilityOnScavenge>g__ReadAheadWithZeroByteReadAsync|44_0()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at System.Net.Http.HttpConnection.<CheckUsabilityOnScavenge>g__ReadAheadWithZeroByteReadAsync|44_0()
at System.Net.Http.HttpConnection.CheckUsabilityOnScavenge()
at System.Net.Http.HttpConnectionPool.<CleanCacheAndDisposeIfUnused>g__IsUsableConnection|115_2(HttpConnectionBase connection, Int64 nowTicks, TimeSpan pooledConnectionLifetime, TimeSpan pooledConnectionIdleTimeout)
at System.Net.Http.HttpConnectionPool.<CleanCacheAndDisposeIfUnused>g__ScavengeConnectionList|115_1[T](List`1 list, List`1& toDispose, Int64 nowTicks, TimeSpan pooledConnectionLifetime, TimeSpan pooledConnectionIdleTimeout)
at System.Net.Http.HttpConnectionPool.CleanCacheAndDisposeIfUnused()
at System.Net.Http.HttpConnectionPoolManager.RemoveStalePools()
at System.Net.Http.HttpConnectionPoolManager.<>c.<.ctor>b__11_0(Object s)
at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
at System.Threading.TimerQueue.FireNextTimers()
at System.Threading.TimerQueue.AppDomainTimerCallback(Int32 id)
at System.Threading.UnmanagedThreadPoolWorkItem.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
--- End of stack trace from previous location ---
--- End of inner exception stack trace ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
FirstChanceException : System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..
---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.
--- End of inner exception stack trace ---
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
Its only the AppDomain.CurrentDomain.FirstChanceException that fetch this exception. Break on all errors or stepping do not show exception or where it occurs? The Exception do only have a system stacktrace, nothing related ty the project code.
If I remove the Seq part of the appsettings file the problem is gone.
The Serilog part of appsettings.json look like this :
"serilog": {
"Using": [ "Serilog.Sinks.MSSqlServer", "Serilog.Sinks.Email" ],
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"System": "Warning"
}
},
"Enrich": [ "FromLogContext", "WithMachineName", "WithProcessId" ],
"WriteTo": [
{
"Name": "Console",
"Args": {
"outputTemplate": "[{Timestamp:HH:mm:ss.fff} [{Level}] {SourceContext} {Message}{NewLine}{Exception}",
"theme": "Serilog.Sinks.SystemConsole.Themes.AnsiConsoleTheme::Code, Serilog.Sinks.Console"
}
},
{
"Name": "File",
"Args": {
"path": "/Logs/log.txt",
"outputTemplate": "{Timestamp:G} {SourceContext} [{Level}] {Message}{NewLine:1}{Exception:1}",
"formatter": "Serilog.Formatting.Json.JsonFormatter, Serilog",
"fileSizeLimitBytes": 1000000,
"rollOnFileSizeLimit": "true",
"shared": "true",
"flushToDiskInterval": 3
}
},
{
"Name": "Seq",
"Args": {
"serverUrl": "http://localhost:8081/"
}
},
{
"Name": "MSSqlServer",
"Args": {
"connectionString": "Data Source=x;Initial Catalog=MyAppClouddev;User ID=x;Password=x",
"tableName": "Logs",
"restrictedToMinimumLevel": "Fatal"
}
}
]
}
Serilog are called like this(Structured logging) :
_logger.LogInformation("OutputResponse received {#message}, {#messageGuid}, {#messageChainGuid}", args.Data, args.Data.MessageGuid, args.Data.MessageChainGuid);
Some times it might not just be string values that are logged but whole objects.
How do you troubleshoot this?
EDIT : When turning off "just my code" I can see that the exception is thrown from within a Microsoft library and I can see that the URL(with port) is the same as the SEQ service got(serverURL) so that Is why I Think the problem are due to this sink. And if I remove the Seq part of the configuration I do not get any exceptions. The regular logging to console are working just fine.

Microsoft Graph Upload API: Frequent 504 Gateway Timeout Error

Problem:
I am trying to upload images to one drive and frequently getting error
504 Gateway Timeout (Unknown error ).
API used:
PUT
https://graph.microsoft.com/v1.0/users/{userId}/drive/items/{rootFolderId}/{folderPath}/{fileName}:/content
Response:
504 Gateway Timeout
{
"error": {
"code": "UnknownError",
"message": "",
"innerError": {
"request-id": "9709847a-36d4-42f2-90dd-4c37094caead",
"date": "2018-05-16T12:18:37"
}
}
}

Google dataflow workflow error

I have a batch read and write to a storage bucket within the same project. Im see this exception when is trying to write the output. any idea?
(c1a5d1aff2d8459b): java.lang.RuntimeException: com.google.cloud.dataflow.sdk.util.UserCodeException: java.io.IOException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "No object name",
"reason" : "required"
} ],
"message" : "No object name"
}
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:160)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase$DoFnContext.outputWindowedValue(DoFnRunnerBase.java:288)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase$DoFnContext.outputWindowedValue(DoFnRunnerBase.java:284)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase$DoFnProcessContext$1.outputWindowedValue(DoFnRunnerBase.java:508)
at com.google.cloud.dataflow.sdk.util.GroupAlsoByWindowsViaIteratorsDoFn.processElement(GroupAlsoByWindowsViaIteratorsDoFn.java:123)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerLoggingParDoFn.processElement(DataflowWorkerLoggingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:55)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:221)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:284)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:220)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:170)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:192)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:172)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:159)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This is caused by specifying an incorrect path to TextIO.Write (missing the GCS bucket - an example correct path is gs://some-bucket/some-output-prefix whereas in this job it was specified as simply gs://some-output-prefix).
This should have been caught at pipeline construction time, before starting the workers. This is a bug in Apache Beam and Dataflow SDK's validation of GCS paths. I'm working on a fix at http://github.com/apache/beam/pull/2602, follow that PR for updates. – jkff 10 mins ago

What is causing CouchDB changes_reader_died

I'm trying to configure a filtered replication bidirectionnal between two databases. Each database has a document into the _replicator database that set the replication towards the database. Each database have the same _design document with the filter and they got the same filter parameters.
Although I have been looking over the web, I didn't find the cause of my problem. I hope you guys can help me.
Log output :
[Fri, 26 Aug 2016 19:36:31 GMT] [error] [<0.22247.80>] ** Generic server <0.22247.80> terminating
** Last message in was {'EXIT',<0.22219.80>,changes_reader_died}
** When Server state == {state,<0.22219.80>,<0.22249.80>,20,
{httpdb,
"REPLACEDFORSECURITYREASONS",
nil,
[{"Accept","application/json"},
{"User-Agent","CouchDB/1.6.1"}],
30000,
[{socket_options,
[{keepalive,true},{nodelay,false}]}],
10,250,<0.22065.80>,20},
{httpdb,
"http:REPLACEDFORSECURITYREASONS",
nil,
[{"Accept","application/json"},
{"User-Agent","CouchDB/1.6.1"}],
30000,
[{socket_options,
[{keepalive,true},{nodelay,false}]}],
10,250,<0.22223.80>,20},
[],nil,nil,nil,
{rep_stats,0,0,0,0,0},
nil,nil,
{batch,[],0}}
** Reason for termination ==
** changes_reader_died
[Fri, 26 Aug 2016 19:36:31 GMT] [error] [<0.22243.80>] {error_report,<0.34.0>,
{<0.22243.80>,crash_report,
[[{initial_call,
{couch_replicator_worker,init,['Argument__1']}},
{pid,<0.22243.80>},
{registered_name,[]},
{error_info,
{exit,changes_reader_died,
[{gen_server,terminate,6,
[{file,"gen_server.erl"},{line,744}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]}},
{ancestors,
[<0.22219.80>,couch_replicator_job_sup,
couch_primary_services,couch_server_sup,
<0.35.0>]},
{messages,[]},
{links,[<0.22244.80>]},
{dictionary,
[{last_stats_report,{1472,240191,741233}}]},
{trap_exit,true},
{status,running},
{heap_size,233},
{stack_size,27},
{reductions,158}],
[{neighbour,
[{pid,<0.22244.80>},
{registered_name,[]},
{initial_call,{erlang,apply,2}},
{current_function,
{couch_replicator_worker,queue_fetch_loop,5}},
{ancestors,[]},
{messages,[]},
{links,[<0.22243.80>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,610},
{stack_size,10},
{reductions,4}]}]]}}
[Fri, 26 Aug 2016 19:36:31 GMT] [error] [<0.22247.80>] {error_report,<0.34.0>,
{<0.22247.80>,crash_report,
[[{initial_call,
{couch_replicator_worker,init,['Argument__1']}},
{pid,<0.22247.80>},
{registered_name,[]},
{error_info,
{exit,changes_reader_died,
[{gen_server,terminate,6,
[{file,"gen_server.erl"},{line,744}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,239}]}]}},
{ancestors,
[<0.22219.80>,couch_replicator_job_sup,
couch_primary_services,couch_server_sup,
<0.35.0>]},
{messages,[]},
{links,[<0.22249.80>]},
{dictionary,
[{last_stats_report,{1472,240191,741235}}]},
{trap_exit,true},
{status,running},
{heap_size,233},
{stack_size,27},
{reductions,162}],
[{neighbour,
[{pid,<0.22249.80>},
{registered_name,[]},
{initial_call,{erlang,apply,2}},
{current_function,
{couch_replicator_worker,queue_fetch_loop,5}},
{ancestors,[]},
{messages,[]},
{links,[<0.22247.80>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,610},
{stack_size,10},
{reductions,4}]}]]}}
Best regards.
While doing a deep analyze of the log, I found that there was a timeout error before the changes_read_died.
The error was pretty explicit :
Fri, 26 Aug 2016 20:01:06 GMT] [info] [<0.2238.0>] Replication `"c48f6c26aa44689de43ee5ffaa18c7ad+continuous"` is using:
4 worker processes
a worker batch size of 500
20 HTTP connections
a connection timeout of 30000 milliseconds
10 retries per request
socket options are: [{keepalive,true},{nodelay,false}]
[Fri, 26 Aug 2016 20:01:06 GMT] [info] [<0.2204.0>] XXX.XXX.X.XX- - GET /akpaper/_changes?filter=global%2FbySite&IDSITE=MILLS2&feed=continuous&style=all_docs&since=0&heartbeat=10000 200
[Fri, 26 Aug 2016 20:01:06 GMT] [error] [<0.2204.0>] OS Process Error <0.204.0> :: {<<"compilation_error">>,
<<"Expression does not eval to a function. (ffunction(doc, req) { return doc._deleted || doc.IDSITE == req.query.IDSITE;})">>}
[Fri, 26 Aug 2016 20:01:06 GMT] [info] [<0.2204.0>] XXX.XXX.X.XX - - GET /akpaper/_changes?filter=global%2FbySite&IDSITE=MILLS2&feed=continuous&style=all_docs&since=0&heartbeat=10000 500
[Fri, 26 Aug 2016 20:01:06 GMT] [error] [<0.2204.0>] httpd 500 error response:
{"error":"compilation_error","reason":"Expression does not eval to a function. (ffunction(doc, req) { return doc._deleted || doc.IDSITE == req.query.IDSITE;})"}
All I did to correct the problem was just to correct my filter function wich could not be eval.
for me, the "changes" parser was failing, hence the message: changes_reader_dies
Looking at the replication doc, it seems that it setting the Content-Type to JSON and somehow failing to parse changes. Removing that line fixed it for me:
Bad:
{
"_id": "repl/...",
"_rev": "8-76d8a4dd87f5911b5dccb2527b2304",
"source": {
"url": "
"headers": {
"Authorization": "Basic FaBNwB...",
"Content-Type": "application/json"
}
"target": ...
...
}
Good:
{
"_id": "repl/...",
"_rev": "8-76d8a4dd87f5911b5dccb2527b2304",
"source": {
"url": "
"headers": {
"Authorization": "Basic FaBNwB..."
}
"target": ...
...
}

Dataflow errors - "500 Internal Server Error" & "503 Service Unavailable"

We've started getting 500 & 503 errors in our Pipelines when running them this morning. Looks like it cannot get the job status once again.
46142 [main] WARN com.google.cloud.dataflow.sdk.runners.DataflowPipelineJob - There were problems getting current job status:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 500 Internal Server Error
{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Internal error encountered.",
"reason" : "backendError"
} ],
"message" : "Internal error encountered.",
"status" : "INTERNAL"
}
1399601 [main] WARN com.google.cloud.dataflow.sdk.runners.DataflowPipelineJob - There were problems getting current job status:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
{
"code" : 503,
"errors" : [ {
"domain" : "global",
"message" : "The service is currently unavailable.",
"reason" : "backendError"
} ],
"message" : "The service is currently unavailable.",
"status" : "UNAVAILABLE"
}
What's the problem?
Job id: 2015-05-19_17_41_46-7486669477281046678
This was a client-side only issue that was not affecting job submission, due to a transient error. Should not be occurring anymore.

Resources