While running a batch job in Google Cloud Dataflow, I am experiencing an error during a particular step that uses Dataflow's service-based Shuffle feature. The error claims that a specific file is no longer present in the temporary job location I specified for this pipeline.
Here is the most relevant piece of the full stacktrace:
An exception was raised when trying to execute the workitem 2931621256965625980 : Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 490, in __init__
metadata = self._get_object_metadata(self._get_request)
File "/usr/local/lib/python3.7/site-packages/apache_beam/utils/retry.py", line 206, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 513, in _get_object_metadata
return self._client.objects.Get(get_request)
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py", line 1100, in Get
download=download)
File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/usr/local/lib/python3.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://www.googleapis.com/storage/v1/b/<CLOUD STORAGE PATH FOR TEMPORARY JOB FILES>%2F<DATAFLOW JOB NAME>.1571774420.011973%2Ftmp-626a66561e20e8b6-00000-of-00003.avro?alt=json>: response: <{'x-guploader-uploadid': 'AEnB2UrVuWRWrrcneEjgvuGSwYR82tBqDdVa727Ylo8tVW6ucnPdeNbE2A8DXf7mDYqKKP42NdJapXZLR1UbCjvJ8n7w2SOVTMGFsrcbywKD1K9yxMWez7k', 'content-type': 'application/json; charset=UTF-8', 'date': 'Tue, 22 Oct 2019 20:43:59 GMT', 'vary': 'Origin, X-Origin', 'cache-control': 'no-cache, no-store, max-age=0, must-revalidate', 'expires': 'Mon, 01 Jan 1990 00:00:00 GMT', 'pragma': 'no-cache', 'content-length': '473', 'server': 'UploadServer', 'status': '404'}>, content <{
"error": {
"code": 404,
"message": "No such object: <CLOUD STORAGE PATH FOR TEMPORARY JOB FILES>/<DATAFLOW JOB NAME>.1571774420.011973/tmp-626a66561e20e8b6-00000-of-00003.avro",
"errors": [
{
"message": "No such object: <CLOUD STORAGE PATH FOR TEMPORARY JOB FILES>/<DATAFLOW JOB NAME>.1571774420.011973/tmp-626a66561e20e8b6-00000-of-00003.avro",
"domain": "global",
"reason": "notFound"
}
]
}
}
Any advice on how to resolve this error would be appreciated. It seems to me that Dataflow created a temporary file during the course of the shuffle. That file must have been deleted shortly afterward, but sometime later Dataflow is attempting to access that file and causing this error.
After a lot of trial and error, I never really found an answer to this problem. The root cause is Dataflow's Shuffle service--it seems that if a particular shuffle step is very very expensive, these types of intermittent connection issues eventually cause the job to error out.
I ultimately solved this problem by re-working the data set to cut the amount of shuffling required by about half. The Shuffle service now runs reliably for me.
Cloud Dataflow shuffle is still an experimental feature--I am hoping that this kind of instability disappears as it matures.
You will need to add stagingLocation or gcpTempLocation to resolve this error.
You can check here [1] for further details.
1 - https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service
Related
We have an excessively complex stored procedure in SQL Server (which I did not write and cannot change).
I created an Azure logic app, using a "Recurrence" (not a "Sliding Window") trigger, with an "Execute stored procedure (V2)" action, and set the "frequency" to run daily at 22:00 (10pm).
But, I try running it manually just to test it (Run Trigger).
It runs for between 9 1/2 and 10 1/2 minutes, then fails with a "Bad Gateway" and "Error 504" and "Timeout Expired" error (see JSON at bottom):
How do I increase the timeout? Is there a simple setting buried in Azure that I'm missing?
I never specified any gateway, and another, much simpler stored procedure with no gateway specification executed fine.
And lastly, I am an Azure neophyte.
Thank you.
{
"error": {
"code": 504,
"source": "logic-apis-eastus2.azure-apim.net",
"clientRequestId": "{I removed the client request Id before posting this}",
"message": "BadGateway",
"innerError": {
"status": 504,
"message": "Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\r\nclientRequestId: {I removed the client request Id before posting this}",
"error": {
"message": "Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding."
},
"source": "sql-eus2.azconn-eus2.p.azurewebsites.net"
}
}
I am able to publish messages to DLX when they are not able to process due to insufficient information or any other issues with the message received by binding DLX for the queue through Spring AMPQ.
For instance, invoice received with missing of billable hours and/or no employee id present it it.
{
"invoiceId": "INV1234",
"hourRate": 18.5,
"hoursWorked": 0,
"employeeId": "EMP9900"
}
Due to less size of request body, it's easy to understand what is the issue. But, we have some considerable request body length and 15-20 validation points.
Producer of the message expecting what is the issue when pushing back the message to publishing back them through DLX.
I have the following two thoughts to address this requirements.
Option #1: Append the errors information to the original message.
{
"message": {
"invoiceId": "INV1234",
"hourRate": 18.5,
"hoursWorked": 0,
"employeeId": "EMP9900"
},
"errors": [
{
"errorCode": "001",
"errorMessage": "Invalid no. of hours."
},
{
"errorCode": "002",
"errorMessage": "Employee not found in the system."
}
]
}
Option #2 : Add additional errors object in the headers
Out of these two options,
what is the better way of handling this requirements? And
Is there any in-built solution available in either spring-amqp or any other library?
See the documentation. The framework implements your #2 solution.
Starting with version 1.3, a new RepublishMessageRecoverer is provided, to allow publishing of failed messages after retries are exhausted.
...
The RepublishMessageRecoverer publishes the message with additional information in message headers, such as the exception message, stack trace, original exchange, and routing key. Additional headers can be added by creating a subclass and overriding additionalHeaders().
I am trying to migrate on-premise TFS (DevOps Server 2019 update 1.1) to DevOps Services and using the migrate tool. I have run the validate command and cleaned up those warnings but the next command (prepare) fails mysteriously. The log file simply says:
[Error #11:18:19.488]
Exception Message: Request failed (type AadGraphTimeoutException)
Exception Stack Trace: at Microsoft.VisualStudio.Services.Identity.DataImport.AadIdentityMapper.ExecuteGraphRequest[T](Func`1 request)
at Microsoft.VisualStudio.Services.Identity.DataImport.AadIdentityMapper.GetAadTenantId()
at TfsMigrator.TfsMigratorCommandValidate.PopulateDataImportPropertiesOnContext()
at TfsMigrator.TfsMigratorCommandValidate.PopulateValidationItems(DataImportValidationContext context)
at TfsMigrator.TfsMigratorCommandValidate.RunValidations(Boolean validateFiles)
at TfsMigrator.TfsMigratorCommandPrepare.RunImpl()
at TfsMigrator.TfsMigratorCommand.Run()
A colleague pointed out this troubleshooting from the docs but a) we have about 10 users involved in TFS (~50 total active in local AD) so it is hard to believe we have so many users it would time out, and b) I ran the Get-MsolUser troubleshooting commands and successfully queried AAD via Graph.
I ran the prepare command again with Fiddler Classic connected as a proxy and discovered a failing call to the Graph API. It looked like
Request (simplified headers):
POST https://graph.windows.net/xxxxxxxx-xxxx-xxxx-xxxx-0664e34adcbd/$batch?api-version=1.6 HTTP/1.1
Content-Type: multipart/mixed; boundary=batch_ea471df4-db73-403d-a172-a0955ddb1575
...
--batch_ea471df4-db73-403d-a172-a0955ddb1575
GET https://graph.windows.net/xxxxxxxx-xxxx-xxxx-xxxx-0664e34adcbd/tenantDetails?api-version=1.6 HTTP/1.1
...
--batch_ea471df4-db73-403d-a172-a0955ddb1575--
Response (body):
{
"odata.error": {
"code": "Authentication_Unauthorized",
"message": {
"lang": "en",
"value": "User was not found."
},
"requestId": "58c4cabc-dd67-4ce8-9735-134a7e0df60c",
"date": "2020-09-14T20:07:49"
}
}
So my question at this point is - Are there any permissions (DevOps, Azure, Graph) that are missing? Are there any work arounds available? I did tag this question with Microsoft Graph API but do believe the failing call uses the older Azure AD Graph API.
While syncing messages via the mailFolder delta endpoint, specific mailFolders will provide #odata.deltaLinks that when requested return 503 errors.
This affects 4 of roughly 6500 mailFolders that we are currently syncing. Each of these 4 is in a different Office 365 tenant, and includes both default (e.g. "Sent Items") and custom folders.
#odata.nextLink works correctly. The mailFolder receiving new messages also doesn't fix the issue:
If the syncing progress is restarted, that is, throwing away the existing #odata.deltaLink, following the #odata.nextLink chain will correctly return all messages (including ones created after the deltaLink was provided). However, the newly provided #odata.deltaLink, while different, will also return the error.
This issue affects both the 1.0 and beta versions of this endpoint.
All other mailFolders for the affected users work correctly.
This issue has existed from the very first time we've attempted to sync affected mailFolders. The first time that we saw it was 2019-08-03 08:38 PM UTC, though the issue has likely existed longer than this.
The response does not include a Retry-After header.
All of our other 6500 mailFolders have correctly functioning message deltas, and we're able to correctly sync them with our code.
Can be reproduced via Curl to an affected message delta URL.
curl -H "Authorization: Bearer $access_token" "$delta_url"
A sample of one of the specific error codes returned, with a real request id:
Request to:
https://graph.microsoft.com/v1.0/me/mailFolders/<mailFolderId>/messages/delta?$skiptoken=<skipToken>
{
"error": {
"code": "UnknownError",
"message": "Error while processing response.",
"innerError": {
"request-id": "a4441195-f469-47c8-bea3-cdeedef2e396",
"date": "2019-08-08T21:24:20"
}
}
}
This other potentially relevant header was on the response:
x-ms-ags-diagnostic: {"ServerInfo":{"DataCenter":"West US","Slice":"SliceC","Ring":"5","ScaleUnit":"003","RoleInstance":"AGSFE_IN_7","ADSiteName":"WUS"}}
When I executed a WorkItem, I have this error:
[07/18/2019 09:24:00] Error: Non-optional output [outputFile.dwg] is missing .
[07/18/2019 09:24:00] Error: An unexpected error happened during phase Publishing of job.
In Activity I have the follow code:
"outputFile": {
"zip": false,
"ondemand": false,
"verb": "put",
"description": "output file",
"localName": "outputFile.dwg",
"required": "true"
}
And in WorkItem:
"outputFile": {
"url": "https://developer.api.autodesk.com/oss/v2/buckets/{{ TokenKey}}/objects/outputFile.dwg",
"headers": {
"Authorization": "Bearer {{ oAuthToken }}",
"Content-type": "application/octet-stream"
},
"verb": "put"
},
What may be change?
The error says that "outputFile.dwg" was not generated. It is a non-optional (i.e. required) output so this is an error. I suspect there's something wrong with your script. Look higher up in the report to see if you can find something that gives you a clue.
this is Qun Lu from Forge Design Automation / AutoCAD team. The execution of your activity (with your input arguments as inputs) has to generate the expected result file, in your case, "outputFile.dwg", so it can be uploaded using your URL. It should be done by your "Rota" command, or other AutoCAD built-in command that your script in activity specifies. It appears that either your command (or script in general) missed the step of saving the drawing as "outputFile.dwg", or your "PluginPrueba.dll" module did not load properly hence the "Rota" command is not found. Can you give us the full report so we can check further? You can also ping me at qun.lu#autodesk.com. Thanks!