I need to read in a GCS file of 750K records.
For each record I need to compare it to a corresponding record in Google Datastore. If the record from the file does not match the record in Datastore, I need to update the Datastore record and enqueue a Taskqueue task.
The part I'm stuck on is launching this taskqueue task.
The only way seems to be via Google Cloud Task's HTTP api (https://cloud.google.com/tasks/docs/creating-http-target-tasks) but issuing a HTTP call from within a DoFn feels inefficient.
I looked into using pubsub for the task since dataflow has an adapter for that, but you can only use pubsub on streaming pipelines.
Yes, Beam doesn't seem to have special IO connectors for Cloud Task. So I guess you can only issue HTTP requests from inside a Beam DoFn.
Related
I have been trying to build a pipeline in Google Cloud Data Fusion where data source is a 3rd party API endpoint. I have been unable to successfully use the HTTP Plugin, but it has been suggested that I use Pub/Sub for the data ingest.
I've been trying to follow this tutorial as a starting point, but it doesn't help me out with the very first step of the process: ingesting data from API endpoint.
Can anyone provide examples of using Pub/Sub -- or any other viable method -- to ingest data from an API endpoint and send that data down to Data Fusion for transformation and ultimately to BigQuery?
I will also need to be able to dynamically modify the URI (e.g., date filter parameters) in the GET request in this pipeline.
In order to achieve the first step in the tutorial you are following
Ingest CSV (Comma-separated values) data to BigQuery using Cloud Data Fusion.
You need to set up a functioning pub/sub system. This can be done via the command line, the console, or in your case the best would be to use, one of the client libraries. If you follow this tutorial you should have a functioning pub/sub system.
At that point you should be able to follow the original tutorial
I've an Azure blob storage bucket with some video files.I need to trigger a Jenkins Pipeline whenever a file gets added to the bucket. I was thinking I could have a microservice in the Azure Functions to monitor the bucket and trigger Jenkins but it would be great if I could do this directly without an additional microservice.
Is there a way I can get Jenkins to trigger a pipeline based on my bucket? A plugin or a script or something?
PS: I found this question, but I'm looking for something different.
You could trigger a build without parameters by setting up an event subscription on your storage account to call your Jenkins build endpoint. Since your build won't have parameters, your script would have to keep track of the blobs processed (assuming they are not deleted once processed).
But if you need build parameters then you would have to transform the payload coming from the blob event before calling the Jenkins API.
Though you mentioned that you wouldn't want to include another service for this, sharing options just in case, in increasing order of complexity
If you have your Jenkins API behind an API Gateway, like Azure APIM, you could transform the body before forwarding the request to Jenkins.
Use a simple Logic App to trigger on the event and then call the Jenkins API, passing the parameters extracted from the event as required
Similar to what is mentioned in the other question you linked, Azure Functions.
If you don't have APIM (or something similar), Logic Apps are a great solution considering the use case with almost no-code to write.
I'm after the composed task execution listener, that publishes events to middle-ware, basically, the same behavior as documented for custom task here
Is there any way to enable this feature for composed tasks run via SCDF REST API ?
Thanks
I am trying to execute multiple spring cloud task jobs within spring cloud data flow container on PCF. These jobs reads the raw file from a http source and then
parses it and writes that to mysql db.These jobs are written in plain java and not with spring batch.
I have binded mysql db with the scdf container on PCF . I believe spring cloud task will use mysql db to store the execution status of these status .I want the actual output records also to go in mysql.
My question is how the output records for each of these jobs will get stored in mysql db ? Will it use different schema for each of these parser jobs ? If not then how can I configure it to do so ?
Please share your thoughts if you have encountered this scenario.
Thanks!
Nilanjan
To orchestrate Tasks in SCDF, you'd have to supply an RDBMS and it looks like you've already done that. The task repository is primarily used to persist Task executions as a historical representation, so you can drill into the entire history of executions via the GUI/Shell.
You'd configure task-repository at the server level - see this CF-server's manifest.yml sample (under services: section) for reference.
My question is how the output records for each of these jobs will get stored in mysql db ?
If you'd like to also use the same datastore for all the tasks, it can be configured via SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_TASK_SERVICES env-var. Anything supplied via this token would be automatically propagated to all the Task applications.
However, it is your responsibility to make sure the right database driver is in the classpath of your Task application. In your case, you'd need to have one of the mysql drivers.
Will it use different schema for each of these parser jobs ?
It is up to your business requirements. Whether it is a different schema or different set of tables, you'd have to determine what's needed for your requirements and make sure it exist/setup before binding the Task application via SPRING_CLOUD_DEPLOYER_CLOUDFOUNDRY_TASK_SERVICES.
If not then how can I configure it to do so ?
If you have to use a different datasource, you can supply a different mysql binding for the Task application that includes your requirement specific schema/table changes. Review this section to learn how autoconfiguration kicks in on PCF.
As an alternative option, you can selectively supply different mysql binding at each application, too - here's some docs on that.
In a streaming dataflow pipeline, how can I dynamically change the bucket or the prefix of data I write to cloud storage?
For example, I would like to store data to text or avro files on GCS, but with a prefix that includes the processing hour.
Update: The question is invalid because there simply is no sink you can use in streaming dataflow that writes to Google Cloud Storage.
Google Cloud Dataflow currently does not allow GCS sinks in streaming mode.