I just need to run a dataflow pipeline on a daily basis, but it seems to me that suggested solutions like App Engine Cron Service, which requires building a whole web app, seems a bit too much.
I was thinking about just running the pipeline from a cron job in a Compute Engine Linux VM, but maybe that's far too simple :). What's the problem with doing it that way, why isn't anybody (besides me I guess) suggesting it?
This is how I did it using Cloud Functions, PubSub, and Cloud Scheduler
(this assumes you've already created a Dataflow template and it exists in your GCS bucket somewhere)
Create a new topic in PubSub. this will be used to trigger the Cloud Function
Create a Cloud Function that launches a Dataflow job from a template. I find it easiest to just create this from the CF Console. Make sure the service account you choose has permission to create a dataflow job. the function's index.js looks something like:
const google = require('googleapis');
exports.triggerTemplate = (event, context) => {
// in this case the PubSub message payload and attributes are not used
// but can be used to pass parameters needed by the Dataflow template
const pubsubMessage = event.data;
console.log(Buffer.from(pubsubMessage, 'base64').toString());
console.log(event.attributes);
google.google.auth.getApplicationDefault(function (err, authClient, projectId) {
if (err) {
console.error('Error occurred: ' + err.toString());
throw new Error(err);
}
const dataflow = google.google.dataflow({ version: 'v1b3', auth: authClient });
dataflow.projects.templates.create({
projectId: projectId,
resource: {
parameters: {},
jobName: 'SOME-DATAFLOW-JOB-NAME',
gcsPath: 'gs://PATH-TO-YOUR-TEMPLATE'
}
}, function(err, response) {
if (err) {
console.error("Problem running dataflow template, error was: ", err);
}
console.log("Dataflow template response: ", response);
});
});
};
The package.json looks like
{
"name": "pubsub-trigger-template",
"version": "0.0.1",
"dependencies": {
"googleapis": "37.1.0",
"#google-cloud/pubsub": "^0.18.0"
}
}
Go to PubSub and the topic you created, manually publish a message. this should trigger the Cloud Function and start a Dataflow job
Use Cloud Scheduler to publish a PubSub message on schedule
https://cloud.google.com/scheduler/docs/tut-pub-sub
There's absolutely nothing wrong with using a cron job to kick off your Dataflow pipelines. We do it all the time for our production systems, whether it be our Java or Python developed pipelines.
That said however, we are trying to wean ourselves off cron jobs, and move more toward using either AWS Lambdas (we run multi cloud) or Cloud Functions. Unfortunately, Cloud Functions don't have scheduling yet. AWS Lambdas do.
There is a FAQ answer to that question:
https://cloud.google.com/dataflow/docs/resources/faq#is_there_a_built-in_scheduling_mechanism_to_execute_pipelines_at_given_time_or_interval
You can automate pipeline execution by using Google App Engine (Flexible Environment only) or Cloud Functions.
You can use Apache Airflow's Dataflow Operator, one of several Google Cloud Platform Operators in a Cloud Composer workflow.
You can use custom (cron) job processes on Compute Engine.
The Cloud Function approach is described as "Alpha" and it's still true that they don't have scheduling (no equivalent to AWS cloudwatch scheduling event), only Pub/Sub messages, Cloud Storage changes, HTTP invocations.
Cloud composer looks like a good option. Effectively a re-badged Apache Airflow, which is itself a great orchestration tool. Definitely not "too simple" like cron :)
You can use cloud scheduler to schedule your job as well. See my post
https://medium.com/#zhongchen/schedule-your-dataflow-batch-jobs-with-cloud-scheduler-8390e0e958eb
Terraform script
data "google_project" "project" {}
resource "google_cloud_scheduler_job" "scheduler" {
name = "scheduler-demo"
schedule = "0 0 * * *"
# This needs to be us-central1 even if the app engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"
http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}
# need to encode the string
body = base64encode(<<-EOT
{
"jobName": "test-cloud-scheduler",
"parameters": {
"region": "${var.region}",
"autoscalingAlgorithm": "THROUGHPUT_BASED",
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://zhong-gcp/temp",
"zone": "us-west1-a"
}
}
EOT
)
}
}
Related
There is an official examples of MassTransit with SQS. The "bus" is configured to use SQS (x.UsingAmazonSqs). The receive endpoint is an SQS which in turn subscribed to an SNS topic. However there is no example how to Publish into SNS.
How to publish into SNS topic?
How to configure SQS/SNS to use http, since I develop against localstack?
AWS sdk version:
var cfg = new AmazonSimpleNotificationServiceConfig { ServiceURL = "http://localhost:4566", UseHttp = true };
Update:
After Chris's reference and experiments with configuration I came up with the following for the 'localstack' SQS/SNS. This configuration executes without errors and Worker gets called, and publishes a message to a bus. However consumer class is not triggered and doesn't seem messages end up in the queue (or rather topic).
public static readonly AmazonSQSConfig AmazonSQSConfig = new AmazonSQSConfig { ServiceURL = "http://localhost:4566" };
public static AmazonSimpleNotificationServiceConfig AmazonSnsConfig = new AmazonSimpleNotificationServiceConfig {ServiceURL = "http://localhost:4566"};
...
services.AddMassTransit(x =>
{
x.AddConsumer<MessageConsumer>();
x.UsingAmazonSqs((context, cfg) =>
{
cfg.Host(new Uri("amazonsqs://localhost:4566"), h =>
{
h.Config(AmazonSQSConfig);
h.Config(AmazonSnsConfig);
h.EnableScopedTopics();
});
cfg.ReceiveEndpoint(queueName: "deal_queue", e =>
{
e.Subscribe("deal-topic", s =>
{
});
});
});
});
services.AddMassTransitHostedService(waitUntilStarted: true);
services.AddHostedService<Worker>();
Update 2:
When I look at sns subscriptions I see that the first which was created and subscribed manually through aws cli has a correct Endpoint, while the second that was created by MassTransit library has incorrect one. How to configure Endpoint for the SQS queue?
$ aws --endpoint-url=http://localhost:4566 sns list-subscriptions-by-topic --topic-arn "arn:aws:sns:us-east-1:000000000000:deal-topic"
{
"Subscriptions": [
{
"SubscriptionArn": "arn:aws:sns:us-east-1:000000000000:deal-topic:c804da4a-b12c-4203-83ec-78492a77b262",
"Owner": "",
"Protocol": "sqs",
"Endpoint": "http://localhost:4566/000000000000/deal_queue",
"TopicArn": "arn:aws:sns:us-east-1:000000000000:deal-topic"
},
{
"SubscriptionArn": "arn:aws:sns:us-east-1:000000000000:deal-topic:b47d8361-0717-413a-92ee-738d14043a87",
"Owner": "",
"Protocol": "sqs",
"Endpoint": "arn:aws:sqs:us-east-1:000000000000:deal_queue",
"TopicArn": "arn:aws:sns:us-east-1:000000000000:deal-topic"
}
Update 3:
I've cloned the project and ran some unit tests of the project for AmazonSQS bus configuration, consumers don't seem to work.
When I list subscriptions after the test run I can tell that Endpoints are incorrect.
...
{
"SubscriptionArn": "arn:aws:sns:us-east-1:000000000000:MassTransit_TestFramework_Messages-PongMessage:e16799c2-9dd3-458d-bc28-52a16d646de3",
"Owner": "",
"Protocol": "sqs",
"Endpoint": "arn:aws:sqs:us-east-1:000000000000:input_queue",
"TopicArn": "arn:aws:sns:us-east-1:000000000000:MassTransit_TestFramework_Messages-PongMessage"
},
...
Could it be that AmazonSQS for localstack has a major bug?
It's not clear how to use library with 'localstack' sqs, how to point out to actual endpoint (QueueUrl) of an SQS queue.
Whenever Publish is called in MassTransit, messages are published to SNS. Those messages are then routed to receive endpoints as configured. There is no need to understand SQS or SNS when using MassTransit with Amazon SQS/SNS.
In MassTransit, you create consumers, those consumers consume message types, and MassTransit configures topics/queues as needed. Any of the samples using RabbitMQ, Azure Service Bus, etc. are easily converted to SQS by changing UsingRabbitMq to UsingAmazonSqs (and adding the appropriate NuGet package).
Looks like your configuration is setup properly to publish, but there are probably at least a few reasons I can think of why you are not receiving messages:
Issue with the current version of localstack. I had to use 0.11.2 - see Localstack with MassTransit not getting messages
You are publishing to a different topic. Masstransit will create the topic using the name of the message type. This may not match the topic you configured on the receive endpoint. You can change the topic name by configuring the topology - see How can I configure the topic name when using MassTransit SQS?
Your consumer is not configured on the receive endpoint - see the example below
public static readonly AmazonSQSConfig AmazonSQSConfig = new AmazonSQSConfig { ServiceURL = "http://localhost:4566" };
public static AmazonSimpleNotificationServiceConfig AmazonSnsConfig = new AmazonSimpleNotificationServiceConfig {ServiceURL = "http://localhost:4566"};
...
services.AddMassTransit(x =>
{
x.UsingAmazonSqs((context, cfg) =>
{
cfg.Host(new Uri("amazonsqs://localhost:4566"), h =>
{
h.Config(AmazonSQSConfig);
h.Config(AmazonSnsConfig);
});
cfg.ReceiveEndpoint(queueName: "deal_queue", e =>
{
e.Subscribe("deal-topic", s => {});
e.Consumer<MessageConsumer>();
});
});
});
services.AddMassTransitHostedService(waitUntilStarted: true);
services.AddHostedService<Worker>();
From what I see in the docs about Consumers you should be able to add your consumer to the AddMastTransit configuration like your original sample, but it didn't work for me.
I have a scenario where I'm using CodePipeline to deploy my cdk project from a tools account to several environment accounts.
The way my pipeline is deploying is by running cdk deploy from within a CodeBuild job.
My team has decided to use SSM Parameter Store to store configuration and we ended up with some parameters living in the environment account, for example the VPC_ID (resources/vpc/id) that I can read in deployment time => ssm.StringParameter.valueForStringParameter.
However, other parameters are living in the tools account, such as the Account Ids from my environment accounts (environment/nonprod/account/id) and other Global Config. I'm having trouble fetching those values.
At the moment, the only way I could think of was by using a step to read all those values in a previous step and loaded them into the context values.
Is there a more elegant approach for this problem? I was hoping I could specify in which account to get the SSM values from. Any ideas?
Thank you.
As you already stated there is no native support for that. I am also using CodePipeline in cross-account deployments, so all the automation parameters or product specified parameters are stored in a secured account and CodePipeline deploys the resources using CloudFormation as an action provider.
Cross account resolution of SSM parameters isn't supported, so in the end, I had added an extra step (stage) in my CodePipeline, which is nothing else but a CodeBuild project, which runs a script in a containerized environment and scripts then "syncs" the parameters from the automation account to the destination account.
As part of your pipeline, I would add a preliminary step to execute a Lambda. That Lambda can then execute whatever queries you wish to obtain whatever metadata/config that is required. The output from that Lambda can then be passed in to the CodeBuild step.
e.g. within the Lambda:
export class ConfigFetcher {
codepipeline = new AWS.CodePipeline();
async fetchConfig(event: CodePipelineEvent, context : Context) : Promise<void> {
// Retrieve the Job ID from the Lambda action
const jobId = event['CodePipeline.job'].id;
// now get your config by executing whatever queries you need, even cross-account, via the SDK
// we assume that the answer is in the variable someValue
const params = {
jobId: jobId,
outputVariables: {
MY_CONFIG: someValue,
},
};
// now tell CodePipeline you're done
await this.codepipeline.putJobSuccessResult(params).promise().catch(err => {
console.error('Error reporting build success to CodePipeline: ' + err);
throw err;
});
// make sure you have some sort of catch wrapping the above to post a failure to CodePipeline
// ...
}
}
const configFetcher = new ConfigFetcher();
exports.handler = async function fetchConfigMetadata(event: CodePipelineEvent, context : Context): Promise<void> {
return configFetcher.fetchConfig(event, context);
};
Assuming that you create your pipeline using CDK, then your Lambda step will be created using something like this:
const fetcherAction = new LambdaInvokeAction({
actionName: 'FetchConfigMetadata',
lambda: configFetcher,
variablesNamespace: 'ConfigMetadata',
});
Note the use of variablesNamespace: we need to refer to this later in order to retrieve the values from the Lambda's output and insert them as env variables into the CodeBuild environment.
Now our CodeBuild definition, again assuming we create using CDK:
new CodeBuildAction({
// ...
environmentVariables: {
MY_CONFIG: {
type: BuildEnvironmentVariableType.PLAINTEXT,
value: '#{ConfigMetadata.MY_CONFIG}',
},
},
We can call the variable whatever we want within CodeBuild, but note that ConfigMetadata.MY_CONFIG needs to match the namespace and output value of the Lambda.
You can have your lambda do anything you want to retrieve whatever data it needs - it's just going to need to be given appropriate permissions to reach across into other AWS accounts if required, which you can do using role assumption. Using a Lambda as a pipeline step will be a LOT faster than using a CodeBuild step in the pipeline, plus it's easier to change: if you write your Lambda code in Typescript/JS or Python, you can even use the AWS console to do in-place edits whilst you test that it executes correctly.
AFAIK there is no native way to achieve what you described. If there is way I'd like to know too. I believe you can use the CloudFormation custom resource baked by lambda for this purpose.
You can pass parameters to the lambda request and get information back from the lambda response.
See https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-custom-resources-lambda.html, https://www.2ndwatch.com/blog/a-step-by-step-guide-on-using-aws-lambda-backed-custom-resources-with-amazon-cfts/ and https://docs.aws.amazon.com/cdk/api/latest/docs/custom-resources-readme.html for more information.
This question is a year old, but a simpler method I found for retrieving parameters from your tools/deployment account is to specify them as env variables in your buildspec file. CodeBuild will always pull these from whatever account your job is running in (which in this question's scenario would be the tools account).
To pull parameters from your target environment accounts, it's best to use the CDK SSM approach suggested by the question author.
I create a Cloud Run client, however, couldn't find a way to list a service that is deployed with Cloud Run on GKE (for Anthos).
Create the client:
HttpTransport httpTransport = new NetHttpTransport();
JsonFactory jsonFactory = new JacksonFactory();
GoogleCredentials credential = GoogleCredentials.getApplicationDefault();
credential.createScoped("https://www.googleapis.com/auth/cloud-platform");
HttpRequestInitializer requestInitializer = new HttpCredentialsAdapter(credential);
CloudRun.Builder builder = new CloudRun.Builder(httpTransport, jsonFactory, requestInitializer);
return builder.setApplicationName(applicationName)
.setRootUrl(cloudRunRootUrl)
.build();
} catch (IOException e) {
e.printStackTrace();
}
try to list services:
services = cloudRun.namespaces().services()
.list("namespaces/default")
.execute()
.getItems();
My "hello" service is deploy on a GKE cluster under the namespace default. The above code doesn't work because the client always see "default" as project_id and complains about permission stuff. If I put the project_id rather than "default", permission errors are gone, but no services will be found.
I tried another project that does have Google fully-managed cloud run services, the same code returns result (with .list("namespaces/")).
How to access the service on GKE?
And my next question would be, how to programmatically create Cloud Run services on GKE?
Edit - for creating a service
As I couldn't figure out how to interact with Cloud Run on GKE, I took a step back to try fully managed one. The following code to create a service fails, and the error message just doesn't provide much useful insight, how to make it work?
Service deployedService = null;
// Map<String,String> annotations = new HashMap<>();
// annotations.put("client.knative.dev/user-image","gcr.io/cloudrun/hello");
ServiceSpec spec = new ServiceSpec();
List<Container> containers = new ArrayList<>();
containers.add(new Container().setImage("gcr.io/cloudrun/hello"));
spec.setTemplate(new RevisionTemplate().setMetadata(new ObjectMeta().setName("hello-fully-managed-v0.1.0"))
.setSpec(new RevisionSpec().setContainerConcurrency(20)
.setContainers(containers)
.setTimeoutSeconds(100)
)
);
helloService.setApiVersion("serving.knative.dev/v1")
.setMetadata(new ObjectMeta().setName("hello-fully-managed")
.setNamespace("data-infrastructure-test-env")
// .setAnnotations(annotations)
)
.setSpec(spec)
.setKind("Service");
try {
deployedService = cloudRun.namespaces().services()
.create("namespaces/data-infrastructure-test-env",helloService)
.execute();
} catch (IOException e) {
e.printStackTrace();
response.add(e.toString());
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(response);
}
Error message I got:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "The request has errors",
"reason" : "badRequest"
} ],
"message" : "The request has errors",
"status" : "INVALID_ARGUMENT"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
And the base_url is: https://europe-west1-run.googleapis.com
Your question is quite detailed (and is about Java which I am no expert in) and there are actually too many questions in there (ideally, please ask only 1 question here). However, I'll try to answer a few things you asked:
First, Cloud Run (managed, and on GKE) both implement the Knative Serving API. I've explained this at https://ahmet.im/blog/cloud-run-is-a-knative/ In fact, Cloud Run on GKE is just the open source Knative components installed to your cluster.
And my next question would be, how to programmatically create Cloud Run services on GKE?
You will have a very hard time (if possible at all) using the Cloud Run API client libraries (e.g. new CloudRun above) because these are designed for *.googleapis.com endpoints.
The Knative API part of "Cloud Run on GKE" is actually just your Kubernetes (GKE) master API endpoint (which runs on an IP address, with a TLS certificate that isn't trusted by root CAs, but you can find the CA cert in GKE GetCluster API call to verify the cert.) The TLS is part is why it's so hard to use the API Client libraries.
Knative APIs are just Kubernetes objects. So your best bet is one of these:
See Kubernetes java client (https://github.com/kubernetes-client/java) actually allows dynamic objects. (Go implementation does) and try to use that to create Knative CRDs.
Use kubectl apply.
Ask Knative Serving open source repository for help (they should be providing client libraries, maybe they're already there I'm not sure)
To program Cloud Run (managed) with the API Client Libraries, you need to explicitly override the API endpoint to the region e.g. us-central1-run.googleapis.com. (This is documented on each API call's REST API reference documentation.)
I have written a blog post in detail (with sample code in Go) on how to create/update services on Cloud Run (managed) using the Knative Serving API here: https://ahmet.im/blog/gcloud-run-deploy/
If you want to see how gcloud run deploy works, and which APIs it calls, you can pass --log-http option to observe the request/response traffic.
As for the error you got, it seems like the error message isn't helpful, but it might be coming from anywhere (as you're trying to imitate Knative API in GCP client libraries). I recommend reading my blog posts and sample code in depth.
UPDATES: Our engineering team's looking at the issue, it appears that there's currently a bug not adding the "details" field to the error. That's being worked on.
In your case, we see the following errors from requests:
field: "spec.template.spec"
description: "Missing template spec."
Means you are not properly filling up the spec field as I shown in my blog post and sample code.
field: "metadata.name"
description: "The revision name must be prefixed by the name of the enclosing Service or Configuration with a trailing -"
Make sure the name you are specifying adheres the patterns specified in API docs. Try to create that name manually perhaps in the UI or gcloud CLI.
field: "api_version"
description: "Unsupported API version \'serving.knative.dev/v1\'. Expected \'serving.knative.dev/v1alpha1\'"
Do not use v1alpha1 API, use v1 directly.
We'll try to get the details to the error message, however it appears that you need to study the sample code I linked in my blog post more in detail:
https://github.com/GoogleCloudPlatform/cloud-run-button/blob/a52c7fbaae33a3e06c112206c7227a0ef9649647/cmd/cloudshell_open/deploy.go#L26-L112
The Java SDK is automatically generated from the fact that the Cloud Run (fully managed) API is public. It does not support Cloud Run for Anthos.
(gcloud.run.deploy) The revision name must be prefixed by the name of the enclosing Service or Configuration with a trailing -revision name
revision name name should be 65 character then problem will be resolved in Automation pipeline with GCP revision suffix should be less revision name is the combination of (service name +revision suffix) will automatically created by GCP.
I have an issue regarding goolge dataflow.
I'm writing a dataflow pipeline which reads data from PubSub, and write to BigQuery, it's works.
Now I have to handle late data and i was following some examples on intenet but it's not working properly, here is my code:
pipeline.apply(PubsubIO.readStrings()
.withTimestampAttribute("timestamp").fromSubscription(Constants.SUBSCRIBER))
.apply(ParDo.of(new ParseEventFn()))
.apply(Window.<Entity> into(FixedWindows.of(WINDOW_SIZE))
// processing of late data.
.triggering(
AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(DELAY_SIZE))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(ALLOW_LATE_SIZE)
.accumulatingFiredPanes())
.apply(ParDo.of(new ParseTableRow()))
.apply("Write to BQ", BigQueryIO.<TableRow>write()...
Here is my pubsub message:
{
...,
"timestamp" : "2015-08-31T09:52:25.005Z"
}
When I manually push some messages(go to PupsubTopic and publish) with timestamp is << ALLOW_LATE_SIZE but these messages are still passed.
You should specify the allowed lateness formally using the "Duration" object as: .withAllowedLateness(Duration.standardMinutes(ALLOW_LATE_SIZE)), assuming you have set the value of ALLOW_LATE_SIZE in minutes.
You may check the documentation page for "Google Cloud Dataflow SDK for Java", specifically the "Triggers" sub-chapter.
We create service workers by
navigator.serviceWorker.register('sw.js', { scope: '/' });
We can create new Workers without an external file like this,
var worker = function() { console.log('worker called'); };
var blob = new Blob( [ '(' , worker.toString() , ')()' ], {
type: 'application/javascript'
});
var bloburl = URL.createObjectURL( blob );
var w = new Worker(bloburl);
With the approach of using blob to create ServiceWorkers, we will get a Security Error as the bloburl would be blob:chrome-extension..., and the origin won't be supported by Service Workers.
Is it possible to create a service worker without external file and use the scope as / ?
I would strongly recommend not trying to find a way around the requirement that the service worker implementation code live in a standalone file. There's a very important of the service worker lifecycle, updates, that relies on your browser being able to fetch your registered service worker JavaScript resource periodically and do a byte-for-byte comparison to see if anything has changed.
If something has changed in your service worker code, then the new code will be considered the installing service worker, and the old service worker code will eventually be considered the redundant service worker as soon as all pages that have the old code registered and unloaded/closed.
While a bit difficult to wrap your head around at first, understanding and making use of the different service worker lifecycle states/events are important if you're concerned about cache management. If it weren't for this update logic, once you registered a service worker for a given scope once, it would never give up control, and you'd be stuck if you had a bug in your code/needed to add new functionality.
One hacky way is to use the the same javascript file understand the context and act as a ServiceWorker as well as the one calling it.
HTML
<script src="main.js"></script>
main.js
if(!this.document) {
self.addEventListener('install', function(e) {
console.log('service worker installation');
});
} else {
navigator.serviceWorker.register('main.js')
}
To prevent maintaining this as a big file main.js, we could use,
if(!this.document) {
//service worker js
importScripts('sw.js');
else {
//loadscript document.js by injecting a script tag
}
But it might come back to using a separate sw.js file for service worker to be a better solution. This would be helpful if one'd want a single entry point to the scripts.