My CDK stack contains too many parameters (subnet ids, api urls) to specify in the command line. So I'd like to keep them in separate files like dev.properties or prod.json. Context values from cdk.json could be the way but i don't see how to keep multiple parallel versions. Is there a way to apply parameters from a file, e.g. cdk deploy --parameters file:///dev.json?
you can use the cdk.json file for this. For example, my cdk.json looks like
{
"app": "python3 app.py",
"profile": "my-aws-profile",
"context": {
"#aws-cdk/core:enableStackNameDuplicates": "true",
"aws-cdk:enableDiffNoFail": "true",
"ENVIRONMENTS": {
"prod": {
"bucket_name": "my-prod-bucket-name",
}
}
}
Then in my stack code
from aws_cdk import (core, aws_s3)
class MyStack(core.Stack):
def __init__(self, scope: core.Construct, id: str, env) -> None:
super().__init__(scope, id, env=env)
environments = self.node.try_get_context("ENVIRONMENTS")
environment = environments.get("prod")
bucket_name = environment.get("bucket_name")
my_bucket = aws_s3.Bucket(
self,
bucket_name,
bucket_name=bucket_name
)
Related
I have a Dataflow template that I can use for a Dataflow job running as a service account of my choosing. I've actually used one of Google's provided samples: gs://dataflow-templates/latest/GCS_Text_to_BigQuery.
I now want to schedule this using Cloud Scheduler. I've set up my scheduler job like so:
When the scheduler job runs it errors with PERMISSION_DENIED:
{
"insertId": "1kw7uaqg3tnzbqu",
"jsonPayload": {
"#type": "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished",
"url": "https://dataflow.googleapis.com/v1b3/projects/project-redacted/locations/europe-west2/templates:launch?gcsPath=gs%3A%2F%2Fdataflow-templates%2Flatest%2FGCS_Text_to_BigQuery",
"jobName": "projects/project-redacted/locations/europe-west2/jobs/aaa-schedule-dataflow-job",
"status": "PERMISSION_DENIED",
"targetType": "HTTP"
},
"httpRequest": {
"status": 403
},
"resource": {
"type": "cloud_scheduler_job",
"labels": {
"job_id": "aaa-schedule-dataflow-job",
"project_id": "project-redacted",
"location": "europe-west2"
}
},
"timestamp": "2021-12-16T16:41:17.349974291Z",
"severity": "ERROR",
"logName": "projects/project-redacted/logs/cloudscheduler.googleapis.com%2Fexecutions",
"receiveTimestamp": "2021-12-16T16:41:17.349974291Z"
}
I have no idea what permission is missing or what I need to grant in order to make this work and am hoping someone here can help me.
In order to reproduce the problem I have built a terraform configuration that creates the Dataflow job from the template along with all of its prerequisites and it executes successfully.
In that same terraform configuration I have created a Cloud Scheduler job that purports to execute an identical Dataflow job and it is that which fails with the error given above.
All this code is available at https://github.com/jamiet-msm/dataflow-scheduler-permission-problem/tree/6ef20824af0ec798634c146ee9073b4b40c965e0 and I have created a README that explains how to run it:
I figured it out, the service account needs to be granted roles/iam.serviceAccountUser on itself
resource "google_service_account_iam_member" "sa_may_act_as_itself" {
service_account_id = google_service_account.sa.name
role = "roles/iam.serviceAccountUser"
member = "serviceAccount:${google_service_account.sa.email}"
}
and roles/dataflow.admin is required also, roles/dataflow.worker isn't enough. I assume that's because dataflow.jobs.create is required which is not provided by roles/dataflow.worker (see https://cloud.google.com/dataflow/docs/concepts/access-control#roles for reference)
resource "google_project_iam_member" "df_admin" {
role = "roles/dataflow.admin"
member = "serviceAccount:${google_service_account.sa.email}"
}
Here is the commit with the required changes: https://github.com/jamiet-msm/dataflow-scheduler-permission-problem/commit/3fd7cabdf13d5465e01a928049f54b0bd486ed73
I have a StepFunctions state machine defined with JSON, that is able to take ARNs of arbitrary ECS tasks (created elsewhere with CloudFormation), execute them and do some post-processing. The ECS task definition takes the task ARN from the input - it boils down to this:
"States": {
"Run Fargate Task": {
"Type": "Task",
"Resource": "arn:${AWS::Partition}:states:::ecs:runTask.sync",
"Parameters": {
"LaunchType": "FARGATE",
"Cluster": "${ecsClusterArn}",
"TaskDefinition.$": "$.ecs_task_arn"
},
"ResultPath": null
}
I am migrating this stack to CDK using the aws_cdk.aws_stepfunctions Python module, but I can't find a way to pass the task ARN. I would like to do something like this:
import aws_cdk, constructs
from aws_cdk import aws_stepfunctions, aws_stepfunctions_tasks, aws_ecs
class StepsStack(aws_cdk.Stack):
def __init__(self, scope: constructs.Construct, construct_id: str, *, cluster: aws_ecs.ICluster, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
state_machine = aws_stepfunctions.StateMachine(
None,
"MyStateMachine",
definition=aws_stepfunctions_tasks.EcsRunTask(
self,
"RunFargateTask",
cluster=cluster,
launch_target=aws_ecs.LaunchType("FARGATE"),
# doesn't work, aws_ecs.TaskDefinition is expected:
task_definition=aws_stepfunctions.JsonPath.string_at("$.ecs_task_arn"),
),
)
It looks like CDK expects me to know exactly which ECS tasks I'll be running. Is there a way to express this dynamic behaviour in CDK?
TLDR: I would like to run beam.io.BigQuerySource with a different query every month using dataflow API and templates. If that is not possible then can I pass query to beam.io.BigQuerySource at runtime while still using Dataflow API and templates?
I have a dataflow 'batch' data pipeline which reads a BigQuery table like below
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument(
'--pro_id',
dest='pro_id',
type=str,
default='xxxxxxxxxx',
help='project id')
parser.add_argument(
'--dataset',
dest='dataset',
type=str,
default='xxxxxxxxxx',
help='bigquery dataset to read data from')
args, pipeline_args = parser.parse_known_args(argv)
project_id = args.pro_id
dataset_id = args.dataset
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(argv=pipeline_args) as p:
companies = (
p
| "Read from BigQuery" >> beam.io.Read(beam.io.BigQuerySource(query=query_bq(project_id, dataset_id),
use_standard_sql=True))
)
And the query parameter for beam.io.BigQuerySource is calculated by a function like this
from datetime import datetime
def query_bq(project, dataset):
month = datetime.today().replace(day=1).strftime("%Y_%m_%d")
query = (
f'SELECT * FROM `{project}.{dataset}.data_{month}_json` '
f'LIMIT 10'
)
return query
Couple of things to note here
I want to run this data pipeline once a day
The table id changes from month to month. So for example, the table id for this month would be data_2020_06_01_json and for next month the table id would be data_2020_07_01_json and all this is calculated by def query_bq(project, dataset) above
I would like to automate the running of this batch pipeline using Dataflow API using cloud function, pubsub event, cloud scheduler.
Here is the cloud function that gets triggered by cloud-scheduler publishing an event to pubsub everyday
def run_dataflow(event, context):
if 'data' in event:
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
pubsub_message_dict = ast.literal_eval(pubsub_message)
event = pubsub_message_dict.get("eventName")
now = datetime.today().strftime("%Y-%m-%d-%H-%M-%S")
project = 'xxx-xxx-xxx'
region = 'europe-west2'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
if event == "run_dataflow":
job = f'dataflow-{now}'
template = 'gs://xxxxx/templates/xxxxx'
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=region,
body={
'jobName': job,
}
)
response = request.execute()
print(response)
Here is the command I use to launch this data pipeline on dataflow
python main.py \
--setup_file ./setup.py \
--project xxx-xx-xxxx \
--pro_id xxx-xx-xxxx \
--dataset 'xx-xxx-xxx' \
--machine_type=n1-standard-4 \
--max_num_workers=5 \
--num_workers=1 \
--region europe-west2 \
--serviceAccount= xxx-xxx-xxx \
--runner DataflowRunner \
--staging_location gs://xx/xx \
--temp_location gs://xx/temp \
--subnetwork="xxxxxxxxxx" \
--template_location gs://xxxxx/templates/xxxxx
The problem I'm facing :
My query_bq function is called during compilation and creation of dataflow template that is then loaded to GCS. And this query_bq function does not get called during runtime. So whenever my cloud-function invokes dataflow create it is always reading from data_2020_06_01_json table and the table in the query will always remain same even when we progress into July, August and so on. What I really want is for that query to dynamically change based on query_bq function so in future I can read from data_2020_07_01_json and data_2020_08_01_json and so on.
I have also looked into the template file generated and it looks like the query is hard-coded into the template after compilation. Here's a snippet
"name": "beamapp-xxxxx-0629014535-344920",
"steps": [
{
"kind": "ParallelRead",
"name": "s1",
"properties": {
"bigquery_export_format": "FORMAT_AVRO",
"bigquery_flatten_results": true,
"bigquery_query": "SELECT * FROM `xxxx.xxxx.data_2020_06_01_json` LIMIT 10",
"bigquery_use_legacy_sql": false,
"display_data": [
{
"key": "source",
"label": "Read Source",
"namespace": "apache_beam.runners.dataflow.ptransform_overrides.Read",
"shortValue": "BigQuerySource",
"type": "STRING",
"value": "apache_beam.io.gcp.bigquery.BigQuerySource"
},
{
"key": "query",
"label": "Query",
"namespace": "apache_beam.io.gcp.bigquery.BigQuerySource",
"type": "STRING",
"value": "SELECT * FROM `xxxx.xxxx.data_2020_06_01_json` LIMIT 10"
},
{
"key": "validation",
"label": "Validation Enabled",
"namespace": "apache_beam.io.gcp.bigquery.BigQuerySource",
"type": "BOOLEAN",
"value": false
}
],
"format": "bigquery",
"output_info": [
{
An alternative I've tried
I also tried the ValueProvider as defined here https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#pipeline-io-and-runtime-parameters
and I added this to my code
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--query_bq', type=str)
user_options = pipeline_options.view_as(UserOptions)
p | "Read from BigQuery" >> beam.io.Read(beam.io.BigQuerySource(query=user_options.query_bq,
use_standard_sql=True))
And when I run this I get this error
WARNING:apache_beam.utils.retry:Retry with exponential backoff: waiting for 3.9023594566785924 seconds before retrying get_query_location because we caught exception: apitools.base.protorpclite.messages.ValidationError: Expected type <class 'str'> for field query, found SELECT * FROM `xxxx.xxxx.data_2020_06_01_json` LIMIT 10 (type <class 'apache_beam.options.value_provider.StaticValueProvider'>)
So I'm guessing beam.io.BigQuerySource does not accept ValueProviders
You cannot use ValueProviders in BigQuerySource, but as of the more recent versions of Beam, you can use beam.io.ReadFromBigQuery, which supports them well.
You would do:
result = (p
| beam.io.ReadFromBigQuery(query=options.input_query,
....))
You can pass value providers, and it has a lot of other utilities. Check out its documentation
In a very simple Wildfly Swarm project Swagger works fine and produces the expected swagger.json output. But as soon as a second java package is added to the project, it doesn't work anymore.
Example (see https://github.com/pe-st/swagger42 example project):
the first commit consists of the project as generated by http://wildfly-swarm.io/ (containing one class HelloWorldEndpoint in the package ch.schlau.pesche.swagger42.rest)
the second commits adds minimal Swagger annotations and generates the following swagger.json :
{
"swagger": "2.0",
"info": {},
"basePath": "/",
"paths": {
"/hello": {
"get": {
"summary": "Get the response",
"description": "",
"operationId": "doGet",
"produces": [
"text/plain"
],
"parameters": [],
"responses": {
"default": {
"description": "successful operation"
}
}
}
}
}
}
the third commit adds an empty class in a second java package ch.schlau.pesche.swagger42.core. Now the generated swagger.json looks like this:
{"swagger":"2.0","info":{},"basePath":"/"}
What has to be done to make Swagger work in projects like these?
https://wildfly-swarm.gitbooks.io/wildfly-swarm-users-guide/advanced/swagger.html
Create a file META-INF/swarm.swagger.conf
and add following entry:
packages:ch.schlau.pesche.swagger42.rest
There is a info in the startup:
[org.wildfly.swarm.swagger] (main) WFSSWGR0004: Configure Swagger for deployment
demo.war with package ch.schlau.pesche.swagger42.core
or similar, which packages is scanned.
Temp solution I exported the build definition json file from the TFS Web interface. And for comparison I exported the build definition object from the API and the json files look different, which is the problem. For now I will use the api object json.
I have TFS2018 installed and I am importing a exported json build file like this:
var filePath = "builddefinition.json";
var buildDef = JsonConvert.DeserializeObject<BuildDefinition>(File.ReadAllText(filePath));
The file is imported successfully. However the steps are not imported. Here is part of the exported json file showing the first step. I have in total 7 steps.
"process": {
"phases": [
{
"steps": [
{
"environment": {},
"enabled": true,
"continueOnError": false,
"alwaysRun": false,
"displayName": "Use NuGet 4.3.0",
"timeoutInMinutes": 0,
"condition": "succeeded()",
"refName": "NuGetToolInstaller1",
"task": {
"id": "2c645196a-524fd-4a402-92be8-d9d4837b7c5d",
"versionSpec": "0.*",
"definitionType": "task"
},
"inputs": {
"versionSpec": "4.3.0",
"checkLatest": "false"
}
},
{... more steps
However if I get the build definition from the api I get all the steps.
var buildDef = buildClient.GetDefinitionAsync("MyProject", builddefid);
Any idea why the steps are not serialized into the object when reading it from the json file?
The contents/format of Json files which exported from Web interface and object API are different.
So you need to Export/Import the build defintion json files in pair. That means Export from web interface then import from web interface, export via API then import it again via API.
This article for your reference : TFS 2015 clone/import/export build definition between team projects