Google Cloud Storage solution to join 2 CSV in a bucket file based on a common column - join

I need help/suggestions to implement the below use case.
I have 2 csv files in Google Cloud Storage bucket, I need to join these 2 files, based on one common column, and I need to save the output file back into the Google Cloud Storage bucket.
I need to implement this using any Google Cloud solution (cloud data flow with beam python), Cloud function or any other Cloud solutions, since I am new to Google cloud platform, I request all to help me on implementing this use case.
Looking forward to hearing from you
Thanks in advance

You have several way to achieve this. If the result of the merge take less than 1Gb and you want only 1 output file, you can do like this
Query external CSV files from BigQuery (federated query) and save the result in a temporary table like this
CREATE OR REPLACE EXTERNAL TABLE mydataset.table1
OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/file1.csv'],
skip_leading_rows = 1
)
CREATE OR REPLACE EXTERNAL TABLE mydataset.table2
OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/file2.csv'],
skip_leading_rows = 1
)
CREATE TABLE mydataset.newtable
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
) AS
SELECT *
FROM mydataset.table1 join mydataset.table2 ON ....
Then, export the temp table mydataset.newtable to GCS
Else, you can use the solution that I describe in this article (that I wrote)
EDIT 1
You can use this sample of workflow definition that do what you need.
- loadFile1:
call: http.post
args:
url: https://bigquery.googleapis.com/bigquery/v2/projects/<projectID>/jobs
auth:
type: OAuth2
body:
configuration:
query:
query: CREATE OR REPLACE EXTERNAL TABLE mydataset.table1 OPTIONS (format = 'CSV', uris = ['gs://mybucket/file1.csv'], skip_leading_rows = 1)
useLegacySql: false
- loadFile2:
call: http.post
args:
url: https://bigquery.googleapis.com/bigquery/v2/projects/<projectID>/jobs
auth:
type: OAuth2
body:
configuration:
query:
query: CREATE OR REPLACE EXTERNAL TABLE mydataset.table2 OPTIONS (format = 'CSV', uris = ['gs://mybucket/file2.csv'], skip_leading_rows = 1)
useLegacySql: false
- joinQuery:
call: http.post
args:
url: https://bigquery.googleapis.com/bigquery/v2/projects/<projectID>/jobs
auth:
type: OAuth2
body:
configuration:
query:
query: CREATE TABLE mydataset.newtable OPTIONS( expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) AS SELECT * ......
useLegacySql: false
result: queryResult
- getState:
call: http.get
args:
url: ${"https://bigquery.googleapis.com/bigquery/v2/projects/<projectID>/jobs/" + queryResult.body.jobReference.jobId}
auth:
type: OAuth2
result: jobState
next: testState
- testState:
switch:
- condition: ${jobState.body.status.state == "DONE"}
next: extractData
next: waitAndGetState
- waitAndGetState:
call: sys.sleep
args:
seconds: 1
next: getState
- extractData:
call: http.post
args:
url: https://bigquery.googleapis.com/bigquery/v2/projects/<projectID>/jobs
auth:
type: OAuth2
body:
configuration:
extract:
destinationUri: gs://<YourBucket>/bq-extract.csv
destinationFormat: CSV
sourceTable:
projectId: <projectID>
datasetId: mydataset
tableId: newtable
result: extractResult
- returnOutput:
return: ${extractResult}
Then, use Cloud Scheduler to call directly the Create Workflow execution API with an empty body {} and a OAuth2 authentication mode.

Related

AWS CDK - inline IAM Policies with conflicting names are generated for different stacks using a shared role

I'm using the CDK to deploy several stacks, and one of the roles used is shared across multiple stacks. The constructs (e.g. CodeBuildAction) which use the role frequently attach necessary permissions as an inline policy. However, despite knowing that it is an "imported" role, the inline policy name that is generated is not unique across stacks, and therefore both CloudFormation stacks contain the same Policy resource, and fight over the contents. (Neither stack contains the Role resource.)
import * as cdk from "#aws-cdk/core";
import * as iam from "#aws-cdk/aws-iam";
const sharedRoleArn = "arn:aws:iam::1111111111:role/MyLambdaRole";
const app = new cdk.App();
const stackOne = new cdk.Stack(app, "StackOne");
const roleRefOne = iam.Role.fromRoleArn(stackOne, "SharedRole", sharedRoleArn);
// Under normal circumstances, this is called inside constructs defined by AWS
// (like a CodeBuildAction that grants permission to access Artifact S3 buckets, etc)
roleRefOne.addToPrincipalPolicy(new iam.PolicyStatement({
actions: ["s3:ListBucket"],
resources: ["*"],
effect: iam.Effect.ALLOW,
}));
const stackTwo = new cdk.Stack(app, "StackTwo");
const roleRefTwo = iam.Role.fromRoleArn(stackTwo, "SharedRole", sharedRoleArn);
roleRefTwo.addToPrincipalPolicy(new iam.PolicyStatement({
actions: ["dynamodb:List*"],
resources: ["*"],
effect: iam.Effect.ALLOW,
}));
The following are fragments of the cloud assembly generated for the two stacks:
SharedRolePolicyA1DDBB1E:
Type: AWS::IAM::Policy
Properties:
PolicyDocument:
Statement:
- Action: s3:ListBucket
Effect: Allow
Resource: "*"
Version: "2012-10-17"
PolicyName: SharedRolePolicyA1DDBB1E
Roles:
- MyLambdaRole
Metadata:
aws:cdk:path: StackOne/SharedRole/Policy/Resource
SharedRolePolicyA1DDBB1E:
Type: AWS::IAM::Policy
Properties:
PolicyDocument:
Statement:
- Action: dynamodb:List*
Effect: Allow
Resource: "*"
Version: "2012-10-17"
PolicyName: SharedRolePolicyA1DDBB1E
Roles:
- MyLambdaRole
Metadata:
aws:cdk:path: StackTwo/SharedRole/Policy/Resource
You can see above that the aws:cdk:paths for two policies are different, but they end up with the same name (SharedRolePolicyA1DDBB1E). That is used as the physical name of the inline policy attached to the MySharedRole role. (The same behavior occurs for stacks in separate "Apps" as well.)
There's no affordance for setting the PolicyName for the "default policy" generated for a role (or which policies a construct attaches permissions to). I could also make the shared role immutable (using { mutable: false } on fromRoleArn, but then I need to reconstruct the potentially complicated Policies a set of constructs would have given the role, and attache it myself.
I was able to work around the issue by templating the stack name into the imported role's "id", as in:
const stack = cdk.Stack.of(scope)
const role = iam.Role.fromRoleArn(scope, `${stack.stackName}SharedRole`, sharedRoleArn);
where I construct my role.
Is this expected behavior? Do I misunderstand something about imported resources with CDK? Is there a better alternative? (My understanding with the construct ids is that they are only intended to need to be unique within a given scope.)

Get SQS URL from within Serverless function?

I'm building a Serverless app that defines an SQS queue in the resources as follows:
resources:
Resources:
TheQueue:
Type: "AWS:SQS:Queue"
Properties:
QueueName: "TheQueue"
I want to send messages to this queue from within one of the functions. How can I access the URL from within the function? I want to place it here:
const params = {
MessageBody: 'message body here',
QueueUrl: 'WHATS_THE_URL_HERE',
DelaySeconds: 5
};
This is a great question!
I like to set the queue URL as an ENV var for my app!
So you've named the queue TheQueue.
Simply add this snippet to your serverless.yml file:
provider:
name: aws
runtime: <YOUR RUNTIME>
environment:
THE_QUEUE_URL: { Ref: TheQueue }
Serverless will automatically grab the queue URL from your CloudFormation and inject it into your ENV.
Then you can access the param as:
const params = {
MessageBody: 'message body here',
QueueUrl: process.env.THE_QUEUE_URL,
DelaySeconds: 5
};
You can use the Get Queue URL API, though I tend to also pass it in to my function. The QueueUrl is the Ref value for an SQS queue in CloudFormation, so you can pretty easily get to it in your CloudFormation. This handy cheat sheet is really helpful for working with CloudFormation attributes and refs.
I go a bit of a different route. I, personally, don't like storing information in environment variables when using lambda, though I really like Aaron Stuyvenberg solution. Therefore, I store information like this is AWS SSM Parameter store.
Then in my code I just call for it when needed. Forgive my JS it has been a while since I did it. I mostly do python
var ssm = new AWS.SSM();
const myHandler = (event, context) => {
var { Value } = await ssm.getParameter({Name: 'some.name.of.parameter'}).promise()
const params = {
MessageBody: 'message body here',
QueueUrl: Value,
DelaySeconds: 5
};
}
There is probably some deconstruction of the returned data structure I got wrong, but this is roughly what I do. In python I wrote a library that does all of this with one line.

Dynamic change servers url

I have web app that i deploy for multiple clients (lets say client1.com, client2.com and client3.com). I wrote REST API documentation using OA3 specification.
I have index.yaml which looks like:
openapi: 3.0.0
info:
title: REST API
description: REST API
version: 0.0.1
servers:
- url: http://client1.com/api
description: Something ...
tags: ...
and standard Swagger-ui index.html
// Begin Swagger UI call region
const ui = SwaggerUIBundle({
url: "docs/index.yaml",
dom_id: '#swagger-ui',
deepLinking: true,
presets: [
SwaggerUIBundle.presets.apis,
SwaggerUIStandalonePreset.slice(1)
],
plugins: [
SwaggerUIBundle.plugins.DownloadUrl
],
layout: "StandaloneLayout"
});
// End Swagger UI call region
window.ui = ui;
My problem is that i need to change servers.url depending on where and for which client i deploy app, so that each client can test API on his own servers and i don't want clients to see each other and know about each other in REST API docs.
How can i "dynamically" change / set server.url ? One "walkaround" solution is to copy index.yaml for each client and hardcode servers.url, but im sure there is better way which i don't see / know about.
Edit #1: given duplicate answer does not help because when i set servers to
servers:
- url: /api
swagger ui still points to http://localhost:8080/api/ but my app url is http://localhost:8080/myapp/. I could set servers: - url: /myapp/api but this is hardcoded value in index.yaml and that does not work for me. I need like "configurable" server url.
But thanks anyway
Edit #2: my current walkaround solution is to process yaml with server side code. In index.yaml i have:
servers:
- url: ##INSERT_SERVERS_TAG_HEREapi
and i replace ##INSERT_SERVERS_TAG_HERE with myapp.client.com/. But i'm still looking for some better solution.

Rails magento API with savon - complex filters

I am trying to import orders from a Magento store to a rails app using Savon and Magento API. So far here is my code:
require 'savon'
client = Savon.client(wsdl: "http://mywebsite.com/api/v2_soap?wsdl")
session_id = client.call(:login,
message: {
username: "myapiuser",
api_key: "myapipassword"
}).body[:login_response][:login_return]
orders = client.call(:sales_order_list,
message: {
session_id: session_id, complex_filters: [{
key: "created_at", operator: "gt", value: '2014-10-14 00:00:00' }]
})
I need to use a complex filter to find orders created after a certain date. The reason for this is if I try to pull all the orders at once it overloads the server. I tried using the complex filter above, but it still tries to pull all the orders. Am I passing the filter in an improper way? Any idea on how to make this work?

Google Docs: Cannot export/download user's document using administrative access/impersonation (forbidden 403) in python

I have read this thoroughly: https://developers.google.com/google-apps/documents-list/#using_google_apps_administrative_access_to_impersonate_other_domain_users
I have googled this to death.
So far I have been able to:
Authorise with:
clientLogin
OAuth tokens (using my domain key)
retrieve document feeds for all users in the domain (authorised either way in #1)
I am using the "entry" from the feed to Export/Download documents and always get forbidden for other users for documents not shared with admin. The feed query I am using is like:
https://docs.google.com/feeds/userid#mydomain.com/private/full/?v=3
(I have tried with and without the ?v=3)
I have also tried adding the xoauth_requestor_id (which I have also seen in posts as xoauth_requestor), both on the uri, and as a client property: client.xoauth_requestor_id = ...
Code fragments:
Client Login (using administrator credentials):
client.http_client.debug = cfg.get('HTTPDEBUG')
client.ClientLogin( cfg.get('ADMINUSER'), cfg.get('ADMINPASS'), 'HOSTED' )
OAuth:
client.http_client.debug = cfg.get('HTTPDEBUG')
client.SetOAuthInputParameters( gdata.auth.OAuthSignatureMethod.HMAC_SHA1, cfg.get('DOMAIN'), cfg.get('APPS.SECRET') )
oatip = gdata.auth.OAuthInputParams( gdata.auth.OAuthSignatureMethod.HMAC_SHA1, cfg.get('DOMAIN'), cfg.get('APPS.SECRET') )
oat = gdata.auth.OAuthToken( scopes = cfg.get('APPS.%s.SCOPES' % section), oauth_input_params = oatip )
oat.set_token_string( cfg.get('APPS.%s.TOKEN' % section) )
client.current_token = oat
Once the feed is retrieved:
# pathname eg whatever.doc
client.Export(entry, pathname)
# have also tried
client.Export(entry, pathname, extra_params = { 'v': 3 } )
# and tried
client.Export(entry, pathname, extra_params = { 'v': 3, 'xoauth_requestor_id': 'admin#mydomain.com' } )
Any suggestions, or pointers as to what I am missing here?
Thanks
You were very close to having a correct implementation. In your example above, you had:
client.Export(entry, pathname, extra_params = { 'v': 3, 'xoauth_requestor_id': 'admin#mydomain.com' } )
xoauth_requestor_id must be set to the user you're impersonating. Also what you need is to use 2-Legged OAuth 1.0a with the xoauth_requestor_id set either in the token or in the client.
import gdata.docs.client
import gdata.gauth
import tempfile
# Replace with values from your Google Apps domain admin console
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
# Set this to the user you're impersonating, NOT the admin user
username = 'userid#mydomain.com'
destination = tempfile.mkstemp()
token = gdata.gauth.TwoLeggedOAuthHmacToken(
consumer_key, consumer_secret, username)
# Setting xoauth_requestor_id in the DocsClient constructor is not required
# because we set it in the token above, but I'm showing it here in case your
# token is constructed via some other mechanism and you need another way to
# set xoauth_requestor_id.
client = gdata.docs.client.DocsClient(
auth_token=token, xoauth_requestor_id=username)
# Replace this with the resource your application needs
resource = client.GetAllResources()[0]
client.DownloadResource(resource, path)
print 'Downloaded %s to %s' % (resource.title.text, destination)
Here is the reference in the source code to the TwoLeggedOAuthHmacToken class:
http://code.google.com/p/gdata-python-client/source/browse/src/gdata/gauth.py#1062
And here are the references in the source code that provide the xoauth_requestor_id constructor parameter (read these in order):
http://code.google.com/p/gdata-python-client/source/browse/src/atom/client.py#42
http://code.google.com/p/gdata-python-client/source/browse/src/atom/client.py#179
http://code.google.com/p/gdata-python-client/source/browse/src/gdata/client.py#136

Resources