Cloud Dataflow reading/writing to/from bigquery

Cloud Dataflow reading/writing to/from bigquery - google-cloud-dataflow

I have just started learning how to use some Google Cloud Products.
I am currently busy with Cloud Dataflow.
I decided to start writing simple program.
It does nothing more than reading from Bigquery table and than writing to another table.
The job is failing.
Pipeline p = Pipeline.create(options);
PCollection<TableRow> data = p.apply(BigQueryIO.Read.named("test")
.fromQuery("select itemName from `Dataset.sampletable`").usingStandardSql());
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("category").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
data.apply(BigQueryIO.Write.named("Write").to("Dataset.dataflow_test")
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
p.run();
}
Error code:
(6093f22a86dc3c25): Workflow failed. Causes: (6093f22a86dc389a): S01:test/DataflowPipelineRunner.BatchBigQueryIONativeRead+Write/DataflowPipelineRunner.BatchBigQueryIOWrite/DataflowPipelineRunner.BatchBigQueryIONativeWrite failed., (709b1cdded98b0f6): BigQuery creating dataset "_dataflow_temp_dataset_10172746300453557418" in project "1234-project" failed., (709b1cdded98b191): BigQuery execution failed., (709b1cdded98b22c): Error:
Message: IAM setPolicy failed for 42241429167:_dataflow_temp_dataset_10172746300453557418
HTTP Code: 400
I can imagine that maybe writing it immediately after reading can be a reason for this failing. Therefore I would like to know a good solution for this.

Related

Refresh BigQuery data in Sheets via an API call

Given a Google Sheet which has data coming from BigQuery via the Sheets data connector for BigQuery (as described here), is it possible to force a refresh of this BigQuery data via the Sheets API? It is possible to do so through the Sheets UI (has a refresh button), but I want to do it via some other service invoking an API.

You can use Apps Script to interact with the data connector in your Spreadsheet.
Considerations
From the Sheet API you are not able to manipulate the DataSource. You can use Apps Script instead. It's relatively easy to set up a script that refreshes you BigQuery data.
Approach
If you want to do this operation with an external API you will have to use Apps Script as a proxy to achieve your goal. You can deploy your Apps Script as a API Executable and trigger its functions from an external service.
Apps Script
/** #OnlyCurrentDoc */
function refresh() {
var spreadsheet = SpreadsheetApp.getActive();
spreadsheet.getRange('A1').activate();
spreadsheet.setActiveSheet(spreadsheet.getSheetByName('Data Sheet 1'), true);
SpreadsheetApp.enableAllDataSourcesExecution();
spreadsheet.getCurrentCell().getDataSourceTables()[0].refreshData();
};
Your External Service
// [...] Oauth and other set ups...
public static void main(String[] args) throws IOException {
// ID of the script to call. Acquire this from the Apps Script editor,
// under Publish > Deploy as API executable.
String scriptId = "ENTER_YOUR_SCRIPT_ID_HERE";
Script service = getScriptService();
// Create an execution request object.
ExecutionRequest request = new ExecutionRequest()
.setFunction("refresh");
try {
// Make the API request.
Operation op =
service.scripts().run(scriptId, request).execute();
// [...] Error handling...
Refrences:
Big Query data connector
Apps Script API

Cannot Read Bigquery table sourced from Google Sheet (Oath / Scope Error)

import pandas as pd
from google.cloud import bigquery
import google.auth
# from google.cloud import bigquery
# Create credentials with Drive & BigQuery API scopes
# Both APIs must be enabled for your project before running this code
credentials, project = google.auth.default(scopes=[
'https://www.googleapis.com/auth/drive',
'https://www.googleapis.com/auth/spreadsheets',
'https://www.googleapis.com/auth/bigquery',
])
client = bigquery.Client(credentials=credentials, project=project)
# Configure the external data source and query job
external_config = bigquery.ExternalConfig('GOOGLE_SHEETS')
# Use a shareable link or grant viewing access to the email address you
# used to authenticate with BigQuery (this example Sheet is public)
sheet_url = (
'https://docs.google.com/spreadsheets'
'/d/1uknEkew2C3nh1JQgrNKjj3Lc45hvYI2EjVCcFRligl4/edit?usp=sharing')
external_config.source_uris = [sheet_url]
external_config.schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING')
]
external_config.options.skip_leading_rows = 1 # optionally skip header row
table_id = 'BambooHRActiveRoster'
job_config = bigquery.QueryJobConfig()
job_config.table_definitions = {table_id: external_config}
# Get Top 10
sql = 'SELECT * FROM workforce.BambooHRActiveRoster LIMIT 10'
query_job = client.query(sql, job_config=job_config) # API request
top10 = list(query_job) # Waits for query to finish
print('There are {} states with names starting with W.'.format(
len(top10)))
The error I get is:
BadRequest: 400 Error while reading table: workforce.BambooHRActiveRoster, error message: Failed to read the spreadsheet. Errors: No OAuth token with Google Drive scope was found.
I can pull data in from a BigQuery table created from CSV upload, but when I have a BigQuery table created from a linked Google Sheet, I continue to receive this error.
I have tried to replicate the sample in Google's documentation (Creating and querying a temporary table):
https://cloud.google.com/bigquery/external-data-drive

You are authenticating as yourself, which is generally fine for BQ if you have the correct permissions. Using tables linked to Google Sheets often requires a service account. Create one (or have your BI/IT team create one), and then you will have to share the underlying Google Sheet with the service account. Finally, you will need to modify your python script to use the service account credentials and not your own.
The quick way around this is to use the BQ interface, select * from the Sheets-linked table, and save the results to a new table, and query that new table directly in your python script. This works well if this is a one-time upload/analysis. If the data in the sheets will be changing consistently and you will need to routinely query the data, this is not a long-term solution.

I solved problem by adding scope object to client.
from google.cloud import bigquery
import google.auth
credentials, project = google.auth.default(scopes=[
'https://www.googleapis.com/auth/drive',
'https://www.googleapis.com/auth/bigquery',
])
CLIENT = bigquery.Client(project='project', credentials=credentials)
https://cloud.google.com/bigquery/external-data-drive

import pandas as pd
from google.oauth2 import service_account
from google.cloud import bigquery
#from oauth2client.service_account import ServiceAccountCredentials
SCOPES = ['https://www.googleapis.com/auth/drive','https://www.googleapis.com/auth/bigquery']
SERVICE_ACCOUNT_FILE = 'mykey.json'
credentials = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE, scopes=SCOPES)
delegated_credentials = credentials.with_subject('myserviceaccountt#domain.iam.gserviceaccount.com')
client = bigquery.Client(credentials=delegated_credentials, project=project)
sql = 'SELECT * FROM `myModel`'
DF = client.query(sql).to_dataframe()

You can try to update your default credentials through the console:
gcloud auth application-default login --scopes=https://www.googleapis.com/auth/userinfo.email,https://www.googleapis.com/auth/drive,https://www.googleapis.com/auth/cloud-platform

Google dataflow 2.0 pubsub handler late data

I have an issue regarding goolge dataflow.
I'm writing a dataflow pipeline which reads data from PubSub, and write to BigQuery, it's works.
Now I have to handle late data and i was following some examples on intenet but it's not working properly, here is my code:
pipeline.apply(PubsubIO.readStrings()
.withTimestampAttribute("timestamp").fromSubscription(Constants.SUBSCRIBER))
.apply(ParDo.of(new ParseEventFn()))
.apply(Window.<Entity> into(FixedWindows.of(WINDOW_SIZE))
// processing of late data.
.triggering(
AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(DELAY_SIZE))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(ALLOW_LATE_SIZE)
.accumulatingFiredPanes())
.apply(ParDo.of(new ParseTableRow()))
.apply("Write to BQ", BigQueryIO.<TableRow>write()...
Here is my pubsub message:
{
...,
"timestamp" : "2015-08-31T09:52:25.005Z"
}
When I manually push some messages(go to PupsubTopic and publish) with timestamp is << ALLOW_LATE_SIZE but these messages are still passed.

You should specify the allowed lateness formally using the "Duration" object as: .withAllowedLateness(Duration.standardMinutes(ALLOW_LATE_SIZE)), assuming you have set the value of ALLOW_LATE_SIZE in minutes.
You may check the documentation page for "Google Cloud Dataflow SDK for Java", specifically the "Triggers" sub-chapter.

easiest way to schedule a Google Cloud Dataflow job

I just need to run a dataflow pipeline on a daily basis, but it seems to me that suggested solutions like App Engine Cron Service, which requires building a whole web app, seems a bit too much.
I was thinking about just running the pipeline from a cron job in a Compute Engine Linux VM, but maybe that's far too simple :). What's the problem with doing it that way, why isn't anybody (besides me I guess) suggesting it?

This is how I did it using Cloud Functions, PubSub, and Cloud Scheduler
(this assumes you've already created a Dataflow template and it exists in your GCS bucket somewhere)
Create a new topic in PubSub. this will be used to trigger the Cloud Function
Create a Cloud Function that launches a Dataflow job from a template. I find it easiest to just create this from the CF Console. Make sure the service account you choose has permission to create a dataflow job. the function's index.js looks something like:
const google = require('googleapis');
exports.triggerTemplate = (event, context) => {
// in this case the PubSub message payload and attributes are not used
// but can be used to pass parameters needed by the Dataflow template
const pubsubMessage = event.data;
console.log(Buffer.from(pubsubMessage, 'base64').toString());
console.log(event.attributes);
google.google.auth.getApplicationDefault(function (err, authClient, projectId) {
if (err) {
console.error('Error occurred: ' + err.toString());
throw new Error(err);
}
const dataflow = google.google.dataflow({ version: 'v1b3', auth: authClient });
dataflow.projects.templates.create({
projectId: projectId,
resource: {
parameters: {},
jobName: 'SOME-DATAFLOW-JOB-NAME',
gcsPath: 'gs://PATH-TO-YOUR-TEMPLATE'
}
}, function(err, response) {
if (err) {
console.error("Problem running dataflow template, error was: ", err);
}
console.log("Dataflow template response: ", response);
});
});
};
The package.json looks like
{
"name": "pubsub-trigger-template",
"version": "0.0.1",
"dependencies": {
"googleapis": "37.1.0",
"#google-cloud/pubsub": "^0.18.0"
}
}
Go to PubSub and the topic you created, manually publish a message. this should trigger the Cloud Function and start a Dataflow job
Use Cloud Scheduler to publish a PubSub message on schedule
https://cloud.google.com/scheduler/docs/tut-pub-sub

There's absolutely nothing wrong with using a cron job to kick off your Dataflow pipelines. We do it all the time for our production systems, whether it be our Java or Python developed pipelines.
That said however, we are trying to wean ourselves off cron jobs, and move more toward using either AWS Lambdas (we run multi cloud) or Cloud Functions. Unfortunately, Cloud Functions don't have scheduling yet. AWS Lambdas do.

There is a FAQ answer to that question:
https://cloud.google.com/dataflow/docs/resources/faq#is_there_a_built-in_scheduling_mechanism_to_execute_pipelines_at_given_time_or_interval
You can automate pipeline execution by using Google App Engine (Flexible Environment only) or Cloud Functions.
You can use Apache Airflow's Dataflow Operator, one of several Google Cloud Platform Operators in a Cloud Composer workflow.
You can use custom (cron) job processes on Compute Engine.
The Cloud Function approach is described as "Alpha" and it's still true that they don't have scheduling (no equivalent to AWS cloudwatch scheduling event), only Pub/Sub messages, Cloud Storage changes, HTTP invocations.
Cloud composer looks like a good option. Effectively a re-badged Apache Airflow, which is itself a great orchestration tool. Definitely not "too simple" like cron :)

You can use cloud scheduler to schedule your job as well. See my post
https://medium.com/#zhongchen/schedule-your-dataflow-batch-jobs-with-cloud-scheduler-8390e0e958eb
Terraform script
data "google_project" "project" {}
resource "google_cloud_scheduler_job" "scheduler" {
name = "scheduler-demo"
schedule = "0 0 * * *"
# This needs to be us-central1 even if the app engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"
http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}
# need to encode the string
body = base64encode(<<-EOT
{
"jobName": "test-cloud-scheduler",
"parameters": {
"region": "${var.region}",
"autoscalingAlgorithm": "THROUGHPUT_BASED",
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://zhong-gcp/temp",
"zone": "us-west1-a"
}
}
EOT
)
}
}

An error occurred running job Full Analysis Database Sync for team project collection or Team Foundation server

We are running TFS 2012 in our house for around 3 months and in particular "processing of cubes" was working fine till 14/08. At that point is just stopped to work (nothing was done on the server <- or at least I didn't found any changes yet)
What we get in the windows log looks like this:
Detailed Message: TF221122: An error occurred running job Full
Analysis Database Sync for team project collection or Team Foundation
server TEAM FOUNDATION. Exception Message: Failed to Process Analysis
Database 'Tfs_Analysis'. (type WarehouseException) Exception Stack
Trace: at
Microsoft.TeamFoundation.Warehouse.TFSOlapProcessComponent.ProcessOlap(AnalysisDatabaseProcessingType
processingType, WarehouseChanges warehouseChanges, Boolean
lastProcessingFailed, Boolean cubeSchemaUpdateNeeded) at
Microsoft.TeamFoundation.Warehouse.AnalysisDatabaseSyncJobExtension.RunInternal(TeamFoundationRequestContext
requestContext, TeamFoundationJobDefinition jobDefinition, DateTime
queueTime, String& resultMessage) at
Microsoft.TeamFoundation.Warehouse.WarehouseJobExtension.Run(TeamFoundationRequestContext
requestContext, TeamFoundationJobDefinition jobDefinition, DateTime
queueTime, String& resultMessage)
Inner Exception Details:
Exception Message: Errors in the high-level relational engine. The
following exception occurred while the managed IDbConnection interface
was being used: . Errors in the high-level relational engine. A
connection could not be made to the data source with the DataSourceID
of 'Tfs_AnalysisDataSource', Name of 'Tfs_AnalysisDataSource'. Errors
in the OLAP storage engine: An error occurred while the dimension,
with the ID of 'Dim Team Project', Name of 'Team Project' was being
processed. Errors in the OLAP storage engine: An error occurred while
the 'ProjectNodeSk' attribute of the 'Team Project' dimension from the
'Tfs_Analysis' database was being processed. Internal error: The
operation terminated unsuccessfully. Errors in the high-level
relational engine. The following exception occurred while the managed
IDbConnection interface was being used: . Errors in the high-level
relational engine. A connection could not be made to the data source
with the DataSourceID of 'Tfs_AnalysisDataSource', Name of
'Tfs_AnalysisDataSource'. Errors in the OLAP storage engine: An error
occurred while the dimension, with the ID of 'Dim Team Project', Name
of 'Team Project' was being processed. Errors in the OLAP storage
engine: An error occurred while the 'Project Node Type' attribute of
the 'Team Project' dimension from the 'Tfs_Analysis' database was
being processed. Errors in the high-level relational engine. The
following exception occurred while the managed IDbConnection interface
was being used: . Errors in the high-level relational engine. A
connection could not be made to the data source with the DataSourceID
of 'Tfs_AnalysisDataSource', Name of 'Tfs_AnalysisDataSource'. Errors
in the OLAP storage engine: An error occurred while the dimension,
with the ID of 'Dim Team Project', Name of 'Team Project' was being
processed. Errors in the OLAP storage engine: An error occurred while
the 'Is Deleted' attribute of the 'Team Project' dimension from the
'Tfs_Analysis' database was being processed. Errors in the high-level
relational engine. The following exception occurred while the managed
IDbConnection interface was being used: . Errors in the high-level
relational engine. A connection could not be made to the data source
with the DataSourceID of 'Tfs_AnalysisDataSource', Name of
'Tfs_AnalysisDataSource'. Errors in the OLAP storage engine: An error
occurred while the dimension, with the ID of 'Dim Team Project', Name
of 'Team Project' was being processed. Errors in the OLAP storage
engine: An error occurred while the 'Project Node Name' attribute of
the 'Team Project' dimension from the 'Tfs_Analysis' database was
being processed. Errors in the high-level relational engine. The
following exception occurred while the managed IDbConnection interface
was being used: . Errors in the high-level relational engine. A
connection could not be made to the data source with the DataSourceID
of 'Tfs_AnalysisDataSource', Name of 'Tfs_AnalysisDataSource'. Errors
in the OLAP storage engine: An error occurred while the dimension,
with the ID of 'Dim Team Project', Name of 'Team Project' was being
processed. Errors in the OLAP storage engine: An error occurred while
the 'Project Path' attribute of the 'Team Project' dimension from the
'Tfs_Analysis' database was being processed. Server: The current
operation was cancelled because another operation in the transaction
failed.
Warning: Parser: Out of line object 'Binding', referring to ID(s)
'Tfs_Analysis, Team System, FactCurrentWorkItem', has been specified
but has not been used. Warning: Parser: Out of line object 'Binding',
referring to ID(s) 'Tfs_Analysis, Team System, FactWorkItemHistory',
has been specified but has not been used.
...
so far :
- I've tried to force full processing of the cube, thru instruction from here http://msdn.microsoft.com/en-us/library/ff400237(v=vs.100).aspx
- I've tried to "rebuild reporting" from "TFS admin console"->"Application tire"->"Reporting"->"Start rebuild:
- finally I've also tried just to process directly from "SQL Managment studio" : Tfs_analysie->Process
- I've checked c:\olap\logs\msmdsrv file and I didn't found any errors there
beside that we also tried to:
- restart server
- restart just services
nothing of above helped.
Our TFS is :
- hosted on one machine
- updated to "Update 3" (right after setting it up)
- we use three different domain account to host TFS services, SQL, reporting services <- but nothing get changed in names/password of those account since installation. I've also verified that those accounts have access to proper databases.
Does anyone have similar problem ? Any ideas are really welcome.

I think the core error is A connection could not be made to the data source with the DataSourceID of 'Tfs_AnalysisDataSource' Check the data source settings, especially the connection string. Typical reasons are connection protocol settings wrong so that the protocol configured for Analysis Services is not configured for the relational engine, firewall or authentication issues.

Had the same issue (also TFS2012). Restarting the Analysis Service did the trick for me this time. Also check the account/password "Tfs_AnalysisDataSource" is using by clicking "Properties" in SQL Server Management Studio. I had a similar issue a while back when the passwords changed.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart