I'm trying to figure out how to load a CSV file from GCS into BigQuery. Pipeline below:
// Create the pipeline
Pipeline p = Pipeline.create(options);
// Create the PCollection from csv
PCollection<String> lines = p.apply(TextIO.read().from("gs://impression_tst_data/incoming_data.csv"));
// Transform into TableRow
PCollection<TableRow> row = lines.apply(ParDo.of(new StringToRowConverter()));
// Write table to BigQuery
row.apply(BigQueryIO.<TableRow>writeTableRows()
.to(“project_id:dataset.table”)
.withSchema(getSchema())
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED));
Here is the StringToRowConverter class I'm using in the ParDo to create a TableRow PCollection:
// StringToRowConverter
static class StringToRowConverter extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(new TableRow().set("string_field", c.element()));
}
}
Looking at the staging files it looks like this creates TableRows of JSON that lump the csv into a single column named "string_field". If I don't define string_field in my schema the job fails. When I do define string_field, it writes the each row of the CSV into the column and leaves all my other columns defined in the schema empty. I know this is expected behavior.
So my question: How do I take this JSON output and write it into the schema? Sample output and schema below...
"string_field": "6/26/17 21:28,Dave Smith,1 Learning Drive,867-5309,etc"}
Schema:
static TableSchema getSchema() {
return new TableSchema().setFields(new ArrayList<TableFieldSchema>() {
// Compose the list of TableFieldSchema from tableSchema.
{
add(new TableFieldSchema().setName("Event_Time").setType("TIMESTAMP"));
add(new TableFieldSchema().setName("Name").setType("STRING"));
add(new TableFieldSchema().setName("Address").setType("STRING"));
add(new TableFieldSchema().setName("Phone").setType("STRING"));
add(new TableFieldSchema().setName("etc").setType("STRING"));
}
});
}
Is there a better way of doing this than using the StringToRowConverter?
I need to use a ParDo to create a TableRow PCollection before I can write it out to BQ. However, I'm unable to find a solid example of how to take in a CSV PCollection, transform to TableRow and write it out.
Yes, I am a noob trying to learn here. I'm hoping somebody can help me with a snippet or point me in the right direction on the easiest way to accomplish this. Thanks in advance.
The code in your StringToRowConverter DoFn should parse the string and produce a TableRow with multiple fields. Since each row is comma separated, this would likely involve splitting the string on commas, and then using your knowledge of the column order to do something like:
String inputLine = c.element();
// May need to make the line parsing more robust, depending on your
// files. Look at how to parse rows of a CSV using Java.
String[] split = inputLine.split(',');
// Also, you may need to handle errors such as not enough columns, etc.
TableRow output = new TableRow();
output.set("Event_Time", split[0]); // may want to parse the string
output.set("Name", split[1]);
...
c.output(output);
Related
Following the Snowflake SnowPark tutorial here:
https://quickstarts.snowflake.com/guide/getting_started_with_snowpark/index.html?index=..%2f..index&msclkid=f0b56761cf1011ecb976c58c0f8e2a64#9
Which goes through how to use SnowPark to create a user defined function, upload data programmatically, and then execute the UDF code against the data in Snowflake.
I am able to connect to Snowflake and execute the tutorial Scala code from the command line, but when I try to create the stored procedure in step 10 I get an error message: Package 'com.snowflake:snowpark:latest' is not supported.
Does anybody know how to resolve this so that I can create the stored procedure with my scala code?
I am using the standard Snowflake on AWS and executing in a worksheet under the new interface. The connection settings mirror the programmatic settings that work for userrole, warehouse, database, and schema.
here is the SQL code:
create or replace procedure discoverHappyTweets()
returns string
language scala
runtime_version=2.12
packages=('com.snowflake:snowpark:latest')
imports=('#snowpark_demo_udf_dependency_jars/ejml-0.23.jar','#snowpark_demo_udf_dependency_jars/slf4j-api.jar','#snowpark_demo_udf_dependency_jars/stanford-corenlp-3.6.0-models.jar','#snowpark_demo_udf_dependency_jars/stanford-corenlp-3.6.0.jar')
handler = 'UDFDemo.discoverHappyTweets'
target_path = '#snowpark_demo_udf_dependency_jars/discoverHappyTweets.jar'
as
$$
import com.snowflake.snowpark._
import com.snowflake.snowpark.functions._
import com.snowflake.snowpark.SaveMode.Overwrite
import com.snowflake.snowpark.types.{StringType, StructField, StructType}
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
// import org.apache.log4j.{Level, Logger}
/**
* Demonstrates how to use Snowpark to create user-defined functions (UDFs)
* in Scala.
*
* Before running the main method of this class, download the data and JAR files
* needed for the demo, then run the main method of the UDFDemoSetup class
* to upload those files to internal stages.
*/
object UDFDemo {
// The name of the internal stage for the demo data.
val dataStageName = "snowpark_demo_data"
// The name of the internal stage for the JAR files needed by the UDF.
val jarStageName = "snowpark_demo_udf_dependency_jars"
// The name of the file containing the dataset.
val dataFilePattern = "training.1600000.processed.noemoticon.csv"
/*
* Reads tweet data from the demo CSV file from a Snowflake stage and
* returns the data in a Snowpark DataFrame for analysis.
*/
def collectTweetData(session: Session): DataFrame = {
// Import names from the implicits object, which allows you to use shorthand
// to refer to columns in a DataFrame (e.g. `'columnName` and `$"columnName"`).
import session.implicits._
Console.println("\n=== Setting up the DataFrame for the data in the stage ===\n")
// Define the schema for the CSV file containing the demo data.
val schema = Seq(
StructField("target", StringType),
StructField("ids", StringType),
StructField("date", StringType),
StructField("flag", StringType),
StructField("user", StringType),
StructField("text", StringType),
)
// Read data from the demo file in the stage into a Snowpark DataFrame.
// dataStageName is the name of the stage that was created
// when you ran UDFDemoSetup earlier, and dataFilePattern is
// the pattern matching the files that were uploaded to that stage.
val origData = session
.read
.schema(StructType(schema))
.option("compression", "gzip")
.csv(s"#$dataStageName/$dataFilePattern")
// Drop all of the columns except the column containing the text of the tweet
// and return the first 100 rows.
val tweetData = origData.drop('target, 'ids, 'date, 'flag, 'user).limit(100)
Console.println("\n=== Retrieving the data and printing the text of the first 10 tweets")
// Display some of the data.
tweetData.show()
// Return the tweet data for sentiment analysis.
return tweetData
}
/*
* Determines the sentiment of the words in a string of text by using the
* Stanford NLP API (https://nlp.stanford.edu/nlp/javadoc/javanlp/).
*/
def analyze(text: String): Int = {
val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, pos, parse, sentiment")
lazy val pipeline = new StanfordCoreNLP(props)
lazy val annotation = pipeline.process(text)
annotation.get(classOf[CoreAnnotations.SentencesAnnotation]).forEach(sentence => {
lazy val tree = sentence.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])
return RNNCoreAnnotations.getPredictedClass(tree)
})
0
}
/*
* Creates a user-defined function (UDF) for sentiment analysis. This function
* registers the analyze function as a UDF, along with its dependency JAR files.
*/
def createUDF(session: Session): UserDefinedFunction = {
Console.println("\n=== Adding dependencies for your UDF ===\n")
// Register CoreNLP library JAR files as dependencies to support
// the UDF. The JAR files are already in the Snowflake stage named by
// jarStageName. The stage was created and JARs were uploaded when you ran
// the code in UDFDemoSetup.scala.
session.addDependency(s"#$jarStageName/stanford-corenlp-3.6.0.jar.gz")
session.addDependency(s"#$jarStageName/stanford-corenlp-3.6.0-models.jar.gz")
session.addDependency(s"#$jarStageName/slf4j-api.jar.gz")
session.addDependency(s"#$jarStageName/ejml-0.23.jar.gz")
Console.println("\n=== Creating the UDF ===\n")
// Register the analyze function as a UDF that analyzes the sentiment of
// text. Each value in the column that you pass to the UDF is passed to the
// analyze method.
val sentimentFunc = udf(analyze(_))
return sentimentFunc
}
/*
* Analyzes tweet data, discovering tweets with a happy sentiment and saving
* those tweets to a table in the database.
*/
def processHappyTweets(session: Session, sentimentFunc: UserDefinedFunction, tweetData: DataFrame): Unit = {
// Import names from the `implicits` object so you can use shorthand to refer
// to columns in a DataFrame (for example, `'columnName` and `$"columnName"`).
import session.implicits._
Console.println("\n=== Creating a transformed DataFrame that contains the results from calling the UDF ===\n")
// Call the UDF on the column that contains the content of the tweets.
// Create and return a new `DataFrame` that contains a "sentiment" column.
// This column contains the sentiment value returned by the UDF for the text
// in each row.
val analyzed = tweetData.withColumn("sentiment", sentimentFunc('text))
Console.println("\n=== Creating a transformed DataFrame with just the happy sentiments ===\n")
// Create a new DataFrame that contains only the tweets with happy sentiments.
val happyTweets = analyzed.filter('sentiment === 3)
Console.println("\n=== Retrieving the data and printing the first 10 tweets ===\n")
// Display the first 10 tweets with happy sentiments.
happyTweets.show()
Console.println("\n=== Saving the data to the table demo_happy_tweets ===\n")
// Write the happy tweet data to the table.
happyTweets.write.mode(Overwrite).saveAsTable("demo_happy_tweets")
}
/*
* Reads tweet data from a demo CSV, creates a UDF, then uses the UDF to
* discover the sentiment of tweet text.
*/
def discoverHappyTweets(session: Session): String = {
// Collect tweet data from the demo CSV.
val tweetData = collectTweetData(session)
// Register a user-defined function for determining tweet sentiment.
val sentimentFunc = createUDF(session)
// Analyze tweets to discover those with a happy sentiment.
val happyTweets = processHappyTweets(session, sentimentFunc, tweetData)
"Complete"
}
}
$$;
We have a beam pipeline written in Java that we run on GCP Dataflow. Its very simple, it takes a SQL query as a PipelineOption, issues that SQL query against BigQuery and for every row in the returned dataset constructs a message and puts it onto a pubsub topic.
import com.google.api.services.bigquery.model.TableRow;
import java.util.HashMap;
import java.util.Map;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* The {#code BigQueryEventReplayer} pipeline runs a supplied SQL query
* against BigQuery, and sends the results one-by-one to PubSub
* The query MUST return a column named 'json', it is this column
* (and ONLY this column) that will be sent onward. The column must be a String type
* and should be valid JSON.
*/
public class BigQueryEventReplayer {
private static final Logger logger = LoggerFactory.getLogger(BigQueryEventReplayer.class);
/**
* Options for the BigQueryEventReplayer. See descriptions for more info
*/
public interface Options extends PipelineOptions {
#Description("SQL query to be run."
+ "An SQL string literal which will be run 'as is'")
#Required
ValueProvider<String> getBigQuerySql();
void setBigQuerySql(ValueProvider<String> value);
#Description("The name of the topic which data should be published to. "
+ "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
#Required
ValueProvider<String> getOutputTopic();
void setOutputTopic(ValueProvider<String> value);
#Description("The ID of the BigQuery dataset targeted by the event")
#Required
ValueProvider<String> getBigQueryTargetDataset();
void setBigQueryTargetDataset(ValueProvider<String> value);
#Description("The ID of the BigQuery table targeted by the event")
#Required
ValueProvider<String> getBigQueryTargetTable();
void setBigQueryTargetTable(ValueProvider<String> value);
#Description("The SourceSystem attribute of the event")
#Required
ValueProvider<String> getSourceSystem();
void setSourceSystem(ValueProvider<String> value);
}
/**
* Takes the data from the TableRow and prepares it for the PubSub, including
* adding attributes to ensure the payload is routed correctly.
*/
// We would rather use a SimpleFunction here but then we wouldn't be able
// to inject our value providers. So instead we hackishly make a nested class
public static class MapQueryToPubsub extends DoFn<TableRow, PubsubMessage> {
private final ValueProvider<String> targetDataset;
private final ValueProvider<String> targetTable;
private final ValueProvider<String> sourceSystem;
MapQueryToPubsub(
ValueProvider<String> targetDataset,
ValueProvider<String> targetTable,
ValueProvider<String> sourceSystem) {
this.targetDataset = targetDataset;
this.targetTable = targetTable;
this.sourceSystem = sourceSystem;
}
/**
* Entry point of DoFn for Dataflow.
*/
#ProcessElement
public void processElement(ProcessContext c) {
TableRow row = c.element();
if (!row.containsKey("json")) {
logger.warn("table does not contain column named 'json'");
}
Map<String, String> attributes = new HashMap<>();
attributes.put("sourceSystem", sourceSystem.get());
attributes.put("targetDataset", targetDataset.get());
attributes.put("targetTable", targetTable.get());
String json = (String) row.get("json");
c.output(new PubsubMessage(json.getBytes(), attributes));
}
}
/**
* Run the pipeline. This is the entrypoint for running 'locally'
*/
public static void main(String[] args) {
// Parse the user options passed from the command-line
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
run(options);
}
/**
* Run the pipeline. This is the entrypoint that GCP will use
*/
public static PipelineResult run(Options options) {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read from BigQuery query",
BigQueryIO.readTableRows().fromQuery(options.getBigQuerySql()).usingStandardSql().withoutValidation()
.withTemplateCompatibility())
.apply("Map data to PubsubMessage",
ParDo.of(
new MapQueryToPubsub(
options.getBigQueryTargetDataset(),
options.getBigQueryTargetTable(),
options.getSourceSystem()
)
)
)
.apply("Write message to PubSub", PubsubIO.writeMessages().to(options.getOutputTopic()));
return pipeline.run();
}
}
The BigQuery data being queried is essentially a log of events. We have recently determined that the order in which we insert those events onto the pubsub topic is important. We can determine the correct order by using an ORDER BY in the query that we issue against BigQuery however we are skeptical as to whether that order will be respected when the data gets inserted onto the pubsub topic.
Our main concern is in this code:
pipeline.apply("Read from BigQuery query",
BigQueryIO.readTableRows().fromQuery(options.getBigQuerySql()).usingStandardSql().withoutValidation()
.withTemplateCompatibility())
that simple command manifests as this in Dataflow:
There is a lot happening in that step (shuffles etc...) and actually many of the sub-steps are themselves made up of multiple sub-steps. Moreover, one of the sub-steps is called "ReadFiles" which makes me think that perhaps Dataflow is writing the data to some sort of temporary file store. All-in-all this leads me to doubt that an ORDER BY in the supplied SQL query will be preserved when the rows get published to pubsub.
Does beam/Dataflow offer any guarantee that the ORDER BY will be preserved in this scenario or am I going to have to introduce a sort into my pipeline to guarantee that the desired order is adhered to?
The BigQueryIO Read basically consists an import job to GCS as Avro for the query/table and then a Read from those files (and some more stuff). So it won't preserve the order, since the reads are in parallel and there will be multiple threads reading chunks of the created file(s).
Generally speaking, distributed processing systems as Dataflow (or Spark, etc) don't preserve order and are bad at ordering stuff given the parallel nature of their job. Bare in mind that to sort elements you need to hold everything in a single worker.
In fact, even in BigQuery, the ORDER BY is quite a demanding task.
It's hard to find workarounds for this, since the systems are not built for this type of tasks. I can think on adding the ROW NUMBER, using that as timestamp and adding a window, but this is quite use case specific.
Also, PubSubIO won't preserve the order when publishing.
I have an Apache Beam task that reads from a MySQL source using JDBC and it's supposed to write the data as it is to a BigQuery table. No transformation is performed at this point, that will come later on, for the moment I just want the database output to be directly written into BigQuery.
This is the main method trying to perform this operation:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline p = Pipeline.create(options);
// Build the table schema for the output table.
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("phone").setType("STRING"));
fields.add(new TableFieldSchema().setName("url").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
p.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://host:3306/db_name")
.withUsername("user")
.withPassword("pass"))
.withQuery("SELECT phone_number, identity_profile_image FROM scraper_caller_identities LIMIT 100")
.withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getString(2));
}
})
.apply(BigQueryIO.Write
.to(options.getOutput())
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)));
p.run();
}
But when I execute the template using maven, I get the following error:
Test.java:[184,6] cannot find symbol symbol: method
apply(com.google.cloud.dataflow.sdk.io.BigQueryIO.Write.Bound)
location: class org.apache.beam.sdk.io.jdbc.JdbcIO.Read<com.google.cloud.dataflow.sdk.values.KV<java.lang.String,java.lang.String>>
It seems that I'm not passing BigQueryIO.Write the expected data collection and that's what I am struggling with at the moment.
How can I make the data coming from MySQL meets BigQuery's expectations in this case?
I think that you need to provide a PCollection<TableRow> to BigQueryIO.Write instead of the PCollection<KV<String,String>> type that the RowMapper is outputting.
Also, please use the correct column name and value pairs when setting the TableRow.
Note: I think that your KVs are the phone and url values (e.g. {"555-555-1234": "http://www.url.com"}), not the column name and value pairs (e.g. {"phone": "555-555-1234", "url": "http://www.url.com"})
See the example here:
https://beam.apache.org/documentation/sdks/javadoc/0.5.0/
Would you please give this a try and let me know if it works for you? Hope this helps.
I am exporting some data to csv by Dataflow, but beyond data I want to add each column names as the first line on the output file such as
col_name1, col_name2, col_name3, col_name4 ...
data1.1, data1.2, data1.3, data1.4 ...
data2.1 ...
Is there anyway to do with current API?(searched around TextIO.Write but didn't find anything seems relevant...) or is there anyway I could sort of "insert" column name at the head of to-be-exported PCollection and enforce the the data to be written in order...?
There is no built-in way to do that using TextIO.Write. PCollections are unordered so it isn't possible ot add an eleemnt to the front. You could write a custom BoundedSink which does this.
Custom sink APIs are now available if you want to be the brave one to craft a CSV sink. Current workaround which builds up the output as a single string and outputs it all at finish bundle:
PCollection<String> output = data.apply(ParDo.of(new DoFn<String, String>() {
private static final long serialVersionUID = 0;
String new_line = System.getProperty("line.separator");
String csv_header = "id, stuff1, stuff2, stuff3" + new_line;
StringBuilder csv_body = new StringBuilder().append(csv_header);
#Override
public void processElement(ProcessContext c) {
csv_body.append(c.element()).append(newline);
}
#Override
public void finishBundle(Context c) throws Exception {
c.output(csv_body);
}
})).apply(TextIO.Write.named("WriteData").to(options.getOutput()));
This will only work if your BIG output string fits in memory
As of Dataflow SDK version 1.7.0, you have withHeader function in TextIO.Write .
So you can do this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
A new line character is automatically added to the end of the header.
As a followup question to the following question and answer:
https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
I'd like to confirm with google dataflow engineering team (#jkff) if the 3rd option proposed by Eugene is at all possible with google dataflow:
"have a ParDo that takes these keys and creates the BigQuery tables, and another ParDo that takes the data and streams writes to the tables"
My understanding is that ParDo/DoFn will process each element, how could we specify a table name (function of the keys passed in from side inputs) when writing out from processElement of a ParDo/DoFn?
Thanks.
Updated with a DoFn, which is not working obviously since c.element().value is not a pcollection.
PCollection<KV<String, Iterable<String>>> output = ...;
public class DynamicOutput2Fn extends DoFn<KV<String, Iterable<String>>, Integer> {
private final PCollectionView<List<String>> keysAsSideinputs;
public DynamicOutput2Fn(PCollectionView<List<String>> keysAsSideinputs) {
this.keysAsSideinputs = keysAsSideinputs;
}
#Override
public void processElement(ProcessContext c) {
List<String> keys = c.sideInput(keysAsSideinputs);
String key = c.element().getKey();
//the below is not working!!! How could we write the value out to a sink, be it gcs file or bq table???
c.element().getValue().apply(Pardo.of(new FormatLineFn()))
.apply(TextIO.Write.to(key));
c.output(1);
}
}
The BigQueryIO.Write transform does not support this. The closest thing you can do is to use per-window tables, and encode whatever information you need to select the table in the window objects by using a custom WindowFn.
If you don't want to do that, you can make BigQuery API calls directly from your DoFn. With this, you can set the table name to anything you want, as computed by your code. This could be looked up from a side input, or computed directly from the element the DoFn is currently processing. To avoid making too many small calls to BigQuery, you can batch up the requests using finishBundle();
You can see how the Dataflow runner does the streaming import here:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/BigQueryTableInserter.java