Using C# send Avro message to Azure Event Hub and then de-serialize using Scala Structured Streaming in Databricks 7.2/ Scala 3.0 - avro

So I have been banging my head against this for the last couple of days. I am having trouble de-serializing an Avro file that we are generating and sending into Azure Event Hub. We are attempting to do this with Databricks Runtime 7.2 Structured Streaming. Using the newer from_avro method described here to de-serialize the body of the event message.
import org.apache.spark.eventhubs._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.avro._
import org.apache.avro._
import org.apache.spark.sql.types._
import org.apache.spark.sql.avro.functions._
val connStr = "<EventHubConnectionstring>"
val customEventhubParameters =
EventHubsConf(connStr.toString())
.setMaxEventsPerTrigger(5)
//.setStartingPosition(EventPosition.fromStartOfStream)
val incomingStream = spark
.readStream
.format("eventhubs")
.options(customEventhubParameters.toMap)
.load()
.filter($"properties".getItem("TableName") === "Branches")
val avroSchema = s"""{"type":"record","name":"Branches","fields":[{"name":"_src_ChangeOperation","type":["null","string"]},{"name":"_src_CurrentTrackingId","type":["null","long"]},{"name":"_src_RecordExtractUTCTimestamp","type":"string"},{"name":"ID","type":["null","int"]},{"name":"BranchCode","type":["null","string"]},{"name":"BranchName","type":["null","string"]},{"name":"Address1","type":["null","string"]},{"name":"Address2","type":["null","string"]},{"name":"City","type":["null","string"]},{"name":"StateID","type":["null","int"]},{"name":"ZipCode","type":["null","string"]},{"name":"Telephone","type":["null","string"]},{"name":"Contact","type":["null","string"]},{"name":"Title","type":["null","string"]},{"name":"DOB","type":["null","string"]},{"name":"TimeZoneID","type":["null","int"]},{"name":"ObserveDaylightSaving","type":["null","boolean"]},{"name":"PaySummerTimeHour","type":["null","boolean"]},{"name":"PayWinterTimeHour","type":["null","boolean"]},{"name":"BillSummerTimeHour","type":["null","boolean"]},{"name":"BillWinterTimeHour","type":["null","boolean"]},{"name":"Deleted","type":["null","boolean"]},{"name":"LastUpdated","type":["null","string"]},{"name":"txJobID","type":["null","string"]},{"name":"SourceID","type":["null","string"]},{"name":"HP_UseHolPayHourMethod","type":["null","boolean"]},{"name":"HP_HourlyRatePercent","type":["null","float"]},{"name":"HP_RequiredWeeksOfEmployment","type":["null","float"]},{"name":"rgUseSystemSettings","type":["null","boolean"]},{"name":"rgDutySplitBy","type":["null","int"]},{"name":"rgBasePeriodDate","type":["null","string"]},{"name":"rgFirstDayOfWeek","type":["null","int"]},{"name":"rgDutyStartOfDayTime","type":["null","string"]},{"name":"rgHolidayStartOfDayTime","type":["null","string"]},{"name":"rgMinimumTimePeriod","type":["null","int"]},{"name":"rgLoadPublicTable","type":["null","boolean"]},{"name":"rgPOTPayPeriodID","type":["null","int"]},{"name":"rgPOT1","type":["null","string"]},{"name":"rgPOT2","type":["null","string"]},{"name":"Facsimile","type":["null","string"]},{"name":"CountryID","type":["null","int"]},{"name":"EmailAddress","type":["null","string"]},{"name":"ContractSecurityHistoricalWeeks","type":["null","int"]},{"name":"ContractSecurityFutureWeeks","type":["null","int"]},{"name":"TimeLinkTelephone1","type":["null","string"]},{"name":"TimeLinkTelephone2","type":["null","string"]},{"name":"TimeLinkTelephone3","type":["null","string"]},{"name":"TimeLinkTelephone4","type":["null","string"]},{"name":"TimeLinkTelephone5","type":["null","string"]},{"name":"AutoTakeMissedCalls","type":["null","boolean"]},{"name":"AutoTakeMissedCallsDuration","type":["null","string"]},{"name":"AutoTakeApplyDurationToCheckCalls","type":["null","boolean"]},{"name":"AutoTakeMissedCheckCalls","type":["null","boolean"]},{"name":"AutoTakeMissedCheckCallsDuration","type":["null","string"]},{"name":"DocumentLocation","type":["null","string"]},{"name":"DefaultPortalAccess","type":["null","boolean"]},{"name":"DefaultPortalSecurityRoleID","type":["null","int"]},{"name":"EmployeeTemplateID","type":["null","int"]},{"name":"SiteCardTemplateID","type":["null","int"]},{"name":"TSAllowancesHeaderID","type":["null","int"]},{"name":"TSMinimumWageHeaderID","type":["null","int"]},{"name":"TimeLinkClaimMade","type":["null","boolean"]},{"name":"TSAllowancePeriodBaseDate","type":["null","string"]},{"name":"TSAllowancePeriodID","type":["null","int"]},{"name":"TSMinimumWageCalcMethodID","type":["null","int"]},{"name":"FlexibleShiftsHeaderID","type":["null","int"]},{"name":"SchedulingUseSystemSettings","type":["null","boolean"]},{"name":"MinimumRestPeriod","type":["null","int"]},{"name":"TSMealBreakHeaderID","type":["null","int"]},{"name":"ServiceTracImportType","type":["null","int"]},{"name":"StandDownDiaryEventID","type":["null","int"]},{"name":"ScheduledDutyChangeMessageTemplateId","type":["null","int"]},{"name":"ScheduledDutyAddedMessageTemplateId","type":["null","int"]},{"name":"ScheduledDutyRemovedMessageTemplateId","type":["null","int"]},{"name":"NegativeMessageResponsesPermitted","type":["null","boolean"]},{"name":"PortalEventsStandardLocFirst","type":["null","boolean"]},{"name":"ReminderMessage","type":["null","boolean"]},{"name":"ReminderMessageDaysBefore","type":["null","int"]},{"name":"ReminderMessageTemplateId","type":["null","int"]},{"name":"ScheduledDutyChangeMessageAllowReply","type":["null","boolean"]},{"name":"ScheduledDutyAddedMessageAllowReply","type":["null","boolean"]},{"name":"PayAlertEscalationGroup","type":["null","int"]},{"name":"BudgetedPay","type":["null","int"]},{"name":"PayAlertVariance","type":["null","string"]},{"name":"BusinessUnitID","type":["null","int"]},{"name":"APH_Hours","type":["null","float"]},{"name":"APH_Period","type":["null","int"]},{"name":"APH_PeriodCount","type":["null","int"]},{"name":"AveragePeriodHoursRuleId","type":["null","int"]},{"name":"HolidayScheduleID","type":["null","int"]},{"name":"AutomationRuleProfileId","type":["null","int"]}]}"""
val decoded_df = incomingStream
.select(
from_avro($"body",avroSchema).alias("payload")
)
val query1 = (
decoded_df
.writeStream
.format("memory")
.queryName("read_hub")
.start()
)
I have verified that the file we are sending has a valid schema, that it has data in it and that it is getting to the stream job in the notebook before failing with the following stack trace that states that the data is malformed. However I am able to write the generated file to a .avro file and de-serialize it using the normal .read.format("avro") method just fine.
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:413)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:361)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:322)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:329)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:45)
at org.apache.spark.sql.execution.collect.Collector$.callExecuteCollect(Collector.scala:118)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:69)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:480)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:396)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2986)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3692)
at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2953)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3684)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3682)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2953)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:586)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:835)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:581)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:276)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:274)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:71)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:581)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:231)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:276)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:274)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:71)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:199)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:193)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:346)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:259)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 4 times, most recent failure: Lost task 0.3 in stage 37.0 (TID 84, 10.139.64.5, executor 0): org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$7(WriteToDataSourceV2Exec.scala:438)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:477)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:385)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:657)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -40
at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:100)
... 16 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2478)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2427)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2426)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2426)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1131)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1131)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1131)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2678)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2625)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2613)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:917)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2313)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:382)
... 46 more
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$7(WriteToDataSourceV2Exec.scala:438)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:477)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:385)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:657)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:660)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -40
at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:100)
... 16 more
Tech
C# Azure Function v3 .net core generating Avro file using Avro 1.8.2
Avro file is serialized to byte array using Generic Writer not Specific Writer and sent to Azure Event Hub
Databricks Runtime 7.2/Scala 3.0
Databricks notebooks written in Scala
Databricks Structured Stream Notebook to de-serialize the Avro message
and send to delta lake table
NOT using the following
Event Hub Capture
Kafka
Schema registry

Ok so I just figured out what the issue was. It was in how we were generating the avro message before sending it to event hub. In our serialization method we were using the var writer = new GenericDatumWriter<GenericRecord>(schema); and IFileWriter<GenericRecord> to write to a memory stream and then just getting the byte array of that stream as seen below.
public byte[] Serialize(DataCapture data)
{
var schema = GenerateSchema(data.Schema);
var writer = new GenericDatumWriter<GenericRecord>(schema);
using(var ms = new MemoryStream())
{
using (IFileWriter<GenericRecord> fileWriter = DataFileWriter<GenericRecord>.OpenWriter(writer, ms))
{
foreach (var jsonString in data.Rows)
{
var record = new GenericRecord(schema);
var obj = JsonConvert.DeserializeObject<JObject>(jsonString);
foreach (var column in data.Schema.Columns)
{
switch (MapDataType(column.DataTypeName))
{
case AvroTypeEnum.Boolean:
record.Add(column.ColumnName, obj.GetValue(column.ColumnName).Value<bool?>());
break;
//Map all datatypes ect....removed to shorten example
default:
record.Add(column.ColumnName, obj.GetValue(column.ColumnName).Value<string>());
break;
}
}
fileWriter.Append(record);
}
}
return ms.ToArray();
}
}
When what we actually should do is use var writer = new DefaultWriter(schema); and var encoder = new BinaryEncoder(ms); to then write the records with writer.Write(record, encoder); before returning the byte array of the stream.
public byte[] Serialize(DataCapture data)
{
var schema = GenerateSchema(data.Schema);
var writer = new DefaultWriter(schema);
using (var ms = new MemoryStream())
{
var encoder = new BinaryEncoder(ms);
foreach (var jsonString in data.Rows)
{
var record = new GenericRecord(schema);
var obj = JsonConvert.DeserializeObject<JObject>(jsonString);
foreach (var column in data.Schema.Columns)
{
switch (MapDataType(column.DataTypeName))
{
case AvroTypeEnum.Boolean:
record.Add(column.ColumnName, obj.GetValue(column.ColumnName).Value<bool?>());
break;
//Map all datatypes ect....removed to shorten example
default:
record.Add(column.ColumnName, obj.GetValue(column.ColumnName).Value<string>());
break;
}
}
writer.Write(record, encoder);
}
return ms.ToArray();
}
}
So lesson learned is that not all Avro memory streams converted to byte[] are the same. The from_avro method will only de-serialize avro data the has been binary encoded with the BinaryEncoder class not data created with the IFileWriter. If there is something that I should be doing instead please let me know but this fixed my issue. Hopefully my pain will spare others the same.

Related

Apache Beam IllegalArgumentException on Google Dataflow with message `Not expecting a splittable ParDoSingle: should have been overridden`

I am trying to write a pipeline which periodically checks a Google Storage bucket for new .gz files which are actually compressed .csv files. Then it writes those records to a BigQuery table. The following code was working in batch mode before I added the .watchForNewFiles(...) and .withMethod(STREAMING_INSERTS) parts. I am expecting it to run in streaming mode with those changes. However I am getting an exception that I can't find anything related on the web. Here is my code:
public static void main(String[] args) {
DataflowDfpOptions options = PipelineOptionsFactory.fromArgs(args)
//.withValidation()
.as(DataflowDfpOptions.class);
Pipeline pipeline = Pipeline.create(options);
Stopwatch sw = Stopwatch.createStarted();
log.info("DFP data transfer from GS to BQ has started.");
pipeline.apply("ReadFromStorage", TextIO.read()
.from("gs://my-bucket/my-folder/*.gz")
.withCompression(Compression.GZIP)
.watchForNewFiles(
// Check for new files every 30 seconds
Duration.standardSeconds(30),
// Never stop checking for new files
Watch.Growth.never()
)
)
.apply("TransformToTableRow", ParDo.of(new TableRowConverterFn()))
.apply("WriteToBigQuery", BigQueryIO.writeTableRows()
.to(options.getTableId())
.withMethod(STREAMING_INSERTS)
.withCreateDisposition(CREATE_NEVER)
.withWriteDisposition(WRITE_APPEND)
.withSchema(TableSchema)); //todo: use withJsonScheme(String json) method instead
pipeline.run().waitUntilFinish();
log.info("DFP data transfer from GS to BQ is finished in {} seconds.", sw.elapsed(TimeUnit.SECONDS));
}
/**
* Creates a TableRow from a CSV line
*/
private static class TableRowConverterFn extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String[] split = c.element().split(",");
//Ignore the header line
//Since this is going to be run in parallel, we can't guarantee that the first line passed to this method will be the header
if (split[0].equals("Time")) {
log.info("Skipped header");
return;
}
TableRow row = new TableRow();
for (int i = 0; i < split.length; i++) {
TableFieldSchema col = TableSchema.getFields().get(i);
//String is the most common type, putting it in the first if clause for a little bit optimization.
if (col.getType().equals("STRING")) {
row.set(col.getName(), split[i]);
} else if (col.getType().equals("INTEGER")) {
row.set(col.getName(), Long.valueOf(split[i]));
} else if (col.getType().equals("BOOLEAN")) {
row.set(col.getName(), Boolean.valueOf(split[i]));
} else if (col.getType().equals("FLOAT")) {
row.set(col.getName(), Float.valueOf(split[i]));
} else {
//Simply try to write it as a String if
//todo: Consider other BQ data types.
row.set(col.getName(), split[i]);
}
}
c.output(row);
}
}
And the stack trace:
java.lang.IllegalArgumentException: Not expecting a splittable ParDoSingle: should have been overridden
at org.apache.beam.repackaged.beam_runners_google_cloud_dataflow_java.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
at org.apache.beam.runners.dataflow.PrimitiveParDoSingleFactory$PayloadTranslator.payloadForParDoSingle(PrimitiveParDoSingleFactory.java:167)
at org.apache.beam.runners.dataflow.PrimitiveParDoSingleFactory$PayloadTranslator.translate(PrimitiveParDoSingleFactory.java:145)
at org.apache.beam.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:206)
at org.apache.beam.runners.core.construction.SdkComponents.registerPTransform(SdkComponents.java:86)
at org.apache.beam.runners.core.construction.PipelineTranslation$1.visitPrimitiveTransform(PipelineTranslation.java:87)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:668)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:660)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:660)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:660)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:660)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:660)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:311)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:245)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:458)
at org.apache.beam.runners.core.construction.PipelineTranslation.toProto(PipelineTranslation.java:59)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate(DataflowPipelineTranslator.java:165)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:684)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:173)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at com.diply.data.App.main(App.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
Here is my command to publish the job on Dataflow:
clean compile exec:java -Dexec.mainClass=com.my.project.App "-Dexec.args=--runner=DataflowRunner --tempLocation=gs://my-bucket/tmp --tableId=Temp.TestTable --project=my-project --jobName=dataflow-dfp-streaming" -Pdataflow-runner
I use apache beam version 2.5.0. Here is the relevant section from my pom.xml.
<properties>
<beam.version>2.5.0</beam.version>
<bigquery.version>v2-rev374-1.23.0</bigquery.version>
<google-clients.version>1.23.0</google-clients.version>
...
</properties>
Running the code with Dataflow 2.4.0 gives a more explicit error: java.lang.UnsupportedOperationException: DataflowRunner does not currently support splittable DoFn
However, this answer suggests that this is supported since 2.2.0. This is indeed the case, and following this remark you need to add the --streaming option in your Dexec.args to force it into streaming mode.
I tested it with the code I supplied in the comments with both your pom and mine and both 1. produce your error without --streaming 2. run fine with --streaming
You might want to open a github beam issue since this behavior is not documented anywhere offically as far as I know.

How do I get a continuation token for a bulk INSERT on Azure Cosmos DB?

I want to upload a CSV file that represents 10k documents to be added to my Cosmos DB collection in a manner that's fast and atomic. I have a stored procedure like the following pseudo-code:
function createDocsFromCSV(csv_text) {
function parse(txt) { // ... parsing code here ... }
var collection = getContext().getCollection();
var response = getContext().getResponse();
var docs_to_create = parse(csv_text);
for(var ii=0; ii<docs_to_create.length; ii++) {
var accepted = collection.createDocument(collection.getSelfLink(),
docs_to_create[ii],
function(err, doc_created) {
if(err) throw new Error('Error' + err.message);
});
if(!accepted) {
throw new Error('Timed out creating document ' + ii);
}
}
}
When I run it, the stored procedure creates about 1200 documents before timing out (and therefore rolling back and not creating any documents).
Previously I had success updating (instead of creating) thousands of documents in a stored procedure using continuation tokens and this answer as guidance: https://stackoverflow.com/a/34761098/277504. But after searching documentation (e.g. https://azure.github.io/azure-documentdb-js-server/Collection.html) I don't see a way to get continuation tokens from creating documents like I do for querying documents.
Is there a way to take advantage of stored procedures for bulk document creation?
It’s important to note that stored procedures have bounded execution, in which all operations must complete within the server specified request timeout duration. If an operation does not complete with that time limit, the transaction is automatically rolled back.
In order to simplify development to handle time limits, all CRUD (Create, Read, Update, and Delete) operations return a Boolean value that represents whether that operation will complete. This Boolean value can be used a signal to wrap up execution and for implementing a continuation based model to resume execution (this is illustrated in our code samples below). More details, please refer to the doc.
The bulk-insert stored procedure provided above implements the continuation model by returning the number of documents successfully created.
pseudo-code:
function createDocsFromCSV(csv_text,count) {
function parse(txt) { // ... parsing code here ... }
var collection = getContext().getCollection();
var response = getContext().getResponse();
var docs_to_create = parse(csv_text);
for(var ii=count; ii<docs_to_create.length; ii++) {
var accepted = collection.createDocument(collection.getSelfLink(),
docs_to_create[ii],
function(err, doc_created) {
if(err) throw new Error('Error' + err.message);
});
if(!accepted) {
getContext().getResponse().setBody(count);
}
}
}
Then you could check the output document count on the client side and re-run the stored procedure with the count parameter to create the remaining set of documents until the count larger than the length of csv_text.
Hope it helps you.

TypeError: JSON.stringify cannot serialize cyclic structures

i am using camera plugin in my Ionic 2 app. It works fine in android. But it was throwing an error in IOS. i am getting this error after takes a picture using camera plugin. i am converting dataURI to a Blob: Could anyone suggest me where i am doing wrong?
This is an error: Xcode
/www/build/polyfills.js:2:30128 ERROR: error JSON.stringify()ing
argument: TypeError: JSON.stringify cannot serialize cyclic
structures.
This one in console:
Uncaught (in promise): Error: InvalidCharacterError: DOM Exception 5
atob#[native code] dataURItoBlob
This is the code which convert dataURI to a Blob:
function dataURItoBlob(dataURI) {
// convert base64/URLEncoded data component to raw binary data held in a string
var byteString;
if (dataURI.split(',')[0].indexOf('base64') >= 0)
byteString = atob(dataURI.split(',')[1]);
else
byteString = unescape(dataURI.split(',')[1]);
// separate out the mime component
var mimeString = dataURI.split(',')[0].split(':')[1].split(';')[0];
// write the bytes of the string to a typed array
var ia = new Uint8Array(byteString.length);
for (var i = 0; i < byteString.length; i++) {
ia[i] = byteString.charCodeAt(i);
}
return new Blob([ia], {type:mimeString});
}

Writing to Google Cloud Storage from PubSub using Cloud Dataflow using windowing

I am receiving messages to dataflow via pubsub in streaming mode (which is required for my desires).
Each message should be stored in its own file in GCS.
Since unbounded collections in TextIO.Write is not supported I tried to divide the PCollection into windows which contain one element each.
And writes each window to google-cloud-storage.
Here is my code:
public static void main(String[] args) {
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setProject(PROJECT_ID);
options.setStagingLocation(STAGING_LOCATION);
options.setStreaming(true);
Pipeline pipeline = Pipeline.create(options);
PubsubIO.Read.Bound<String> readFromPubsub = PubsubIO.Read.named("ReadFromPubsub")
.subscription(SUBSCRIPTION);
PCollection<String> streamData = pipeline.apply(readFromPubsub);
PCollection<String> windowedMessage = streamData.apply(Window.<String>triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))).discardingFiredPanes());
e
windowedMessage.apply(TextIO.Write.to("gs://pubsub-outputs/1"));
pipeline.run();
}
I still receive the same error got before windowing.
The DataflowPipelineRunner in streaming mode does not support TextIO.Write.
What is the code for executing the described above.
TextIO work with Bound PCollection, you could write into GCS with API Storage.
You could do :
PipeOptions options = data.getPipeline().getOptions().as(PipeOptions.class);
data.apply(WithKeys.of(new SerializableFunction<String, String>() {
public String apply(String s) { return "mykey"; } }))
.apply(Window.<KV<String, String>>into(FixedWindows.of(Duration.standardMinutes(options.getTimeWrite()))))
.apply(GroupByKey.create())
.apply(Values.<Iterable<String>>create())
.apply(ParDo.of(new StorageWrite(options)));
You create a Window with an operation of groupBy and you could write with iterable into Storage. the processElement of StorageWrite :
PipeOptions options = c.getPipelineOptions().as(PipeOptions.class);
String date = ISODateTimeFormat.date().print(c.window().maxTimestamp());
String isoDate = ISODateTimeFormat.dateTime().print(c.window().maxTimestamp());
String blobName = String.format("%s/%s/%s", options.getBucketRepository(), date, options.getFileOutName() + isoDate);
BlobId blobId = BlobId.of(options.getGCSBucket(), blobName);
WriteChannel writer = storage.writer(BlobInfo.builder(blobId).contentType("text/plain").build());
for (Iterator<String> it = c.element().iterator(); it.hasNext();) {
writer.write(ByteBuffer.wrap(it.next().getBytes()));
}
writer.close();

error while converting from nt to rdf/xml format in Jena

What is the meaning of the following error message:
I am attempting to convert the dogfood.nt to its rdf/xml representation form, what does the StackOverflow message indicate ?
<j.12:Person rdf:about="http://data.semanticweb.org/person/rich-keller">
<j.12:name>Rich Keller</j.12:name>
<rdfs:label>Rich Keller</rdfs:label>
<j.3:affiliation rdf:resource="http://data.semanticweb.org/organization/nasa-ames-research-center"/>
<j.4:holdsRole rdf:resource="http://data.semanticweb.org/conference/iswc/2005/pc-member-at-iswc2005-research-track"/>
</j.12:PersException in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4568)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4717)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4568)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4717)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4272)
at java.util.regex.Pattern$Curly.match(Pattern.java:4234)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Branch.match(Pattern.java:4602)
Following is the code snippet used:
Model model11 = ModelFactory.createDefaultModel();
InputStream is1 = FileManager.get().open("dogfood4.nt");
if (is1 != null) {
model11.read(is1, null, "N-TRIPLE");
model11.write(os1, "RDF/XML");
} else {
System.err.println("cannot read file ");;
}
I am using the semantic dogfood n-triples.

Resources