Error while reading data, error message: JSON table encountered too many errors, giving up. Rows - google-cloud-dataflow

I am having two files and doing a inner join using CoGroupByKey in apache-beam.
When I am writing rows to bigquery,iy gives me following error.
RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_614_c4a563c648634e9dbbf7be3a56578b6d_2f196decc8984a0d83dee92e19054ffb failed. Error Result: <ErrorProto
location: 'gs://dataflow4bigquery/temp/bq_load/06bfafaa9dbb47338ad4f3a9914279fe/dotted-transit-351803.test_dataflow.inner_join/f714c1ac-c234-4a37-bf51-c725a969347a'
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'
reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs']
-----------------code-----------------------
from apache_beam.io.gcp.internal.clients import bigquery
import apache_beam as beam
def retTuple(element):
thisTuple=element.split(',')
return (thisTuple[0],thisTuple[1:])
def jstr(cstr):
import datetime
left_dict=cstr[1]['dep_data']
right_dict=cstr[1]['loc_data']
for i in left_dict:
for j in right_dict:
id,name,rank,dept,dob,loc,city=([cstr[0]]+i+j)
json_str={ "id":id,"name":name,"rank":rank,"dept":dept,"dob":datetime.datetime.strptime(dob, "%d-%m-%Y").strftime("%Y-%m-%d").strip("'"),"loc":loc,"city":city }
return json_str
table_spec = 'dotted-transit-351803:test_dataflow.inner_join'
table_schema = 'id:INTEGER,name:STRING,rank:INTEGER,dept:STRING,dob:STRING,loc:INTEGER,city:STRING'
gcs='gs://dataflow4bigquery/temp/'
p1 = beam.Pipeline()
# Apply a ParDo to the PCollection "words" to compute lengths for each word.
dep_rows = (
p1
| "Reading File 1" >> beam.io.ReadFromText('dept_data.txt')
| 'Pair each employee with key' >> beam.Map(retTuple) # {149633CM : [Marco,10,Accounts,1-01-2019]}
)
loc_rows = (
p1
| "Reading File 2" >> beam.io.ReadFromText('location.txt')
| 'Pair each loc with key' >> beam.Map(retTuple) # {149633CM : [9876843261,New York]}
)
results = ({'dep_data': dep_rows, 'loc_data': loc_rows}
| beam.CoGroupByKey()
| beam.Map(jstr)
| beam.io.WriteToBigQuery(
custom_gcs_temp_location=gcs,
table=table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
additional_bq_parameters={'timePartitioning': {'type': 'DAY'}}
)
)
p1.run().wait_until_finish()
I am running it on gcp using dataflow runner.
When printing json_str string the output is a valid json.
Eg:
{'id': '149633CM', 'name': 'Marco', 'rank': '10', 'dept': 'Accounts', 'dob': '2019-01-31', 'loc': '9204232778', 'city': 'New York'}
{'id': '212539MU', 'name': 'Rebekah', 'rank': '10', 'dept': 'Accounts', 'dob': '2019-01-31', 'loc': '9995440673', 'city': 'Denver'}
Schema which I have defined is also correct.
But,getting that error,when loading it to bigquery.

After doing some reaseach,I finally solved it.
It was a schema error.
Id column value is like 149633CM
I had given data type of Id as INTEGER,but when I tried to load json with bq and schema as --autodetect, bq marked datatype of Id as STRING.
after that,I changed datatype of Id column as STRING in my schema in code.
And,it worked.The table is created and got loaded.
But,I am not getting one thing,if starting 6 characters are numbers in Id column,why INTEGER is not wroking and STRING is working?

because the data type is parsed on the whole field value and not only the first 6 characters. if you drop the last 2 characters, you can put INTEGER

Related

DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

Apache Beam Python - SQL Transform with named PCollection Issue

I am trying to execute the below code in which I am using Named Tuple for PCollection and SQL transform for doing a simple select.
As per the video link (4:06) : https://www.youtube.com/watch?v=zx4p-UNSmrA.
Instead of using PCOLLECTION in SQLTransform query, named PCollections can also be provided as below.
Code Block
class EmployeeType(typing.NamedTuple):
name:str
age:int
beam.coders.registry.register_coder(EmployeeType, beam.coders.RowCoder)
pcol = p | "Create" >> beam.Create([EmployeeType(name="ABC", age=10)]).with_output_types(EmployeeType)
(
{'a':pcol} | SqlTransform(
""" SELECT age FROM a """)
| "Map" >> beam.Map(lambda row: row.age)
| "Print" >> beam.Map(print)
)
p.run()
However the below code block errors out with error
Caused by: org.apache.beam.vendor.calcite.v1_28_0.org.apache.calcite.sql.validate.SqlValidatorException: Object 'a' not found
Apache Beam SDK used is 2.35.0, are there any known limitation in using named PCollection

Apache Beam : How to return multiple outputs

In the below function. I want to return important_col variable as well.
class FormatInput(beam.DoFn):
def process(self, element):
""" Format the input to the desired shape"""
df = pd.DataFrame([element], columns=element.keys())
if 'reqd' in df.columns:
important_col= 'reqd'
elif 'customer' in df.columns:
important_col= 'customer'
elif 'phone' in df.columns:
important_col= 'phone'
else:
raise ValueError(['Important columns not specified'])
output = df.to_dict('records')
return output
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as p:
clean_csv = (p
| 'Read input file' >> beam.dataframe.io.read_csv('raw_data.csv'))
to_process = clean_csv | 'pre-processing' >> beam.ParDo(FormatInput())
In the above pipeline, I want to return Important_col variable from the Format Input.
Once I have that variable, I want to pass it as argument to next step in pipeline
I also want to dump to_process to CSV file.
I tried the following but none of them worked.
converted to_process to to_dataframe and tried to_csv. I got error.
I also tried to dump pcollection to csv. I am not getting how to do that. I referred official apache beam documents, but I dont find any documents similar to my use case.

Save last record according to a composite key. ksqlDB 0.6.0

I have a Kafka topic with the following data flow ( ksqldb_topic_01 ):
% Reached end of topic ksqldb_topic_01 [0] at offset 213
{"city":"Sevilla","temperature":20,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 214
{"city":"Madrid","temperature":5,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 215
{"city":"Sevilla","temperature":10,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 216
{"city":"Valencia","temperature":15,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 217
{"city":"Sevilla","temperature":15,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 218
{"city":"Madrid","temperature":20,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 219
{"city":"Valencia","temperature":15,"sensorId":"sensor02"}
% Reached end of topic ksqldb_topic_01 [0] at offset 220
{"city":"Sevilla","temperature":5,"sensorId":"sensor02"}
% Reached end of topic ksqldb_topic_01 [0] at offset 221
{"city":"Sevilla","temperature":5,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 222
And I want to save in a table the last value that enters me in the topic, for each city and sensorId
In my ksqldDB I create the following table:
CREATE TABLE ultimo_resgistro(city VARCHAR,sensorId VARCHAR,temperature INTEGER) WITH (KAFKA_TOPIC='ksqldb_topic_01', VALUE_FORMAT='json',KEY = 'sensorId,city');
DESCRIBE EXTENDED ULTIMO_RESGISTRO;
Name : ULTIMO_RESGISTRO
Type : TABLE
Key field : SENSORID
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : JSON
Kafka topic : ksqldb_topic_01 (partitions: 1, replication: 1)
Field | Type
-----------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
CITY | VARCHAR(STRING)
SENSORID | VARCHAR(STRING)
TEMPERATURE | INTEGER
-----------------------------------------
Seeing that data is processing me
select * from ultimo_resgistro emit changes;
+------------------+------------------+------------------+------------------+------------------+
|ROWTIME |ROWKEY |CITY |SENSORID |TEMPERATURE |
+------------------+------------------+------------------+------------------+------------------+
key cannot be null
Query terminated
The problem is that you need to set the key of the Kafka message correctly. You also cannot specify two fields in the KEY clause. Read more about this here
Here's an example of how to do it.
First up, load test data:
kafkacat -b kafka-1:39092 -P -t ksqldb_topic_01 <<EOF
{"city":"Madrid","temperature":20,"sensorId":"sensor03"}
{"city":"Madrid","temperature":5,"sensorId":"sensor03"}
{"city":"Sevilla","temperature":10,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":15,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":20,"sensorId":"sensor03"}
{"city":"Sevilla","temperature":5,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":5,"sensorId":"sensor02"}
{"city":"Valencia","temperature":15,"sensorId":"sensor02"}
{"city":"Valencia","temperature":15,"sensorId":"sensor03"}
EOF
Now in ksqlDB declare the schema over the topic - as a stream, because we need to repartition the data to add a key. If you control the producer to the topic then maybe you'd do this upstream and save a step.
CREATE STREAM sensor_data_raw (city VARCHAR, temperature DOUBLE, sensorId VARCHAR)
WITH (KAFKA_TOPIC='ksqldb_topic_01', VALUE_FORMAT='JSON');
Repartition the data based on the composite key.
SET 'auto.offset.reset' = 'earliest';
CREATE STREAM sensor_data_repartitioned WITH (VALUE_FORMAT='AVRO') AS
SELECT *
FROM sensor_data_raw
PARTITION BY city+sensorId;
Two things to note:
I'm taking the opportunity to reserialise into Avro - if you'd rather keep JSON throughout then just omit the WITH (VALUE_FORMAT clause.
When the data is repartitioned the ordering guarantees are lost, so in theory you may end up with events out of order after this.
At this point we can inspect the transformed topic:
ksql> PRINT SENSOR_DATA_REPARTITIONED FROM BEGINNING LIMIT 5;
Format:AVRO
1/24/20 9:55:54 AM UTC, Madridsensor03, {"CITY": "Madrid", "TEMPERATURE": 20.0, "SENSORID": "sensor03"}
1/24/20 9:55:54 AM UTC, Madridsensor03, {"CITY": "Madrid", "TEMPERATURE": 5.0, "SENSORID": "sensor03"}
1/24/20 9:55:54 AM UTC, Sevillasensor01, {"CITY": "Sevilla", "TEMPERATURE": 10.0, "SENSORID": "sensor01"}
1/24/20 9:55:54 AM UTC, Sevillasensor01, {"CITY": "Sevilla", "TEMPERATURE": 15.0, "SENSORID": "sensor01"}
1/24/20 9:55:54 AM UTC, Sevillasensor03, {"CITY": "Sevilla", "TEMPERATURE": 20.0, "SENSORID": "sensor03"}
Note that the key in the Kafka message (the second field, after the timestamp) is now set correctly, compared to the original data that had no key:
ksql> PRINT ksqldb_topic_01 FROM BEGINNING LIMIT 5;
Format:JSON
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Madrid","temperature":20,"sensorId":"sensor03"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Madrid","temperature":5,"sensorId":"sensor03"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":10,"sensorId":"sensor01"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":15,"sensorId":"sensor01"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":20,"sensorId":"sensor03"}
Now we can declare a table over the repartitioned data. Since I'm using Avro now I don't have to reenter the schema. If I was using JSON I would need to enter it again as part of this DDL.
CREATE TABLE ultimo_resgistro WITH (KAFKA_TOPIC='SENSOR_DATA_REPARTITIONED', VALUE_FORMAT='AVRO');
The table's key is implicitly taken from the ROWKEY, which is the key of the Kafka message.
ksql> SELECT ROWKEY, CITY, SENSORID, TEMPERATURE FROM ULTIMO_RESGISTRO EMIT CHANGES;
+------------------+----------+----------+-------------+
|ROWKEY |CITY |SENSORID |TEMPERATURE |
+------------------+----------+----------+-------------+
|Madridsensor03 |Madrid |sensor03 |5.0 |
|Sevillasensor03 |Sevilla |sensor03 |20.0 |
|Sevillasensor01 |Sevilla |sensor01 |5.0 |
|Sevillasensor02 |Sevilla |sensor02 |5.0 |
|Valenciasensor02 |Valencia |sensor02 |15.0 |
|Valenciasensor03 |Valencia |sensor03 |15.0 |
If you want to take advantage of pull queries (in order to get the latest value) then you need to go and upvote (or contribute a PR 😁) this issue.

How to set coder for Google Dataflow Pipeline in Python?

I am creating a custom Dataflow job in Python to ingest data from PubSub to BigQuery. Table has many nested fields.
Where Can I set Coder in this pipeline?
avail_schema = parse_table_schema_from_json(bg_out_schema)
coder = TableRowJsonCoder(table_schema=avail_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name")
| 'Map' >> beam.Map(coder))
# transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
Error: Map can be used only with callable objects. Received TableRowJsonCoder instead.
In the code above, the coder is applied to the message read from the PubSub which is text.
WriteToBigQuery works with both, dictionary and TableRow. json.load emits dict so you can simply use the output from it to write to BigQuery without apply any coder. Note, the field in dictionary has to match Table Schema.
To avoid coder issue I would suggest using following code.
avail_schema = parse_table_schema_from_json(bg_out_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name"))
transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

Resources