Gcp Dataflow processes invalid data - google-cloud-dataflow

We have an API as a proxy between clients and google Pub/Sub, so it basically retrieves a JSON body and publishes it to the topic. Then, it is processed by DataFlow, which stores it in BigQuery. Also, we use transform UDF to, for instance, convert a field value to upper case; it parses JSON sent and produces a new one.
The problem is the following. The number of bytes sent to the destination table is much less than to the deadletter, and the error message is 99% percent contains the error saying that the sent JSON is invalid. And that's true, the payloadstring column contains distorted JSONs: they could be truncated, concatenated with other ones, or even both. I've added logs on the API side to see where did the message set corrupted, but neither received or sent by the API JSON bodies are invalid.
How can I debug this problem? Is it any chance of pub/sub or dataflow to corrupt messages? If so, what can I do to fix it?
UPD. By the way, we use a Google-provided template called "pubsub topic to bigquery"
UPD2. API is written in Go, and the way we send the message is simply by calling
res := p.topic.Publish(ctx, &pubsub.Message{Data: msg})
The res variable is then used for error logging. p here is a custom struct.
The message we sent is a JSON with 15 fields, and just to be concise I'll mock it and UDF.
Message:
{"MessageName":"Name","MessageTimestamp":123123123",...}
UDF:
function transform(inJson) {
var obj;
try {
obj = JSON.parse(inJson);
} catch (error){
throw 'parse JSON error: '+error;
}
if (Object.keys(obj).length !== 15){
throw "Message is invalid";
}
if (!(obj.hasOwnProperty('EventSource') && typeof obj.EventSource === 'string' && obj.MessageName.length>0)) {
throw "MessageName is absent or invalid";
}
/*
other fields check
*/
obj.MessageName = obj.MessageName.toUpperCase()
/*
other fields transform
*/
return JSON.stringify(obj);
}
UPD3:
Besides being corrupted, I've noticed that every single message is duplicated at least once, and the duplicates are often truncated.
The problem occurred several days ago when it was a massive increase in the number of messages, but now it got back to normal, and the error is still there. The problem was seeing before, but it was a much more rare case.

The behavior you describe suggests that the data is corrupt before it gets to Pubsub or Dataflow.

I have performed a test, sending JSON messages containing 15 fields. Your UDF function as well as the Dataflow template work fine since I was able to insert the data to BigQuery.
Based on that, it seems your messages are already corrupted before getting to Pub/Sub, I suggest you to check your messages once they arrived to Pub/Sub and see if they have the correct format.
Please notice that it's required for the messages schema match with the BigQuery table schema.

Related

Assigning to GenericRecord the timestamp from inner object

Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such.
I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows...
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
This is what I've been doing so far, triggering 1 min windows no matter what timestamp. However, I would like to include the timestamp within the object so that it gets triggered just for those within.
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterWatermark
.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
The objects that I'm dealing with have a timestamp object, however, this is a long field and not an Instant field whatsoever.
"{ \"name\": \"timestamp\", \"type\": \"long\", \"logicalType\": \"timestamp-micros\" },"
Having my POJO class with that long field triggers nothing, but if I swap it for an Instant class and recreate the object properly, the following error is thrown whenever a PubSub message is read.
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Long
I've been also thinking to create a kind of wrapper class around GenericRecord which contains a timestamp, but would need to just use the GenericRecord part within once its ready to write with FileIO to .parquet.
Which other ways do I have to use watermark triggers?
EDIT: After #Anton comments, I've tried the following.
.apply("Apply timestamps", WithTimestamps.of(
(SerializableFunction<GenericRecord, Instant>) item -> new Instant(Long.valueOf(item.get("timestamp").toString())))
.withAllowedTimestampSkew(Duration.standardSeconds(30)))
Even it it has been deprecated this seem to pass through the pipeline but still not written (still getting discarded prior writing for some reason by the previously shown trigger?).
And also tried the other mentioned approach using outputWithTimestamp but due to the delay, it's printing the following error...
Caused by: java.lang.IllegalArgumentException: Cannot output with timestamp 2019-06-12T18:59:58.609Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-12T18:59:59.848Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.

Empty KSQL Stream

I'm having a problem with fetching data from Kafka topic.
this topic content object, I didn't really know how to store some variable in the stream.
and I'm sure that the topic exists
i have:
Ticket object:{
header object{storeID,storename.....}
body object{...}
}
i wanna put storeID in the stream
create stream test (StoreID VARCHAR) with (KAFKA_TOPIC= 'output__tfrema',VALUE_FORMAT='AVRO');
i try that example but it give me 0 data , i expect it to give me at least 10000 data.
thx any way
If there's no results returned it can be for several reasons.
Make sure you've SET 'auto.offset.reset' = 'earliest'; so you read all messages from the beginning of the topic
Are there deserialisation errors in your KSQL server log?
You can read more here: https://www.confluent.io/blog/troubleshooting-ksql-part-1

change trace log format in emqtt message broker

I am using emqtt message broker for mqtt.
I am not a erlang developer and has zero knowledge on that.
I have used this erlang based broker, because after searching many open source broker online and suggestions from people about the advantage of erlang based server.
Now i am kind of stuck with the out put of the emqttd_cli trace command.
Its not json type and if i use a perl parser to convert to json type i am getting delayed output.
I want to know, in which file i could change the trace log output format.
I looked on the trace code of the broker and found a file src/emqttd_protocol.erl. An exported function named trace/3 has the code that you need.
Second argument of this function, named Packet, has the information of receive & send data via broker. You can fetch required data from it and format according to how you want to print.
Edit : Sample modified code added
trace(recv, Packet, ProtoState) ->
PacketHeader = Packet#mqtt_packet.header,
HostInfo = esockd_net:format(ProtoState#proto_state.peername),
%% PacketInfo = {ClientId, Username, ClientIP, ClientPort, Payload, QoS, Retain}
PacketInfo = {ProtoState#proto_state.client_id, ProtoState#proto_state.username, lists:nth(1, HostInfo), lists:nth(3, HostInfo), Packet#mqtt_packet.payload, PacketHeader#mqtt_packet_header.qos, PacketHeader#mqtt_packet_header.retain},
?LOG(info, "Data Received ~s", [PacketInfo], ProtoState);

DocuSign Connect update XML desserialization error

I have been using DocuSign SOAP and REST based API calls to create envelope and am also using their Connect feature to update the recipient and envelope statuses for my clients.
I am getting a strange error parsing DocuSign Connect update for one client.
The error says "There is an error in XML document (1, 16174)".
Here is my code...
Dim sr As New StreamReader(Request.InputStream)
Dim reader As XmlReader = New XmlTextReader(New StringReader(xml))
Dim serializer As New XmlSerializer(GetType(DocuSignEnvelopeInformation), "http://www.docusign.net/API/3.0")
If Not serializer Is Nothing Then
envelopeInfo = TryCast(serializer.Deserialize(reader), DocuSignEnvelopeInformation)
Dim envid As String = envelopeInfo.EnvelopeStatus.EnvelopeID.ToString
I have tried bunch of things such as removing the XML definition from the XML document but did not work. The strange thing is that the same code works for all of my other clients. This is the only client that is having issues. They have added closed 65 tags in the document to be signed but I don't think that the tags are causing issues on their end since I also tried removing them.
Please advise.
Minal
I have run into this issue before when there are unsupported characters in the tab values or in the PDF byte stream itself when it is decoded. I suspect that copying and pasting values into tabs from external programs like Word introduce some invisible weird characters like 
 - carriage returns and the like. You should validate your XML in its entirety.

REXML :: RuntimeError (entity expansion has grown too large)

After upgrading to Ruby-1.9.3-p392 today, REXML throws a Runtime Error when attempting to retrieve an XML response over a certain size - everything works fine and no error is thrown when receiving under 25 XML records, but once a certain XML response length threshold is reached, I get this error:
Error occurred while parsing request parameters.
Contents:
RuntimeError (entity expansion has grown too large):
/.rvm/rubies/ruby-1.9.3-p392/lib/ruby/1.9.1/rexml/text.rb:387:in `block in unnormalize'
I realize this was changed in the most recent Ruby version:
http://www.ruby-lang.org/en/news/2013/02/22/rexml-dos-2013-02-22/
As a quick fix, I've changed the size of REXML::Document.entity_expansion_text_limit to a larger number and the error goes away.
Is there a less risky solution?
This issue is generated when you send too much content as XML response.
To fix this issue : You need to restrict the data(< 10k) in the individual node (Instead of sending the whole data, show truncated data and provide a seperate link to view full content)
The error is being raised from the below file :
ruby-2.1.2/lib/ruby/2.1.0/rexml/text.rb
# Unescapes all possible entities
def Text::unnormalize( string, doctype=nil, filter=nil, illegal=nil )
sum = 0
string.gsub( /\r\n?/, "\n" ).gsub( REFERENCE ) {
s = Text.expand($&, doctype, filter)
if sum + s.bytesize > Security.entity_expansion_text_limit
raise "entity expansion has grown too large"
else
sum += s.bytesize
end
s
}
end
The limit ruby-2.1.2/lib/ruby/2.1.0/rexml/text.rb defaults to 10240 which means 10k data per node.
REXML already defaults to only allow 10000 entity substitutions per document, so the maximum amount of text that can be generated by entity substitution will be around 98 megabytes. (Refer https://www.ruby-lang.org/en/news/2013/02/22/rexml-dos-2013-02-22/ )
That sounds like a LOT of XML. Do you really need to get all of it? Maybe you can just request certain fields from the remote server? One option might be to try another XML parser (Nokogiri for example). Another option to maybe use something other than XML as a transport (JSON? Binary?).

Resources