Dataflow output parameterized type to avro file - google-cloud-dataflow

I have a pipeline that successfully outputs an Avro file as follows:
#DefaultCoder(AvroCoder.class)
class MyOutput_T_S {
T foo;
S bar;
Boolean baz;
public MyOutput_T_S() {}
}
#DefaultCoder(AvroCoder.class)
class T {
String id;
public T() {}
}
#DefaultCoder(AvroCoder.class)
class S {
String id;
public S() {}
}
...
PCollection<MyOutput_T_S> output = input.apply(myTransform);
output.apply(AvroIO.Write.to("/out").withSchema(MyOutput_T_S.class));
How can I reproduce this exact behavior except with a parameterized output MyOutput<T, S> (where T and S are both Avro code-able using reflection).
The main issue is that Avro reflection doesn't work for parameterized types. So based on these responses:
Setting Custom Coders & Handling Parameterized types
Using Avrocoder for Custom Types with Generics
1) I think I need to write a custom CoderFactory but, I am having difficulty figuring out exactly how this works (I'm having trouble finding examples). Oddly enough, a completely naive coder factory appears to let me run the pipeline and inspect proper output using DataflowAssert:
cr.RegisterCoder(MyOutput.class, new CoderFactory() {
#Override
public Coder<?> create(List<? excents Coder<?>> componentCoders) {
Schema schema = new Schema.Parser().parse("{\"type\":\"record\,"
+ "\"name\":\"MyOutput\","
+ "\"namespace\":\"mypackage"\","
+ "\"fields\":[]}"
return AvroCoder.of(MyOutput.class, schema);
}
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
List components = new ArrayList();
return components;
}
While I can successfully assert against the output now, I expect this will not cut it for writing to a file. I haven't figured out how I'm supposed to use the provided componentCoders to generate the correct schema and if I try to just shove the schema of T or S into fields I get:
java.lang.IllegalArgumentException: Unable to get field id from class null
2) Assuming I figure out how to encode MyOutput. What do I pass to AvroIO.Write.withSchema? If I pass either MyOutput.class or the schema I get type mismatch errors.

I think there are two questions (correct me if I am wrong):
How do I enable the coder registry to provide coders for various parameterizations of MyOutput<T, S>?
How do I values of MyOutput<T, S> to a file using AvroIO.Write.
The first question is to be solved by registering a CoderFactory as in the linked question you found.
Your naive coder is probably allowing you to run the pipeline without issues because serialization is being optimized away. Certainly an Avro schema with no fields will result in those fields being dropped in a serialization+deserialization round trip.
But assuming you fill in the schema with the fields, your approach to CoderFactory#create looks right. I don't know the exact cause of the message java.lang.IllegalArgumentException: Unable to get field id from class null, but the call to AvroCoder.of(MyOutput.class, schema) should work, for an appropriately assembled schema. If there is an issue with this, more details (such as the rest of the stack track) would be helpful.
However, your override of CoderFactory#getInstanceComponents should return a list of values, one per type parameter of MyOutput. Like so:
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
return ImmutableList.of(myOutput.foo, myOutput.bar);
}
The second question can be answered using some of the same support code as the first, but otherwise is independent. AvroIO.Write.withSchema always explicitly uses the provided schema. It does use AvroCoder under the hood, but this is actually an implementation detail. Providing a compatible schema is all that is necessary - such a schema will have to be composed for each value of T and S for which you want to output MyOutput<T, S>.

Related

Apply not applicable with ParDo and DoFn using Apache Beam

I am implementing a Pub/Sub to BigQuery pipeline. It looks similar to How to create read transform using ParDo and DoFn in Apache Beam, but here, I have already a PCollection created.
I am following what is described in the Apache Beam documentation to implement a ParDo operation to prepare a table row using the following pipeline:
static class convertToTableRowFn extends DoFn<PubsubMessage, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
PubsubMessage message = c.element();
// Retrieve data from message
String rawData = message.getData();
Instant timestamp = new Instant(new Date());
// Prepare TableRow
TableRow row = new TableRow().set("message", rawData).set("ts_reception", timestamp);
c.output(row);
}
}
// Read input from Pub/Sub
pipeline.apply("Read from Pub/Sub",PubsubIO.readMessagesWithAttributes().fromTopic(topicPath))
.apply("Prepare raw data for insertion", ParDo.of(new convertToTableRowFn()))
.apply("Insert in Big Query", BigQueryIO.writeTableRows().to(BQTable));
I found the DoFn function in a gist.
I keep getting the following error:
The method apply(String, PTransform<? super PCollection<PubsubMessage>,OutputT>) in the type PCollection<PubsubMessage> is not applicable for the arguments (String, ParDo.SingleOutput<PubsubMessage,TableRow>)
I always understood that a ParDo/DoFn operations is a element-wise PTransform operation, am I wrong ? I never got this type of error in Python, so I'm a bit confused about why this is happening.
You're right, ParDos are element-wise transforms and your approach looks correct.
What you're seeing is the compilation error. Something like this happens when the argument type of the apply() method that was inferred by java compiler doesn't match the type of the actual input, e.g. convertToTableRowFn.
From the error you're seeing it looks like java infers that the second parameter for apply() is of type PTransform<? super PCollection<PubsubMessage>,OutputT>, while you're passing the subclass of ParDo.SingleOutput<PubsubMessage,TableRow> instead (your convertToTableRowFn). Looking at the definition of SingleOutput your convertToTableRowFn is basically a PTransform<PCollection<? extends PubsubMessage>, PCollection<TableRow>>. And java fails to use it in apply where it expects PTransform<? super PCollection<PubsubMessage>,OutputT>.
What looks suspicious is that java didn't infer the OutputT to PCollection<TableRow>. One reason it would fail to do so if you have other errors. Are you sure you don't have other errors as well?
For example, looking at convertToTableRowFn you're calling message.getData() which doesn't exist when I'm trying to do it and it fails compilation there. In my case I need to do something like this instead: rawData = new String(message.getPayload(), Charset.defaultCharset()). Also .to(BQTable)) expects a string (e.g. a string representing the BQ table name) as an argument, and you're passing some unknown symbol BQTable (maybe it exists in your program somewhere though and this is not a problem in your case).
After I fix these two errors your code compiles for me, apply() is fully inferred and the types are compatible.

log4j2 and custom key value using JSONLayout

I would like to add to my log a String key and an Integer value using Log4j2.
Is there a way to do it? when I added properties to the ThreadContext I was able to add only String:String key and values but this does not help I have numbers that I need to present in Kibana (some graphs)
thanks,
Kobi
The built-in GelfLayout may be useful.
It's true that the default ThreadContext only supports String:String key-values. The work done in LOG4J2-1648 allows you to use other types in ThreadContext:
Tell Log4j to use a ThreadContext map implementation that implements the ObjectThreadContextMap interface. The simplest way to accomplish this is by setting system property log4j2.garbagefree.threadContextMap to true.
The standard ThreadContext facade only has methods for Strings, so you need to create your own facade. The below should work:
public class ObjectThreadContext {
public static boolean isSupported() {
return ThreadContext.getThreadContextMap() instanceof ObjectThreadContextMap;
}
public static Object getValue(String key) {
return getObjectMap().getValue(key);
}
public static void putValue(String key, Object value) {
getObjectMap().putValue(key, value);
}
private static ObjectThreadContextMap getObjectMap() {
if (!isSupported()) { throw new UnsupportedOperationException(); }
return (ObjectThreadContextMap) ThreadContext.getThreadContextMap();
}
}
It is possible to avoid ThreadContext altogether by injecting key-value pairs from another source into the LogEvent. This is (briefly) mentioned under Custom Context Data Injectors (http://logging.apache.org/log4j/2.x/manual/extending.html#Custom_ContextDataInjector).
I found default log4j2 implementation somewhat problematic for passing custom fields with values. In my opinion current java logging frameworks are not well suited for writing structured log events
If you like hacks, you can check https://github.com/skorhone/gelfj-alt/tree/master/src/main/java/org/graylog2/log4j2 . It's a library written for gelf. One of provided features is a layout (ExtGelfjLayout) that supports extracting custom fields (See FieldExtractor) from events. But... im order to send such event, you need to write your own logging facade on top of log4j2.

Using MySQL as input source and writing into Google BigQuery

I have an Apache Beam task that reads from a MySQL source using JDBC and it's supposed to write the data as it is to a BigQuery table. No transformation is performed at this point, that will come later on, for the moment I just want the database output to be directly written into BigQuery.
This is the main method trying to perform this operation:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline p = Pipeline.create(options);
// Build the table schema for the output table.
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("phone").setType("STRING"));
fields.add(new TableFieldSchema().setName("url").setType("STRING"));
TableSchema schema = new TableSchema().setFields(fields);
p.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://host:3306/db_name")
.withUsername("user")
.withPassword("pass"))
.withQuery("SELECT phone_number, identity_profile_image FROM scraper_caller_identities LIMIT 100")
.withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getString(2));
}
})
.apply(BigQueryIO.Write
.to(options.getOutput())
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)));
p.run();
}
But when I execute the template using maven, I get the following error:
Test.java:[184,6] cannot find symbol symbol: method
apply(com.google.cloud.dataflow.sdk.io.BigQueryIO.Write.Bound)
location: class org.apache.beam.sdk.io.jdbc.JdbcIO.Read<com.google.cloud.dataflow.sdk.values.KV<java.lang.String,java.lang.String>>
It seems that I'm not passing BigQueryIO.Write the expected data collection and that's what I am struggling with at the moment.
How can I make the data coming from MySQL meets BigQuery's expectations in this case?
I think that you need to provide a PCollection<TableRow> to BigQueryIO.Write instead of the PCollection<KV<String,String>> type that the RowMapper is outputting.
Also, please use the correct column name and value pairs when setting the TableRow.
Note: I think that your KVs are the phone and url values (e.g. {"555-555-1234": "http://www.url.com"}), not the column name and value pairs (e.g. {"phone": "555-555-1234", "url": "http://www.url.com"})
See the example here:
https://beam.apache.org/documentation/sdks/javadoc/0.5.0/
Would you please give this a try and let me know if it works for you? Hope this helps.

IBM Integration Bus: How to read user defined node (Java) complex (table) property in Java extension code

I created Java user defined node in IntegrationToolkit (9.0.0.1) and assigned it with several properties. Two of the node properties are simple (of String type) and one property is complex (table property with predefined type of User-defined) that is consisted of another two simple properties.
By following the documentation I was able to read two simple properties in my Java extension class (that extends MbNode and implements MbNodeInterface) by making getters and setters that match the names of the two simple properties. Documentation also states that getters and setters should return and set String values whatever the real simple type of a property may be. Obviously, this would not work for my complex node property.
I was also able to read User Defined Properties that are defined on the message flow level, by using CMP (Integration Buss API) classes, which was another impossible thing to do from user defined node without CMP. At one point I began to think that my complex property would be among User Defined Properties, (although UDPs are defined on the flow level and my property is defined on the custom node level) based on some other random documentation and some other forum discussion.
I finally deduced that the complex property should map to MbTable type (as it is so stated in that type's description), but I was not able to use that.
Does anyone know how to access user defined node's complex(table) property value from Java?
I recently started working with WebSphere Message Broker v 8.0.0.5 for one of my projects and I was going to ask the same question until SO suggested your question which answered my question. It might be a little late for this question but it may help others having similar questions.
After many frustrating hours consulting IBM documentation this is what I found following your thread:
You're correct about the properties being available as user-defined properties (UDP) but only at the node level.
According to the JavaDoc for MbTable class (emphasis added to call out the relevant parts):
MbTable is a complex data type which contains one or more rows of simple data types. It structure is very similar to a * standard java record set. It can not be constructed in a node but instead is returned by the getUserDefinedAttribute() on the MbNode class. Its primary use is in allowing complex attributes to be defined on nodes instead of the normal static simple types. It can only be used in the runtime if a version of the toolkit that supports complex properties is being used.
You have to call com.ibm.broker.plugin.MbNode.getUserDefinedAttribute which will return an instance of com.ibm.broker.plugin.MbTable. However, the broker runtime doesn't call any setter methods for the complex attributes during the node initialization process like it does for simple properties. Also, you cannot access the complex attributes in either the constructor or the setter methods of other simple properties in the node class. These are available only in the run or evaluate method.
The following is the decompiled method definition of com.ibm.broker.plugin.MbNode.getUserDefinedAttribute.
public Object getUserDefinedAttribute(String string) {
Object object;
String string2 = "addDynamicTerminals";
if (Trace.isOn) {
Trace.logNamedEntry((Object)this, (String)string2);
}
if ((object = this.getUDA(string)) != null && object.getClass() == MbElement.class) {
try {
MbTable mbTable;
MbElement mbElement = (MbElement)object;
object = mbTable = new MbTable(mbElement);
}
catch (MbException var4_5) {
if (Trace.isOn) {
Trace.logStackTrace((Object)this, (String)string2, (Throwable)var4_5);
}
object = null;
}
}
if (Trace.isOn) {
Trace.logNamedExit((Object)this, (String)string2);
}
return object;
}
As you can see it always returns an instance of MbTable if the attribute is found.
I was able to access the complex attributes with the following code in my node definition:
#Override
public void evaluate(MbMessageAssembly inAssembly, MbInputTerminal inTerminal) throws MbException {
checkUserDefinedProperties();
}
/**
* #throws MbException
*/
private void checkUserDefinedProperties() throws MbException {
Object obj = getUserDefinedAttribute("geoLocations");
if (obj instanceof MbTable) {
MbTable table = (MbTable) obj;
int size = table.size();
int i = 0;
table.moveToRow(i);
for (; i < size; i++, table.next()) {
String latitude = (String) table.getValue("latitube");
String longitude = (String) table.getValue("longitude");
}
}
}
The documentation for declaring attributes for user-defined extensions in Java is surprisingly silent on this little bit of detail.
Please note that all the references and code are for WebSphere Message Broker v 8.0.0 and should be relevant for IBM Integration Bus 9.0.0.1 too.

ASP.NET MVC: dealing with Version field

I have a versioned model:
public class VersionedModel
{
public Binary Version { get; set; }
}
Rendered using
<%= Html.Hidden("Version") %>
it gives:
<input id="Version" name="Version" type="hidden" value=""AQID"" />
that looks a bit strange. Any way, when the form submitted, the Version field is always null.
public ActionResult VersionedUpdate(VersionedModel data)
{
...
}
How can I pass Version over the wire?
EDIT:
A naive solution is:
public ActionResult VersionedUpdate(VersionedModel data)
{
data.Version = GetBinaryValue("Version");
}
private Binary GetBinaryValue(string name)
{
return new Binary(Convert.FromBase64String(this.Request[name].Replace("\"", "")));
}
Related posts I found.
Link
Suggests to turn 'Binary Version' into 'byte[] Version', but some commenter noticed:
The problem with this approach is that
it doesn't work if you want to use the
Table.Attach(modified, original)
overload, such as when you are using a
disconnected data context.
Link
Suggests a solution similar to my 'naive solution'
public static string TimestampToString(this System.Data.Linq.Binary binary)
{ ... }
public static System.Data.Linq.Binary StringToTimestamp(this string s)
{ ... }
http://msdn.microsoft.com/en-us/library/system.data.linq.binary.aspx
If you are using ASP.Net and use the
SQL Server "timestamp" datatype for
concurrency, you may want to convert
the "timestamp" value into a string so
you can store it (e.g., on a web
page). When LINQ to SQL retrieves a
"timestamp" from SQL Server, it stores
it in a Binary class instance. So you
essentially need to convert the Binary
instance to a string and then be able
to convert the string to an equivalent
Binary instance.
The code below provides two extension
methods to do this. You can remove the
"this" before the first parameter if
you prefer them to be ordinary static
methods. The conversion to base 64 is
a precaution to ensure that the
resultant string contains only
displayable characters and no escape
characters.
public static string ConvertRowVersionToString(this Binary rowVersion) {
return Convert.ToBase64String(rowVersion.ToArray());
}
public static Binary ConvertStringToRowVersion(this string rowVersion) {
return new Binary(Convert.FromBase64String(rowVersion));
}
I think the problem with not seeing it in the bound model on form submission is that there is no Convert.ToBinary() method available to the model binary to restructure the data from a string to it's binary representation. If you want to do this, I think that you'll need to convert the value by hand. I'm going to guess that the value you are seeing is the Base64 encoding of the binary value -- the output of Binary.ToString(). In that case, you'll need to convert it back from Base64 to a byte array and pass that to the Binary() constructor to reconstitute the value.
Have you thought about caching the object server-side, instead? This could be a little tricky as well as you have to detach the object from the data context (I'm assuming LINQ) or you wouldn't be able to reattach it to a different data context. This blog entry may be helpful if you decide to go that route.
You may need to use binding to get a strongly-typed parameter to your action method.
Try rendering using:
<%=Html.Hidden("VersionModel.Version")%>
And defining your action method signature as:
public ActionResult VersionedUpdate([Bind(Prefix="VersionModel")] VersionedModel data)
{
...
}
This post http://forums.asp.net/p/1401113/3032737.aspx#3032737 suggests to use
LinqBinaryModelBinder from http://aspnet.codeplex.com/SourceControl/changeset/view/21528#338524.
Once registered
protected void Application_Start()
{
ModelBinders.Binders.Add(typeof(Binary), new LinqBinaryModelBinder());
}
the binder will automatically deserialize Version field
public ActionResult VersionedUpdate(VersionedModel data)
{ ... }
rendered this way:
<%= Html.Hidden("Version") %>
(See also http://stephenwalther.com/blog/archive/2009/02/25/asp.net-mvc-tip-49-use-the-linqbinarymodelbinder-in-your.aspx)
There are many ways like here
byte[] b = BitConverter.GetBytes(DateTime.Now.Ticks);//new byte [(DateTime.Now).Ticks];
_store.Version = new System.Data.Linq.Binary(b)
(make sure you bind exclude your version),
But the best way is to let the DB handle it...
There are many ways like here
byte[] b = BitConverter.GetBytes(DateTime.Now.Ticks);//new byte [(DateTime.Now).Ticks]; _store.Version = new System.Data.Linq.Binary(b)
(make sure you bind exclude your version),
But the best way is to let the DB handle it...

Resources