How can I code nullable objects in Google Cloud Dataflow? - google-cloud-dataflow

This post is intended to answer questions like the following:
Which built-in Coders support nullable values?
How can I encode nullable objects?
What about classes with nullable fields?
What about collections with null entries?

You can inspect the built-in Coders in the DataflowJavaSDK source.
Some of the default Coders do not support null values, often for efficiency. For example, DoubleCoder always encodes a double using 8 bytes; adding a bit to reflect whether the double is null would add a (padded) 9th byte to all non-null values.
It is possible to encode nullable values using the techniques outlined below.
We generally recommend using AvroCoder to encode classes. AvroCoder has support for nullable fields annotated with the org.apache.avro.reflect.Nullable annotation:
#DefaultCoder(AvroCoder.class)
class MyClass {
#Nullable String nullableField;
}
See the TrafficMaxLaneFlow for a more complete code example.
AvroCoder also supports fields that include Null in a Union.
We recommend using NullableCoder to encode nullable objects themselves. This implements the strategy in #1.
For example, consider the following working code:
PCollection<String> output =
p.apply(Create.of(null, "test1", null, "test2", null)
.withCoder(NullableCoder.of(String.class)));
Nested null fields/objects are supported by many coders, as long as the nested coder supports null fields/objects.
For example, the SDK should be able to infer a working coder using the default CoderRegistry for a List<MyClass> -- it should automatically use a ListCoder with a nested AvroCoder.
Similarly, a List<String> with possibly-null entries can be encoded with the Coder:
Coder<List<String>> coder = ListCoder.of(NullableCoder.of(String.class))
Finally, in some cases Coders must be deterministic, e.g., the key used for GroupByKey. In AvroCoder, the #Nullable fields are coded deterministically as long as the Coder for the base type is itself deterministic. Similarly, using NullableCoder should not affect whether an object can be encoded deterministically.

Related

Aliases for primitive types in AVRO

I want to define aliases/logical names for primitive types. But if I do that directly like for PartyId in the example below, Avro-Tools do not accept this:
protocol Test {
string PartyId;
record Header {
PartyId partyId;
}
}
Is that possible in IDL or AVRO Schema language? How? As a work-around, I can define:
record PartyId {
string _value;
}
Though binary this seems equivalent, semantically (e.g. in generated Java code) this is not the same - PartyId is a structured type, not a primitive type.
I can define custom names for enums and records, but it seems to me that AVRO doesn't offer means for aliasing primitive types.
Aliases in Avro are for remapping of record namespaces
Names for records and enums are because these are actual objects, not primitives
Just like you cannot refer to a Java String by another class, Avro has no need for such a feature. You can write doc comments or name the fields appropriately to make it clear what each field is

Deserialize missing values of type to certain value

I have made a wrapper type called Skippable<'a> (an F# discriminated union, not unlike Option) specifically meant for indicating which members should be excluded when serializing types:
type Skippable<'a> =
| Skip
| Serialize of 'a
I have functioning converters, but during deserialization, I want missing JSON values to be serialized to the Skip case of the DU (instead of null as is currently happening).
I know of DefaultValueAttribute, but that only works with constant values, and besides I don't want to use an attribute on each and every Skippable-wrapped property in my DTOs.
Is it possible in some way to tell Newtonsoft.Json to populate missing values of a certain type (Skippable<'a>) with a certain value of that type (Skip)? Using converters, contract resolvers, or other methods?
Making Skippable a struct union is one way to do it, since then the default value (e.g. using Unchecked.defaultOf) seems to be the first case with any fields (none, in this case) at their default values.
[<Struct>]
type Skippable<'a> =
| Skip
| Serialize of 'a
// Unchecked.defaultof<Skippable<obj>> = Skip
This is part of the FSharp.JsonSkippable library, which allows you to control in a simple and strongly typed manner whether to include a given property when serializing (and determine whether a property was included when deserializing), and moreover, to control/determine exclusion separately of nullability.

Can a record have a nullable field?

Is it legal for a record to have a nullable field such as:
type MyRec = { startDate : System.Nullable<DateTime>; }
This example does build in my project, but is this good practice if it is legal, and what problems if any does this introduce?
It is legal, but F# encourage using option types instead:
type MyRec = { startDate : option<DateTime>; }
By using option you can easily pattern match against options and other operations to transform option values as for example map values (by using Option.map), and abstractions such as the Maybe monad (by using Option.bind), whereas with nullable you can't since only value types can be made nullables.
You will notice most F# functions (such as List.choose) work with options instead of nullables. Some language features like optional parameters are interpreted as the F# option type.
However in some cases, when you need to interact with C# you may want to use Nullable.
When usign Linq to query a DB you may consider using the Linq.Nullable Module and the Nullable operators
F# does not allow types that are declared in F# to be null. However, if you're using types that are not defined in F#, you are still allowed to use null. This is why your code is still legal. This is needed for inter-operability, because you may need to pass null to a .NET library or accept it as a result.
But I would say it is not a good practice unless your need is specifically of inter-operability. As others pointed out, you can use the option feature. However, this doesn't create an optional record field whose value you don't need to specify when creating it. To create a value of the record type, you still need to provide the value of the optional field.
Also, you can mark a type with the AllowNullLiteral attribute, and F# compiler would allow null as a value for that specific type, even if it is a type declared in F#. But AllowNullLiteral can't be applied to record types.
Oh and I almost forgot to mention: option types are NOT compatible with nullable types. Something that I kind of naively expected to just work (stupid me!). See this nice SO discussion for details.

How to closely recreate SOS.dll functionality in C# managed code? How does SOS.dll ObjSize and DumpObject work under the hood?

I have done a lot of research on this topic and am still stumped. I've previously asked this question to stackOverflow and recieved a less than satisfactory response. This leads me to believe that this is a rather advanced topic needing a strong understanding of the CLR to answer. I'm hoping a guru can help me out.
This question is largely based on my previous posts found here and here.
I'm attempting to recreate some of the functionality of the SOS.dll using reflection. Specifically the ObjSize and DumpObject commands. I use reflection to find all the fields and then if the fields are primitive types I add the size of the primitive type to the overall size of the object. If the field is a value type, then I recursively call the original method and walk down the reference tree until I've hit all primitive type fields.
I'm consistently getting object sizes larger than SOS.dll ObjSize command by a factor of two or so. One reason I've found is that my reflection code seems to be finding fields that SOS is ignoring. For example in a Dictionary, SOS find's the following fields:
buckets
entries
count
version
freeList
freeCount
comparer
keys
values
_syncRoot
m_siInfo
However my reflection code finds all of the above and also finds:
VersionName
HashSizeName
KeyValuePairsName
ComparerName
A previous answer implied that these were constants and not fields. Are constants not held in memory? Should I be ignoring constants? I'm not sure which binding flags to use to get all fields other than constants also...
Also, I'm getting confused regarding the inconsistencies found in the SOS ObjSize and DumpObject commands. I know DumpObject doesn't look at the size of the referenced types. However when I call Object size on the dictionary mentioned above I get:
Dictionary - 532B
Then I call DumpObject on the Dictionary to get the memory address of it's reference types. Then when I call Objsize on it's reference types I get:
buckets - 40
entries - 364
comparer - 12
keys - 492
(the rest are null or primitive)
**Shouldn't the ObjSize on the top level dictionary roughly be the sum of all the ObjSizes on fields within the dictionary? Why is Reflection finding more fields that DumpObject? Any thoughts on why my reflection analysis is returning numbers larger than SOS.dll? **
Also, I never got an answer to one of my questions asked in the thread linked above. I was asking whether or not I should ignore properties when evaluating the memory size of an object. The general consensus was ignore them. However, I found a good example of when a property's backing field would not be included in the collection returned from Type.GetFields(). When looking under the hood of a String you have the following:
Object contains Property named FirstChar
Object contains Property named Chars
Object contains Property named Length
Object contains Field named m_stringLength
Object contains Field named m_firstChar
Object contains Field named Empty
Object contains Field named TrimHead
Object contains Field named TrimTail
Object contains Field named TrimBoth
Object contains Field named charPtrAlignConst
Object contains Field named alignConst
The m_firstChar and m_stringLength are the backing fields of the Properties FirstChar and Length but the actual contents of the string are held in the Chars property. This is an indexed property that can be indexed to return all the chars in the String but I can't find a corresponding field that holds the characters of a string.
Any thoughts on why that is? Or how to get the backing field of the indexed property? Should indexed properties be included in the memory size?
While this idea is interesting I believe it is ultimately futile as Reflection doesn't provide access to the actual storage of objects. Reflection lets you query types, but not their actual memory representation (which is an implementation detail of the CLR).
For reference types the CLR itself adds internal fields to each instance (MT and syncblk). These are not surfaced by the Reflection APIs. Additionally the CLR may use any kind of padding/compacting on the storage of the fields depending on the definition of the type. What this means is that size of primitive types may not be consistent across different reference types. Reflection doesn't allow you to discover this either.
In short Reflection is unable to discover a lot of details needed to produce correct results.

What's the difference between an option type and a nullable type?

In F# mantra there seems to be a visceral avoidance of null, Nullable<T> and its ilk. In exchange, we are supposed to instead use option types. To be honest, I don't really see the difference.
My understanding of the F# option type is that it allows you to specify a type which can contain any of its normal values, or None. For example, an Option<int> allows all of the values that an int can have, in addition to None.
My understanding of the C# nullable types is that it allows you to specify a type which can contain any of its normal values, or null. For example, a Nullable<int> a.k.a int? allows all of the values that an int can have, in addition to null.
What's the difference? Do some vocabulary replacement with Nullable and Option, null and None, and you basically have the same thing. What's all the fuss over null about?
F# options are general, you can create Option<'T> for any type 'T.
Nullable<T> is a terrifically weird type; you can only apply it to structs, and though the Nullable type is itself a struct, it cannot be applied to itself. So you cannot create Nullable<Nullable<int>>, whereas you can create Option<Option<int>>. They had to do some framework magic to make that work for Nullable. In any case, this means that for Nullables, you have to know a priori if the type is a class or a struct, and if it's a class, you need to just use null rather than Nullable. It's an ugly leaky abstraction; it's main value seems to be with database interop, as I guess it's common to have `int, or no value' objects to deal with in database domains.
Im my opinion, the .Net framework is just an ugly mess when it comes to null and Nullable. You can argue either that F# 'adds to the mess' by having Option, or that it rescues you from the mess by suggesting that you avoid just null/Nullable (except when absolutely necessary for interop) and focus on clean solutions with Options. You can find people with both opinions.
You may also want to see
Best explanation for languages without null
Because every .NET reference type can have this extra, meaningless value—whether or not it ever is null, the possibility exists and you must check for it—and because Nullable uses null as its representation of "nothing," I think it makes a lot of sense to eliminate all that weirdness (which F# does) and require the possibility of "nothing" to be explicit. Option<_> does that.
What's the difference?
F# lets you choose whether or not you want your type to be an option type and, when you do, encourages you to check for None and makes the presence or absence of None explicit in the type.
C# forces every reference type to allow null and does not encourage you to check for null.
So it is merely a difference in defaults.
Do some vocabulary replacement with Nullable and Option, null and None, and you basically have the same thing. What's all the fuss over null about?
As languages like SML, OCaml and Haskell have shown, removing null removes a lot of run-time errors from real code. To the extent that the original creator of null even describes it as his "billion dollar mistake".
The advantage to using option is that it makes explicit that a variable can contain no value, whereas nullable types leave it implicit. Given a definition like:
string val = GetValue(object arg);
The type system does not document whether val can ever be null, or what will happen if arg is null. This means that repetitive checks need to be made at function boundaries to validate the assumptions of the caller and callee.
Along with pattern matching, code using option types can be statically checked to ensure both cases are handled, for example the following code results in a warning:
let f (io: int option) = function
| Some i -> i
As the OP mentions, there isn't much of a semantic difference between using the words optional or nullable when conveying optional types.
The problem with the built-in null system becomes apparent when you want to express non-optional types.
In C#, all reference types can be null. So, if we relied on the built-in null to express optional values, all reference types are forced to be optional ... whether the developer intended it or not. There is no way for a developer to specify a non-optional reference type (until C# 8).
So, the problem isn't with the semantic meaning of null. The problem is null is hijacked by reference types.
As a C# developer, i wish I could express optionality using the built-in null system. And that is exactly what C# 8 is doing with nullable reference types.
Well, one difference is that for a Nullable<T>, T can only be a struct which reduces the use cases dramatically.
Also make sure to read this answer: https://stackoverflow.com/a/947869/288703

Resources