Does the coder we select significantly affect performance? - google-cloud-dataflow

I'm having trouble understanding the purpose of "coders". My understanding is that we choose coders in order to "teach" dataflow how a particular object should be encoded in byte format and how equality and hash code should be evaluated.
By default, and perhaps by mistake, I tend to put the words " implement serializable" on almost all my custom classes. This has the advantage the dataflow tends not to complain. However, because some of these classes are huge objects, I'm wondering if the performance suffers, and instead I should implement a custom coder in which I specify exactly which one or two fields can be used to determine equality and hash code etc. Does this make sense? Put another way, does creating a custom coder (which may only use one or two small primitive fields) instead of the default serial coder improve performance for very large classes?

Java serialization is very slow compared to other forms of encoding, and can definitely cause performance problems. However, only serializing part of your object means that the rest of the object will be dropped when it is sent between processes.
Much better that using Serializable, and pretty much just as easy, you can use AvroCoder by annotation your classes with
#DefaultCoder(AvroCoder.class)
This will automatically deduce an Avro schema from your class. Note that this does not work for generic types, so you'll likely want to use a custom coder in that case.

Related

Data Layer Convention

I am currectly defining a data layer definition/convention that is to be used at a large oranisation.
So every time someone is defining new tags, collect some sort of information from a web page, should follow the convention.
It covers variable naming, values, type description and when to use.
The convention is later to be used with GTM/Tealium iQ but it should be tool agnostic.
What is the best way, from a technical perspective, to define the data layer schema? I am thinking if swagger of json-schema. Any thoughts?
It's important to define your data layer in a way in which works for your organisation. That said, the best data layers have an easy to understand naming convention, are generally not nested and they contain good quality data.
A good tag manager will be able to read your data layer in whatever format you would like, whether this is out of the box or a converter which runs before tag execution.
Here is Tealium's best practice:
https://community.tealiumiq.com/t5/Data-Layer/Data-Layer-Best-Practices/ta-p/15987

There seem to be a lot of ruby methods that are very similar, how do I pick which one to use?

I'm relatively new to Ruby, so this is a pretty general question. I have found through the Ruby Docs page a lot of methods that seem to do the exact same thing or very similar. For example chars vs split(' ') and each vs map vs collect. Sometimes there are small differences and other times I see no difference at all.
My question here is how do I know which is best practice, or is it just personal preference? I'm sure this varies from instance to instance, so if I can learn some of the more important ones to be cognizant of I would really appreciate that because I would like to develop good habits early.
I am a bit confused by your specific examples:
map and collect are aliases. They don't "do the exact same thing", they are the exact same thing. They are just two names for the same method. You can use whatever name you wish, or what reads best in context, or what your team has decided as a Coding Standard. The Community seems to have settled on map.
each and map/collect are completely different, there is no similarity there, apart from the general fact that they both operate on collections. map transform a collection by mapping every element to a new element using a transformation operation. It returns a new collection (an Array, actually) with the transformed elements. each performs a side-effect for every element of the collection. Since it is only used for its side-effect, the return value is irrelevant (it might just as well return nil like Kernel#puts does, in languages like C, C++, Java, C♯, it would return void), but it is specified to always return its receiver.
split splits a String into an Array of Strings based on a delimiter that can be either a Regexp (in which case you can also influence whether or not the delimiter itself gets captured in the output or ignored) or a String, or nil (in which case the global default separator gets used). chars returns an Array with the individual characters (represented as Strings of length 1, since Ruby doesn't have an specific Character type). chars belongs together in a family with bytes and codepoints which do the same thing for bytes and codepoints, respectively. split can only be used as a replacement for one of the methods in this family (chars) and split is much more general than that.
So, in the examples you gave, there really isn't much similarity at all, and I cannot imagine any situation where it would be unclear which one to choose.
In general, you have a problem and you look for the method (or combination of methods) that solve it. You don't look at a bunch of methods and look for the problem they solve.
There'll typically be only one method that fits a specific problem. Larger problems can be broken down into different subproblems in different ways, so it is indeed possible that you may end up with different combinations of methods to solve the same larger problem, but for each individual subproblem, there will generally be only one applicable method.
When documentation states that 2 methods do the same, it's just matter of preference. To learn the details, you should always start with Ruby API documentation

Why does F# Set need IComparable?

So I am trying to use the F# Set as a hash table. But my element type doesn't implement the IComparable interface (although it implements IEquatable). I got an error saying the construction is not allowed because of comparison constraint. And through some further read, I discovered that F# Set is implemented using binary tree, which makes insertion causes O(log(n)). This looks weird to me, why is the Set structure designed this way?
Edit: So I learned that Set in F# is actually a SortedSet. And I guess the question becomes, why is Sorted Set somehow more preferable than a general Hash Set as an immutable/functional data structure?
There are two important points that should help you understand how sets in F# (and in functional languages in general) work and how they are used:
Implementing immutable hashtables (like .NET HashSet) is hard - when you remove or add elements, you want to avoid copying everything in the data structure and (as far as I know) there is no general way of doing that (you would end up copying too much, so it would be inefficient).
For this reason, most functional sets are implemented as (some form of trees). Those require comparison to build a sorted tree. The nice property of balanced trees is that removing and adding elements does not have to copy everything in the tree, so even the worst case scenario is reasonably efficient (though mutable hashtable is still faster).
Now, F# is functional-first, which means that immutable structures are preferred, but it is perfectly fine to use mutable data structures (especially if you limit the usage to some well defined and restricted scope). For this reason, F# programmers often use Dictionary or HashSet, especially when this is only within the scope of a single function.

How to design a set of file readers and writers for different format

Digging into a legacy project (C++) that needs to be extended, I realized that there are about 40 reader/writer/parser classes. They are used to read and write various type of data (different objects) in different files format (binary, hdf5, xml, text, ...) ; one type of object is typically bound to one or two file formats. The classes have for most of them just no knowledge of the others. Interfaces and inheritance were apparently unknown to the writer, as well as design patterns.
It seems to me an horrendous mess. On the other hand I am not exactly sure how to handle this situation. I will at least extract interfaces. I would also like to see if I can have common code in some parent classes, for example what is specific to a hdf5 reader/writer. I also thought that the abstract factory pattern could help but the object I get out of the readers are completely different.
How would you handle this situation ? how would you design the classes ? what design pattern would you use if any ? Would you keep the reading and writing parts splitted ?
The Abstract Factory pattern isn't the right track. You usually only need interfaces if you anticipate multiple implementations for a given file type and want both to operate the same way.
Question: can one class be written to multiple file types? As in, object 'a' (of type Class A) potentially needs to be written to either/both XML or text formats?
If that is true, you need to decouple the classes from the readers/writers. Take a look at this question: What design pattern should I use for import/export?

Is it good practice to use a Dynamic Array in an object field?

I am refactoring some existing Delphi code into a class.
The current code uses a global variable defined as a dynamic array array of byte. At initialization time the code figures out the size of the array and uses SetLength to allocate it. It is convenient both as the buffer to obtain the data and as the runtime container for a later processing.
I want to move this variable as one of the object attributes.
But I am not sure if it is ok to maintain its type. Is it considered good practice?
The alternative I am considering is to tranform it to a dynamic container like a TList. I will keep the very same code for obtaining the data, with a local dynamic array but moving it to the container for the rest of its lifespan. Is it worth the effort? I know that elegance always pay off at the end, but I don't really see the value of the effort at this moment. Any thoughts?
Dynamic arrays are great, but really only for fixed dimentions. If they have to grow, especially in single record increments, this can cause eventual errors from the memory manager (and possible performance issues) since the array has to be reallocated and copied to the new bigger destination. TList does at least have a 'growing' mechanism that is called less frequently.
I know that elegance always pay off at the end,
Is that so? Note that changing working code always includes the risk to break something. It must IMHO be decided in every situation if the gained elegance is worth the risk.
In your case, if you add and remove items during runtime, I would use a TList since it is much easier for these operations. If you just initialize the length once and the arrays is constant after initialization you can just keep the dynamic array. There's definitely no "good practice" saying that you shouldn't use dynamic arrays.

Resources