Converting ROOT Tree to HDF5 - hdf5

I have a TTree in ROOT with 1000 events and 15 variables associated to each of them. I would like to convert this in its entirety to an hdf5 dataset. How do I organise my data in HDF5 Groups such that I can access data both by event number and by variable (if I wanted all the data from the 'kinetic energy' variable for example, over all events)? Note: I have already tried the root2hdf5 conversion tool but this does not work for branches with arrays / compound datatypes.

You can try loading the TTree into a Pandas Dataframe with root_pandas, which should work for array branches (not sure for compound datatypes).
From there, you can use both event and variable indexing, and use the regular Pandas functionality to save in your favorite format like HDF5.

Related

Read multiple files at runtime (dataflow template)

I am trying to build a dataflow template.
The goal is to read ValueProvider that will tell me what files to read.
Then for each files I need to read and enrich data with the object.
I have tried this:
p.apply(Create.of(options.getScheduleBatch()))
.apply(ParDo.of(StringScheduleBatchToFileReceivedFn.of()))
.apply(ParDo.of(new DoFn<FileReceived, PCollection<EventRow>>() {
#ProcessElement
public void process(ProcessContext c) {
FileReceived fileReceived = c.element();
Broker broker = configuration.getBroker(fileReceived.getBrokerId());
PCollection<EventRow> eventRows = p
.apply(TextIO.read().from(fileReceived.getUri()))
.apply(ParDo.of(StringToEventRowFn.of(broker, fileReceived, options.getJobName())));
c.output(eventRows);
}
}));
But I have the following error:
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.beam.sdk.values.PCollection.
I would love to find a better way than reading the file by myself using gcs client.
Do you have any tips ?
Best regards
The problem:
You're trying to emit a PCollection as an output of your ParDo. This doesn't work.
Details:
PCollection is an abstraction that represents a potentially unbounded collection of elements. Applying a transformation to a PCollection gives you another PCollection. One of the transformations you can apply is a ParDo. ParDos make element-wise transforms. When applying a ParDo you're expressing - "take this PCollection and make another one by converting all elements within it by applying that ParDo".
One of the things that makes the processing effective is ability to execute everything in parallel, e.g. converting a lot of elements at once on multiple execution nodes (e.g. VMs/machines) by running the same ParDo on each against different elements. And you can't explicitly control whether any specific transform will happen on the same execution node or another one, it's part of the underlying system design how to optimize this. But to enable this you must be able to potentially pass elements around between execution nodes and persist them for aggregation. Beam supports this by requiring you to implement Coders for elements. Coders are a serialization mechanism that allows Beam to convert an element (represented by a java object) to a byte array which can then be passed to the next transformation (that can potentially happen on another machine) or storage. For example, Beam needs to be able to encode the elements that you output from a ParDo. Beam knows how to serialize some types, but it cannot infer everything automatically, you have to provide coders for something that cannot be inferred.
Your example looks like this: take some PCollection, and convert it into another PCollection by applying a ParDo to each element, and that ParDo transforms each input element into a PCollection. This means that as soon as element gets processed by a ParDo you have to encode it and pass it to the next transformation. And the question here is - how do you encode and pass a (potentially unbounded) PCollection to the next transform or persist it for aggregation?
Beam doesn't support thisat the moment, so you will need to choose another approach.
In your specific case I am not sure if in Beam out of the box you can simply use a stream of filenames and the convert them into sub-pipelines for processing the lines in the files.
Workarounds:
Few approaches I can think of to bypass this limitation:
If your file names have a known pattern, you can specify the pattern in TextIO and it can read the new files as they arrive.
If they don't have a known pattern, you can potentially write another pipeline to rename the files names so that they have common name pattern and then use the pattern in TextIO in another pipeline.
If feasible (e.g. files fit in memory), you could probably read the files contents with pure java File API, split them into rows and emit those rows in a single ParDo. Then you can apply the same StringToEventRowFn in the following ParDo.
Hope this helps

What are buckets in terms of hash functions?

Looking at the book Mining of Massive Datasets, section 1.3.2 has an overview of Hash Functions. Without a computer science background, this is quite new to me; Ruby was my first language, where a hash seems to be equivalent to Dictionary<object, object>. And I had never considered how this kind of datastructure is put together.
The book mentions hash functions, as a means of implementing these dictionary data structures. This paragraph:
First, a hash function h takes a hash-key value as an argument and produces
a bucket number as a result. The bucket number is an integer, normally in the
range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any
type. There is an intuitive property of hash functions that they “randomize”
hash-keys
What exactly are buckets in terms of a hash function? it sounds like buckets are array-like structures, and that the hash function is some kind of algorithm / array-like-structure search that produces the same bucket number every time? What is inside this metaphorical bucket?
I've always read that javascript objects/ruby hashes/ etc don't guarantee order. In practice I've found that keys' order doesn't change (actually, I think using an older version of Mozilla's Rhino interpreter that the JS object order DID change, but I can't be sure...).
Does that mean that hashes (Ruby) / objects (JS) ARE NOT resolved by these hash functions?
Does the word hashing take on different meanings depending on the level at which you are working with computers? i.e. it would seem that a Ruby hash is not the same as a C++ hash...
When you hash a value, any useful hash function generally has a smaller range than the domain. This means that out of a large list of input values (for example all possible combinations of letters) it will output any of a smaller list of values (a number capped at a certain length). This means that more than one input value can map to the same output value.
When this is the case, the output values are refered to as buckets.
Consider the function f(x) = x mod 2
This generates the following outputs;
1 => 1
2 => 0
3 => 1
4 => 0
In this case there are two buckets (1 and 0), with a bunch of input values that fall into each.
A good hash function will fill all of these 'buckets' equally, and so enable faster searching etc. If you take the mod of any number, you get the bucket to look into, and thus have to search through less results than if you just searched initially, since each bucket has less results in it than the whole set of inputs. In the ideal situation, the hash is fast to calculate and there is only one result in each bucket, this enables lookups to take only as long as applying the hash function takes.
This is a simplified example of course but hopefully you get the idea?
The concept of a hash function is always the same. It's a function that calculates some number to represent an object. The properties of this number should be:
it's relatively cheap to compute
it's as different as possible for all objects.
Let's give a really artificial example to show what I mean with this and why/how hashes are usually used.
Take all natural numbers. Now let's assume it's expensive to check if 2 numbers are equal.
Let's also define a relatively cheap hash function as follows:
hash = number % 10
The idea is simple, just take the last digit of the number as the hash. In the explanation you got, this means we put all numbers ending in 1 into an imaginary 1-bucket, all numbers ending in 2 in the 2-bucket etc...
Those buckets don't really exists as data structure. They just make it easy to reason about the hash function.
Now that we have this cheap hash function we can use it to reduce the cost of other things. For example, we want to create a new datastructure to enable cheap searching of numbers. Let's call this datastructure a hashmap.
Here we actually put all the numbers with hash=1 together in a list/set/..., we put the numbers with hash=5 into their own list/set ... etc.
And if we then want to lookup some number, we first calculate it's hash value. Then we check the list/set corresponding to this hash, and then compare only "similar" numbers to find our exact number we want. This means we only had to do a cheap hash calculation and then have to check 1/10th of the numbers with the expensive equality check.
Note here that we use the hash function to define a new datastructure. The hash itself isn't a datastructure.
Consider a phone book.
Imagine that you wanted to look for Donald Duck in a phone book.
It would be very inefficient to have to look every page, and every entry on that page. So rather than doing that, we do the following thing:
We create an index
We create a way to obtain an index key from a name
For a phone book, the index goes from A-Z, and the function used to get the index key, is just getting first letter from the Surname.
In this case, the hashing function takes Donald Duck and gives you D.
Then you take D and go to the index where all the people with Surnames starting with D are.
That would be a very oversimplified way to put it.
Let me explain in simple terms. Buckets come into picture while handling collisions using chaining technique ( Open hashing or Closed addressing)
Here, each array entry shall correspond to a bucket and each array entry (if nonempty) will be having a pointer to the head of the linked list. (The bucket is implemented as a linked list).
The hash function shall be used by hash table to calculate an index into an array of buckets, from which the desired value can be found.
That is, while checking whether an element is in the hash table, the key is first hashed to find the correct bucket to look into. Then, the corresponding linked list is traversed to locate the desired element.
Similarly while any element addition or deletion, hashing is used to find the appropriate bucket. Then, the bucket is checked for presence/absence of required element, and accordingly it is added/removed from the bucket by traversing corresponding linked list.

How can I merge several files on SPSS by variable label?

I have 48 .sav data sets containing results of a monthly survey. I need to merge the cases of all common variables from them, in order to come up with a 4 years aggregate. As I'm new to SPSS and I'm not very proficient with syntax (although i can follow it) I would normally do this using Data - Merge files - Add Cases but most of these common variables have different variable names on each data set as the questions are not always formulated in the same order and some questions only appear on one or two data sets.
However, the variable labels do not change from one data set to another. It would be great if someone knows a way to merge this data sets by variable label instead of variable name. Swapping variable names and variable labels would also do as then I could use Data - Merge files - Add Cases without problems.
Many thanks beforehand!
The merge procedures such as ADD FILES (Data > Merge Files > Add Cases) provide a capability to rename variables in the input files before merging. However, if there are a lot of variables to merge, this would get pretty tedious and error prone. Also, the dialog box supports only merging two files, while syntax allows up to 50.
Variable labels are generally not valid as variable names due to the typical presence of characters such as blanks and punctuation and length restrictions. If you have a rule that could be used to turn labels into valid variable names, that could be automated, or if the variables are always in the same order and are present in all the files, they could be renamed something like V1, V2, ...
The renaming could be done manually in syntax that you would craft for each file, or this could be done with a short Python program that you run on each file. I can write that for you if you provide details and, preferably, a sample dataset to test with (jkpeck AT gmail.com).
The Python code could loop over all the sav files in a directory and apply the renaming logic to each in one step.

Csv Type Provider convert to Json

I am using the Csv Type Provider to read data from a local csv file.
I want to export the data as json, so I am taking each row and serializing it using the json.net Library with JsonConvert.SerializeObject(x).
The problem is that each row is modeled as a tuple, meaning that the column headers do not become property names when serializing. Instead I get Item1="..."; Item2="..." etc.
How can I export to Json without 'hand-rolling' a class/record type to hold the values and maintain the property names?
The TypeProvider works by providing compile time type safety. The actual code that is compiled maps (at compile time) the nice accessors to tupled values (for performance reasons, I guess). So at run time the JSON serializer sees tuples only.
AFAIK there is no way around hand-rolling records. (That is unless we eventually get type providers that are allowed to take types as parameters which would allow a Lift<T>-type provider or the CSV type provider implementation is adjusted accordingly.)

Time series modeling in f#-- seq vs array vs vector vs list vs generic list

If I want to make a time series type in F# to hold stock prices, which basic type should I use? We need
Select a subset based on time index,
Calculate basic statistics for a subset like mean, STD or for several subsets like correlations,
Append item for new data and fast update statistics or technical indicators,
Do linear regression between time series, etc
I have read that array has a better performance, seq has a smaller memory footnote, list is better for adding items and F# vector is easier for certain math calculation. To balance all the trade offs, how would you model a stock price time series in f#? Thanks.
As a concrete representation you can choose either array or list or some other .NET colllection type. A sequence seq<'T> is an abstract type and both array and list are automatically also sequences - this means that when you write some code that works with sequences, it will work with any concrete data type (array, list or any other .NET collection).
So, when writing data processing, you can use Seq by default (as it gives you great flexibility - it doesn't matter what concrete representation you use) and then optimize some operations to use the concrete representation (whatever that will be) if you need something to run faster.
Regarding the concrete representation - I think the crucial question is whether you want to add elements without changing original data structure (immutable list or array used in an immutable way) or whether you want to mutate the data structure (e.g. use some mutable .NET collection).
If you need to add new items freuqently then you can either use immutable list (which supports appending elements to front) or a mutable collection (array won't do as it cannot be resized).
If you're working on a more sophisticated system, I would recommend taking a look at ObservableCollection<T> (see MSDN). This is a collection that automatically notifies you when it is changed. In response to the notification, you could update your statistics (it also tells you which elements were added, so you don't need to recalculate everything). However, F# doesn't have any libraries for working with this type, so you'll need to write a lot of things yourself.
If you're adding data only rarely or adding them in larger groups, you could use array (and allocate new array each time you add items). If you have only relatively small number of items in the collection, you could use lists (where adding item is easy).
For numerical calculations, the F# PowerPack (and types like vector) offer only quite limitied set of features, so you may need to look at some thrid party libraries. Extreme optimizations is a commercial library with some F# examples and Math.NET is an open source alternative.
Otherwise, it is difficult to give any concrete advice - can you add some more details about your system? (e.g. how large the data set is, how many items need to be added how often etc...)

Resources