Can I limit the number of rows read by avro-tools? - avro

I am using avro-tools tojson file.avro to inspect a large Avro file. I am only interested in seeing a few examples, just to get a feeling for the data.
Is there an option for avro-tools tojson that limits the number of rows read?

No. That's not possible. See source code here.
But it should be easy enough to just add a limit to the code.
Or just use head to fake it 😅

Yes, there is an option for avro-tools tojson that limits the number of rows read:
avro-tools tojson --head=<n> <filename>

Related

How to replace Telegraf's default timestamp?

I use telegraf to send some data from a database to InfluxDB in regular intervals which works fine apart from one issue:
I need to replace telegraf's auto-generated timestamp (which is the current time at the moment of telegraf reading the data to transmit) with a field from the data.
(To answer the "why?" question: So the data I get in InfluxDB as a result actually matches the time of the event I want to record).
I would have thought there's some standard configuration parameter or an easy to find processor plugin, that let's me replace the default timestamp with the content of a field, but I didn't find any.
It does not seem to me to be a very exotic request and Telegraf's "Metric" does have a "SetTime" function, so I hope someone already solved that and can answer that.
You can use a starlark processor to accomplish this, at least if you are able to put the correct timestamp into a tag first.
Here an example of how I use a starlark processor to replace the timestamp of the measurement with the content of a tag that I already populated with the correct timestamp in the input plugin:
[[processors.starlark]]
source = '''
def apply(metric):
if "nanoseconds" in metric.tags:
metric.time = int(metric.tags["nanoseconds"])
return metric
'''
Due to the fact that
no one had an idea so far,
that feature isn't available in all plugins coming with telegraf, and
I have to solve this pretty urgently,
I wrote a telegraf processor plugin that does exactly what I needed.
(I will likely offer to contribute this to telegraf in the future, when I have a bit more time to breathe than right now.)
There exists a method to do this both for JSON format input and CSV input.
Here's a link to the JSON format description and how to set timestamps based on a value that's in the payload (to the format you can specify).
use
[[processors.date]]
tag_key = "name_of_column of your timestamp"
date_format = {time format applicable in Go language}

Best way to ingest a list of csv files with dataflow

I'm looking for a way to read from a list of csv files and convert each row into json format. Assuming I cannot get header names beforehand, I must ensure that each worker can read from the beginning of one csv file, otherwise we don't know the header names.
My plan is to use FileIO.readMatches to get ReadableFile as elements, and for each element, read the first line as header and combine header with each other line into json format. My questions are:
Is it safe to assume ReadableFile will always contain a whole file, not a partial file?
Will this approach require worker memory to be larger than file size?
Any other better approaches?
Thanks!
Yes, ReadableFile will always give you a whole file.
No. As you go through the file line-by-line, you first read one line to determine the columns, then you read each line to output the rows - this should work!
This seems like the right approach to me, unless you have few files that are very large (GBs, TBs). If you have at least a dozen or a few dozen files, you should be fine.
An extra tip - it may be convenient to insert an apply(Reshuffle.viaRandomKey()) in between your CSV parser and your next transform. This will allow you to shuffle the output of each file into multiple workers downstream - it will give you more parallelism downstream.
Good luck! Feel free to ask follow up questions in the comments.

'Or' operator for bosun tags in an expression

I am writing a Bosun expression in order to get the number of 2xx responses in a service like:
ungroup(avg(q("sum:metric.name.hrsp_2xx{region=eu-west-1}", "1m", "")))
The above expression gives me the number of 2xx requests of the selected region (eu-west-1) in the last minute but I would like to get the number of 2xx requests that happened in 2 regions (eu-west-1 and eu-central-1).
This metric is tagged with region. I have 4 regions available.
I was wondering if it is possible to do an 'or' operation with the tags. Something like:
{region=or(eu-west-1,eu-central-1)}
I've checked the documentation but I'm not able to find anything to achieve this.
Since q() is specific to querying OpenTSDB, it uses the same syntax. The basic syntax for what you put would be to use a pipe symbol: ungroup(avg(q("sum:metric.name.hrsp_2xx{region=eu-west-1|eu-central-one}", "1m", ""))).
If you have version 2.2 set to true you can also use more advanced features of the filters as documented in the OpenTSDB documentation (i.e. host=literal_or(web01|web02|web03)). The main advantage is that OpenTSDB added the ability to aggregate a subset of tag values instead of all or nothing. The Graph page in Bosun also helps you generate the queries for OpenTSDB.

Delphi TStringList wrapper to implement on-the-fly compression

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA
Ross
I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).
You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.
I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and
keep separated growning dictionary
to reference the words and save list of indexes internally
You implement something related to
zlib stream available in Delphi, but operating by the block that
for example can contains 10-100
strings. In this case you still have
to recompress/compress the complete
block, but the "price" you pay is lower.
I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile

How to detect tabular data from a variety of sources

In an experimental project I am playing with I want to be able to look at textual data and detect whether it contains data in a tabular format. Of course there are a lot of cases that could look like tabular data, so I was wondering what sort of algorithm I'd need to research to look for common features.
My first thought was to write a long switch/case statement that checked for data seperated by tabs, and then another case for data separated by pipe symbols and then yet another case for data separated in another way etc etc. Now of course I realize that I would have to come up with a list of different things to detect - but I wondered if there was a more intelligent way of detecting these features than doing a relatively slow search for each type.
I realize this question isn't especially eloquently put so I hope it makes some sense!
Any ideas?
(no idea how to tag this either - so help there is welcomed!)
The only reliable scheme would be to use machine-learning. You could, for example, train a perceptron classifier on a stack of examples of tabular and non-tabular materials.
A mixed solution might be appropriate, i.e. one whereby you handled the most common/obvious cases with simple heuristics (handled in "switch-like" manner) as you suggested, and to leave the harder cases, for automated-learning and other types of classifier-logic.
This assumes that you do not already have a defined types stored in the TSV.
A TSV file is typically
[Value1]\t[Value..N]\n
My suggestion would be to:
Count up all the tabs
Count up all of new lines
Count the total tabs in the first row
Divide the total number of tabs by the tabs in the first row
With the result of 4, if you get a remainder of 0 then you have a candidate of TSV files. From there you may either want to do the following things:
You can continue reading the data and ignoring the error of lines with less or more than the predicted tabs per line
You can scan each line before reading to make sure all are consistent
You can read up to the line that does not fit the format and then throw an error
Once you have a good prediction of the amount of tab separated values you can use a regular expression to parse out the values [as a group].

Resources