Skipping header rows - is it possible with Cloud DataFlow? - google-cloud-dataflow

I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields).
Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when loading in?

This is not currently possible. It sounds like there are two potential requests here:
Specifying presence and skip behavior for header lines for a BigQuery import.
Specifying that a GCS text source should skip a header line.
Future work on this is tracked in https://issues.apache.org/jira/browse/BEAM-123.
Also, in the meantime, you could add a simple filter to your ParDo code to skip headers. Something like this:
PCollection<X> rows = ...;
PCollection<X> nonHeaders =
rows.apply(Filter.by(new MatchIfNonHeader()));

Related

SSIS Export to CSV but start to write at line 3

I've tried googling for an answer but have had no luck.
What I need to do is write to a csv file that has a double header row. The columns I need as headers are in row 2 and I can set them using 'Header rows to skip'.
However, I need to start writing data at line 3 and this isn't currently happening.
What's happening is that the header in row 1 is being removed.
Everything else is fine with the package.
Any ideas ?
You can trick your Flat File Target into producing the desired output if you can do without the comfort of automatically creating the column header names from your pipeline metadata:
On your Flat File Connection, uncheck the option "Column names in the first data row"
Open your Flat File Target "Advanced Editor" and switch to the Component Properties tab.
Find the Header property under Custom Properties and edit the text to contain the desired header data ending with \r\n\r\n to produce the desired blank line between your head line and your data.
Setting Header to
column1;column2\r\n\r\n`
your resulting file should look something like this:
column1;column2CRLF
CRLF
val11;val12CRLF
val21,val22CRLF

HDFS Flume sink - Roll by File

Is it possible for HDFS Flume sink to roll whenever a single file (from a Flume source, say Spooling Directory) ends, instead of rolling after certain bytes (hdfs.rollSize), time (hdfs.rollInterval), or events (hdfs.rollCount)?
Can Flume be configured so that a single file is a single event?
Thanks for your input.
Reagarding your first question, it is not possible due to the sinks logic is disconnected from the sources logic. I mean, a sink only sees events being put into the channel which must be processed by him; the sink does not know if an event is the first or the last regarding a file.
Of course, you could try to create your own source (or extend an existing one) in order to add a header to the event with a value meaning "this is the last event". Then, another custom sink could behave depending on such a header: for instance, if the header is not set, then the events are not persisted but stored in memory until the header is seen; then all the information is persisted in the final backend as a bach. Other possibility is that custom sink persists the data in a file until the header is seen; then the file is closed and another one is opened.
Regarding your second question, it depends on the sink. The spooldir source behaves based on the deserializer parameter; by default its value is LINE, what means:
Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
But other custom Java classes can be configured, as said above; for instance, a deserialized for the whole file.
You can set rollsize to a small number combined with BlobDeserializer to load file by file instead of combining into blocks. This is really helpful when you have unsplittable binary files such as PDF or gz files.
This is part of the configuration that is relevant:
#Set deserializer to BlobDeserializer and set the maximum blob size to be 1GB.
#Notice that the blobs have to fit in memory so this doesn't work for files that cannot fit in memory.
agent.sources.spool.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.sources.spool.deserializer.maxBlobLength = 1000000000
#Set rollSize to 1024 to avoid combining multiple small files into one part.
agent.sinks.hdfsSink.hdfs.rollSize = 1024
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 0
The answer to the question "Can Flume be configured so that a single file is a single event?" is yes.
Yo only have to configure the following property to be 1:
hdfs.rollCount = 1
I'm looking for a solution for your first question, because sometimes the file is too big and it's needed to split the file in several chunks.
You can use any event headers in hdfs.path. ( https://flume.apache.org/FlumeUserGuide.html#hdfs-sink )
If you are using Spooling Directory Source, you can enable putting the file name in the events using fileHeaderKey or basenameHeaderKey ( https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source ).
Can Flume be configured so that a single file is a single event?
It could be, however it is not recommended. The underlying implementation (protobuf) limits file sizes to 64m. Flume events are to be small in size due to its architecture and design. (Fault-tolerance, etc.)

Need advice for INI parsing and validation

My constraints
A mandatory section
An optional section
A single-level section
Only one identical option by section
Text values that can look like this:
Electric= yes6batteries
Electric= yes4battery
Electric= yes8solar_panel
Electric= yes
Thermal= no
Conditional options, for example:
Electric should not exist (or should be no) if Thermal= yes but must be if Thermal= no
Need to get the number or the content of the error/conflict lines
I looked ConfigObj but I soon abandoned because not validated for Python3.
I started working with ConfigParser but I'm not sure to reach what I want.
So I ask you what you would do in my place or if there is a library best suited to my need.
TOML isn't exactly the INI format, but it looks almost like it.
There's a python library for TOML, and it works with Python 3.

How to process a GCS filepattern, full file at a time?

I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.

Merging/appending multiple pcap files to an existing one without overwriting

I am using tshark to filter some packets based on Display/Read filters from one file into another.
I want to have one final output file out.pcap after executing multiple read filters over number of files and combine all into out.pcap.
I was trying to use mergecap but it does not allow to append (combine) two file and store in one of them without overwriting.
Is there any way to do this, as I don't want to keep creating temporary files and merge all them together at the end.
This is not possible that I know of with existing tools, although given the way the capture file format is layed out, it should be possible to write a new tool (or extend mergecap) to do this.

Resources