Merging/appending multiple pcap files to an existing one without overwriting - wireshark

I am using tshark to filter some packets based on Display/Read filters from one file into another.
I want to have one final output file out.pcap after executing multiple read filters over number of files and combine all into out.pcap.
I was trying to use mergecap but it does not allow to append (combine) two file and store in one of them without overwriting.
Is there any way to do this, as I don't want to keep creating temporary files and merge all them together at the end.

This is not possible that I know of with existing tools, although given the way the capture file format is layed out, it should be possible to write a new tool (or extend mergecap) to do this.

Related

Creating similar HTML reports for several data files

I need to analyze a dozen similarly formatted data files. I wish to generate a similar html report, containing some statistics and graphs which describe the data, for each file. One html report per one file, same graphs in each, just different numbers. For a single file, this is easy to do for instance using the FsLab journal. Despite my best efforts, I haven't found any way to do this efficiently for many similar files (same format, different numbers).
If I have 10 files, I'd need to copy-paste the journal 10 times and change the line that defines which file to load in each copy. Then whenever I wish to add a new graph I'd need to edit all 10 files. This clearly cannot be the best way to do this.
I am willing to use other methods than the journal and other libraries than FsLab if they suit the problem better, but I'd believe there'd be an easy solution for a basic thing like this.
This is something that is not very nicely supported by the FsLab Journals system, but you can definitely find some way to do this. One simple option I can think of would be to modify the build.fsx script for the journals so that it processes the script repeatedly and uses, e.g. environment variable to specify the input file.
If you are using the standard template, look at the generateJournals functoion:
let generateJournals ctx =
let builtFiles = Journal.processJournals ctx
traceImportant "All journals updated."
Journal.getIndexJournal ctx builtFiles
I think you should be able to modify it along the following lines:
let generateJournals ctx =
// Iterate over all inputs you want to process
for input in inputFiles do
// Set environment variable to keep 'input'
let builtFiles = Journal.processJournals ctx
// Move the resulting files, so that they do not
// get overwritten by the next run
// Just return the journal you want to open first below
traceImportant "All journals updated."
Journal.getIndexJournal ctx builtFiles
Then in the journal, you should be able to use System.Environment to read the variable set in the build script.

Remove disconnected structures of compounds

I am uploading 3 different chemical files to my application, one at a time. Each file contains SMILE of compound, but the tag name is different. I am creating an IAtomContainer stream by reading file. I want to remove the disconnected structures from the stream. Is there any way to remove it instead of manually checking SMILES. I am using cdk 1.5.13.
ConnectivityChecker.isConnected(IAtomContainer);
this is working. Its returning boolean value.

HDFS Flume sink - Roll by File

Is it possible for HDFS Flume sink to roll whenever a single file (from a Flume source, say Spooling Directory) ends, instead of rolling after certain bytes (hdfs.rollSize), time (hdfs.rollInterval), or events (hdfs.rollCount)?
Can Flume be configured so that a single file is a single event?
Thanks for your input.
Reagarding your first question, it is not possible due to the sinks logic is disconnected from the sources logic. I mean, a sink only sees events being put into the channel which must be processed by him; the sink does not know if an event is the first or the last regarding a file.
Of course, you could try to create your own source (or extend an existing one) in order to add a header to the event with a value meaning "this is the last event". Then, another custom sink could behave depending on such a header: for instance, if the header is not set, then the events are not persisted but stored in memory until the header is seen; then all the information is persisted in the final backend as a bach. Other possibility is that custom sink persists the data in a file until the header is seen; then the file is closed and another one is opened.
Regarding your second question, it depends on the sink. The spooldir source behaves based on the deserializer parameter; by default its value is LINE, what means:
Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
But other custom Java classes can be configured, as said above; for instance, a deserialized for the whole file.
You can set rollsize to a small number combined with BlobDeserializer to load file by file instead of combining into blocks. This is really helpful when you have unsplittable binary files such as PDF or gz files.
This is part of the configuration that is relevant:
#Set deserializer to BlobDeserializer and set the maximum blob size to be 1GB.
#Notice that the blobs have to fit in memory so this doesn't work for files that cannot fit in memory.
agent.sources.spool.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.sources.spool.deserializer.maxBlobLength = 1000000000
#Set rollSize to 1024 to avoid combining multiple small files into one part.
agent.sinks.hdfsSink.hdfs.rollSize = 1024
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 0
The answer to the question "Can Flume be configured so that a single file is a single event?" is yes.
Yo only have to configure the following property to be 1:
hdfs.rollCount = 1
I'm looking for a solution for your first question, because sometimes the file is too big and it's needed to split the file in several chunks.
You can use any event headers in hdfs.path. ( https://flume.apache.org/FlumeUserGuide.html#hdfs-sink )
If you are using Spooling Directory Source, you can enable putting the file name in the events using fileHeaderKey or basenameHeaderKey ( https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source ).
Can Flume be configured so that a single file is a single event?
It could be, however it is not recommended. The underlying implementation (protobuf) limits file sizes to 64m. Flume events are to be small in size due to its architecture and design. (Fault-tolerance, etc.)

How to process a GCS filepattern, full file at a time?

I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.

Add new values to XML dynamically

I have an XML file in my app resources folder. I am trying to update that file with new dictionaries dynamically. In other words I am trying to edit an existing XML file to add new keys and values to it.
First of all can we edit a static XML file and add new dictionary with keys and values to it. What is the best way to do this.
In general, you can read an XML file into a document object (choose your language), use methods to modify it (add your new dictionary), and (re-)write it back out to either the original XML file, or a new one.
That's straightforward ... just roll up the ol' sleeves and code it up.
The real problem comes in with formatting in the XML file before and after said additions.
If you are going to 'unix diff' the XML file before and after, then order is important. Some standard XML processors do better with order than others.
If the order changes behind the scenes, and is gratuitously propagated into your output file, you lose standard diffing advantages, such as some gui differs, and some scm diffs (svn, cvs, etc.).
For example, browse to:
Order of XML attributes after DOM processing
They discuss that DOM loses order where SAX does not.
You can also write a custom XML 'diff'er (there may be such off-the-shelf ... for example check out 'http://diffxml.sourceforge.net/') that compares 2 XML documents tag-by-tag, attribute-by-attribute, etc.
Perhaps some standard XML-related tool such as XSLT will allow you to keep the formatting constant without changing tag or attribute order. You'd have to research that.
BTW, a related problem is the config (.ini) file problem ... many common processors flippantly announce that the write-order may not agree with the read-order.

Resources