Get MIME Type with Tika with part of a file - apache-tika

Is it possible with Tika to get the MIME Type or other meta data without loading the whole file?
I could code a script to get the first 1MB. I am thinking of doing this to take off some of the load on Tika and my server.

For container-based formats, Apache Tika needs the whole file to be sure of the type. Container formats include pretty much everything based on a zip file (Word .docx, OpenDocumentFormat .odf, iWorks etc), anything based on the OLE2 format (Excel .xls, Hangul, MSI etc), and pretty much all multimedia formats. You can often take a good guess based on the filename and the container type, but to be sure you need to process the whole file to identify the contents and hence the file type
For everything else, if Tika can detect the file type, then only the first few tens of KBs are needed, often even only the first few hundred bytes. (Depends on the format in question - different ones have their predictable signatures in different places)
If you don't need Tika's very best detection guess, but can make do with a slightly lower certainty (especially on container-based formats), then simply just give Tika the start of the file. Or tell Tika to only use the mime magic detector without any of the container-specific detectors.

Related

Loading textfile into stringlist with firemonkey on osx when the encoding is unknown

I am having a hard time to load a textfile into a stringlist in firemonkey on osx when the encoding of the textfile in not known.
When I just use list.loadfromfile(filename), I get most of the time an exception regarding encoding.
list.loadfromfile(filename,TEncoding.unicode) will also fail when the file is in ansi, and opposite.
There is no issue on Windows, list.loadfromfile(filename) just works, but not on osx.
I cant specify the encoding, because it will be unknown (user provide the text files).
Any clue how I can get around this encoding issue when running the app on a mac?
In general this is not possible. It is quite possible to create a single file that is valid when interpreted in all common encodings. This has been discussed many times, for instance: The Notepad file encoding problem, redux.
I'm assuming that you are working with files that do not contain byte order marks, BOMs. Obviously if your input files contained BOMs then you could simply check the BOM and be done.
With that assumption stated, the right solution to the problem, in a perfect world, is to know the encoding. Either pick a specific encoding which your program requires, or arrange for the user to tell you the encoding when they supply the file.
If, for whatever reason, you cannot do that then the next best thing to do is to use heuristics to attempt to guess the encoding used. I'm not aware of any Pascal code to do this. But you should be able to put something together that will work reasonably well. This answer gives an outline of a basic strategy: https://stackoverflow.com/a/20747074

Matlab Parse Binary File

I am looking to speed up the reading of a data file which has been converted from binary (it is my understanding that "binary" can mean a lot of different things - I do not know what type of binary file I have, just that it's a binary file) to plaintext. I looked into reading files quickly awhile ago, and was informed that reading/parsing a binary file is faster than text. So, I would like to parse/read the binary file (that was converted to plaintext) in an effort to speed up the program.
I'm using Matlab for this project (I have a Matlab "program" that needs the data in the file). I guess I need some information on the different "types" of binary, but I really want information on how to read/parse said binary file (I know what I'm looking for in plaintext, so I imagine I'll need to convert that to binary, search the file, then pull the result out into plaintext). The file is a logfile, if that helps in any way.
Thanks.
There are several issues in what you are asking -- however, you need to know the format of the file you are reading. If you can say "At position xx, I can expect to find data yy", that's what you need to know. In you question/comments you talk about searching for strings. You can also do it (much like a text file) "when I find xxxx in the file, give me the following data up to nth character, or up to the next yyyy".
You want to look at the documentation for fread. In the documentation there are snippets of code that will get you started, but as I (and others) said you need to know the format of your binary files. You can use a hex editor to ascertain some information if you are desperate, but what should be quicker is the documentation for the program that outputs these files.
Regarding different "binary files", well, there is least significant byte first or LSB last. You really don't need to know about that for this work. There are also other platform-dependent issues which I am almost certain you don't need to know about (unless you are moving the binary files from Mac to PC to unix machines). If you read to almost the bottom of the fread documentation, there is a section entitled "Reading Files Created on Other Systems" which talks about the issues and how to deal with them.
Another comment that I have to make, you say that "reading/parsing a binary file is faster than text". This is not true (or even if it is, odds are you won't notice the performance gain). In terms of development time, however, reading/parsing a textfile will save you huge amounts of time.
The simple way to store data in a binary file is to use the 'save' command.
If you load from a saved variable it should be significantly faster than if you load from a text file.

Binary Serialized File - Delphi

I am trying to deserialize an old file format that was serialized in Delphi, it uses binary seralization. I know nothing about the structure of the file except some very high level records that are in it.
What steps would you take to solve this problem? Any tools etc?
A good hexeditor, and use the gray matter to identify structures.
If you get a hint what kind of file it is, you can search for more specialized tools.
Running the unix/Linux "file" command can be good too (*) See Barry's comment below for how it works. It can be a quick check for common filetypes like DBF,ZIP etc hidden by using a different extension.
(*) there are 3rd party builds for windows, but they might lag in versions. If you can do it on a recent *nix distro, it is advised to do so.
The serialization process simply loops over all published properties and streams their value to a text file. If you do not know the exact classes that were streamed to the file you will have a very hard time deserializing the file. (if not impossible)
A good hex editor is first. If the file is read without buffering (eg read directly from a TFileStream) you could gain some information when using ProcMon from SysInternals; You can see exactly what data is read in what chunks and thus determine more quickly where the boundaries are between the structures you already identified.

efficient and flexible binary data parsing

I have an external device that spits out UDP packets of binary data and software running on an embedded system that needs to read this data stream, parse it and do somethign useful. The binary data gets logged to a file as well. I would like to write a parser that can easily take the input directly from either the UDP stream, or a file, parse the data into a specific format and then direct the output to either a file (e.g. matlab dat file) or to another process that will do some real time processing. Are there any resources that would help me with this and what is the best way to go about this? I think it might make sense to use C++ streams but I'm not familiar with creating custom output streams. Does this seem like a good approach to take or is there a better way to go about it?
Thanks.
The beauty of binary data is that its is generally of very fixed format.
A typical method of parsing it is to declare a structure that maps onto the received packets, and then to just use type-casts to read the fields as structure elements.
The beauty is that this requires no parsing.
you have to be careful about structure packing rules, and endian-ness to make the structure map exactly the same way. Use of the C "offsetof" and "sizeof" macros is useful to emit some debug info to check that your structure is indeed mapping to what you think it is mapping.
Packing rules can typically be altered either by directives (such as #pragma's) or command line options. Endian-ness you are stuck with. If its different from what your embedded system uses, declare all the fields as bytes, or use something like the "ntoh" macro to do the byte swapping.
The New Jersey Machine Code Toolkit is a scheme for decoding arbitrary binary patterns. It was originally designed for decoding instruction sets, but it ought to be just fine for decoding message formats. You provide a description of the binary format, it synthesizes code to access the fields of that format (when valid). THus you can refer to message fields using generated function calls rather than think about where the field is or how it is encoded.

What are the differences or advantages of using a binary file vs XML with TClientDataSet?

Is there any difference or advantages using binary a file or XML file with
TClientDataSet.
Binary will be smaller and faster.
XML will be more portable and human readable.
The Binary file will be a little smaller.
The main advantage of the XML format is that you can pass it around via http(s) protocols.
Binary is smaller and faster, but only readable by TClientDataSets.
XML is larger and slower (both are not that bad, i.e. not by orders of magnitude bigger or slower).
XML is readable by people (not recommended in general, but it is doable), and software.
Therefore it is more portable (as Nick wrote).
TClientDataSets can load and save their own style of XML, or you can use the Delphi XML Mapper tool to read and write any kind of XML.
XSLT can for instance be used to transform those XML files into any kind of text, including other XML, HTML, CSV, fixed columns, etc.
In contrast to what Tim indicates, both binary and XML can be transferred through HTTP and HTTPS. However, it is often appreciated sending XML as it is easier to trace.
Without having tested it: I guess the binary format would be quite a lot faster when reading and writing. You'd better do your own benchmarks for that, though.
Another advantage of binary might be, that it cannot be easily edited which prevents people from mucking up the data outside the application.
When using Delphi 2009, we have noticed that if the file has an extension of .XML, it will not save in binary format over an existing dfXMLUTF8 format, even with a LoadFromFile, SaveToFile. Changing the file extension to something else (.DAT, for example) allows saving the file in dfBinary. Our experience is that the binary file, in addition to being somewhat more difficult for the end-user to manipulate (a plus!), is approximately 50% smaller than the dfXMLUTF8 format file.

Resources