Extract key value pairs from various configuration files - parsing

I am attempting to build a program which restricts users from modifying certain configuration parameters across a variety of items. For example, I would like to allow a user to upload the nginx.conf file or a preconfiguration file for the linux OS, then be able to identify the key value pairs in these files (which may have different delimiters), extract these KV pairs and store them in a database somewhere.
As there are a wide variety of config file structures out there, I was thinking along the lines of using an NLP library which could look for these KV pairs (as opposed to a function in my program based on standard delimiters). Is there anything that you've used before or would recommend? A Go library would be a bonus as my program is currently written in Go.

Related

Importing or bypassing a complicated SDK

I'm writing a program (C#) to read, convert, display, adjust and output point cloud data.
I can make every part of the program except for one - I am required to read in a proprietary file format. The data is coming straight from a laser scanner and we cannot get any closer to the stream than what is output to the proprietary file in binary.
I have an SDK from the manufacturer/proprietor that is well outside my scope of ability to deal with.
Firstly it is written in C++, which I can read and write to some degree but this all appears incredibly complex (there are hundred of header/source files).
Secondly, the SDK documentation says that I must create my SLN using CMake which is a nightmare for me also.
Thirdly, the documentation is scarce and horrid.
Basically my question is this:
I know that after a certain amount of header information I should find thousands of lines of "lineref,x,y,z,r,g,b,time,intensity".
Can I bypass the SDK and find another way to read in this file type?
Or, must an SDK from the proprietor be used to interact with their file type due to some sort of encryption?

Building a 3D model Database - Preferably MySql/Postgres/MongoDB

I am currently working on creating a library of 3D models created till date by our in-house 3D modellers using Unity3D/Maya/3ds Max, for further analysis and keeping track of each.
So, my question is how to go about storing them?
So should I store them in a database or use some kind of storage like AWS S3. They are to be stored in .fbx format. Once stored I would want to perform operations on them, like viewing them online, download etc.
Is there any other way/some kind of best practice to do the same
According to Wikipedia .fbx format is represented as "binary or ASCII data". This means that you can hardly analyze it using SQL.
You will probably use a simple BLOB (or other binary type) column to store the model itself. This will allow you to control access to models, share them, add comments, store revisions, etc. For viewing or downloading you will pull the whole file content from the database and serve it as binary data.

Get MIME Type with Tika with part of a file

Is it possible with Tika to get the MIME Type or other meta data without loading the whole file?
I could code a script to get the first 1MB. I am thinking of doing this to take off some of the load on Tika and my server.
For container-based formats, Apache Tika needs the whole file to be sure of the type. Container formats include pretty much everything based on a zip file (Word .docx, OpenDocumentFormat .odf, iWorks etc), anything based on the OLE2 format (Excel .xls, Hangul, MSI etc), and pretty much all multimedia formats. You can often take a good guess based on the filename and the container type, but to be sure you need to process the whole file to identify the contents and hence the file type
For everything else, if Tika can detect the file type, then only the first few tens of KBs are needed, often even only the first few hundred bytes. (Depends on the format in question - different ones have their predictable signatures in different places)
If you don't need Tika's very best detection guess, but can make do with a slightly lower certainty (especially on container-based formats), then simply just give Tika the start of the file. Or tell Tika to only use the mime magic detector without any of the container-specific detectors.

Shipping 1.2 million records with an iPhone app

I have a data set of 1.2 million key-value pairs. The keys are a string (a sequence of numbers up to 22 characters in length), and the values are strings.
What's the best way to ship this so a value can be looked up and retrieved quickly?
I suspect a plist is not the way to go for a data set this size.
I have the data set stored in two ways - a CSV, and a mySQL database table with 2 columns. I
ll go forward using whatever method gets the data into the app the best.
Core Data and SQLite are two good options for dealing with very large data sets in iOS. It's not difficult to create a Core Data model for the kind of data you're talking about. You can then copy that model into a little command line program that you'll write to move the data into a Core Data store. You can then move the resulting data file into your iOS app's resources.
A third option, particularly useful if the data is likely to change often, is to build a web service to provide the data from the service. I don't think this is what you're asking about, but it's something to consider if the data set is very large and/or subject to frequent change.
a collection of text files could work well. you can:
divide them into multiple files (e.g. by leading character range).
order pairs appropriately (e.g. by character number)
and quickly/easily read incrementally/portions as appropriate.
balance resources between file reads and memory usage well.
choosing the right encoding for the strings can also help (i'd start with utf8 if it's mostly in ascii).
if distribution size is also your concern, you could compress/decompress these files.
or you can just take this approach and use a custom serialized class to represent a subsets of the collection if that sounds like too much parse and read implementation.
if you're using objc types for storage and/or parsing, then it would be good to keep those files small. if you use c or c++, then it would help to profile the app.
your data set will require up to 30 MB using an 8 bit per character single byte encoding. one large file (again, ordered) which you mmap would also be worth consideration. see [NSData initWithContentsOfMappedFile:path];.
My personal experience is with a plist file that has only thousands of records and I can say that it isn't it so fast. So my options for this amount of data you have are:
A database.
Or if you have a sorting criteria for those keys and prefer plist
files split it in many files and keep a reference dictionary with the start key of every file. For ex. all keys that begin with 'abc' go in a.plist etc.
(I don't know if it's the case with your app but you can consider moving the data to a server and search via a webservice, specifically if your data will grow.)
A sqlite file is probably your best bet. You can create it on the desktop using either the command-line sqlite3 or any sqlite gui. Make sure you index the key column.
Import the csv file as described here: Importing csv files into sqlite
Then just add the database to your project/target. If you want to modify the database at runtime, however, you'll have to copy it out into your Documents or cache directories.
For an objective-c wrapper around sqlite, I like fmdb

Binary Serialized File - Delphi

I am trying to deserialize an old file format that was serialized in Delphi, it uses binary seralization. I know nothing about the structure of the file except some very high level records that are in it.
What steps would you take to solve this problem? Any tools etc?
A good hexeditor, and use the gray matter to identify structures.
If you get a hint what kind of file it is, you can search for more specialized tools.
Running the unix/Linux "file" command can be good too (*) See Barry's comment below for how it works. It can be a quick check for common filetypes like DBF,ZIP etc hidden by using a different extension.
(*) there are 3rd party builds for windows, but they might lag in versions. If you can do it on a recent *nix distro, it is advised to do so.
The serialization process simply loops over all published properties and streams their value to a text file. If you do not know the exact classes that were streamed to the file you will have a very hard time deserializing the file. (if not impossible)
A good hex editor is first. If the file is read without buffering (eg read directly from a TFileStream) you could gain some information when using ProcMon from SysInternals; You can see exactly what data is read in what chunks and thus determine more quickly where the boundaries are between the structures you already identified.

Resources