File format for meta data of data set - machine-learning

I would like to know and compare as many file formats as possible, which can store meta data of features, attributes, or fields
The meta data of attributes may be:
storage type: String, numeric, integer, datetime, etc
scale type: Nominal, Ordinal, Interval, Ratio, etc
format of date time to parse: "YYYY-mm-dd_HH:MM:SS", etc
For example, ARFF and SAV can have meta data of attributes
Text:
ARFF of Weka
Binary
SAV, ZSAV of SPSS
I would very appreciate if you let me know other data formats so that I can compare them.

You need to search the internet some more. You missed a number of obvious choices. Beginning with "abused" formats such as Python pickles, to unspecific formats such as JSON and YAML, to "big data" formats such as Apache Arrow, to scientific formats such as HDF 4 and HDF 5.

Related

How do I convert ICU formatted strings into an TMX (Translation memory exchange) file?

I am attempting to aggregate multiple data sources and locales into a single TMX translation memory file.
I cannot seem to find any good documentation/existing tools on how converting into TMX format might be achieved. These converters are the closest thing I have found but they do not appear to be sufficient for formatting ICU syntax.
Right now I have extracted my strings into JSON format which would look something like this:
{
foo_id: {
en: "This is a test",
fr: "Some translation"
},
bar_id: {
en: "{count, plural, one{This is a singular} other{This is a test for count #}}",
fr: "{count, plural, one{Some translation} other{Some translation for count #}}"
}
}
Based on how many translation vendors allow ICU formatting when submitting content and then exporting their TM as .tmx files it feels like this must be a solved problem but information seems scarce, does anyone have experience with this? I am using formatjs to write the ICU strings.
Since TMX only really supports plain segments with simple placeholders (not plural forms) it's not easy to convert from ICU to TMX.
Support for ICU seems pretty patchy in translation tools but there is another format which does a similar job and has better support: .po gettext. Going via .po to get to TMX might work:
Use this tool ICU2po to convert from ICU to .po format
Import the .po file into a TMS e.g. Phrase or a CAT tool e.g. Trados
Run human/machine translation process
Export a TMX

Extracting data from Invoices in pdf or image format

I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic solution for this.I have tried the following libraries
Invoice2Data : It is based on templates.It has given fairly good results in json format till now.But Template creation for complex pdfs containing dynamic table is complex.
Tabula : Table extraction is based on coordinates of the table to be extracted.If the data in the table increases the table length increases and hence the coordinates changes.So in this case it gives wrong results.
Pdftotext : It converts any pdfs to text but with the format that needs lots of parsing which we do not want.
Aws_Textract and Elis_Rossum_Ai : Gives all the data in json format.But if the table column contains multiple line then json parsing becomes difficult.Even the json given is huge in size to parse.
Tesseract : Same as pdftotext.Complex pdfs are not parseable.
Other than all this or with combination of the above libraries has anyone been able to parse complex pdf data please help.
I am working on a similar business problem. since invoices don't have fixed format so you can't directly use any text parsing method.
To solve this problem you have to use Computer Vision (Deep Learning) for field detection and Pytesseract OCR for converting image into text. For better understanding here are the steps:
Convert invoices to image and annotate the images with fields like address, Amount etc using tools like labelImg. (For better results use different types of 500-1000 invoices)
After Generating XML files train any object detection model like YOLO or TF object detection API.
The model will detect the fields and gives you coordinates of Region Of Interest(ROI). like
Apply Pytessract OCR on the ROI coordinates. Click Here
Finally, use regex to validate the text in the extracted field and perform any manipulation/transformation that is necessary. At last store data to CSV OR Database.
Hope my answer helps you! Upvote answer so it reaches to maximum people.

PostgreSQL Storing Large Json Strings (How to Best Optimize for Memory)

I have three tables, each with an attribute storing large JSON strings as text on a PostgreSQL database (on an AWS RDS instance). These JSON fields represent fabric.js javascript objects for canvas drawing and manipulation. Many of them have svg representations of complex shapes, such as maps.
My initial assumption was that we could somehow compress these fields whenever writing to the DB, and decompress them whenever we need the JSON. I naively wrote a script for the conversion, using a bytea column type, and zlib compression. I chose select records (using Rails ActiveRecord) to compare the JSON bytesize with the bytesize of the compressed JSON, confirming that the compressed version was usually about half the bytesize. I wanted proof of concept that compressing the JSON attribute on one table (notebook_question_answers) would save some memory.
After running my script on all the records for one table, I was very surprised to find out that compressing the JSON fields and converting them to binary type resulted in ADDED memory usage to the database.
So I have two questions:
Does the bytea data type have additional overhead? There are many NULL values on this particular DB attribute, and at least 1-4bytes is allocated for each binary data entry. I read on the PostgreSQL that text is optimized for text, so you should use the text datatype. Otherwise, why is there added memory when using compression?
I've started to read about json vs jsonb PostgreSQL datatypes. json offers parsing, and jsonb offers optimized indexing for search, as far as I understand. Would one of these be good paths to take for reducing memory consumption when storing large JSON strings? What would you suggest?
EDIT:
The column in question was originally of data type "text", and changed it to bytea.
after running the following command, I now see that the table takes up less space than before. About ~30% reduction of total bytes.
VACUUM (VERBOSE, ANALYZE, FULL) notebook_question_answers;

Fastest iOS data format for parsing

I am in need of a data format which will allow me to reduce the time needed to parse it to a minimum. In other words I'm looking for a format with as little overhead as possible and being parseable in the shortest amount of time.
I am building an application which will pull a lot of data from an API, parse it and display it to the user. So the format should be as small as possible so that the transmission will be fast and also should be very efficient for parsing. What are my options?
Here are a few formats that pop in in my head:
XML (a lot of overhead and slow parsing IMO)
JSON (still too cumbersome)
MessagePack (looks interesting)
CSV (with a custom parser written in C)
Plist (fast parsing, a lot of overhead)
... any others?
So currently I'm looking at CSV the most. Any other suggestions?
As stated by Apple in Property List Programming Guide binary plist representation should be fastest
Property List Representations
A property list can be stored in one of three different ways: in an
XML representation, in a binary format, or in an “old-style” ASCII
format inherited from OpenStep. You can serialize property lists in
the XML and binary formats. The serialization API with the old-style
format is read-only.
XML property lists are more portable than the binary alternative and
can be manually edited, but binary property lists are much more
compact; as a result, they require less memory and can be read and
written much faster than XML property lists. In general, if your
property list is relatively small, the benefits of XML property lists
outweigh the I/O speed and compactness that comes with binary property
lists. If you have a large data set, binary property lists, keyed
archives, or custom data formats are a better solution.
You just need to set the correct flag while creation or reading NSPropertyListBinaryFormat_v1_0. Just be sure that the data you want to represent in the plist are resented by this format.

Extensible toolkits or approaches to sniffing file formats from messy data?

Are there any frameworks out there to support file format sniffing using declarative, fuzzy schema and/or syntax definitions for the valid formats? I'm looking for something that can handle dirty or poorly formatted files, potentially across multiple versions of file format definitions/schemas, and make it easy to write rules- or pattern-based sniffers that make a best guess at file types based on introspection.
I'm looking for something declarative, allowing you define formats descriptively, maybe a DSL, something like:
format A, v1.0:
is tabular
has a "id" and "name" column
may have a "size" column
with integer values in 1-10 range
is tab-delimited
usually ends in .txt or .tab
format A, v1.1:
is tabular
has a "id" column
may have a "name" column
may have a "size" column
with integer values in 1-10 range
is tab- or comma-separated
usually ends in .txt, .csv or .tab
The key is that the incoming files may be mis-formatted, either due to user error or poorly implemented export from other tools, and the classification may be non-deterministic. So this would need to support multiple, partial matching to format definitions, along with useful explanations. A simple voting scheme is probably enough to rank guesses (i.e. the more problems found, the lower the match score).
For example, given the above definitions, a comma-delimited "test.txt" file with an "id" column and "size" column with no values would result in a sniffer log something like:
Probably format A, v1.1
- but "size" column is empty
Possibly format A, v1.0
- but "size" column is empty
- but missing "name" column
- but is comma-delimited
The Sniffer functionality in the Python standard library is heading in the right direction, but I'm looking for something more general and extensible (and not limited to tabular data). Any suggestions on where to look for something like this?
First of all, I am glad I have found this question - I am thinking of something similar too (declarative solution to markup any file format and feed it, along with the file itself, to the tool that can verify the file).
What you are naming "sniffer" is widely known as "file carver" and this person is big at carving: http://en.wikipedia.org/wiki/Simson_Garfinkel
Not only he has developed an outstanding carver, he has also provided the definition of different cases of incomplete files.
So, If you are working on some particular file format repair tool - check the aforementioned classification to find out how complex is the problem. For example, carving from incompletely received data stream and carving from disk image defers significantly. Carving from disk image with fragmented disk would be insanely more difficult, whereas padding some video file with meaningless data, just to make it open by video player is easy - you just have to provide the correct format.
Hope it helped.
Regards

Resources