Apache Parquet: human-readable encoding present? - avro

This question is inspired by the Avro Json Encoding: https://avro.apache.org/docs/1.8.1/spec.html#Encodings
Avro specifies two serialization encodings: binary and JSON. Most applications will use the binary encoding, as it is smaller and faster.
But, for debugging and web-based applications, the JSON encoding may sometimes be appropriate.
Is there a similar way to store data in Parquet in a human-readable form? For debugging purposes.

Related

Why database file is a yml instead of rb

I have been asked in a recent interview why we make the database as a yml file instead of rb. Initially, I was baffled by this question and I try to give an answer that we can serialize and deserialize the yml file but the answer was not satisfactory. So, can someone share his views over it
As I noticed,
YAML is a superset of JSON. YAML is visually easier to look and easy to read.
We can use "anchors" to reference other data in YAML so it can handle relational data.
YAML is more robust about embedding other serialization formats such as JSON.
It will reduce unnecessary creation of objects like done in .rb file.
So storing configuration where only key-value pairs to be stored, are backed up by yaml files.
In short, data and code should be separated for sanity. From a perspective of functionality, data plays a different role with code. That's why we store data in a database, or we serialized them to JSON and YAML.
Config from YAML is deserialization. It's human readable, free from bothering with irrelevant language concerns, and if you want to migrate from an old codebase, it's much easier when you have configurations in text format. YAML is over JSON here because of readability.
Rails is built on the concept of separating layers based on their logic functionality. Like, MVC is designed for the same reason. And you will have a separated auth layer out of the three if necessary.

What is the difference between XML and RDF [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a doubt about what are the differences between JSON, XML and RDF.
I read on the internet:
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.
The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web.
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.
So if I'm not mistaken, Json is used for exchange of data. XML is used for exchange of data too.
what are the main differences between these two?
RDF is used to describe resources on the Web and is based on a xml syntax. So the xml can be used both for the exchange of information, both for the description of the new languages​​?
Can you give me some clarification?
EDIT:
What I understand is:
"Resource Description Framework" suggests that provides a framework for describing resources. In a university exam I have used RDF to describe the ontology of a company that I have described the main components of a company and the relationships between them.
RDF is important for the semantic web because "describe resources" allows us to associate a semantic meaning to resources.
XML is a markup language. A markup language is a set of rules that describe the mechanisms of representation (structural, semantic or presentational) of a text (wikipedia). For this reason it can be used to define the structure of the text of RDF or SOAP etc..
You also say that it is used for data serialize.
JSON is only for data serialize. To serialize data JSON and XML is similar but with XML and XML SCHEMA I can associate semantic meaning to data, or am I wrong?
XML started life as a document markup language. It has been additionally been widely used to store (serialize) data structures in various computer languages and is the basis of SOAP based web services.
Json and YAML are designed to record data structures. Yaml has been described as a superset of JSON. In practice I have found there is little practical difference apart from the fact that Yaml is simpler to read and write by humans. JSON is now more widely favoured by REST based webservices, due to its simplicity.
RDF is less a data format and more acurately described as a metadata data model. It is used to record information on the internet and is one of the building block standards of the Semantic web. RDF can be expressed in various different formats, for example XML and JSON. I can recommend the following link as an introduction:
https://github.com/JoshData/rdfabout/blob/gh-pages/intro-to-rdf.md
For some RDF examples and more discussion on this topic:
JSON to XML conversion

Dealing with invalid characters from web scraping

I've written a web scraper to extract a large amount of information from a website using Nokigiri and Mechanize, which outputs a database seed file. Unfortunately, I've discovered there's a lot of invalid characters in the text on the source website, things like keppnisæfind, Scémario and Klätiring, which is preventing the seed file from running. The seed file is too large to go through with search and replace, so how can I go about dealing with this issue?
I think those are html characters, all you need do is to write functions that will clean the characters. This depends on the programming platform
Those are almost certainly UTF-8 characters; the words should look like keppnisæfind, Scémario and Klätiring. The web sites in question might be sending UTF-8 but not declaring that as their encoding, in which case you will have to force Mechanize to use UTF-8 for sites with no declared encoding. However, that might complicate matters if you encounter other web sites without a declared encoding and they send something besides UTF-8.

Framework for building structured binary data parsers?

I have some experience with Pragmatic-Programmer-type code generation: specifying a data structure in a platform-neutral format and writing templates for a code generator that consume these data structure files and produce code that pulls raw bytes into language-specific data structures, does scaling on the numeric data, prints out the data, etc. The nice pragmatic(TM) ideas are that (a) I can change data structures by modifying my specification file and regenerating the source (which is DRY and all that) and (b) I can add additional functions that can be generated for all of my structures just by modifying my templates.
What I had used was a Perl script called Jeeves which worked, but it's general purpose, and any functions I wanted to write to manipulate my data I was writing from the ground up.
Are there any frameworks that are well-suited for creating parsers for structured binary data? What I've read of Antlr suggests that that's overkill. My current target langauges of interest are C#, C++, and Java, if it matters.
Thanks as always.
Edit: I'll put a bounty on this question. If there are any areas that I should be looking it (keywords to search on) or other ways of attacking this problem that you've developed yourself, I'd love to hear about them.
Also you may look to a relatively new project Kaitai Struct, which provides a language for that purpose and also has a good IDE:
Kaitai.io
You might find ASN.1 interesting, as it provide an absract way to describe the data you might be processing. If you use ASN.1 to describe the data abstractly, you need a way to map that abstract data to concrete binary streams, for which ECN (Encoding Control Notation) is likely the right choice.
The New Jersey Machine Toolkit is actually focused on binary data streams corresponding to instruction sets, but I think that's a superset of just binary streams. It has very nice facilities for defining fields in terms of bit strings, and automatically generating accessors and generators of such. This might be particularly useful
if your binary data structures contain pointers to other parts of the data stream.

Why is XML store not available in iOS?

From the Core Data Programming Guide:
iOS: The XML store is not available on iOS.
Why isn't this available? Is it because of the lack of certain XML classes or does it require too much processing power or RAM?
Apple would be the authoritative source for this, so we can only guess.
It’s probably because of two factors: XML stores are slower (as stated in the official documentation, mainly because of the need to parse XML and lack of efficient algorithm/data structures for common database operations) and potentially use more disk space than SQLite stores (since data must be enclosed in tags and XML stores use human-readable representation of data).
Edit: libxml2 is available on iOS so XML parsing functionality (or lack thereof) is certainly not the reason.

Resources