The MessagePack specification helps you to learn about the available data types and their binary representation. However, I can't find information on the "grammar" that has to be used for building valid msgpack structures (since it's not a textual but a binary format, the term "grammar" is probably not accurate). I'm especially wondering if there are any requirements regarding the "top-level" elements within a msgpack structure. It's basically a similar question to the one that occured for JSON some time ago, albeit JSON's grammar is specified.
Using msgpack, is it ok to have primitive types (Int, Boolean..) at the top level, or does everything have to be encapsulated in a map/array? Is it possible to have multiple elements at the top level (e.g. two arrays, not nested, but "next to each other")?
Related
This is for the case of calling Saxon from a Java application. I understand that Saxon can use XPath 3.1 to run queries against JSON files. A couple of question on this:
Where is there an example of how to do this? I've done searches and find lots of answers on details of doing this, but noting on how to read in the file and perform queries. Is it the same as XML?
Is it possible to have a schema file for the JSON so returned values are correctly typed? If so, how?
Is XQuery also able to perform queries on JSON?
What version of Saxon supports this? (We are using 9.9.1.1 and want to know if I need to upgrade.)
Technically, you don't run queries against JSON files; you run them against the data structure that results from parsing a JSON file, which is a structure of maps and arrays. You can parse the JSON file using the parse-json() or json-doc() functions, and then query the result using operators that work on maps and arrays. Some of these (and examples of their use) are shown in the spec at
https://www.w3.org/TR/xpath-31/#id-maps-and-arrays
Googling for "query maps arrays JSON XPath 3.1" finds quite a lot of useful material. Or get Priscilla Walmsley's book: http://www.datypic.com/books/xquery/chapter24.html
Data types: the data types of string, number, and boolean that are intrinsic to JSON are automatically recognized by their form. There's no capability to do further typing using a schema.
XQuery is a superset of XPath, but as far as JSON/Maps/Arrays are concerned, I think the facilities in XPath and those in XQuery are exactly the same.
Saxon has added a bit of extra conformance and performance in each successive release. 9.9 is pretty complete in its coverage; 10.0 adds some optimizations (like a new internal data structure for maps whose keys are all strings, such as you get when you parse JSON). Details of changes in successive Saxon releases are described in copious detail at http://www.saxonica.com/documentation/index.html#!changes
What is semantic and syntactic interoperabilty on IoT, and what is the difference between them? I am reading papers, googling etc in order to understand what is syntactic and what is semantic interoperability in IoT, and what is the difference between them, but I am really confused, either beacause my background is too poor on this field or I cannot understand the small (?) boundary between those 2 words. Can you help with an example, or anything that could help me?
Thank you...
Taking a very concrete example: LWM2M defines both a syntactical standard and adds many semantic standards on top.
The syntactical standard defines how to transfer data, i.e. how are strings, integers, floats arrays and structs are represented and transferred. This part of the standard does not care if you transfer temperature data, smart meter data, parking sensors data or whatever.
The semantical standard defines how e.g. a temperature sensor is represented. See LWM2M Registry under ID 3303 for details. You can find on that page semantic standards for different domains.
Another view to syntactical standard vs. semantical standard: JSON defines a syntactical standard, while a specific JSON Schema file defining the JSON for a temperature sensor would provide a semantic standard.
If I want to train the Stanford Neural Network Dependency Parser for another language, there is a need for a "treebankLanguagePack"(TLP) but the information about this TLP is very limited:
particularities of your treebank and the language it contains
If I have my "treebank" in another language that follows the same format as PTB, and my data is using CONLL format. The dependency format follows the "Universal Dependency" UD. Do I need this TLP?
As of the current CoreNLP release, the TreebankLanguagePack is used within the dependency parser only to 1) determine the input text encoding and 2) determine which tokens count as punctuation [1].
Your best bet for a quick solution, then, is probably to stick with the UD English TreebankLanguagePack. You should do this by specifying the property language as "UniversalEnglish" (whether you're accessing the dependency parser via code or command line). If you're using the dependency parser via the CoreNLP main entry point, this property key should be depparse.language.
Technical details
Two very subtle details follow. You probably don't need to worry about these if you're just trying to hack something together at first, but it's probably good to mention so that you can avoid apocalyptic / head-smashing bugs in the future.
Evaluation and punctuation: If you do choose to stick with UniversalEnglish, be aware that there is a hack in the evaluation code that overrides the punctuation set for English parsing in particular. Any changes you make to punctuation in PennTreebankLanguagePack (the TLP used for the UniversalEnglish language) will be ignored! If you need to get around this, it should be enough to copy and paste the PennTreebankLanguagePack into your own codebase and name it something different.
Potential memory leak: When building parse results to be returned to the user, the dependency parser draws from a pool of cached GrammaticalRelation objects. This cache does not live-update. This means that if you have relations which aren't formally defined in the language you specified via the language property, they will lead to the instantiation of a new object whenever those relations show up in parser predictions. (This can be a big deal memory-wise if you happen to store the parse objects somewhere.)
[1]: Punctuation is excluded during evaluation. This is a standard "cheat" used throughout the dependency parsing literature.
aeson seems to take a somewhat simple-minded approach to parsing JSON: it parses a top-level JSON value (an object or array) to its own fixed representation and then offers facilities to help users convert that representation to their own. This approach works pretty well when JSON objects and arrays are small. When they're very large, things start to fall apart, because user code can't do anything until JSON values are completely read and parsed. This seems particularly unfortunate since JSON seems to be designed for recursive descent parsers— it seems like it should be fairly simple to allow user code to step in and say how each piece should be parsed. Is there a deep reason aeson and the earlier json work this way, or should I try to make a new library for more flexible JSON parsing?
json-stream is a stream based parser. This is a bit out of date (2015), but they took the benchmarks from aeson and compared the two libraries: aeson and json-stream performance comparison. There is one case where json-stream is significantly worse than aeson.
If you just want a faster aeson (not streaming), haskell-sajson looks interesting. It wraps a performant C++ library in Haskell and returns Value from aeson.
Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.