Resources for protocol patterns - communication

Can anybody recommend some resources that discuss the common patterns found in communication protocol design and programming?
For protocol design patterns, I'm thinking of things like how delimiters are used to mark the start/end of messages, using escaping so that a delimiter can appear within the message data, the pros/cons of various checksum methods, fixed-length vs. variable-length messages, the various ways of representing data within a message (e.g. plaintext, little/big endian binary encoding), and so forth.
For protocol programming patterns, I'm referring to techniques used when writing software to communicate using these protocols. As an example - for various protocol designs, what techniques exist to re-synchronize on a message boundary when the datastream is momentarily corrupted (e.g. looking for the next start-of-message delimiter if one is defined, sliding a window over the input buffer looking for a well-formed message), or when an existing conversation is picked up mid-stream?
For my projects I'm mostly concerned with communicating with hardware over unreliable links, such as RS-232, so error detection and message framing techniques are of particular interest to me, though a general treatment of the topic would be excellent.
Most of what I know about this topic was learned in an ad-hoc way from looking at existing protocols. I'd like to formalize my understanding a bit and hopefully pick up a few new techniques along the way, as well as have something that I can point to when training others to do this sort of work.

I found the early chapters of O'Reilly's BEEP: The Definitive Guide to be a good introduction to protocol design.

There are numerous resources... PPP being one: for serial & unreliable links, this protocol is often used.

Related

Does it make sense to interrogate structured data using NLP?

I know that this question may not be suitable for SO, but please let this question be here for a while. Last time my question was moved to cross-validated, it froze; no more views or feedback.
I came across a question that does not make much sense for me. How IFC models can be interrogated via NLP? Consider IFC models as semantically rich structured data. IFC defines an EXPRESS based entity-relationship model consisting of entities organized into an object-based inheritance hierarchy. Examples of entities include building elements, geometry, and basic constructs.
How could NLP be used for such type of data? I don't see NLP relevant at all.
In general, I would suggest that using NLP techniques to "interrogate" already (quite formally) structured data like EXPRESS would be overkill at best and a time / maintenance sinkhole at worst. In general, the strengths of NLP (human language ambiguity resolution, coreference resolution, text summarization, textual entailment, etc.) are wholly unnecessary when you already have such an unambiguous encoding as this. If anything, you could imagine translating this schema directly into a Prolog application for direct logic queries, etc. (which is quite a different direction than NLP).
I did some searches to try to find the references you may have been referring to. The only item I found was Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques:
... the authors propose a new method for extending the IFC schema to incorporate CC-related information, in an objective and semiautomated manner. The method utilizes semantic natural language processing techniques and machine learning techniques to extract concepts from documents that are related to CC [compliance checking] (e.g., building codes) and match the extracted concepts to concepts in the IFC class hierarchy.
So in this example, at least, the authors are not "interrogating" the IFC schema with NLP, but rather using it to augment existing schemas with additional information extracted from human-readable text. This makes much more sense. If you want to post the actual URL or reference that contains the "NLP interrogation" phrase, I should be able to comment more specifically.
Edit:
The project grant abstract you referenced does not contain much in the way of details, but they have this sentence:
... The information embedded in the parametric 3D model is intended for facility or workplace management using appropriate software. However, this information also has the potential, when combined with IoT sensors and cognitive computing, to be utilised by healthcare professionals in Ambient Assisted Living (AAL) environments. This project will examine how as-constructed BIM models of healthcare facilities can be interrogated via natural language processing to support AAL. ...
I can only speculate on the following reason for possibly using an NLP framework for this purpose:
While BIM models include Industry Foundation Classes (IFCs) and aecXML, there are many dozens of other formats, many of them proprietary. Some are CAD-integrated and others are standalone. Rather than pay for many proprietary licenses (some of these enterprise products are quite expensive), and/or spend the time to develop proper structured query behavior for the various diverse file format specifications (which may not be publicly available in proprietary cases), the authors have chosen a more automated, general solution to extract the content they are looking for (which I assume must be textual or textual tags in nearly all cases). This would almost be akin to a search engine "scraping" websites and looking for key words or phrases and synonyms to them, etc. The upside is they don't have to explicitly code against all the different possible BIM file formats to get good coverage, nor pay out large sums of money. The downside is they open up new issues and considerations that come with NLP, including training, validation, supervision, etc. And NLP will never have the same level of accuracy you could obtain from a true structured query against a known schema.

Is there a standard notation for a 'switch' type BPM node?

Did not find a more suitable SE site, so asking this here.
In BPM, there are more-or-less standard ways of marking activities - like Start (green circle), Stop (red circle), Conditional (yellow diamond). An example I found is here.
Conditional is basically an 'if' in any programming language.
What I could not find any example of is a 'switch'.
I know that one could use multiple Conditionals, but I have to implement a user friendly way, so I'd like to know if there is any standard way of representing such a thing in BPM.
It depends on what you mean by "switch". If you want to switch process flow in one or another way you can use Complex Gateway. A Complex Gateway allows the modeler to specify any arbitrary rules needed to determine how to emit tokens or process arriving tokens.
BPMN Complex Gateways can be used for splitting or merging a complicated process flow scenario. However, it is not recommended practice as there are simpler and more practical inclusive and exclusive gateways. There is always better to use their combination. You can learn more about complex gateways here.

Sharing atoms between http-connected erlang clusters

We have a few non-erlang-connected clusters in our infrastructure and currently use term_to_binary to encode erlang terms for messages between the clusters. On the receiving side we use binary_to_term(Bin, [safe]) to only convert to existing atoms (should there be any in the message).
Occasionally (especially after starting a new cluster/stack), we run into the problem that there are partially known atoms encoded in the message, i.e. the sending cluster knows this atom, but the receiving does not. This can be for various reasons, most common is that the receiving node simply has not loaded a module containing some record definition. We currently employ some nasty work-arounds which basically amount to maintaining a short-ish list of potentially used atoms, but we're not quite happy with this error-prone approach.
Is there a smart way to share atoms between these clusters? Or is it recommended to not use the binary format for such purposes?
Looking forward to your insights.
I would think hard about why non-Erlang nodes are sending atom values in the first place. Most likely there is some adjustment that can be made to the protocol being used to communicate -- or most often there is simply not a real protocol defined and the actual protocol in use evolved organically over time.
Not knowing any details of the situation, there are two solutions to this:
Go deep and use an abstract serialization technique like ASN.1 or JSON or whatever, using binary strings instead of atoms. This makes the most sense when you have a largish set of well understood, structured data to send (which may wrap unstructured or opaque data).
Remain shallow and instead write a functional API interface for the processes/modules you will be sending to/calling first, to make sure you fully understand what your protocol actually is, and then back that up by making each interface call correspond to a matching network message which, when received, dispatches the same procedures an API function call would have.
The basic problem is the idea of non-Erlang nodes being able to generate atoms that the cluster may not be aware of. This is a somewhat sticky problem. In many cases the places where you are using atoms you can instead use binaries to similar effect and retain the same semantics without confusing the runtime. Its the difference between {<<"new_message">>, Data} and {new_message, Data}; matching within a function head works the same way, just slightly more noisy syntactically.

spell checker uses language model

I look for spell checker that could use language model.
I know there is a lot of good spell checkers such as Hunspell, however as I see it doesn't relate to context, so it only token-based spell checker.
for example,
I lick eating banana
so here at token-based level no misspellings at all, all words are correct, but there is no meaning in the sentence. However "smart" spell checker would recognize that "lick" is actually correctly written word, but may be the author meant "like" and then there is a meaning in the sentence.
I have a bunch of correctly written sentences in the specific domain, I want to train "smart" spell checker to recognize misspelling and to learn language model, such that it would recognize that even thought "lick" is written correctly, however the author meant "like".
I don't see that Hunspell has such feature, can you suggest any other spell checker, that could do so.
See "The Design of a Proofreading Software Service" by Raphael Mudge. He describes both the data sources (Wikipedia, blogs etc) and the algorithm (basically comparing probabilities) of his approach. The source of this system, After the Deadline, is available, but it's not actively maintained anymore.
One way to do this is via a character-based language model (rather than a word-based n-gram model). See my answer to Figuring out where to add punctuation in bad user generated content?. The problem you're describing is different, but you can apply a similar solution. And, as I noted there, the LingPipe tutorial is a pretty straightforward way of developing a proof-of-concept implementation.
One important difference - to capture more context, you may want to train a larger n-gram model than the one I recommended for punctuation restoration. Maybe 15-30 characters? You'll have to experiment a little there.

Comparison of OData and Semantic Web/Linked Data

I'm trying to get my head around two very different approaches to data sharing: OData and Semantic Web/Linked Data. Is there a good comparison of the two?
As I understand it, OData combines syndication/CRUD (AtomPub), serialisation formats (XML, JSON), a data model, a query language, and some semantics/conventions governing use of those existing technologies. It's primarily intended for exposing data from one system so that others can consume it.
Linked Data is a data model, a rigorous commitment to URIs, an (optional?) serialisation format (RDF/XML), but (correct me if I'm wrong) doesn't say anything about transport, CRUD, etc. It seems intended to allow inferencing across lots of little chunks of data drawn from a wide variety of sources. (Not something of major importance to us right now - we would be synchronising large slabs of data between a small number of sources, and wanting to preserve provenance information).
I'm interested in technologies for sharing data between certain data management platforms, some of which I work on directly. OData seems more appealing as it's very straightforward to explain to developers: implement this API, follow that Atom standard, serialise the data like this. We're already doing something very similar for one platform: sharing XML-serialised data on an Atom feed, with URL parameters used to filter.
By contrast, my past experiences working with RDF have given me an impression of brittle, opaque (massive slabs of RDF/XML), inaccessible (using SPARQL vs SQL) technology - but perhaps I'm confusing the experience of working with a triplestore like Jena with simply exposing an existing database via a linked data API.
Any pointers, comments etc on the differences and similarities between these two approaches in terms of scope, technologies, ease, future potential etc would be great.
I think discussing this in depth is not really what Stackoverflow is meant for, but just to give you some pointers to interesting discussions about differences and overlap:
Oh - it is data on the Web
Microsoft, OData and RDF
One of the key differences seems to be that OData has no means to link data from different sources to each other. Essentially, you're still stuck in a silo.
It might also be interesting to check out various attempts to convert data between the two approaches. See a.o. http://answers.semanticweb.com/questions/1298/has-anyone-written-a-mapping-from-odata-to-rdf .
OData may be easier, but its not better, by any means. SPARQL and RDF (forget RDF/XML, better to look at Turtle) satisfies everything in OData along with providing many more cutting edge features such as:
Federation Extensions
Linked Data
Reasoning and Inference (for the more brave)
Equally, the software supporting the standards is actually quite sophisticated. Most people interested in OData generally come from a Microsoft background, so take a look at dotNetRdf
Here's a comparison matrix:
http://uoccou.wordpress.com/2011/02/17/linked-data-odata-gdata-datarss-comparison-matrix/
Unfortunately the table formatting is pretty horrible, but the content is useful.

Resources