I'd like to make a parser for DNS records (e.g. what gets returned by dig), but I can't find a standard textual representation - as far as I can tell the RFCs just specify the wire format. However, the intro in https://tools.ietf.org/id/draft-daley-dnsxml-00.html implies that there is a standard format:
Historically, DNS Resource Records (RRs) have a presentation format and wire format. The presentation format is typically used to conveniently store DNS RRs in Human Readable Form.
Does anyone know if these presentation formats are defined anywhere?
The "zone file" format is standardized is standardized in section 5 of RFC1035
This is the standard text representation.
But about
I'd like to make a parser for DNS records (e.g. what gets returned by dig)
Do not make a parser on dig output. Use any kind of programming language you want, you will find libraries doing DNS requests and then use those to get results in proper structures, instead of trying to parse textual output from a command. You will then also be free of any actual textual representation of records.
Related
COuld you please recommend some materials to learn more about EDI and its professional language such as Test ISA Qualifier: 14
Test ISA Sender ID:
Test GS Sender ID:
I am totally a beginner and would like to learn more about this topic
Also, which program I could use to convert EDI message type to a different format ( for instance from X12 to XML) from FDI to AS2 communication method ( not sure if you understand in this context)
Thank you a lot for your response.
Kim
Your question is quite broad, so I'll try to just give some information that may help.
Most EDI exchanges tends to be undertaken by partnering with an EDI company, rather than self-built. This is partly because there's a lot of complexity involved, partly because standards aren't always publicly available, and partly because you often need to use a variety of transport mechanisms, including private networks. I note you've mentioned AS2, but again, you'd normally get a third party to either manage that for you, or to provide software to do so.
There are various EDI standards bodies. You've mentioned X12, which is most common if you're in North America, but their specifications have to be bought, and are quite expensive. However, you may be able to get a specification from your trading partner.
There are a number of proprietary products that will translate EDI formats to other formats (such as XML), but they usually require some expertise to use. If you have a common X12 format, and you wish to translate it to a common integration format (say an XML format defined by a commonly used accounting package - I'm not sure what "FDI" is), you may be able to find something off the shelf.
Increasingly, most businesses wish to outsource their EDI to a managed service who will look after everything for you. The costs are not usually prohibitive to even small traders.
My answer assumes you know a lot about XML, I apologize if that is not correct. But, if you want to learn EDI and ANSI, you should learn XML as its hierarchical structure fits well with ANSI formats.
Learning "Electronic Data Interchange" using ANSI X12 transaction sets which contain ISA and GS Segments (a segment is a variable length data structure) can begin with learning which industry uses which transaction sets and which version (https://www.edi2xml.com or https://x12.org). X12.org allows you to download sample ANSI files.
Materials Management uses specific ANSI X12 transactions (Purchase Orders and Invoices) which are different from the needs of a hospital business office or an insurance claim adjudication company which uses X12N HIPAA-mandated transaction sets (www.wpc-edi.com) in USA.
All ANSI segments are made up of "elements" - and "Qualifier" is an element within many segments. It denotes a specific value that identifies or qualifies the data being transferred - like a value that tells the Receiver what type of Insurance Plan is being paid by the insurance company. A "Sender ID" is also an ANSI element - in ISA and GS segments. It generally contains a number or alpha-numeric that identifies the EDI sender of the EDI transaction - may or may not be the originator of the information.
For most workplaces, a third-party software and/or vendor is generally used to send/receive the necessary transactions electronically.
I worked in healthcare for years, and I got started understanding the necessary ANSI transaction sets by asking the insurance companies for a copy of their specific Implementation Guide for a specific transaction(s). This may only be a document that shows the differences between their transactions and the HIPAA recommendations.
I have also found older, pre-HIPAA (before 1996) versions of ANSI transaction guides (developer's documentation) on the internet.
Once you have an understanding of which ANSI transaction sets are used in your industry, then try to find the appropriate ANSI transaction set associated like 837/835 for a hospital, 850/855 for purchasing or warehouse.
When you know which transactions are used for which purpose, and you understand its hierarchical structure, then try taking them apart using the programmer/developer documentation (Implementation Guide or Standard) you have found or purchased. Your trading partners may have documentation guides they will send you. If you have no "trading partners" yet, then look online or in book stores for documentation.
If you have any programming experience, the Implementation Guide or ANSI Standard documentation are the programmer's tools for understanding the transaction set function and segment layout.
If you don't have any programming skills, this would be a good project to learn some basic input and output logic - to convert an ANSI transaction file into a well-formed XML document of your design, or into a CSV or Tab-Delimited file.
With some basic input, output and data manipulation logic, you can convert any ANSI X12 file into a well-formed XML document - where ANSI segments are converted (mostly) to empty XML elements which contain Attributes that hold the ANSI Segment's data, like so:
For this ANSI stream:
ISA*00* *00* *ZZ*123456789012345*ZZ*123456789012346*080503*1705*>*00501*000010216*0*T*:~GS*HS*12345*12345*20080503*1705*20213*X*005010X279A1~ST*270*1235*005010X279A1~ (to end of file after the IEA segment)
Convert it to a flat file with CRLF characters after each tilde (~):
ISA*00* *00* (skipping elements to simplify) *00501*000010216*0*T*:~
GS*HS*12345*12345*20080503*1705*20213*X*005010X279A1~
ST*270*1235*005010X279A1~
(continue to end of file, after the IEA segment)
Then, convert the new ANSI flat file to an XML document with Attributes like:
<?xml version="1.0"?> (this must be your first line in the XML document)
<ISA a1="00" a2=" " a3="00" (skipping attributes to simplify) >
<GS a1="HS" a2="12345" a3="12345" a4="20080503" a5="1705" a6="20213" a7="X" a8= "005010X279A1" >
<ST a1="270" a2="1235" a3="005010X279A1" > (ISA, GS, ST are not empty elements)
(continue to end of file with empty elements like:)
<
<ST ... /> (closing xml tag for the ST element, convert the "SE")
<GS ... /> (closing xml tag for the GS element, convert the "GE")
<ISA a1="1" a2="000010216" /> (Closing xml tag for the ISA element, convert the "IEA")
Like I said, the output is mostly made up of "empty" XML element tags, but the SE, GS and IEA ANSI elements must be converted to become the closing XML element tags for the "ISA", "GS" and "ST" XML elements - since these three are NOT empty XML elements. They contain other XML elements.
There are many examples of other interior ANSI elements which should also not be "empty" XML elements, since in the ANSI form they have a parent/child relationship with other subordinate ANSI elements. Think of an insurance claim (not my example) containing a lot of patient demographic data and many medical charges - in ANSI these segments are child elements of a "CLM" Segment. In XML, the "CLM" would not be "empty" - containing child XML elements, for example, the "NM1" and "SVC".
The first "CLM" XML tag would be closed, enclosing all its "children" elements - when the next "CLM" (a patient's insurance claim) is encountered. In ANSI, this "closing" is not apparent, but only signaled by the existence of a following "CLM" segment.
To know if you have created a well-formed XML document, try to browse and display the XML output file with a web browser. If there are errors in the XML, it will stop displaying at the error.
Once you have learned the data structure (layout) of the ANSI files, you can then learn to create a simple well-formed XML document, and with some "XSL" Transformation skills, to translate it into a browser-friendly displayable document.
<?xml version="1.0"?> must be the first line of your XML document so that
a browser can recognize and know what to do with the XML markup. It has no
equivalent in the ANSI file.
I am in need of a data format which will allow me to reduce the time needed to parse it to a minimum. In other words I'm looking for a format with as little overhead as possible and being parseable in the shortest amount of time.
I am building an application which will pull a lot of data from an API, parse it and display it to the user. So the format should be as small as possible so that the transmission will be fast and also should be very efficient for parsing. What are my options?
Here are a few formats that pop in in my head:
XML (a lot of overhead and slow parsing IMO)
JSON (still too cumbersome)
MessagePack (looks interesting)
CSV (with a custom parser written in C)
Plist (fast parsing, a lot of overhead)
... any others?
So currently I'm looking at CSV the most. Any other suggestions?
As stated by Apple in Property List Programming Guide binary plist representation should be fastest
Property List Representations
A property list can be stored in one of three different ways: in an
XML representation, in a binary format, or in an “old-style” ASCII
format inherited from OpenStep. You can serialize property lists in
the XML and binary formats. The serialization API with the old-style
format is read-only.
XML property lists are more portable than the binary alternative and
can be manually edited, but binary property lists are much more
compact; as a result, they require less memory and can be read and
written much faster than XML property lists. In general, if your
property list is relatively small, the benefits of XML property lists
outweigh the I/O speed and compactness that comes with binary property
lists. If you have a large data set, binary property lists, keyed
archives, or custom data formats are a better solution.
You just need to set the correct flag while creation or reading NSPropertyListBinaryFormat_v1_0. Just be sure that the data you want to represent in the plist are resented by this format.
Are there any frameworks out there to support file format sniffing using declarative, fuzzy schema and/or syntax definitions for the valid formats? I'm looking for something that can handle dirty or poorly formatted files, potentially across multiple versions of file format definitions/schemas, and make it easy to write rules- or pattern-based sniffers that make a best guess at file types based on introspection.
I'm looking for something declarative, allowing you define formats descriptively, maybe a DSL, something like:
format A, v1.0:
is tabular
has a "id" and "name" column
may have a "size" column
with integer values in 1-10 range
is tab-delimited
usually ends in .txt or .tab
format A, v1.1:
is tabular
has a "id" column
may have a "name" column
may have a "size" column
with integer values in 1-10 range
is tab- or comma-separated
usually ends in .txt, .csv or .tab
The key is that the incoming files may be mis-formatted, either due to user error or poorly implemented export from other tools, and the classification may be non-deterministic. So this would need to support multiple, partial matching to format definitions, along with useful explanations. A simple voting scheme is probably enough to rank guesses (i.e. the more problems found, the lower the match score).
For example, given the above definitions, a comma-delimited "test.txt" file with an "id" column and "size" column with no values would result in a sniffer log something like:
Probably format A, v1.1
- but "size" column is empty
Possibly format A, v1.0
- but "size" column is empty
- but missing "name" column
- but is comma-delimited
The Sniffer functionality in the Python standard library is heading in the right direction, but I'm looking for something more general and extensible (and not limited to tabular data). Any suggestions on where to look for something like this?
First of all, I am glad I have found this question - I am thinking of something similar too (declarative solution to markup any file format and feed it, along with the file itself, to the tool that can verify the file).
What you are naming "sniffer" is widely known as "file carver" and this person is big at carving: http://en.wikipedia.org/wiki/Simson_Garfinkel
Not only he has developed an outstanding carver, he has also provided the definition of different cases of incomplete files.
So, If you are working on some particular file format repair tool - check the aforementioned classification to find out how complex is the problem. For example, carving from incompletely received data stream and carving from disk image defers significantly. Carving from disk image with fragmented disk would be insanely more difficult, whereas padding some video file with meaningless data, just to make it open by video player is easy - you just have to provide the correct format.
Hope it helped.
Regards
Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.
I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI
but do not have a reserved purpose are
called unreserved. These include
uppercase and lowercase letters,
decimal digits, hyphen, period,
underscore, and tilde.
The solution seems to be to:
Convert the character string into a
sequence of bytes using the UTF-8
encoding
Convert each byte that is
not an ASCII letter or digit to %HH,
where HH is the hexadecimal value of
the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.
From Google Webmaster Central
Does that mean I should avoid
rewriting dynamic URLs at all?
That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases. If you want to serve a
static equivalent of your site, you
might want to consider transforming
the underlying content by serving a
replacement which is truly static. One
example would be to generate files for
all the paths and make them accessible
somewhere on your site. However, if
you're using URL rewriting (rather
than making a copy of the content) to
produce static-looking URLs from a
dynamic site, you could be doing harm
rather than good. Feel free to serve
us your standard dynamic URL and we
will automatically find the parameters
which are unnecessary.
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
Hypertext Transfer Protocol
GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
[Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
Request Version: HTTP/1.0
User-Agent: Wget/1.11.4\r\n
Accept: */*\r\n
Host: hy.wikipedia.org\r\n
Connection: Keep-Alive\r\n
\r\n
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.