Background reading for parsing sloppy / quirky / "almost structured" data? - parsing

I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply different quirks modes, etc. It's frustrating, because I'm somewhat familiar with the theory and practice of parsing if things are well behaved, and there are nice parsing frameworks etc. out there, but the unreliability of the data has led me to write some very sloppy ad-hoc code. It's OK at the moment but I'm worried that as I expand it to process more variations and more complex data, things will get out of hand. So my question is:
Since there are a fair number of existing commercial products that do related things ("quirks modes" in web browsers, error interpretation in compilers, even natural language processing and data mining, etc.) I'm sure some smart people have put thought into this, and tried to develop a theory, so what are the best sources for background reading on parsing unprincipled data in as principled a manner as possible?
I realize this is somewhat open-ended, but my problem is that I think I need more background to even know what the right questions to ask are.

Given the choice between what you've proposed and fighting a hungry crocodile while covered in raw-beef-flavored marmalade and both hands tied behind my back, I'd choose the ...
Well, OK on a more serious note, if you have data that doesn't abide by the any "sane" structure, you have to study the data and find frequencies of quirks in it and correlate the data for the given context (i.e. how it was generated)
Print to OCR to get the data in is almost always going to lead to heart break. The company I work for employs a veritable army of people who manually read such documents and hand "code" (i.e. enter by hand) the data for known problematic OCR scenarios, or documents our customers detect the original OCR failed on.
As for leveraging "Parsing Frameworks" these tend to expect data that will always follow the grammar rules you've laid out. The data you've described has no such guarantees. If you go that route be prepared for unexpected - though not always obvious - failures.
By all means if there is any way possible to get the original data files, do so. Or if you can demand that those providing the data make their data come in a single well defined format, even better. (It might not be "YOUR" format, but at least it's a regular and predictable format you can convert from)

Related

How can I store large amounts of unorganized, related data and organize it as I receive it?

I'm writing a program for a client. The data they send us is essentially information from a relational database that got flattened, resulting in utterly gigantic comma-delimited text files that consist of extremely redundant information with only a few fields changing per line.
I am reading this into a typed dataset and essentially organizing the data I'm getting into the third normal form, which drastically cuts down on the sheer amount of redundancy. From there, I convert the data in the dataset to XML and send it off to another program to create forms and statements.
However, I'm wondering if there's a better way to go about this. It might not be as bad as I think it is, but I can't shake the feeling that there's a better, faster way to do this. The important thing is that the data is organized and easily understood, and that it is constraint-checked and validated before I convert it to XML.
Since none of the data needs to persist (in fact, it shouldn't), an actual RMDB didn't seem worth it if I was just going to end up clearing it after every use.
The program also needs to run in a myriad of environments; my workstation is Windows 7 64-bit, the testing server is Windows XP 32-bit, and the production server is Windows 7 64-bit or 32-bit depending on which server it's going on.
IMHO then I would start off with SQL Express - it's designed to work its way through those kinds of data volumes, and will adapt itself to the different platforms you're running; it's scalable to the bigger versions if necessary; and in SSMS you have a tool for easily examining intermediate results etc., and importing .csv is straightforward. And it's free.
For all the above reasons, I would give SQL Express a try and evaluate its real-world performance.
Going back to your original question, my opinion fwiw is that this is a reasonable approach; I don't think you're missing anything.

What's a succinct, useful and efficient way to store large time-series in F#?

I'm currently learning F# and I'm exploring using it to analyse financial time-series. Can anyone recommend a good data structure to store time-series data in?
F# offers a rich selection of native types and I'm looking for a some simple combination that would provide an elegant, succinct and efficient solution.
I'm looking store tick data, which consists of millions of records each with a time stamp, and several (~5-20) fields of numerical and textual data, with possible missing values.
My first thoughts are perhaps a sequence of tuples or records, but I was wondering if someone could kindly suggest something that has worked well in the real world.
EDIT:
A few extra points for clarification:
The common operations that I'm likely to require are:
Time based lookup - i.e. find the most recent data point at a given time
Time based joins
Appends
(Updates and deletes are going to be rare. )
I should make it clear I'm exploring using F# primarily as an interactive tool for research, with the ability to compile as a (really big) added bonus.
ANOTHER EDIT:
I should also have mentioned, my role/use of F# and this data is purely within research not development. The intention being that once we understand the data (and what we want to do with it) better then we can later specify tools that our developers would build. Such as data warehouses etc. at which we'd start using their data structures etc.
Although, I am concerned that our models are computationally intensive, use a lot of memory and can't always be coded in a recursive manner. So we many end up having to query out large chunks anyway.
I should also say that I've always used Matlab or R for these sorts of tasks before but I'm now interested in F# as it offers that interactive, high level flexibility for Research but the same code can be used in production.
My apologies for not giving this context information at the start (It's my first question), I can see now that it helps people form their answers.
My thanks again to everyone that's taken the time to help me.
It really sounds like your data should be stored and queried in a relational database (where is it currently stored?: loading millions of records with several fields into memory must be an expensive operation, and could leave you with stale data and difficulty persisting changes). And then you could use the F# LINQ to SQL implementation (which I believe you can find in the Power Pack) to have F# expressions translated to SQL expressions.
Here's a link from Don Syme about LINQ Support in F# Power Pack: http://blogs.msdn.com/b/dsyme/archive/2009/10/23/a-quick-refresh-on-query-support-in-the-f-power-pack.aspx
The best choice of data structure depends upon what operations you want to do on it.
The simplest would be an array of structs. This has the advantages of fast random lookup, good space efficiency for an uncompressed representation and good locality. If there is sharing between substructures (like the strings) then intern them to make sure they get shared.
Alternatives might be a seq that is loaded from disk on-demand, a singly-linked list that allows you to prepend elements quickly or a balanced binary trees that allows operations like insertion at random locations efficiently.

Thoughts on minimize code and maximize data philosophy

I have heard of the concept of minimizing code and maximizing data, and was wondering what advice other people can give me on how/why I should do this when building my own systems?
Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.
The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.
A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:
total_tax_burden(income) {
if (income < 1000)
return 0
if (income < 10000)
return .1 * (income - 1000)
if (income < 100000)
return 999.9 + .2 * (income - 10000)
return 18999.7 + .3 * (income - 100000)
}
Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.
But if it were data-driven, you could store this table in a configuration file:
1000:0
10000:10
100000:20
inf:30
Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.
And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:
#! /bin/bash
# $Id$
# Copyright (C) 2002-2007 Novell/SUSE
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation, version 2 of the
# License.
#=NAME open
#=DESCRIPTION
# Verify that the open syscall is correctly managed for confined profiles.
#=END
pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`
bin=$pwd
. $bin/prologue.inc
file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w
# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file
# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file
# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file
# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file
# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file
# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file
# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file
It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.
I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.
In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.
For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.
This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.
Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.
The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as log() and tan(), as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.
cost = (base * ordered * exchange * ... + ... / ...)^13
Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.
The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.
The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of Fib(n) monkeys for the cost of large orders.
Question: Is this input file data?
From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).
Yet now many people would argue that the input file just became code.
So what is the distinguishing feature that turns the input format from data into code?
Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the L in its name.
In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:
Maintainability: In short, if the user of the system has to hire a third party to make the expertise needed to modify the behaviour of the system available, then the system should be considered code-centric to a degree.
This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.
It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.
A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.
On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.
I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.
It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to.
That third party can be expected to employ a fair number of experts.
One of five maxims under the Unix Philosophy, as presented by Rob Pike, is this:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
It is often shortened to, "write stupid code that uses smart data."
Other answers have already dug into how you can often code complex behavior with simple code that just reacts to the pattern of its particular input. You can think of the data as a domain-specific language, and of your code as an interpreter (maybe a trivial one).
Given lots of data you can go further: the statistics can power decisions. Peter Norvig wrote a great chapter illustrating this theme in Beautiful Data, with text, code, and data all available online. (Disclosure: I'm thanked in the acknowledgements.) On pp. 238-239:
How does the data-driven approach compare to a more traditional software development
process wherein the programmer codes explicit rules? ... Clearly, the handwritten rules are difficult to develop and maintain. The big
advantage of the data-driven method is that so much knowledge is encoded in the data,
and new knowledge can be added just by collecting more data. But another advantage is
that, while the data can be massive, the code is succinct—about 50 lines for correct, compared to over 1,500 for ht://Dig’s spelling code. ...
Another issue is portability. If we wanted a Latvian spelling-corrector, the English
metaphone rules would be of little use. To port the data-driven correct algorithm to another
language, all we need is a large corpus of Latvian; the code remains unchanged.
He shows this concretely with code in Python using a dataset collected at Google. Besides spelling correction, there's code to segment words and to decipher cryptograms -- in just a couple pages, again, where Grady Booch's book spent dozens without even finishing it.
"The Unreasonable Effectiveness of Data" develops the same theme more broadly, without all the nuts and bolts.
I've taken this approach in my work for another search company and I think it's still underexploited compared to table-driven/DSL programming, because most of us weren't swimming in data so much until the last decade or two.
In languages in which code can be treated as data it is a non-issue. You use what's clear, brief, and maintainable, leaning towards data, code, functional, OO, or procedural, as the solution requires.
In procedural, the distinction is marked, and we tend to think about data as something stored in an specific way, but even in procedural it is best to hide the data behind an API, or behind an object in OO.
A lookup(avalue) can be reimplemented in many different ways during its lifetime, as long as its starts as a function.
...All the time I desing programs for nonexisting machines and add: 'if we now had a machine comprising the primitives here assumed, then the job is done.'
... In actual practice, of course, this ideal machine will turn out not to exist, so our next task --structurally similar to the original one-- is to program the simulation of the "upper" machine... But this bunch of programs is written for a machine that in all probability will not exist, so our next job will be to simulate it in terms of programs for a next lower level machine, etc., until finally we have a program that can be executed by our hardware...
E. W. Dijkstra in Notes on Structured Programming, 1969, as quoted by John Allen, in Anatomy of Lisp, 1978.
When I think of this philosophy which I agree with quite a bit, the first thing that comes to mind is code efficiency.
When I'm making code I know for sure it isn't always anything close to perfect or even fully knowledgeable. Knowing enough to get close to maximum efficiency out of a machine when it is needed and good efficiency the rest of the time (perhaps trading off for better workflow) has allowed me to produce high quality finished products.
Coding in a data driven way, you end up using code for what code is for. To go and 'outsource' every variable to files would be foolishly extreme, the functionality of a program needs to be in the program and the content, settings and other factors can be managed by the program.
This also allows for much more dynamic applications and new features.
If you have even a simple form of database, you are able to apply the same functionality to many states. You may also do all manner of creative things like changing the context of what your program is doing based on file header data or perhaps directory, file name or extension, though not all data is necessarily stored on a filesystem.
Finally keeping your code in a state where it is simply handling data puts you in a state of mind where you are closer to envisioning what is actually going on. This also keeps the bulk out of your code, greatly reducing bloatware.
I believe it makes code more maintainable, more flexible and more efficient aaaand I like it.
Thank you to the others for your input on this as well! I found it very encouraging.

Parsing text file (100+ MB) and sending data over network

I have a requirement to parse a huge text file and send parts of this file to be added as seperate rows in Content Manager. what is the best way of parsing and then update the DB?
I also would need identify certain tokens within this text file.
Please suggest what language should I use to code this requirement.
Thanks
All widely used programming languages can do that, though scripting languages (especially Perl) may be better suited to the task than others. However, your personal experience is a bigger factor: using the language you're most familiar with would probably be best, unless you have specific reasons not to use it, or to use a different language.
A classic problem when working with large files is just reading them in the first place. A lot of standard libraries tend to want to read the entire file into memory / array. However for really large files this is usually not practical.
For what ever language you end up choosing, look over the file I/O libraries carefully and select a method that will allow you to read in the file in chunks. Then run your parsing logic over the chunks and when getting to the end of a chunk, read in the next. Be careful with the parsing logic, it can sometimes be tricky to handle a chunk when it ends in a place that your parsing is not expecting.
Additionally a double buffer system sometimes works well. Process one chunk and when you get near the end, you fill the other buffer with the next chunk. If your parsing is CPU intensive, you might even look at filling a buffer on another thread to overlap the file I/O with the parsing. However, I wouldn't do this first. Start with just getting the logic working before any performance optimizations.
Without more detailed requirements it's difficult to suggest a particular language. Certainly no language is going to magically solve the problem of parsing such a big file. Depending on the format of the file there might be parsing library particularly suited to the job which might guide your choice of language.
If by "Content Manager" you mean Microsoft Content Manager Server I guess one of the Microsoft languages such as C# or VB.Net might be a better choice.
So my answer would pick one of the languages you already know, probably the one you know best.

Why is EDI still used, and how to deal with it?

Why is this archaic format still used in the face of easier-to-use technologies? Does it provide some benefit that I'm not seeing? It seems that a large amount of vendors still provide data only in this format, instead of something more manageable and easier to use such as XML; at the least it would make sense to me to offer both formats.
Also, what are some good ways to deal with and utilize EDI when you have no other choice but to use it? Something like BizTalk is out of the question as it's far too expensive. Are there any free/open source applications that make EDI easier to work with?
EDI is not that hard to understand once you familiarize yourself with the delimiters it uses. You might ask yourself as well why anyone would still be using CSV or tab-delimited data.
The answer is probably that those formats are "domain specific languages" defined by committee and standardized in a certain industry, and that a lot of money has already been invested in supporting those formats. Where's the business case to throw that all out again?
One word, Inertia. Developing the EDI formats by committee between various companys and organisations with different agendas was a nightmare (sad to say I have been there).
Asking them to abandon these with yet another round of committees agreeing web service API standards is going to take even longer, how do you sell the idea of replacing one electronic format with another to a non-technical board? What possible busness advantage does it give them. Originally the benefits of electronic exchange were clear but replace one with another is not. We're talking really big companies here.
You may be interested in the following project:
http://bots.sourceforge.net/en/index.shtml
Google code archive
A little information for all interested. EDI is basically a design by committee data exchange format that not only set out rules for data formatting (like XML), but also set out to define each document that could possibly ever be sent between 2 companies. So for any piece of data that could be exchanged between companies they came up with an exact definition of what was supposed to be in each of these documents. Of course, nobody could foresee every piece of data that 2 companies would want to exchange. So you end up with companies using fields that were defined for 1 thing, being used for some other piece of information.
What you ended up with, is an extremely convoluted data format, in which many people using it don't follow the standards, because they need to send custom information, which the standard doesn't account for. So in the end, you still need to talk to each company you want to deal with, and find out all the little idiosyncrasies of their implementation, just as you would have to do if you went to someone with a custom XML interface. Except that in the case of EDI, the format is hard to parse and even harder to write well, so you end up doing a whole bunch of work just to send a document, when doing the same kind of think with having a custom XML solution would have resulted in many times less problems.
And switching to XML would give you what - a slightly easier to debug line format?
Generally you set it up and leave it, there isn't a lot of need to play with the raw EDI feed, certainly not enough to abandon the standard and start again.
There are lots of standards, like FAX that could be made more readable but no real pressing need to change them.
Because it's a formally established Standard (in fact a very large and comprehensive set of standards). And that's one of the claimed benefits of a standard - you won't need to change anything for a long time.
And to change it, it takes agreement between two or more (often thousands and thousands more) trading partners (including maybe all of your competitors) to agree.
EDI formats have much higher signal-to-noise ratios (because they were designed back when that was considered important.) Someone who knows and understands EDI will look at your XML and say "Where's the beef (data)?"
Very few developers write their own parsers. There are many good mappers available (and many legacy and enterprise apps come with them built in). So there's lots of relief available for your pain (including at least one Open Source app on SourceForge).
"If it ain't broke, don't fix it."
Most of these organisations are processing vast amounts of data using EDI, and aren't about to change to something more modern without a compelling reason. And making things easy for third-party developers doesn't usually qualify, sad to say.
IMHO there are several problems with EDIFACT.
It is not easy to parse or generate an Object model from it. This is probably not a big problem anymore as there are now good system around that do it for you e.g. smooks.org
It is not easy to read. You get used to but XML is a lot easier to read
Validation isn't that easy (compare that to validating XML)
There are far too many different versions and flavours, D95B, D96B, D00A, D00B etc.
But I think the biggest problem is that everyone is using the standards differently. They use the same 'format' but the fields are defined differently. We use EDIFACT to send and receive messages from Container Terminals and they all have slight differences. They would e.g. all use a D95B CODECO but for some terminals a certain segment is mandatory while for another it is optional or even not allowed to be there. Then you have segments that are used the same but the content in it is different.
So to summarise it: It is a pain in the neck.
EDI is a very compact format and is often used to keep bandwidth usage in data exchanges as small as possible. The German customs offices for example use it in their ATLAS system to exchange a very high volume of data every day.
It is hard to parse and hard to read, but if the size of the resulting data matters, it can be a good choice and is supported by most of the bigger business applications.
Legacy Support
EDI is prolific in many industries. It would be prohibitively expensive to replace an already-working technology with a newer one.
Consider this, Walmart uses EDI to communicate with its vendors, stores, distribution chain, etc. I'm guessing they deal with tenss of thousands of vendors. Every one of them has sunk thousands of dollars into EDI technology. If Walmart decided to switch over to XML, its a decision that affects thousands of companies, not just Walmart.
This is true for any EDI user. After all, it's a standard used between trading partners.
I agree, EDI is a pain to work with. But 'back in the day', that's all we had.
Edifact is one of the best standards when it comes down to document interchange.
Most problems come from tradingpartners sending non standardized documents.
Yes it's a bit odd format and is tedious to work with if you don't know the ins and outs but that goes for XML as well.
You really want XML over Edifact? Look at the bloated, hard to read XML standards peppol (pan-european public procurement online) is working on.
Yes it's working nice and dandy if you don't have any errors in the systems, troubleshooting edifacts is so much easier once you get used to the format than troubleshooting UBL documents.
You say you have $0.00 to use on the project?
You really should look into the amount of manual work done in your company and the costsavings EDI can offer some cost benefit analysis can be mighty handy.
What types of information can be exchanged via EDI?
A variety of types of business information exchange is available via EDI including:
-•Booking information
-•Bill of Lading information
-•Invoicing
-•Electronic Funds Transfer
-•Arrival Notice Information
-•Shipment Status Information
How would choosing EDI benefit my company?
-•It streamlines the communications process between you and APL
-•It eliminates the need to rekey data, thus eliminating errors and the need to recheck information
-•It eliminates paper handling and the need for document storage
-•It improves the turntime and the accuracy of your data
-•It eliminates the need for faxing
One solution, although it will cost you, is to go to a company like ADX, which has tools you can use to convert EDI formats to more pleasing formats like CSV. Depending on the volume and type of transactions you are doing, this can be both affordable and a lot less stressful. I've used their products in the past, and while they are a bit of work to set up, they do work quite well, and are very stable. Because of the history of EDI, you could probably find hundreds of other companies that offer similar services.
EDI has been around since before XML. Apart from the fact that two parties can pre-negotiate the EDI format that works for them both you must also consider the part of the VAN (value added network.)
In some cases the VAN performs validation of the message, or even reads the message and performs actions on it, such as copying it to additional parties based on its content.
The only reason really to use EDI is because "that's the way it's always been done", and therefore there is a lot of existing infrastructure around to support it. Why switch to XML when there is no need? And how is to say XML wont be replaced by JSON which will then be replaced by something else?
Another reason is that being business messages such as order. invoices, credit notes etc there is a lot of financial worth in the transactions and they need to be secure but perhaps more importantly they need have end to end validation and verification as well as non repudiation.
For example i send you an order for 1/2 million Euros worth of goods, you send me the goods, then i "lose" the order information and tell you i am not paying. The combination of the standards and the VANS make this almost impossible or at least with so much of an audit trail that it the problems could be tracked. This is why the "Oh let use xml and the internet instead of EDIFACT and the VANS" tend to fail. As someone els answered, Inertia, but it is an inertia founded in a stable effective, secure, reliable and well understood system.
Doing it on the cheap is not always an option.
If it is any consolation when i first implemented EDI in '87 there was virtually no software around and so i got the Interbridge tables and wrote my own parser for the UK TRADACOMS standard using Cognos software on and HP Mini, and it worked fine. Assuming you are trading with other EDI partners the cost probably comes at the point of needing to use a VAN.
I've used EDI (ANSI X12 and EDIFACT) in 2 projects about Maritime Transport Logistics and found them to be very useful since most Ocean Carriers and Trading Partners accept them as the standard way of communication between their different systems.
So EDI format is still used and will continue to be used since it's a stablished standard and thousand companies have developed systems around them, and replacing them is a really big deal.
I've had to use EDI as well and I agree. We used BizTalk to map it which worked well. Many system are built on EDI(well before XML).

Resources