Standard ECMA-376, Office Open XML File Formats
http://www.ecma-international.org/publications/standards/Ecma-376.htm
It has specification for how a feature in given document (say a word made bold in docx file) corresponds to XML.
These mappings are spread over the specification.
Is there any tabular summary available for mapping a feature in given document type (word, point, excel) to corresponding XML?
E.g.
DOCX features-xml mapping
Feature XML
Bold <w:r><w:rPr><w:b/>
Text Border <w:r><w:rPr><w:bdr/>
and so on.
Related
COuld you please recommend some materials to learn more about EDI and its professional language such as Test ISA Qualifier: 14
Test ISA Sender ID:
Test GS Sender ID:
I am totally a beginner and would like to learn more about this topic
Also, which program I could use to convert EDI message type to a different format ( for instance from X12 to XML) from FDI to AS2 communication method ( not sure if you understand in this context)
Thank you a lot for your response.
Kim
Your question is quite broad, so I'll try to just give some information that may help.
Most EDI exchanges tends to be undertaken by partnering with an EDI company, rather than self-built. This is partly because there's a lot of complexity involved, partly because standards aren't always publicly available, and partly because you often need to use a variety of transport mechanisms, including private networks. I note you've mentioned AS2, but again, you'd normally get a third party to either manage that for you, or to provide software to do so.
There are various EDI standards bodies. You've mentioned X12, which is most common if you're in North America, but their specifications have to be bought, and are quite expensive. However, you may be able to get a specification from your trading partner.
There are a number of proprietary products that will translate EDI formats to other formats (such as XML), but they usually require some expertise to use. If you have a common X12 format, and you wish to translate it to a common integration format (say an XML format defined by a commonly used accounting package - I'm not sure what "FDI" is), you may be able to find something off the shelf.
Increasingly, most businesses wish to outsource their EDI to a managed service who will look after everything for you. The costs are not usually prohibitive to even small traders.
My answer assumes you know a lot about XML, I apologize if that is not correct. But, if you want to learn EDI and ANSI, you should learn XML as its hierarchical structure fits well with ANSI formats.
Learning "Electronic Data Interchange" using ANSI X12 transaction sets which contain ISA and GS Segments (a segment is a variable length data structure) can begin with learning which industry uses which transaction sets and which version (https://www.edi2xml.com or https://x12.org). X12.org allows you to download sample ANSI files.
Materials Management uses specific ANSI X12 transactions (Purchase Orders and Invoices) which are different from the needs of a hospital business office or an insurance claim adjudication company which uses X12N HIPAA-mandated transaction sets (www.wpc-edi.com) in USA.
All ANSI segments are made up of "elements" - and "Qualifier" is an element within many segments. It denotes a specific value that identifies or qualifies the data being transferred - like a value that tells the Receiver what type of Insurance Plan is being paid by the insurance company. A "Sender ID" is also an ANSI element - in ISA and GS segments. It generally contains a number or alpha-numeric that identifies the EDI sender of the EDI transaction - may or may not be the originator of the information.
For most workplaces, a third-party software and/or vendor is generally used to send/receive the necessary transactions electronically.
I worked in healthcare for years, and I got started understanding the necessary ANSI transaction sets by asking the insurance companies for a copy of their specific Implementation Guide for a specific transaction(s). This may only be a document that shows the differences between their transactions and the HIPAA recommendations.
I have also found older, pre-HIPAA (before 1996) versions of ANSI transaction guides (developer's documentation) on the internet.
Once you have an understanding of which ANSI transaction sets are used in your industry, then try to find the appropriate ANSI transaction set associated like 837/835 for a hospital, 850/855 for purchasing or warehouse.
When you know which transactions are used for which purpose, and you understand its hierarchical structure, then try taking them apart using the programmer/developer documentation (Implementation Guide or Standard) you have found or purchased. Your trading partners may have documentation guides they will send you. If you have no "trading partners" yet, then look online or in book stores for documentation.
If you have any programming experience, the Implementation Guide or ANSI Standard documentation are the programmer's tools for understanding the transaction set function and segment layout.
If you don't have any programming skills, this would be a good project to learn some basic input and output logic - to convert an ANSI transaction file into a well-formed XML document of your design, or into a CSV or Tab-Delimited file.
With some basic input, output and data manipulation logic, you can convert any ANSI X12 file into a well-formed XML document - where ANSI segments are converted (mostly) to empty XML elements which contain Attributes that hold the ANSI Segment's data, like so:
For this ANSI stream:
ISA*00* *00* *ZZ*123456789012345*ZZ*123456789012346*080503*1705*>*00501*000010216*0*T*:~GS*HS*12345*12345*20080503*1705*20213*X*005010X279A1~ST*270*1235*005010X279A1~ (to end of file after the IEA segment)
Convert it to a flat file with CRLF characters after each tilde (~):
ISA*00* *00* (skipping elements to simplify) *00501*000010216*0*T*:~
GS*HS*12345*12345*20080503*1705*20213*X*005010X279A1~
ST*270*1235*005010X279A1~
(continue to end of file, after the IEA segment)
Then, convert the new ANSI flat file to an XML document with Attributes like:
<?xml version="1.0"?> (this must be your first line in the XML document)
<ISA a1="00" a2=" " a3="00" (skipping attributes to simplify) >
<GS a1="HS" a2="12345" a3="12345" a4="20080503" a5="1705" a6="20213" a7="X" a8= "005010X279A1" >
<ST a1="270" a2="1235" a3="005010X279A1" > (ISA, GS, ST are not empty elements)
(continue to end of file with empty elements like:)
<
<ST ... /> (closing xml tag for the ST element, convert the "SE")
<GS ... /> (closing xml tag for the GS element, convert the "GE")
<ISA a1="1" a2="000010216" /> (Closing xml tag for the ISA element, convert the "IEA")
Like I said, the output is mostly made up of "empty" XML element tags, but the SE, GS and IEA ANSI elements must be converted to become the closing XML element tags for the "ISA", "GS" and "ST" XML elements - since these three are NOT empty XML elements. They contain other XML elements.
There are many examples of other interior ANSI elements which should also not be "empty" XML elements, since in the ANSI form they have a parent/child relationship with other subordinate ANSI elements. Think of an insurance claim (not my example) containing a lot of patient demographic data and many medical charges - in ANSI these segments are child elements of a "CLM" Segment. In XML, the "CLM" would not be "empty" - containing child XML elements, for example, the "NM1" and "SVC".
The first "CLM" XML tag would be closed, enclosing all its "children" elements - when the next "CLM" (a patient's insurance claim) is encountered. In ANSI, this "closing" is not apparent, but only signaled by the existence of a following "CLM" segment.
To know if you have created a well-formed XML document, try to browse and display the XML output file with a web browser. If there are errors in the XML, it will stop displaying at the error.
Once you have learned the data structure (layout) of the ANSI files, you can then learn to create a simple well-formed XML document, and with some "XSL" Transformation skills, to translate it into a browser-friendly displayable document.
<?xml version="1.0"?> must be the first line of your XML document so that
a browser can recognize and know what to do with the XML markup. It has no
equivalent in the ANSI file.
My case: List of Danish-named students (with names including characters as ü,æ,ø,å). Minimal Working Example
CSV file:
Fornavn;Efternavn;Mobil;Adresse
Øjvind;Ørnenæb;87654321;Paradisæblevej 125, 5610 Åkirkeby
Süzette;Ågård;12345678;Ærøvej 123, 2000 Frederiksberg
In-browser neo4j-editor:
$ LOAD CSV WITH HEADERS FROM 'file:///path/to/file.csv' AS line FIELDTERMINATOR ";"
CREATE (:Elev {fornavn: line.Fornavn, efternavn: line.Efternavn, mobil: line.Mobilnr, adresse: line.Adresse})
Resulting in registrations like:
Neo4j browser screenshot, containing ?-characters, where Danish/German characters are wanted. My data come from a Learning Management System into Excel. When exporting as CSV from Excel, I can control file encoding as a function of the Save As dialogue box. I have tried encoding from Excel as "UTF-8" (which the Neo4j manual says it wants), "ISO-Western European", "Windows-Western European", "Unicode" in separately named file, and adjusted the FROM 'file:///path/to/file.csv' clause accordingly.
Intriguingly, exactly the same misrepresentation results, independent of which (apparent?) file encoding, I request from Excel when "Saving As". When Copy-pasting the names and addresses directly into the editor, I do not encounter the same problem.
Check Michael Hunger's blog post here which contains some tips, namely:
if you use non-ascii characters (umlauts, accents etc.) make sure to use the appropriate locale or provide the System property -Dfile.encoding=UTF8
I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.
I know how to convert a Set of text or web page files in to arff file using TextDirectoryLoader.
I want to know how to convert a single Text file in to Arff file.
Any help will be highly appreciated.
Please be more specific. Anyway:
If the text in the file corresponds to a single document (that it, a
single instance), then all you need is to replace all "new lines"
with the escape code \n to make the full text be in a single line,
then manually format as an arff with a single text attribute and a
single instance.
If the text corresponds to several instances (e.g. documents), then I
suggest to make an script to break it into several files and to apply
TextDirectoryLoader. If there is any specific formating (e.g.
instances are enclosed in XML tags), you can either do the same (by
taking advantage of the XML format), or to write a custom Loader
class in WEKA to recognize your format and build an Instances object.
If you post an example, it would be easier to get a more precise suggestion.
I'm looking for an programmatic approach to positively identify fillable PDF form(s) among non-PDF form files.
The options that I believe are available are:
Parse the PDF code and content
Parse the file for signature identification with a hex capable language such as Python
Parse the file with a hex capable language such as Python to flag telltale signs
If the Catalog > AcroForm > Fields Array has at least one Dictionary element, the PDF is a fillable form. For bonus points, you could do some level of validation that the Field Dictionary is legit (validate Type and Subtype for example).