I have an XML where I have several attributes for one node:
var
row : IXMLNode;
rowattr : IXMLAttr;
xml : IXMLDocument;
begin
xml := ConstructXMLDocument('xml');
SetNodeAttr(xml.DocumentElement, 'version', '1.0');
SetNodeAttr(xml.DocumentElement, 'encoding', 'UTF-8');
row := AppendNode(xml, 'Links');
rowattr:=xml.CreateAttribute('Link1');
rowattr.Value:='http:\\wwww.somelink1.com';
row.Attributes.SetNamedItem(rowattr);
rowattr:=xml.CreateAttribute('Link2');
rowattr.Value:='http:\\wwww.somelink2.com';
row.Attributes.SetNamedItem(rowattr);
rowattr:=xml.CreateAttribute('Link3');
rowattr.Value:='http:\\wwww.somelink3.com';
row.Attributes.SetNamedItem(rowattr);
XMLSaveToFile(xml, 'C:\Test1.xml', ofIndent);
end;
I wish to have every link on a separate line like this:
<xml version="1.0" encoding="UTF-8">
<Links
link1="http://www.somelink1.com"
link2="http://www.somelink2.com"
link3="http://www.somelink3.com"
/>
</xml>
OmniXML does not offer such fine grained control of output formatting. You could perhaps look to find an external XML pretty printer that will do what you need. Or you could even write your own XML library.
Before you go any further though I would like to make the point that XML was never intended to be read by humans. Its design makes no effort to be readable, and if you continue trying to make your XML as readable as possible, then you will be swimming against the tide. If you want a human readable structured file format then you might look instead at YAML which was designed with that in mind.
Another avenue to consider is the structure of the XML. Using node attributes to specify an array of values is a poor decision. Attributes are intended to be used with name/value mapping pairs. If you want to specify an array of values then you might do so like this:
<links>
<item>http://www.somelink1.com</item>
<item>http://www.somelink2.com</item>
<item>http://www.somelink3.com</item>
</links>
This is clearer than your XML and much easier to parse. Try writing code to parse your attributes and you will see what I mean.
Now, just to illustrate my point above, in YAML this would be:
Links:
- http://www.somelink1.com
- http://www.somelink2.com
- http://www.somelink3.com
Of course, all this is moot if somebody other than you is defining the format of the XML.
Related
I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.
I'm trying to use different coders for the same class for two different scenarios:
Reading from JSON input files - using data = TextIO.Read.from(options.getInput()).withCoder(new Coder1())
Elsewhere in the job I want the class to be persisted using SerializableCoder using data.setCoder(SerializableCoder.of(MyClass.class)
It works locally, but fails when run in the cloud with
Caused by: java.io.StreamCorruptedException: invalid stream header: 7B227365.
Is it a supported scenario? The reason to do this in the first place is to avoid read/write of JSON format, and on the other hand make reading from input files more efficient (UTF-8 parsing is part of the JSON reader, so it can read from InputStream directly)
Clarifications:
Coder1 is my coder.
The other coder is a SerializableCoder.of(MyClass.class)
How does the system choose which coder to use? The two formats are binary incompatible, and it looks like due to some optimization, the second coder is used for data format which can only be read by the first coder.
Yes, using two different coders like that should work. (With the caveat that the coder in #2 will only be used if the system choses to persist 'data' instead of optimizing it into surround computations.)
Are you using your own Coders or ones provided by the Dataflow SDK? Quick caveat on TextIO -- because it uses newlines to encode element boundaries, you'll get into trouble if you use a coder that produces encoded values containing something that can be mistaken for a newline. You really should only use textual encodings within TextIO. We're hoping to make that clearer in the future.
I have an XML file in my app resources folder. I am trying to update that file with new dictionaries dynamically. In other words I am trying to edit an existing XML file to add new keys and values to it.
First of all can we edit a static XML file and add new dictionary with keys and values to it. What is the best way to do this.
In general, you can read an XML file into a document object (choose your language), use methods to modify it (add your new dictionary), and (re-)write it back out to either the original XML file, or a new one.
That's straightforward ... just roll up the ol' sleeves and code it up.
The real problem comes in with formatting in the XML file before and after said additions.
If you are going to 'unix diff' the XML file before and after, then order is important. Some standard XML processors do better with order than others.
If the order changes behind the scenes, and is gratuitously propagated into your output file, you lose standard diffing advantages, such as some gui differs, and some scm diffs (svn, cvs, etc.).
For example, browse to:
Order of XML attributes after DOM processing
They discuss that DOM loses order where SAX does not.
You can also write a custom XML 'diff'er (there may be such off-the-shelf ... for example check out 'http://diffxml.sourceforge.net/') that compares 2 XML documents tag-by-tag, attribute-by-attribute, etc.
Perhaps some standard XML-related tool such as XSLT will allow you to keep the formatting constant without changing tag or attribute order. You'd have to research that.
BTW, a related problem is the config (.ini) file problem ... many common processors flippantly announce that the write-order may not agree with the read-order.
I have a file that tells me what to do on runtime.
Notation is as below;
<Service name="Service2">
<Request>
<User value="admin">
<Pass value="1234">
</Request>
Is it possible to parse it with standard rules, without writing custom parser?
Thanks
As the above text are in file, you will read it as NSString. After reading the file you can either use some algorithm to fetch the value like, breaking the string into arrays separated by < and >
or, you can create your own parser.
As you dont want to form your custom parser, then you can use the first way to find the values, but that will be fixed for the above file contents.
I've loaded a XML document, and now I wish to run a XPath query to select a certain subset of the XML. The XML is
<?xml version="1.0"?>
<catalog xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications with
XML.</description>
</book>
</catalog>
and the procedure goes something like
procedure RunXPathQuery(XML: IXMLDOMDocument2; Query: string);
begin
XML.setProperty('SelectionLanguage', 'XPath');
NodeListResult := XML.documentElement.selectNodes(Query));
ShowMessage('Found (' + IntToStr(NodeListResult.length) + ') nodes.');
end;
Problem is: when I run the XPath query '/catalog' for the above XML, it returns (as expected) a nodelist of 1 element. However, if I remove :xsi from
<catalog xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'> and re-run the query, the nodelist returned is empty. If I remove the entire 'xmlns'-attribute, the resulting nodelist has, once again, 1 element.
So my question is this: what can I do to remedy this, i.e. how do I make MSXML return the correct number of instances (when running a XPath query), regardless of the namespace (or other attributes)?
Thanks!
See this link!
When you use <catalog xmlns='http://www.w3.org/2001/XMLSchema-instance'> then the whole node will be moved to a different (default) namespace. Your XPath isn't looking inside this other namespace so it can't find any data. With <catalog xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'> you're just declaring xsi as a different namespace. This would be a different namespace than the default namespace.
I can't test it right now but adding something like this might help:
XML.setProperty('SelectionNamespaces', 'xmlns=''http://www.w3.org/2001/XMLSchema-instance''');
Or maybe it doesn't. As I said, I can't test it right now.
Figured it out. It seems that my problem has been described here and here (and most likely a zillion other places, too).
The query /*[local-name()='catalog'] works for me.
Use:
document.setProperty('SelectionNamespaces', 'xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"')
/*[local-name()='catalog']
is a solution to your question.
But why would you want to ignore namespaces? They have been introduced to express something, e.g. to distinguish different types of catalogs. With your query, you can now select the content of any catalog in the world, but I assume you can only handle books. What will happen if you get a catalog of screws or cars instead?
The mentioned things about the prefix (xsi) is correct. If you remove the prefix, all elements are in that namespace (called default namespace then). But you can still deal with it.
In your code, give the namespace a prefix anyway. It needn't even match the original prefix:
XML.setProperty('SelectionNamespaces', "xmlns:xyz='http://www.w3.org/2001/XMLSchema-instance'");
The second thing is to adapt the XPath query. It must then be
/xyz:catalog
The original XML only declares the xsi namespace, but never makes use of it. In this case, you can remove it completely. If you want to use the namespace and you want it with prefixes, then rewrite your XML to
<?xml version="1.0"?>
<xsi:catalog xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'>
<xsi:book id="bk101">
<xsi:author>Gambardella, Matthew</xsi:author>
<xsi:title>XML Developer's Guide</xsi:title>
<xsi:genre>Computer</xsi:genre>
<xsi:price>44.95</xsi:price>
<xsi:publish_date>2000-10-01</xsi:publish_date>
<xsi:description>An in-depth look at creating applications with
XML.</xsi:description>
</xsi:book>
</xsi:catalog>