Using Python 3.6 to parse XML how can I determine if an XML tag contains no data - xml-parsing

I am trying to learn Python by writing a script that will extract data from multiple records in an XML file. I have been able to find the answers to most of my questions by searching on the web, but I have not found a way to determine if an XML tag contains no data before the getElementsByTagName("tagname")[0].firstChild.data method is used and an AttributeError is thrown when no data is present. I realize that I could write my code with a try and handle the AttributeError but I would rather know that the tag is empty before I try to extract the data an not have to handle the exception.
Here is an example of an XML file that contains two records one with data in the tags and one with an empty tag.
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<records>
<rec>
<name>ZYSRQPO</name>
<state>Washington</state>
<country>United States</country>
</rec>
<rec>
<name>ZYXWVUT</name>
<state></state>
<country>Mexico</country>
</rec>
</records>
Here is a sample of the code that I might use to extract the data:
from xml.dom import minidom
import sys
mydoc = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")
for rec in records:
try:
name = rec.getElementsByTagName("name")[0].firstChild.data
state = rec.getElementsByTagName("state")[0].firstChild.data
country = rec.getElementsByTagName("country")[0].firstChild.data
print('{}\t{}\t{}'.format(name, state, country))
except (AttributeError):
print('AttributeError encountered in record {}'.format(name), file=sys.stderr)
continue
When processing this file no information for the record named ZYXWVUT will be printed except that an exception was encountered. I would like to be able to have a null value for the state name used and the rest of the information printed about this record. Is there a method that can be used to do what I want, so that I could use an if statement to determine whether the tag contained no data before using getElementsByTagName and encountering an error when no data is found?

from xml.dom import minidom
import sys
mydoc = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")
for rec in records:
name = rec.getElementsByTagName("name")[0].firstChild.data
state = None if len(rec.getElementsByTagName("state")[0].childNodes) == 0 else rec.getElementsByTagName("state")[0].firstChild.data
country = rec.getElementsByTagName("country")[0].firstChild.data
print('{}\t{}\t{}'.format(name, state, country))
Or if there is any chance, that name and country is empty too:
from xml.dom import minidom
import sys
def get_node_data(node):
if len(node.childNodes) == 0:
result = None
else:
result = node.firstChild.data
return result
mydoc = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")
for rec in records:
name = get_node_data(rec.getElementsByTagName("name")[0])
state = get_node_data(rec.getElementsByTagName("state")[0])
country = get_node_data(rec.getElementsByTagName("country")[0])
print('{}\t{}\t{}'.format(name, state, country))

I tried reedcourty's second suggestion and found that it worked great. But I decided that I really did not want none to be returned if the element was empty. Here is what I came up with:
from xml.dom import minidom
import sys
def get_node_data(node):
if len(node.childNodes) == 0:
result = '*->No ' + node.nodeName + '<-*'
else:
result = node.firstChild.data
return result
mydoc = minidom.parse(dataFileSpec)
records = mydoc.getElementsByTagName("rec")
for rec in records:
name = get_node_data(rec.getElementsByTagName("name")[0])
state = get_node_data(rec.getElementsByTagName("state")[0])
country = get_node_data(rec.getElementsByTagName("country")[0])
print('{}\t{}\t{}'.format(name, state, country))
When this is run against this XML:
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<records>
<rec>
<name>ZYSRQPO</name>
<country>United States</country>
<state>Washington</state>
</rec>
<rec>
<name></name>
<country>United States</country>
<state>Washington</state>
</rec>
<rec>
<name>ZYXWVUT</name>
<country>Mexico</country>
<state></state>
</rec>
<rec>
<name>ZYNMLKJ</name>
<country></country>
<state>Washington</state>
</rec>
</records>
It produces this output:
ZYSRQPO Washington United States
*->No name<-* Washington United States
ZYXWVUT *->No state<-* Mexico
ZYNMLKJ Washington *->No country<-*

Related

can't retrieving files from pubmed using biopython

I am using this script to get data about covid-19 from pubmed
from Bio import Entrez
def search(query):
Entrez.email = 'your.email#example.com'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='20',
retmode='xml',
term=query)
results = Entrez.read(handle)
return results
def fetch_details(id_list):
ids = ','.join(id_list)
Entrez.email = 'your.email#example.com'
handle = Entrez.efetch(db='pubmed',
retmode='xml',
id=ids)
results = Entrez.read(handle)
return results
if __name__ == '__main__':
results = search('covid-19')
id_list = results['IdList']
papers = fetch_details(id_list)
for i, paper in enumerate(papers['PubmedArticle']):
print("{}) {}".format(i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
I get results in console but what I want is to automatically download files like XML files or text files of articles, any suggestions please on how to do that I googled it but nothing found
You can add this code at the end to save to a JSON file
#write to file
import json
with open('file.json', 'w') as json_file:
json.dump(papers, json_file)

replace all double quotes with nothing in csv file in BIML script

I am importing flatfile connections using BIML.
" is used around text and ; is used as delimiter.
However, in some of the files I see this:
;"this is valid text""";
There are double double quotes with nothing between them. If I edit the file and search & replace all double double quotes with nothing, the import runs well. So, is it in BIML possible to do this action automagically? Search al instances of "" and replace these with ?
<#
string[] myFiles = Directory.GetFiles(path, extension);
string[] myColumns;
// Loop trough the files
int TableCount = 0;
foreach (string filePath in myFiles)
{
TableCount++;
fileName = Path.GetFileNameWithoutExtension(filePath);
#>
<Package Name="stg_<#=prefix#>_<#=TableCount.ToString()#>_<#=fileName#>" ConstraintMode="Linear" AutoCreateConfigurationsType="None" ProtectionLevel="<#=protectionlevel#>" PackagePassword="<#=packagepassword#>">
<Variables>
<Variable Name="CountStage" DataType="Int32" Namespace="User">0</Variable>
</Variables>
<Tasks>
<ExecuteSQL ConnectionName="STG_<#=application#>" Name="SQL-Truncate <#=fileName#>">
<DirectInput>TRUNCATE TABLE <#=dest_schema#>.<#=fileName#></DirectInput>
</ExecuteSQL>
<Dataflow Name="DFT-Transport CSV_<#=fileName#>">
<Transformations>
<FlatFileSource Name="SRC_FF-<#=fileName#> " ConnectionName="FF_CSV-<#=Path.GetFileNameWithoutExtension(filePath)#>">
</FlatFileSource>
<OleDbDestination ConnectionName="STG_<#=application#>" Name="OLE_DST-<#=fileName#>" >
<ExternalTableOutput Table="<#=dest_schema#>.<#=fileName#>"/>
</OleDbDestination>
</Transformations>
</Dataflow>
</Tasks>
</Package>
<# } #>
Turns out I was looking completely at the wrong place for this.
Went to the part where the file is read and added .Replace("\"\"","")
myColumns = myFile.ReadLine().Replace("""","").Replace(separator,"").Split(delimiter);

Saxonica - .NET API - XQuery - XPDY0002: The context item for axis step root/descendant::xxx is absent

I'm getting same error as this question, but with XQuery:
SaxonApiException: The context item for axis step ./CLIENT is absent
When running from the command line, all is good. So I don't think there is a syntax problem with the XQuery itself. I won't post the input file unless needed.
The XQuery is displayed with a Console.WriteLine before the error appears:
----- Start: XQUERY:
(: FLWOR = For Let Where Order-by Return :)
<MyFlightLegs>
{
for $flightLeg in //FlightLeg
where $flightLeg/DepartureAirport = 'OKC' or $flightLeg/ArrivalAirport = 'OKC'
order by $flightLeg/ArrivalDate[1] descending
return $flightLeg
}
</MyFlightLegs>
----- End : XQUERY:
Error evaluating (<MyFlightLegs {for $flightLeg in root/descendant::FlightLeg[DepartureAirport = "OKC" or ArrivalAirport = "OKC"] ... return $flightLeg}/>) on line 4 column 20
XPDY0002: The context item for axis step root/descendant::FlightLeg is absent
I think that like the other question, maybe my input XML file is not properly specified.
I took the samples/cs/ExamplesHE.cs run method of the XQuerytoStream class.
Code there for easy reference is:
public class XQueryToStream : Example
{
public override string testName
{
get { return "XQueryToStream"; }
}
public override void run(Uri samplesDir)
{
Processor processor = new Processor();
XQueryCompiler compiler = processor.NewXQueryCompiler();
compiler.BaseUri = samplesDir.ToString();
compiler.DeclareNamespace("saxon", "http://saxon.sf.net/");
XQueryExecutable exp = compiler.Compile("<saxon:example>{static-base-uri()}</saxon:example>");
XQueryEvaluator eval = exp.Load();
Serializer qout = processor.NewSerializer();
qout.SetOutputProperty(Serializer.METHOD, "xml");
qout.SetOutputProperty(Serializer.INDENT, "yes");
qout.SetOutputStream(new FileStream("testoutput.xml", FileMode.Create, FileAccess.Write));
Console.WriteLine("Output written to testoutput.xml");
eval.Run(qout);
}
}
I changed to pass the Xquery file name, the xml file name, and the output file name, and tried to make a static method out of it. (Had success doing the same with the XSLT processor.)
static void DemoXQuery(string xmlInputFilename, string xqueryInputFilename, string outFilename)
{
// Create a Processor instance.
Processor processor = new Processor();
// Load the source document
DocumentBuilder loader = processor.NewDocumentBuilder();
loader.BaseUri = new Uri(xmlInputFilename);
XdmNode indoc = loader.Build(loader.BaseUri);
XQueryCompiler compiler = processor.NewXQueryCompiler();
//BaseUri is inconsistent with Transform= Processor?
//compiler.BaseUri = new Uri(xqueryInputFilename);
//compiler.DeclareNamespace("saxon", "http://saxon.sf.net/");
string xqueryFileContents = File.ReadAllText(xqueryInputFilename);
Console.WriteLine("----- Start: XQUERY:");
Console.WriteLine(xqueryFileContents);
Console.WriteLine("----- End : XQUERY:");
XQueryExecutable exp = compiler.Compile(xqueryFileContents);
XQueryEvaluator eval = exp.Load();
Serializer qout = processor.NewSerializer();
qout.SetOutputProperty(Serializer.METHOD, "xml");
qout.SetOutputProperty(Serializer.INDENT, "yes");
qout.SetOutputStream(new FileStream(outFilename,
FileMode.Create, FileAccess.Write));
eval.Run(qout);
}
Also two questions regarding "BaseURI".
1. Should it be a directory name, or can it be same as the Xquery file name?
2. I get this compile error: "Cannot implicity convert to "System.Uri" to "String".
compiler.BaseUri = new Uri(xqueryInputFilename);
It's exactly the same thing I did for XSLT which worked. But it looks like BaseUri is a string for XQuery, but a real Uri object for XSLT? Any reason for the difference?
You seem to be asking a whole series of separate questions, which are hard to disentangle.
Your C# code appears to be compiling the query
<saxon:example>{static-base-uri()}</saxon:example>
which bears no relationship to the XQuery code you supplied that involves MyFlightLegs.
The MyFlightLegs query uses //FlightLeg and is clearly designed to run against a source document containing a FlightLeg element, but your C# code makes no attempt to supply such a document. You need to add an eval.ContextItem = value statement.
Your second C# fragment creates an input document in the line
XdmNode indoc = loader.Build(loader.BaseUri);
but it doesn't supply it to the query evaluator.
A base URI can be either a directory or a file; resolving relative.xml against file:///my/dir/ gives exactly the same result as resolving it against file:///my/dir/query.xq. By convention, though, the static base URI of the query is the URI of the resource (eg file) containing the source query text.
Yes, there's a lot of inconsistency in the use of strings versus URI objects in the API design. (There's also inconsistency about the spelling of BaseURI versus BaseUri.) Sorry about that; you're just going to have to live with it.
Bottom line solution based on Michael Kay's response; I added this line of code after doing the exp.Load():
eval.ContextItem = indoc;
The indoc object created earlier is what relates to the XML input file to be processed by the XQuery.

PHP XML DOM Document move sub-sub-nodes within sub-node

I have an xml like this:
<?xml version="1.0" encoding="UTF-8"?>
<OrderListResponse>
<OrderListResponseContainer>
<DateFrom>2018-07-01T00:00:00+00:00</DateFrom>
<DateTo>2018-07-19T00:00:00+00:00</DateTo>
<Page>1</Page>
<TotalNumberOfPages>4</TotalNumberOfPages>
<Orders>
<Order>
<OrderID>158772</OrderID>
<Customer>
<Name><![CDATA[John Smith]]></Name>
<StreetAddress><![CDATA[33, Sunset Boulevrd]]></StreetAddress>
</Customer>
<Delivery>
<Name><![CDATA[John Smith]]></Name>
<StreetAddress><![CDATA[47, Rodeo Drive]]></StreetAddress>
</Delivery>
<Billing>
<Name><![CDATA[John Smith]]></Name>
<StreetAddress><![CDATA[33, Sunset Boulevrd]]></StreetAddress>
</Billing>
<Payment>
<Module>paypal</Module>
<TransactionID/>
</Payment>
<DatePurchased>2018-07-01 16:30:42</DatePurchased>
<DateLastModified>2018-07-02 21:08:28</DateLastModified>
<CheckoutMessage><![CDATA[]]></CheckoutMessage>
<Status>cancelled</Status>
<Currency>EUR</Currency>
<Products>
<Product>
<MxpID>44237</MxpID>
<SKU>IRF 8707TR</SKU>
<Quantity>3</Quantity>
<Price>2.46</Price>
</Product>
</Products>
<Total>
<SubTotal>7.38</SubTotal>
<Shipping>2.7</Shipping>
<Cod>0</Cod>
<Insurance>0</Insurance>
<Tax>1.62</Tax>
<Total>11.7</Total>
</Total>
</Order>
<Order>...</Order>
</Orders>
</OrderListResponseContainer>
</OrderListResponse>
and although surely there a better way to do it,
to parse all orders I build a routine like this:
$xmlDoc = new DOMDocument();
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->loadXML($response);
$xpath = new DOMXPath($xmlDoc);
$rootNode = $xpath->query('//OrderListResponseContainer/Orders')->item(0);
foreach($rootNode->childNodes as $node)
{
foreach($node->childNodes as $subnode)
{
Process User
foreach($subnode->childNodes as $subsubnode)
{
foreach($subsubnode->childNodes as $subsubsubnode)
{
Process Products and Sales
}
}
}
}
**** ADDED ****
I use the nested loops to create one xml for each product (each xml contains details about the buyer, the item and the
sale) and then this xml is passed to a Stored Procedure to generate
the user/item/sale records: For several reason I cannot bulky import
Users first, then Items and then Sales but while building the sale xml
I need some details from the Total Node and one way to get them is to
move Total Node on top of the XML, but clearly within the Order Node
**** ADDED ****
I need to access some Total subnodes before processing Products
The only solution I found is to move Total node at the beginning, but although many attempts, I've not been able to succeed:
The idea was to clone the totalNode and to appendbefore the OrderID Node
The problem is that I need to work on subdocuments and select the node to clone from a node itself, while all example I found do clone the full DocumentElement
perhaps an easier solution can be achieved using XSLT?
Can suggest a solution?
I don't completely understand what you are trying to say about the cloning part. Perhaps you can edit your question and clarify what you mean.
However, about accessing the Total nodes... you could simply use XPath for this as well.
$xmlDoc = new DOMDocument();
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->loadXML($response);
$xpath = new DOMXPath($xmlDoc);
// first, let's fetch all <Order> elements
$orders = $xpath->query('//OrderListResponseContainer/Orders/Order');
// loop through all <Order> elements
foreach( $orders as $order ) {
/*
There's all sorts of ways you could convert <Total> to something useful
*/
// Example 1.
// fetch <Total> that is a direct child (./) of our context node (second argument) $order
$total = $xpath->query( './Total', $order )->item( 0 );
// then do something like
$subTotal = $total->getElementsByTagName( 'SubTotal' )->item( 0 );
$shipping = $total->getElementsByTagName( 'Shipping' )->item( 0 );
// ... etc. for each child node of <Total>
// or perhaps simply convert it to a SimpleXMLElement
$total = simplexml_import_dom( $total );
var_dump( $total );
// and then access the values like this:
$total->SubTotal;
$total->Shipping;
// ... etc.
// Example 2.1
// fetch all children of <Total> into an array
$total = [];
foreach( $xpath->query( './Total/*', $order ) as $totalNode ) {
$total[ $totalNode->nodeName ] = $totalNode->textContent;
}
var_dump( $total );
// Example 2.2
// fetch all children of <Total> into a stdClass object
$total = new \stdClass;
foreach( $xpath->query( './Total/*', $order ) as $totalNode ) {
$total->{ $totalNode->nodeName } = $totalNode->textContent;
}
var_dump( $total );
/*
Now, after this you can create and process the Customer and Products data
in a similar fashion as I've shown how to process the Total data above
*/
}

reading xml with Linq

I cannot figure out how to get the all the ItemDetail nodes in the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<AssessmentMetadata xmlns="http://tempuri.org/AssessmentMetadata.xsd">
<ItemDetails>
<ItemName>I1200</ItemName>
<ISC_Inactive_Codes>NS,NSD,NO,NOD,ND,NT,SP,SS,SSD,SO,SOD,SD,ST,XX</ISC_Inactive_Codes>
<ISC_StateOptional_Codes>NQ,NP</ISC_StateOptional_Codes>
</ItemDetails>
<ItemDetails>
<ItemName>I1300</ItemName>
<ISC_Inactive_Codes>NS,NSD,NO,NOD,ND,NT,SP,SS,SSD,SO,SOD,SD,ST,XX</ISC_Inactive_Codes>
<ISC_StateOptional_Codes>NQ,NP</ISC_StateOptional_Codes>
</ItemDetails>
<ItemDetails>
<ItemName>I1400</ItemName>
<ISC_Active_Codes>NC</ISC_Active_Codes>
<ISC_Inactive_Codes>NS,NSD,NO,NOD,ND,NT,SP,SS,SSD,SO,SOD,SD,ST,XX</ISC_Inactive_Codes>
<ISC_StateOptional_Codes>NQ,NP</ISC_StateOptional_Codes>
</ItemDetails>
</AssessmentMetadata>
I have tried a number of things, I am thinking it might be a namespace issue, so this is my last try:
var xdoc = XDocument.Load(asmtMetadata.Filepath);
var assessmentMetadata = xdoc.XPathSelectElement("/AssessmentMetadata");
You need to get the default namespace and use it when querying:
var ns = xdoc.Root.GetDefaultNamespace();
var query = xdoc.Root.Elements(ns + "ItemDetails");
You'll need to prefix it for any element. For example, the following query retrieves all ItemName values:
var itemNames = xdoc.Root.Elements(ns + "ItemDetails")
.Elements(ns + "ItemName")
.Select(n => n.Value);

Resources