How parse the text between the element of xml file in python - xml-parsing

I want to parse the xml file as following
<book attr='1'>
<page number='1'>
<text> sss </text>
<text> <b>bb<i>sss<b></i></b></text>
<text> <i><b>sss</b></i></text>
<text><a herf='a'> sss</a></text>
</page>
<page number='2'>
<text> sss2 </text>
<text> <b>bb<i>sss2</i><b></text>
<text> <i><b>sss2</b></i></text>
<text><a herf='a'> sss2</a></text>
</page>
.......
</book>
I want to extract all the text between the 'text' element. But there are 'b' 'i' 'a' elements et al., in between the 'text' element.
I have tried to use the following code.
tree = ET.parse('book.xml')
root = tree.getroot()
for p in root.findall('page'):
print(p.get('number'))
for t in p.findall('text'):
print(t.text)
But the result:
1
sss
None
None
None
2
sss2
None
None
None
Actually, I want to extract all the text between the and , and join to be sentence like the following:
1
bb sss
sss
sss
sss
2
bb sss2
sss2
sss2
sss2
But how to parse the subelement between the 'text' thanks!

For parsing XML you can use BeautifulSoup. The text between elements can be obtained with get_text() method:
data = '''<book attr='1'>
<page number='1'>
<text> sss </text>
<text> <b>bb<i>sss<b></i></b></text>
<text> <i><b>sss</b></i></text>
<text><a herf='a'> sss</a></text>
</page>
<page number='2'>
<text> sss2 </text>
<text> <b>bb<i>sss2</i><b></text>
<text> <i><b>sss2</b></i></text>
<text><a herf='a'> sss2</a></text>
</page>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for page in soup.select('page[number]'):
print(page['number'])
for text in page.select('text'):
print(text.get_text(strip=True, separator=' '))
Prints:
1
sss
bb sss
sss
sss
2
sss2
bb sss2
sss2
sss2

Related

ColdFusion : String : Get Price From Inside 2 Points

I'm playing with the NOMICS API and get data in a string. But I'm having trouble getting just the Price:
This is part of the string from the METHOD=GET - which works fine..
"currency":"SHIB","platform_currency":"ETH","price":"0.000026199726","price_date":"2022-02-06T00:00:00Z","price_timestamp":"
I know that ,"price":" is the lead and then "," is the end...
But I can't seem to get just the 0.000026199726 from the middle- which is what I need.
<CFHTTP METHOD="Get"
URL="https://api.nomics.com/v1/currencies/ticker?key=#apikey#&ids=SHIB">
<cfset feedData = cfhttp.filecontent>
<cfset startpos = findNoCase(',"price":"', feedData)>
<cfset endpos = findNoCase('",', feedData)>
<cfset getdata = mid(feeddata,startpos,endpos-startpos)
<b>#getdata#</b> Errors as neg number.
The value of parameter 3 of the function Mid, which is now -191, must be a non-negative integer
This has to be an easy task. I must be using the wrong string function?
EDIT: Figured out - it was finding the "," but they are so many of them it found first one, which put things negative - so fix was to find the structure after. ","price_date" is after.
<cfset string = cfhttp.filecontent>
<cfset startpos = findNoCase('price":"', string)>
<cfset endpos = findNoCase('","price_date"', string)>
<cfset detdata = mid(string,startpos,endpos-startpos)>
<cfoutput>
start: #startpos#<br>
end: #endpos#<br>
data: #detdata#<br>
trimmed data: #trim(detdata)#<br>
trimmed data:
<br><b>#removechars(detdata,1,8)#</b><br><br>
</cfoutput>
I'll look at the JSON examples as well. Perhaps that will help with multiple pulls.
Excellent Folks : Thank you so much
<CFHTTP METHOD="Get"
URL="https://api.nomics.com/v1/currencies/ticker?key=#apikey#&ids=SHIB,BTC">
<cfset output = cfhttp.filecontent>
<cfoutput>
<cfset arrayOfStructs = deserializeJson(output)>
<cfloop array="#arrayOfStructs#" index="getpr">
<cfset Price = getpr.price />
<cfset TKID = getpr.id />
#tkid#: #price#<br>
</cfloop>
</cfoutput>
Spits out:
BTC: 43963.45841296
SHIB: 0.000033272664
Credit to Andrea/SOS
<CFHTTP METHOD="Get"
URL="https://api.nomics.com/v1/currencies/ticker?key=#apikey#&ids=SHIB,BTC">
<cfset output = cfhttp.filecontent>
<cfoutput>
<cfset arrayOfStructs = deserializeJson(output)>
<cfloop array="#arrayOfStructs#" index="getpr">
<cfset Price = getpr.price />
<cfset TKID = getpr.id />
#tkid#: #price#<br>
</cfloop>
</cfoutput>

React Native replace parts of string with tagged elements

I have a sentence for example:
"How now brown cow"
I want to find certain words within the sentence, for example 'now' and 'cow' and have them generated with difference tags added around each word. For example:
<Text style={styles.text}>
How <TouchableHighlight><Text>now</Text></TouchableHighlight>brown<TouchableHighlight><Text>cow</Text></TouchableHighlight>
But I can't see how to add these element tags to them. The basic code is below:
render() {
return (
var sentence = "How now brown cow";
<TouchableHighlight onPress={() => this._pressRow(rowID)}>
<View>
<View style={styles.row}>
<Text style={styles.text}>
{sentence}
</Text>
</View>
</View>
</TouchableHighlight>
);
},
Split your sentence using String.prototype.split() at iterate through that array.
Make sure, that you change the flexDirection to row, otherwise the single elements will be positioned in single lines.
I added a space after each word, maybe you could look for a better solution, so that the space won't be added to the last element.
render() {
const sentence = "How now brown cow";
const words = sentence.split(' ');
const wordsToHighlight = ['now', 'cow'];
const renderWord = (word, index) => {
if(wordsToHighlight.indexOf(word) > -1) return (<TouchableHighlight><Text>{word} </Text></TouchableHighlight>);
return (<Text>{word} </Text>);
}
return (<View style={{flexDirection: 'row'}}>{React.Children.map(words, renderWord)}</View>);
}

Parse Xml tags with attributes

I have this xml :
<document-display>
<name>
<entry lang="nl">nl Text</entry>
<entry lang="fr">fr Text</entry>
<entry lang="en">en Text</entry>
</name>
</document-display>
I would like to get the text according to the langage.
I'm using XmlSlurper.
With my current code :
def parsedD = new XmlSlurper().parse(xml)
parsedD."document-display".name.entry.each {it.#lang == 'fr'}
I have as bad result which is the concatenation of the 3 text content :
nl Textfr Texten Text
Thanks for helping.
Try
parsedD.name.entry.find { it.#lang == 'fr' }?.text()

xml.parse return null google app script

I am trying parse the xml but result return null.
Here is the xml:
<feed>
<title type="text">neymar</title>
<subtitle type="text">Bing Image Search</subtitle>
<id>https://api.datamarket.azure.com/Data.ashx/Bing/Search/Image?Query='neymar'&$top=2</id>
<rights type="text"/>
<updated>2013-05-13T08:45:02Z</updated>
<link rel="next" href="https://api.datamarket.azure.com/Data.ashx/Bing/Search/Image?Query='neymar'&$skip=2&$top=2"/>
<entry>
<id>https://api.datamarket.azure.com/Data.ashx/Bing/Search/Image?Query='neymar'&$skip=0&$top=1</id>
<title type="text">ImageResult</title>
<updated>2013-05-13T08:45:02Z</updated>
<content type="application/xml">
<m:properties>
<d:ID m:type="Edm.Guid">99cb00e9-c9bb-45ca-9776-1f51e30be398</d:ID>
<d:Title m:type="Edm.String">neymaer wallpaper neymar brazil wonder kid neymar wallpaper hd</d:Title>
<d:MediaUrl m:type="Edm.String">http://3.bp.blogspot.com/-uzJS8HW4j24/Tz3g6bNII_I/AAAAAAAAB1o/ExYxctnybUo/s1600/neymar-wallpaper-5.jpg</d:MediaUrl>
<d:SourceUrl m:type="Edm.String">http://insidefootballworld.blogspot.com/2012/02/neymar-wallpapers.html</d:SourceUrl>
<d:DisplayUrl m:type="Edm.String">insidefootballworld.blogspot.com/2012/02/neymar-wallpapers.html</d:DisplayUrl>
<d:Width m:type="Edm.Int32">1280</d:Width>
<d:Height m:type="Edm.Int32">800</d:Height>
<d:FileSize m:type="Edm.Int64">354173</d:FileSize>
<d:ContentType m:type="Edm.String">image/jpeg</d:ContentType>
<d:Thumbnail m:type="Bing.Thumbnail">
<d:MediaUrl m:type="Edm.String">http://ts3.mm.bing.net/th?id=H.5042206689331494&pid=15.1</d:MediaUrl>
<d:ContentType m:type="Edm.String">image/jpg</d:ContentType>
<d:Width m:type="Edm.Int32">300</d:Width>
<d:Height m:type="Edm.Int32">187</d:Height>
<d:FileSize m:type="Edm.Int64">12990</d:FileSize>
</d:Thumbnail>
</m:properties>
</content>
</entry>
<entry>
<id>https://api.datamarket.azure.com/Data.ashx/Bing/Search/Image?Query='neymar'&$skip=1&$top=1</id>
<title type="text">ImageResult</title>
<updated>2013-05-13T08:45:02Z</updated>
<content type="application/xml">
<m:properties>
<d:ID m:type="Edm.Guid">9a6b7476-643e-4844-a8da-a4b640a78339</d:ID>
<d:Title m:type="Edm.String">neymar jr 485x272 Neymar Show 2012 Hd</d:Title>
<d:MediaUrl m:type="Edm.String">http://www.sontransferler.com/wp-content/uploads/2012/07/neymar_jr.jpg</d:MediaUrl>
<d:SourceUrl m:type="Edm.String">http://www.sontransferler.com/neymar-show-2012-hd</d:SourceUrl>
<d:DisplayUrl m:type="Edm.String">www.sontransferler.com/neymar-show-2012-hd</d:DisplayUrl>
<d:Width m:type="Edm.Int32">1366</d:Width>
<d:Height m:type="Edm.Int32">768</d:Height>
<d:FileSize m:type="Edm.Int64">59707</d:FileSize>
<d:ContentType m:type="Edm.String">image/jpeg</d:ContentType>
<d:Thumbnail m:type="Bing.Thumbnail">
<d:MediaUrl m:type="Edm.String">http://ts1.mm.bing.net/th?id=H.4796985557255960&pid=15.1</d:MediaUrl>
<d:ContentType m:type="Edm.String">image/jpg</d:ContentType>
<d:Width m:type="Edm.Int32">300</d:Width>
<d:Height m:type="Edm.Int32">168</d:Height>
<d:FileSize m:type="Edm.Int64">4718</d:FileSize>
</d:Thumbnail>
</m:properties>
</content>
</entry>
</feed>
and here is the code:
var response = UrlFetchApp.fetch('https://api.datamarket.azure.com/Bing/Search/Image?Query=%27neymar%27&$top=2',options)
var resp = response.getContentText();
var ggg = Xml.parse(resp,false).getElement().getElement('entry').getElement('content').getElement('m:properties');
Logger.log(ggg);
How do I get element <d:MediaUrl m:type="Edm.String">?
update: but still not work
var response = UrlFetchApp.fetch('https://api.datamarket.azure.com/Bing/Search/Image?Query=%27neymar%27&$top=2',options)
var text = response.getContentText();
var eleCont = Xml.parse(text,true).getElement().getElement('entry').getElement('content');
var eleProp = eleCont.getElement('hxxp://schemas.microsoft.com/ado/2007/08/dataservices/metadata','properties')
var medUrl= eleProp.getElement('hxxp://schemas.microsoft.com/ado/2007/08/dataservices','MediaUrl').getText()
Logger.log(medUrl)
While the provider is using multiple namespaces (signified by m: and d: in front of element names), you can ignore them for retrieving the data you're interested in.
Once you've called getElement() to get the root of the XML doc, you can navigate through the rest using attribute names. (Stop after var feed = ... in the debugger, and explore feed, you'll find you have the entire XML document there
Try this:
var text = Xml.parse(resp,true);
var feed = text.getElement();
var urls = [];
for (var i in feed.entry) {
urls.push(feed.entry[0].content.properties.MediaUrl.Text);
}
Logger.log(urls);
This also works. Note that you have multiple entries in your response, and this example is going after the second of them:
var ggg = Xml.parse(resp,true)
.getElement()
.getElements('entry')[1]
.getElement('content')
.getElement('properties')
.getElement('MediaUrl')
.getText();
References
Namespaces in XML 1.0
XmlElement methods referencing namespace, such as getElement(namespaceName, localName)
Other relevant StackOverflow questions. xml element name with colon, lots about XML namespaces

Someone is eating my <?xml version="1.0"?> returning an Excel XML in Asp.net MVC

I'm playing with VB XML literals for returning and Excel XML file.
the problem is that the first line containing <?xml version="1.0"?> does not make to the download.
Here is the code:
Public Class ReservasController
Function Test()
Response.Clear()
Response.AddHeader("Content-Disposition", "attachment; filename=test.xml")
Response.ContentType = "application/vnd.ms-excel"
Response.ContentEncoding = System.Text.Encoding.GetEncoding("utf-8")
Response.Write(GetXML())
''//This works:
''//Response.Write("<?xml version=""1.0""?>" + GetXML().ToString())
Response.End()
Return Nothing
End Function
The GetXML method is very simple:
Private Function GetXML()
Return <?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
<Author>Bizcacha Excel Generator</Author>
<LastAuthor>Bizcacha Excel Generator</LastAuthor>
<Created>20100101</Created>
<Company>Bizcacha</Company>
<Version>1</Version>
</DocumentProperties>
<ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
</ExcelWorkbook>
<Styles>
<Style ss:ID="Default" ss:Name="Normal">
<Alignment ss:Vertical="Bottom"/>
<Borders/>
<Font/>
<Interior/>
<NumberFormat/>
<Protection/>
</Style>
</Styles>
<Worksheet ss:Name="title">
<Table x:FullColumns="1" x:FullRows="1" ss:DefaultRowHeight="15">
<Column ss:Width="100"/>
<Column ss:Width="100"/>
<Row ss:AutoFitHeight="0">
<Cell ss:StyleID="Default"><Data ss:Type="String">Hello</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">World</Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
End Function
End Class
Looking at it, would it have anything to do with returning a string in which quotes are used...why not replace those quotes with Chr(34) which is the ASCII code for the " ...as an example:
Const Char DblQuote As String = Chr(34)
Private Function GetXML()
Return "<?xml version=" & DblQuote & "1.0" & DblQuote & "?>" & vbCrLf _
& "<?mso-application progid=" & DblQuote & "Excel.Sheet" & DblQuote & "?>" & vbCrLf _
and so on...
What do you think?
Hope this helps,
Best regards,
Tom.
Doing XML literals was a pain, so I moved back to string concatenation.

Resources