Selecting an XML subset with Nokogiri - xml-parsing

For my first discovery of Nokogiri and XML parsing, I need to extract a list of Code items (with their children) provided a webservice. The document looks like this:
<message:Structure xmlns:common="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/common" xmlns:structure="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/structure" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message" xsi:schemaLocation="http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message SDMXMessage.xsd">
<message:Header>
<message:ID>70c32d97-bbd5-44c7-8bff-50a63abe07eb</message:ID>
<message:Test>false</message:Test>
<message:Prepared>2021-08-26T11:37:37</message:Prepared>
<message:Sender id="CH1">
<message:Name>CH1</message:Name>
</message:Sender>
</message:Header>
<message:CodeLists>
<structure:CodeList id="CL_LEISTUNGSART" version="1.0" agencyID="CH1" isFinal="true">
<structure:Name xml:lang="de">Leistungsart</structure:Name>
<structure:Name xml:lang="fr">Type prestation</structure:Name>
<structure:Name xml:lang="it">Tipo di prestazione</structure:Name>
<structure:Code value="01">
</structure:Code>
<structure:Code value="02">
<structure:Description xml:lang="de">Reguläre Unterstützung mit Zielvereinbarung</structure:Description>
<structure:Description xml:lang="fr">Aide financière régulière avec contrat d'insertion</structure:Description>
<structure:Description xml:lang="it">Assistenza regolare con contratto d’inserimento</structure:Description>
<structure:Annotations>
</structure:Annotations>
</structure:Code>
<structure:Code value="03">
</structure:Code>
<structure:Code value="04">
</structure:Code>
<structure:Code value="05">
</structure:Code>
<structure:Code value="10">
</structure:Code>
<structure:Code value="21">
</structure:Code>
<structure:Code value="22">
</structure:Code>
<structure:Code value="23">
</structure:Code>
<structure:Code value="25">
</structure:Code>
<structure:Code value="26">
</structure:Code>
<structure:Code value="32">
</structure:Code>
<structure:Code value="33">
</structure:Code>
<structure:Code value="34">
</structure:Code>
<structure:Code value="35">
</structure:Code>
<structure:Code value="36">
</structure:Code>
<structure:Code value="37">
</structure:Code>
<structure:Code value="40">
</structure:Code>
<structure:Code value="50">
</structure:Code>
<structure:Annotations>
</structure:Annotations>
</structure:CodeList>
<structure:CodeList id="CL_LEISTUNGSART" version="2.0" agencyID="CH1">
</structure:CodeList>
</message:CodeLists>
</message:Structure>
The request is: select the one CodeList from the structure where version is the highest value and isFinal is true, and then read the Code elements (with their children).
I can select the CodeList elements from the structure namespace:
document.css("structure|CodeList")
but then I get lost when trying to evaluate attributes version and isFinal.
Your help will be greatly appreciated!

For reasons you'll see below, this is not really an answer, but it's too long for a comment and may help to some extent.
In terms of pure xpath expressions, to get (for example), the French language translation of the structure:Name child node of the structure:CodeList node (which meets your two requirements), the following expression
//structure:CodeList[#version=max(//*[#isFinal="true"]/number(#version))]/structure:Name[#xml:lang="fr"]/text()
would output
Type prestation
and similarly for other languages or, for example structure:Description. Since the xml uses namespaces, for nokogiri you'll have to use something like
doc.xpath('//structure:CodeList[#version=max(//*[#isFinal="true"]/number(#version))]/structure:Name[#xml:lang="fr"]/text()',
'structure' => "http://www.SDMX.org/resources/SDMXML/schemas/v2_0/structure")
The problem is the use of the max() function in the expression. I can't test it myself right now, but max() is an xpath 2.0 function and my understanding is that only xpath 1.0 is supported.
One why to possibly address the issue of support for later xpath versions is to take a look here.

Related

Read binary data as Lua number with FFI

I have a file I opened as binary like this: local dem = io.open("testdem.dem", "rb")
I can read out strings from it just fine: print(dem:read(8)) -> HL2DEMO, however, afterwards there is a 4-byte little endian integer and a 4-byte float (docs for the file format don't specify endianess but since it didn't specify little like the integer I'll have to assume big).
This cannot be read out with read.
I am new to the LuaJIT FFI and am not sure how to read this out. Frankly, I find the documentation on this specific aspect of the FFI to be underwhelming, although I'm just a lua programmer and don't have much experience with C. One thing I have tried is creating a cdata, but I don't think I understand that:
local dem = io.open("testdem.dem", "rb")
print(dem:read(8))
local cd = ffi.new("int", 4)
ffi.copy(cd, dem:read(4), 4)
print(tostring(cd))
--[[Output
HL2DEMO
luajit: bad argument #1 to 'copy' (cannot convert 'int' to 'void *')
]]--
Summary:
Goal: Read integers and floats from binary data.
Expected output: A lua integer or float I can then convert to string.
string.unpack does this for Lua 5.3, but there are some alternatives for LuaJIT as well. For example, see this answer (and other answers to the same question).

Can rmarkdown return a value to a target

I find myself using rmarkdown/rnotebooks quite a bit to do exploratory analysis since I can combine code, prose and graphs. Many a times, I'll write my entire predictive modeling approach and the model itself within markdown.
However, then I end up with forecast models embedded within rmarkdown, unlinked to a target within my drake_plan. Today, I save these to disk first, then read them back in to my plan using file_in or other similar approach.
My question is - can I have a markdown document return an object directly to a drake target?
Conceptually:
plan = drake_plan(
dat = read_data(),
model = analyze_data(dat)
)
analyse_data = function(dat){
result = render(....)
return(result)
}
This way - I can get my model directly into my drake target, but if I need to investigate the model, I can open up my markdown/HTML.
I recommend you include those models as targets in the plan, but what you describe is possible. R Markdown and knitr automatically run code chunks in the calling environment, so the variable assignments you make in the report are available.
library(drake)
library(tibble)
simulate <- function(n){
tibble(x = rnorm(n), y = rnorm(n))
}
render_and_return <- function(input, output) {
rmarkdown::render(input, output_file = output, quiet = TRUE)
return_value # Assigned in the report.
}
lines <- c(
"---",
"output: html_document",
"---",
"",
"```{r show_data}",
"return_value <- head(readd(large))", # return_value gets assigned here.
"```"
)
writeLines(lines, "report.Rmd")
plan <- drake_plan(
large = simulate(1000),
subset = render_and_return(knitr_in("report.Rmd"), file_out("report.html")),
)
make(plan)
#> target large
#> target subset
readd(subset)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <dbl>
#> 1 1.30 -0.912
#> 2 -0.327 0.0622
#> 3 1.29 1.18
#> 4 -1.52 1.06
#> 5 -1.18 0.0295
#> 6 -0.985 -0.0475
Created on 2019-10-10 by the reprex package (v0.3.0)

How to get JJ and NN (adjective and Noun) from the triples generated StanfordDependencyParser with NLTK?

i got triples using the following code, but i want to get nouns and adjective from tripples, i tried alot but failed, new to NLTK and python, any help ?
from nltk.parse.stanford import StanfordDependencyParser
dp_prsr= StanfordDependencyParser('C:\Python34\stanford-parser-full-2015-04-20\stanford-parser.jar','C:\Python34\stanford-parser-full-2015-04-20\stanford-parser-3.5.2-models.jar', model_path='edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz')
word=[]
s='bit is good university'
sentence = dp_prsr.raw_parse(s)
for line in sentence:
print(list(line.triples()))
[(('university', 'NN'), 'nsubj', ('bit', 'NN')), (('university', 'NN'), 'cop', ('is', 'VBZ')), (('university', 'NN'), 'amod', ('good', 'JJ'))]
i want to get university and good and BIT and universityi tried the following but couldnt work.
for line in sentence:
if (list(line.triples)).__contains__() == 'JJ':
word.append(list(line.triples()))
print(word)
but i get empty array... please any help.
Linguistically
What you're looking out for when you look for triplets that contains a JJ and an NN is usually a Noun phrase NP in a context-free grammar.
In dependency grammar, what you're looking for is a triplet that contains the the JJ and NN POS tags in the arguments. Most specifically, when you're for a constituent / branch that contains an adjectival modified Noun. From the StanfordDepdencyParser output, you need to look for the predicate amod. (If you're confused with what's explained above it is advisable to read up on Dependency grammar before proceeding, see https://en.wikipedia.org/wiki/Dependency_grammar.
Note that the parser outputs the triplets, (arg1, pred, arg2), where the argument 2 (arg2) depends on argument 1 (arg1) through the predicate (pred) relation; i.e. arg1 governs arg2 (see, https://en.wikipedia.org/wiki/Government_(linguistics))
Pythonically
Now to the code part of the answer. You want to iterate through a list of tuples (i.e. triplets) so the easiest solution is to specifically assign variables to the tuples as you iterate, then check for the conditions you need see Find an element in a list of tuples
>>> x = [(('university', 'NN'), 'nsubj', ('bit', 'NN')), (('university', 'NN'), 'cop', ('is', 'VBZ')), (('university', 'NN'), 'amod', ('good', 'JJ'))]
>>> for arg1, pred, arg2 in x:
... word1, pos1 = arg1
... word2, pos2 = arg2
... if pos1.startswith('NN') and pos2.startswith('JJ') and pred == 'amod':
... print ((arg1, pred, arg2))
...
(('university', 'NN'), 'amod', ('good', 'JJ'))

Are SRFI/41 and Racket/stream different?

in-range in Racket returns a stream. There are plenty of functions defined on streams from racket/stream library. However i can't use a function stream-take from srfi/41 on them. I wanted to execute
(stream-take 5 (in-range 10))
It complained that stream-take: non-stream argument.
(stream->list (stream-cons 10 (in-range 10)))
The above throws the following error:
stream-promise: contract violation;
given value instantiates a different structure type with the same name
expected: stream?
given: #<stream>
However:
(stream->list (stream-cons 10 stream-null)) ;; works
(stream->list (stream-cons 10 empty-stream)) ;; works
both work fine.
Does the above mean that streams from racket/stream and srfi/41 are incompatible? How can i take 10 items from a racket/stream stream without reinventing the wheel?
Racket 5.3.3
Yes, the kind of stream that (in-range 10) produces is different from srfi/41 streams. In general, you can't expect srfi/41 functions to work on all streams in Racket because a Racket "stream" is actually a generic datatype that dispatches to different method implementations (see gen:stream). In contrast, srfi/41 expects only its own datatype.
(stream-take should probably be added to racket/stream though)
If you want to take 10 items from racket/stream, use (for/list ([x some-stream] [e 10]) x).

In the UK, how do I find an address given the GPS coordinates?

Sorry this is not a very well defined question, I am thinking about an idea for a product, so need to know what is possible...
Say I am standing at the fount door of a house, given the GSP coordinates from a smart phone, how can I find the address I am standing at?
Is GPS good enough for this?
How much does the data/service I need to use cost?
What other questions should I be asking about this?
GPS is limited to returning the latitude and longitude coordinates of your position.
To resolve these coordinates to an address, you would need to use an external data source. The act of converting a geographical coordinate to an address is often referred to as reverse geocoding.
There are some free reverse geocoding services such as that offered within the Google Maps API. However make sure you read and understand the Terms of Use before using such a service.
As an example, you can do reverse geocoding with the Google Maps API using the following HTTP request:
Simple CSV:
http://maps.google.com/maps/geo?q=40.756041,-73.986939&output=csv&sensor=false
Returns:
200,8,"601-699 7th Ave, New York, NY 10036, USA"
More Complex XML:
http://maps.google.com/maps/geo?q=40.756041,-73.986939&output=xml&sensor=false
Returns:
<kml xmlns="http://earth.google.com/kml/2.0"><Response>
<name>40.756041,-73.986939</name>
<Status>
<code>200</code>
<request>geocode</request>
</Status>
<Placemark id="p1">
<address>601-699 7th Ave, New York, NY 10036, USA</address>
<AddressDetails Accuracy="8" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><Country><CountryNameCode>US</CountryNameCode><CountryName>USA</CountryName><AdministrativeArea><AdministrativeAreaName>NY</AdministrativeAreaName><SubAdministrativeArea><SubAdministrativeAreaName>New York</SubAdministrativeAreaName><Locality><LocalityName>New York</LocalityName><DependentLocality><DependentLocalityName>Manhattan</DependentLocalityName><Thoroughfare><ThoroughfareName>601-699 7th Ave</ThoroughfareName></Thoroughfare><PostalCode><PostalCodeNumber>10036</PostalCodeNumber></PostalCode></DependentLocality></Locality></SubAdministrativeArea></AdministrativeArea></Country></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7595131" south="40.7532178" east="-73.9835667" west="-73.9898620" />
</ExtendedData>
<Point><coordinates>-73.9869192,40.7560331,0</coordinates></Point>
</Placemark>
<Placemark id="p2">
<address>Times Sq - 42nd St Station, New York, NY 10116, USA</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><AddressLine>Times Sq - 42nd St Station</AddressLine></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7591946" south="40.7528994" east="-73.9838014" west="-73.9900966" />
</ExtendedData>
<Point><coordinates>-73.9869490,40.7560470,0</coordinates></Point>
</Placemark>
<Placemark id="p3">
<address>Times Square - 42nd Street</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><AddressLine>Times Square - 42nd Street</AddressLine></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7591476" south="40.7528524" east="-73.9838524" west="-73.9901476" />
</ExtendedData>
<Point><coordinates>-73.9870000,40.7560000,0</coordinates></Point>
</Placemark>
<Placemark id="p4">
<address>W 42 St - 7 Av, New York, NY 10116, USA</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><AddressLine>W 42 St - 7 Av</AddressLine></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7591446" south="40.7528494" east="-73.9839964" west="-73.9902916" />
</ExtendedData>
<Point><coordinates>-73.9871440,40.7559970,0</coordinates></Point>
</Placemark>
<Placemark id="p5">
<address>New Amsterdam Theatre, New York, NY 10036, USA</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><Country><CountryNameCode>US</CountryNameCode><CountryName>USA</CountryName><AdministrativeArea><AdministrativeAreaName>NY</AdministrativeAreaName><Locality><LocalityName>New York</LocalityName><PostalCode><PostalCodeNumber>10036</PostalCodeNumber></PostalCode><AddressLine>New Amsterdam Theatre</AddressLine></Locality></AdministrativeArea></Country></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7593416" south="40.7530464" east="-73.9842484" west="-73.9905436" />
</ExtendedData>
<Point><coordinates>-73.9873960,40.7561940,0</coordinates></Point>
</Placemark>
<Placemark id="p6">
<address>W 42 St - 7 Av, New York, NY 10116, USA</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><AddressLine>W 42 St - 7 Av</AddressLine></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7594606" south="40.7531654" east="-73.9842484" west="-73.9905436" />
</ExtendedData>
<Point><coordinates>-73.9873960,40.7563130,0</coordinates></Point>
</Placemark>
<Placemark id="p7">
<address>Times Sq - 42nd St Station, New York, NY 10116, USA</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><AddressLine>Times Sq - 42nd St Station</AddressLine></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7589406" south="40.7526454" east="-73.9832194" west="-73.9895146" />
</ExtendedData>
<Point><coordinates>-73.9863670,40.7557930,0</coordinates></Point>
</Placemark>
<Placemark id="p8">
<address>W 42 St - Broadway, New York, NY 10116, USA</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><AddressLine>W 42 St - Broadway</AddressLine></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7588236" south="40.7525284" east="-73.9831654" west="-73.9894606" />
</ExtendedData>
<Point><coordinates>-73.9863130,40.7556760,0</coordinates></Point>
</Placemark>
<Placemark id="p9">
<address>7 Av - W 41 St, New York, NY 10116, USA</address>
<AddressDetails Accuracy="9" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"><AddressLine>7 Av - W 41 St</AddressLine></AddressDetails>
<ExtendedData>
<LatLonBox north="40.7586296" south="40.7523344" east="-73.9843024" west="-73.9905976" />
</ExtendedData>
<Point><coordinates>-73.9874500,40.7554820,0</coordinates></Point>
</Placemark>
</Response></kml>
Simply change the q parameter with your latitude,longitude.
Note that the free version of the Google Maps API has a limit of 15,000 request per IP address per day. (Google Maps API FAQ)
If you are planning to heavily use Google's reverse geocoding services, you may want to consider using the Premier edition of the Google Maps API.
The Premier API automatically comes with "advanced geocoding capabilities with greater volume and speed", so the limitations of the standard API should be superseded by new quotas.
As an additional side-note, according to one unofficial source (dated April 2008), the cost for the Premier API starts at USD 10,000 per year.
You can use various free services, such as those provided by Google to reverse-geocode (the technical term) an address from some GPS coordinates. I strongly suggest having a play with their API, full documentation is available here:
http://code.google.com/apis/maps/documentation/services.html
Google's Geolocation Network Protocol might help; I think it worked for UK locations when I last tried it.
Btw, the Ordinance Survey just released their data under really loose licensing and I think they have a web service as well. You might want to give that a look.
OpenStreetMap (the free open-source map of the world created by volunteers) uses Nominatim, which seems to be very good. See
http://wiki.openstreetmap.org/wiki/Nominatim#Reverse_Geocoding_.2F_Address_lookup

Resources