incorrect number of table entries using LuaXml - xml-parsing

I'm using LuaXml to convert a xml string received from network to Lua table but got two problems. Anyone could help to point out the problem? Thanks!
1) xml.eval returns a table with 4 entries instead of 3. My intention is to get 3 entries of "preset", but got 4 entries with the last one showing "presets" .
2) I was hoping to use tbl.find("preset") to get the 3 entries of "preset" before the for loop and get attributes of each entry, but tbl.find("preset") would return nil.
Here is the code.
xml = require("LuaXml")
buff = "\
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?> \
<presets> \
<preset name=\"preset1\" url=\"Load?name=testlist1\" id=\"1\"/> \
<preset name=\"preset2\" url=\"Load?name=testlist2\" id=\"2\"/> \
<preset name=\"preset3\" url=\"Load?name=testlist3\" id=\"3\"/> \
</presets>"
local tbl = xml.eval(buff)
for i in pairs(tbl) do
print("name: " .. tbl[i].name .. ", id: " .. tbl[i].id .. ", url: " .. tbl[i].url)
end

A little bit of experimental poking suggests that LuaXml creates an entry in the table for the toplevel xml document element (at index 0) and then an additional element for each direct child tag of that element (at sequential numerical indices).
So your output table is:
> for i=0,#tbl do print(i, type(tbl[i]), tbl[i]) end
0 string presets
1 table <preset url="Load?name=testlist1" name="preset1" id="1" />
2 table <preset url="Load?name=testlist2" name="preset2" id="2" />
3 table <preset url="Load?name=testlist3" name="preset3" id="3" />
This strikes me as a very odd way of handling things but that seems to be what it does.

Related

How to change the HTML rendering of a Pandoc element?

I'm trying to customize the default HTML output of footnotes from an .odt file.
For example a file with a footnote like this:
Some text with a footnote1
Will render the HTML output below:
<ol class="footnotes">
<li id="fn1" role="doc-endnote">
<p>Content of footnote number 1. ↩︎</p>
</li>
</ol>
I want instead to have a flat paragraph to be output, with hardcoded a number like following:
<p>1. Content of footnote number 1. ↩︎</p>
I've used parts of sample.lua from the Pandoc repo but is not working, the process is blocked by this error:
$ pandoc --lua-filter=my-filter.lua file.odt -o file.html
Error running filter my-filter.lua:
my-filter.lua:7: bad argument #1 to 'gsub' (string expected, got table)
stack traceback:
[C]: in function 'string.gsub'
my-filter.lua:7: in function 'Note'
Below is my attempted script, I guess I'm naively overlooking something obvious or I've badly understood how filters work.
-- Table to store footnotes, so they can be included at the end.
local notes = {}
function Note(s)
local num = #notes + 1
-- insert the back reference right before the final closing tag.
s = string.gsub(s,
'(.*)</', '%1 ↩</')
-- add a list item with the note to the note table.
table.insert(notes, '<p id="fn' .. num .. '">' .. num .. '. ' .. s .. '</p>')
-- return the footnote reference, linked to the note.
return '<a id="fnref' .. num .. '" href="#fn' .. num ..
'"><sup>' .. num .. '</sup></a>'
end
function Pandoc (doc)
local buffer = {}
local function add(s)
table.insert(buffer, s)
end
add(doc)
if #notes > 0 then
for _,note in pairs(notes) do
add(note)
end
end
return table.concat(buffer,'\n') .. '\n'
end
Update
Tweaking part of what #tarleb answered I've managed now to modify the inline note reference link, but apparently the second function is not rendering the list of footnotes at the end of the document. What's missing?
local notes = pandoc.List{}
function Note(note)
local num = #notes + 1
-- add a list item with the note to the note table.
notes:insert(pandoc.utils.blocks_to_inlines(note.content))
-- return the footnote reference, linked to the note.
return pandoc.RawInline('html', '<a id="fnref' .. num .. '" href="#fn' .. num ..
'"><sup>' .. num .. '</sup></a>')
end
function Pandoc (doc)
doc.meta['include-after'] = notes:map(
function (content, i)
-- return a paragraph for each note.
return pandoc.Para({tostring(i) .. '. '} .. content)
end
)
return doc
end
The sample.lua is an example of a custom Lua writer, not a Lua filter. They can look similar, but are quite different. E.g., filter functions modify abstract document elements, while functions in custom writers generally expect strings, at least in the first argument.
A good way to go about this in a filter could be to place the custom rendering in the include-after metadata:
local notes = pandoc.List{}
function Pandoc (doc)
doc.blocks:walk {
Note = function (note)
notes:insert(pandoc.utils.blocks_to_inlines(note.content))
-- Raw HTML goes into an RawInline element
return pandoc.RawInline('html', 'footnote link HTML goes here')
end
}
doc.meta['include-after'] = notes:map(
function (content, i)
-- return a paragraph for each note.
return pandoc.Para({tostring(i) .. ' '} .. content)
end
)
return doc
end
I've managed after some trial and error to get a result that is working as intended, but "stylistically" not absolutely perfect.
Please read my commentary below mostly as an excercise, I'm trying to understand better how to use this great tool the way I wanted, not the way any reasonable person should in a productive way (or any way at all). ;)
What I'd like to improve:
I have to wrap the p elements in a div because as of Pandoc 2.18 is not possible to provide direct attributes to a Paragraph. This is a minor code bloat but acceptable.
I'd like to use a section element instead of a div to put all the notes at end of document (used in the Pandoc function), but I haven't found a way to create a RawBlock element and then add the note blocks to it.
I'm tottaly not proficient in Lua and barely grasped a few concept of how Pandoc works, so I'm pretty confident that what I've done below is non optimal. Suggestions are welcome!
-- working as of Pandoc 2.18
local notes = pandoc.List{}
function Note(note)
local num = #notes + 1
-- create a paragraph for the note content
local footNote = pandoc.Para(
-- Prefix content with number, ex. '1. '
{tostring(num) .. '. '} ..
-- paragraph accept Inline objects as content, Note content are Block objects
-- and must be converted to inlines
pandoc.utils.blocks_to_inlines(note.content) ..
-- append backlink
{ pandoc.RawInline('html', '<a class="footnote-back" href="#fnref' .. num .. '" role="doc-backlink"> ↩︎</a>')}
)
-- it's not possible to render paragraphs with attribute elements as of Pandoc 2.18
-- so wrap the footnote in a <div> with attributes and append the element to the list
notes:insert(pandoc.Div(footNote, {id = 'fn' .. num, role = 'doc-endnote'}))
-- return the inline body footnote reference, linked to the note.
return pandoc.RawInline('html', '<a id="fnref' .. num .. '" href="#fn' .. num ..
'"><sup>' .. num .. '</sup></a>')
end
function Pandoc (doc)
if #notes > 0 then
-- append collected notes to block list, the end of the document
doc.blocks:insert(
pandoc.Div(
notes:map(
function (note)
return note
end
),
-- attributes
{class = 'footnotes', role = 'doc-endnotes'}
)
)
end
return doc
end

how to find the index of a repeated character in lua string

suppose you have a path like this
/home/user/dev/project
I want to get the index of any / I want
like if I want the one before dev or the one before user
I don't get lua string patterns if there is a good documentation for it please link it
There are several ways to do this. Perhaps the simplest is using the () pattern element which yields a match position combined with string.gmatch:
for index in ("/home/user/dev/project"):gmatch"()/" do
print(index)
end
which prints
1
6
11
15
as expected. Another way to go (which requires some more code) would be repeatedly invoking string.find, always passing a start index.
Assuming that you probably want to split a string by slashes, that's about as simple using string.gmatch:
for substr in ("/home/user/dev/project"):gmatch"[^/]+" do
print(substr)
end
(the pattern finds all substrings of nonzero, maximal length that don't contain a slash)
Documentation for patterns is here. You might want to have a look at the subsection "Captures".
There are many ways to do so.
Also its good to know that Lua has attached all string functions on datatype string as methods.
Thats what #LMD demonstrates with the : directly on a string.
My favorite place for experimenting with such complicated/difficult things like pattern and their captures is the Lua Standalone Console maked with: make linux-readline
So lets play with the pattern '[%/\\][%u%l%s]+'
> _VERSION
Lua 5.4
> -- Lets set up a path
> path='/home/dev/project/folder with spaces mixed with one OR MORE Capitals in should not be ignored'
> -- I am curious /home exists so trying to have a look into
> os.execute('/bin/ls -Ah ' .. ('"%s"'):format(path:match('[%/\\][%u%l%s]+')));
knoppix koyaanisqatsi
> -- OK now lets see if i can capture the last folder with the $
> io.stdout:write(('"%s"\n'):format(path:match('[%/\\][%u%l%s]+$'))):flush();
"/folder with spaces mixed with one OR MORE Capitals in should not be ignored"
> -- Works too so now i want to know whats the depth is
> do local str, count = path:gsub('[%/\\][%u%l%s%_%-]+','"%1"\n') print(str) return count end
"/home"
"/dev"
"/project"
"/folder with spaces mixed with one OR MORE Capitals in should not be ignored"
4
> -- OK seems usefull lets check a windows path with it
> path='C:\\tmp\\Some Folder'
> do local str, count = path:gsub('[%/\\][%u%l%s]+','<%1>') print(str) return count end
C:<\tmp><\Some Folder>
2
> -- And that is what i mean with "many"
> -- But aware that only lower upper and space chars are handled
> -- So _ - and other chars has to be included by the pattern
> -- Like: '[%/\\][%u%l%s%_%-]+'
> path='C:\\tmp\\Some_Folder'
> do local str, count = path:gsub('[%/\\][%u%l%s%_%-]+','<%1>') print(str) return count end
C:<\tmp><\Some_Folder>
2
> path='C:\\tmp\\Some-Folder'
> do local str, count = path:gsub('[%/\\][%u%l%s%_%-]+','<%1>') print(str) return count end
C:<\tmp><\Some-Folder>
2

Parsing Micro Focus XML in COBOL variables

I have the following xml-structure that I want to parse in Cobol.
<LDO>
<OD>1</OD> //OD 1'st occurrence
<OLD>1</OLD> //OLD 1'st occurrence
<OLD>2</OLD> //OLD 2'nd occurrence
<OLD>3</OLD> //OLD 3'rd occurrence
<OD>2</OD> //OD 2'nd occurrence
<OLD>4</OLD> //OLD 4'th occurrence
</LDO>
As you guys can see there is several OLD tags after an OD tag. What I want to do is reading this xml file step by step and display it's attributes in the following way:
1
1
2
3
2
4
READ xml-stream.
START xml-stream KEY IS OD.
*>check status
START xml-stream KEY IS OLD.
*> check stream status
PERFORM UNTIL EXIT
READ xml-stream next key is
old
IF stream-status = -7
EXIT PERFORM
END-IF
*> check stream status less than 0
display od-value
display old-value
But the od-value doesn't change when i excecute the program. It return the following values
1
1
2
3
1
4
I want that the second occurrence to return the value of the second element OD not the first one.
I would like some help to achieve this.
You could use the "xml parse" syntax:
program-id. xp.
01 xdoc pic x(1024) value
" <LDO>" &
" <OD>1</OD>" &
" <OLD>1</OLD>" &
" <OLD>2</OLD>" &
" <OLD>3</OLD>" &
" <OD>2</OD>" &
" <OLD>4</OLD>" &
"</LDO>".
procedure division.
Xml parse xdoc processing procedure p
ON EXCEPTION
display 'XML document error 'XML-CODE
NOT ON EXCEPTION
display 'XML document successfully parsed'
END-XML
goback.
p.
Evaluate xml-event
When 'START-OF-ELEMENT'
When 'CONTENT-CHARACTERS'
exhibit named xml-text
When 'CONTENT-CHARACTER'
exhibit named xml-text
When 'END-OF-ELEMENT'
exhibit named xml-event
When other
exhibit named xml-event
End-evaluate
.
end program xp.

AppleScript parsing html from site

What I'm trying to do is to get the names of all TV shows on this Wikipedia page.
Ok, so I did this first:
property showsWebList : {}
tell application "Safari"
set loadDelay to 2 -- in seconds; test for your system
make new document at end of every document
set URL of document 1 to "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
delay loadDelay
set nrOfUls to do JavaScript "document.getElementById('mw-content-text').querySelectorAll('ul').length;" in document 1
set nrOfUls to nrOfUls - 1 as number
log nrOfUls
repeat with ws from 1 to nrOfUls
delay loadDelay
set nrOfLis to do JavaScript "document.getElementById('mw-content-text').getElementsByTagName('UL')[" & ws & "].querySelectorAll('li').length;" in document 1
set nrOfLis to nrOfLis - 1 as number
log nrOfLis
repeat with rs from 0 to nrOfLis
delay 0.3
set aShow to do JavaScript "document.getElementById('mw-content-text').getElementsByTagName('UL')[" & ws & "].getElementsByTagName('LI')[" & rs & "].getElementsByTagName('I')[0].getElementsByTagName('A')[0].innerHTML;" in document 1
if aShow is not "" or "missing value" then
copy aShow to end of showsWebList
end if
end repeat
end repeat
end tell
And this works exactly how I want it to. The problem is that it takes 15 minutes until it's done and you gotta have the safari document in front the whole time. So my thought was to pick up the whole code and parse it. Not that easy. This is how my code looks now:
tell application "Safari"
make new document at end of every document
set URL of document 1 to "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
delay 4
set orgHTML to do JavaScript "document.getElementById('mw-content-text').innerHTML;" in document 1
set orgHTML to orgHTML as text
set readyText to my extractBetween(orgHTML, "<li><i><a ", "</a></i></li>")
log (item 0 of readyText)
set removeArray to my extractBetween(readyText, "href", ">")
set completeArray to {}
repeat with rt from 0 to (count readyText)
repeat with ra from 0 to (count removeArray)
if (item ra of removeArray) is in (item rt of readyText) then
set completeName to trim_line((item rt of readyText), (item ra of removeArray), 1)
set end of completeArray to completeName
end if
end repeat
end repeat
log completeArray
end tell
on extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters -- save them for later.
set AppleScript's text item delimiters to startText -- find the first one.
set liste to text items of SearchText
set AppleScript's text item delimiters to endText -- find the end one.
set extracts to {}
repeat with subText in liste
if subText contains endText then
copy text item 1 of subText to end of extracts
end if
end repeat
set AppleScript's text item delimiters to tid -- back to original values.
return extracts
end extractBetween
on trim_line(this_text, trim_chars, trim_indicator)
-- 0 = beginning, 1 = end, 2 = both
set x to the length of the trim_chars
-- TRIM BEGINNING
if the trim_indicator is in {0, 2} then
repeat while this_text begins with the trim_chars
try
set this_text to characters (x + 1) thru -1 of this_text as string
on error
-- the text contains nothing but the trim characters
return ""
end try
end repeat
end if
-- TRIM ENDING
if the trim_indicator is in {1, 2} then
repeat while this_text ends with the trim_chars
try
set this_text to characters 1 thru -(x + 1) of this_text as string
on error
-- the text contains nothing but the trim characters
return ""
end try
end repeat
end if
return this_text
end trim_line
Not that smooth and not working. Somehow it seems like I can't get the items out of the list, because it doesn't see it as a list item. Can someone help me out?
Cheers
I would recommend a different approach. DL the source, and then just grab the title between tags. The whole script takes under two seconds. Start with:
property baseURL : "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
set rawHTML to do shell script "curl '" & baseURL & "'"
set preTag to "\" title=\"" -- " title="
set otid to AppleScript's text item delimiters
set AppleScript's text item delimiters to preTag
set rawList to text items of rawHTML
set nameList to {}
repeat with eachLine in rawList
set theOff to offset of ">" in eachLine
set thisName to text 1 thru (theOff - 2) of eachLine
-- add some error checking here to skip the opening non-title hits, and to fine-tune the precise title string
set nameList to nameList & return & thisName
end repeat
set AppleScript's text item delimiters to otid
return nameList
Add a little error checking, and tweak which preTag and postTag fits best.
I suggest you make use of a specialized 3rd-party tool for this task, which can greatly speed things up.
Here's a solution using the multi-platform web-scraping CLI xidel:
A shell command to demonstrate its brevity and speed (takes less than 1 sec. on my system) - extracts all show names from the page:
xidel -e '//*[#id="mw-content-text"]/ul/li/i/a' https://en.wikipedia.org/wiki/List_of_television_programs_by_name
An equivalent AppleScript snippet - be sure to fill in the path to where you place xidel on your system below:
set targetUrl to "https://en.wikipedia.org/wiki/List_of_television_programs_by_name"
set xPathExpr to "//*[#id=\"mw-content-text\"]/ul/li/i/a"
# Fill in the path to `xidel` on your system here:
set xidelPath to "/path/to/xidel"
# Perform scraping and convert result into an AppleScript list.
set showNames to paragraphs of ¬
(do shell script ¬
quoted form of xidelPath & " -e " & quoted form of xPathExpr & " " & ¬
quoted form of targetUrl)
Here's another solution, use javascript to get the names without any AppleScript loop.
The javascript script takes less than one second to get the names.
tell application "Safari"
make new document at end of every document with properties {URL:"http://en.wikipedia.org/wiki/List_of_television_programs_by_name"}
delay 2 -- in seconds; test for your system
set showsWebList to do JavaScript "var a=new Array();var ul=document.getElementById('mw-content-text').querySelectorAll('UL'); for (var i=1;i<ul.length;i++){li=ul[i].querySelectorAll('LI'); for (var j=0; j< li.length; j++){try {var t=li[j].getElementsByTagName('I')[0].getElementsByTagName('A')[0].innerText; a.push(t)} catch(e) {}}} a;" in document 1
end tell
curl/sed/perl solution:
do shell script "curl 'http://en.wikipedia.org/wiki/List_of_television_programs_by_name' | sed -n '/0-9/,/NewPP/p' | sed -n '/^<li/ s/^.*title=.\\([^\"]*\\).*$/\\1/p' | perl -n -mHTML::Entities -e ' ; print HTML::Entities::decode_entities($_);'"
Here another solution using awk using a very simple script. If the line begins with <li><i> then remove html tags (gsub) and then print it. Then by using every paragraph of the return separated output is converted into a list.
set theURL to "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
every paragraph of (do shell script "curl " & theURL & " | awk '/^\\<li\\>\\<i\\>/{gsub(\"<[^>]*>\", \"\");print}'")

reading semi-formatted data

I'm totally new to AWK, however I think this is the best way to solve my problem and a good time to learn AWK.
I am trying to read a large data file that is created by a simulation program. The output is made to be readable by humans, so its formatting isn't very consistent. An example of the output is in this image
http://i.imgur.com/0kf8l.png
I need a way to find a line like "He 2 4686A -2.088 0.0071", by specifying the "He 2 4686A" part and get the following two numbers. The problem is the line "He 2 4686A -2.088 0.0071" can appear anywhere in the table.
I know how to find the entry "He 2 4686A", but I don't know which of the 4 columns it's in. So I don't know how to address the values that follow it.
A command that lets me just read the next two words, or tells me the location of the pattern once a match is found will both help.
/He 2 4686A/ finds the line
Ca A 3970A -0.900 0.1100 He 2 4686A -2.088 0.0071 S 3 18.67m -0.371 0.3721 Ar 4 444.7A -2.124 0.0066
Any help is appreciated.
First step should be to bring what seems to be 4 columns of records into a 1-column format...then its easy with awk because you can then filter for the first 5 fields - like:
echo "He 2 4686A -2.088 0.0071" | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
which gives
-2.088 0.0071
So, for me, the only challenge is to transform your data to one-column format...And from the picture that look simple because it seems that the columns have a fixed length which you can count.
Assuming that your column-width is 30 characters (difficult to tell from a picture, beware of tabs) and you data is in input_file, then you could first "cut" the data into 4 columns and then pipe the output to another awk-process
awk '{
print substr($0,1,30)
print substr($0,31,30)
print substr($0,61,30)
print substr($0,91,30)
}' input_file | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
If you really just need the next two numbers behind an anchor then I would say the grep-solution from Costa is best for you, however this gives you the possibility to implement further logic...
If you're not dead set on using awk, grep would be the easiest way...
egrep -o "He 2 4686A \-?[0-9.]+ \-?[0-9.]+" output.txt
EDIT: The above would work only if the spacing was done with a whitespace, which doesn't seem to be your case. In order to handle tabs and/or repeating whitespaces...
egrep -o "He[ \t]+2[ \t]+4686A[ \t]+\-?[0-9.]+[ \t]+\-?[0-9.]+" output.txt

Resources