I am writing a rake task to change HTML string to JSON for which I am using Nokogiri to parse the HTML string and build JSON, everything is going fine until I noticed that if I have an inner text like
< 109
or
> 109
then nokogiri returns "109" instead of "> 109" or " < 109"
if I have a string like
str = <td>< 109</td>
then
result = Nokogiri::XML(str)
will return
#(Document:0x115f8 {
name = "document",
children = [ #(Element:0x1160c { name = "td", children = [ #(Text " 109")] })]
})
and
result.children.children.to_s
will return " 109" but i need "< 109"
How can i get desire result?
I am expecting to get "< 109" instaed of just " 109"
You could replace Nokogiri::XML with Nokogiri::HTML, which is more permissive with incorrect syntax :
Nokogiri::XML('<td>< 109</td>').children.last.text # => " 109"
Nokogiri::HTML('<td>< 109</td>').children.last.text # => "< 109"
It's a broken HTML, if this is the only issue that you are trying to solve then you can fix HTML before parsing it. You can replace all < with <.
str = '<td>< 109</td>'
fixed_str = str.gsub(/>< ([0-9]+)</, '>< \1<')
=> "<td>< 109</td>"
result = Nokogiri::XML(str)
=> #(Document:0x2ac1be2860cc { name = "document", children = [ #(Element:0x2ac1be282940 { name = "td", children = [ #(Text "< 109")] })] })
If there are > chars too
fixed_str = str.gsub(/>< ([0-9]+)</, '>< \1<').gsub(/>> ([0-9]+)</, '>> \1<')
Related
Looking for a clean way to return the characters of a string if end_with? evaluates to true.
i.e.
s = "my_name"
name = s.end_with?("name")
puts name
>> "name"
My use case would look somewhat like this:
file_name = "some_pdf"
permitted_file_types = %w(image pdf)
file_type = file_name.end_with?(*permitted_file_types)
puts file_type
>> "pdf"
I would do this:
"my_name".scan(/name\z/)[0]
#=> "name"
"something".scan(/name\z/)[0]
#=> nil
May be using match and \z (end of string)?
string = "my_name"
suffix1 = "name"
suffix2 = "name2"
In Ruby >= 3.1
result1 = %r{#{suffix1}\z}.match(string)&.match(0)
# => "name"
result2 = %r{#{suffix2}\z}.match(string)&.match(0)
# => nil
In Ruby < 3.1
result1 = %r{#{suffix1}\z}.match(string)&.[](0)
# => "name"
result2 = %r{#{suffix2}\z}.match(string)&.[](0)
# => nil
Just for fun trick with tap:
string.tap { |s| break s.end_with?(suffix1) ? suffix1 : nil }
# => "name"
string.tap { |s| break s.end_with?(suffix2) ? suffix2 : nil }
# => nil
ruby already has a String#end_with? method, going back to at least version 2.7.1, maybe earlier. But it returns a boolean and you want the matched string to be returned. You're calling a method on an instance of String, so you're apparently wanting to add a method to the String class.
class String #yeah we're monkey patching!
def ends_with?(str) #don't use end_with.. it's already defined
return str if self.end_with?(str)
end
end
#now
s="my_name"
s.ends_with?("name") #=> "name"
But I really wouldn't bother with all that... I'd just work with what Ruby provides:
s = "my_name"
str = "name"
result = str if s.end_with?(str)
I'm currently working on an XML parser and I'm trying to use Lua's pattern matching tools but I'm not getting the desired result. Let's say I have this XML snippet:
<Parent>
<Child>
<Details>Text in Parent tag and Details child tag</Details>
<Division>Text in Parent tag and Division child tag</Division>
</Child>
</Parent>
I need to pull the Parent tag out into a table, followed by any child tags, and their corresponding text data. I already have the pattern for pulling the data figured out:
DATA = "<.->(.-)<"
Likewise for pulling tags individually:
TAGS ="<(%w+)>"
However like I mentioned, I need to differentiate between tags that are nested and tags that aren't. Currently the pattern that's getting the closest result I need is:
CHILDTAG= "<%w->.-<(%w-)>"
Which should print only "Child" but it prints "Division" as well for a reason I can't comprehend. The idea behind the CHILDTAG pattern is it captures a tag IFF it had an enclosing tag, i.e , the ".-" is there to signify that it may/may not have a new line between it, however I think that's completely wrong because \n- doesn't work and that signifies a new line. I referred to the documentation and to:
https://www.fhug.org.uk/wiki/wiki/doku.php?id=plugins:understanding_lua_patterns
I use Lua 5.1. I want to parse an XML file of the following pattern. How should I go about it?
Lua XML extract from pattern
Simple XML parser (Named entities in XML are not supported)
local symbols = {lt = '<', gt = '>', amp = '&', quot = '"', apos = "'", nbsp = ' ', euro = '€', copy = '©', reg = '®'}
local function unicode_to_utf8(codepoint)
-- converts numeric unicode to string containing single UTF-8 character
local t, h = {}, 127
while codepoint > h do
local low6 = codepoint % 64
codepoint = (codepoint - low6) / 64
t[#t+1] = 128 + low6
h = 288067 % h
end
t[#t+1] = 254 - 2*h + codepoint
return string.char((table.unpack or unpack)(t)):reverse()
end
local function unescape(text)
return (
(text..'<![CDATA[]]>'):gsub('(.-)<!%[CDATA%[(.-)]]>',
function(not_cdata, cdata)
return
not_cdata
:gsub('%s', ' ')
--:gsub(' +', ' ') -- only for html
:gsub('^ +', '')
:gsub(' +$', '')
:gsub('&(%w+);', symbols)
:gsub('&#(%d+);', function(u) return unicode_to_utf8(to_number(u)) end)
:gsub('&#[xX](%x+);', function(u) return unicode_to_utf8(to_number(u, 16)) end)
..cdata
end
)
)
end
function parse_xml(xml)
local tag_stack = {}
local result = {find_child_by_tag = {}}
for text_before_tag, closer, tag, attrs, self_closer in xml
:gsub('^%s*<?xml.-?>', '') -- remove prolog
:gsub('^%s*<!DOCTYPE[^[>]+%[.-]>', '')
:gsub('^%s*<!DOCTYPE.->', '')
:gsub('<!%-%-.-%-%->', '') -- remove comments
:gmatch'([^<]*)<(/?)([%w_]+)(.-)(/?)>'
do
table.insert(result, unescape(text_before_tag))
if result[#result] == '' then
result[#result] = nil
end
if closer ~= '' then
local parent_pos, parent
repeat
parent_pos = table.remove(tag_stack)
if not parent_pos then
error("Closing unopened tag: "..tag)
end
parent = result[parent_pos]
until parent.tag == tag
local elems = parent.elems
for pos = parent_pos + 1, #result do
local child = result[pos]
table.insert(elems, child)
if type(child) == 'table' then
--child.find_parent = parent
parent.find_child_by_tag[child.tag] = child
end
result[pos] = nil
end
else
local attrs_dict = {}
for names, value in ('\0'..attrs:gsub('%s*=%s*([\'"])(.-)%1', '\0%2\0')..'\0')
:gsub('%z%Z*%z', function(unquoted) return unquoted:gsub('%s*=%s*([%w_]+)', '\0%1\0') end)
:gmatch'%z(%Z*)%z(%Z*)'
do
local last_attr_name
for name in names:gmatch'[%w_]+' do
name = unescape(name)
if last_attr_name then
attrs_dict[last_attr_name] = '' -- boolean attributes (such as "disabled" in html) are converted to empty strings
end
last_attr_name = name
end
if last_attr_name then
attrs_dict[last_attr_name] = unescape(value)
end
end
table.insert(result, {tag = tag, attrs = attrs_dict, elems = {}, find_child_by_tag = {}})
if self_closer == '' then
table.insert(tag_stack, #result)
end
end
end
for _, child in ipairs(result) do
if type(child) == 'table' then
result.find_child_by_tag[child.tag] = child
end
end
-- Now result is a sequence of upper-level tags
-- each tag is a table containing fields: tag (string), attrs (dictionary, may be empty), elems (array, may be empty) and find_child_by_tag (dictionary, may be empty)
-- attrs is a dictionary of attributes
-- elems is a sequence of elements (with preserving their order): tables (nested tags) or strings (text between <tag> and </tag>)
return result
end
Usage example:
local xml= [[
<Parent>
<Child>
<Details>Text in Parent tag and Details child tag</Details>
<Division>Text in Parent tag and Division child tag</Division>
</Child>
</Parent>
]]
xml = parse_xml(xml)
--> both these lines print "Text in Parent tag and Division child tag"
print(xml[1].elems[1].elems[2].elems[1])
print(xml.find_child_by_tag.Parent.find_child_by_tag.Child.find_child_by_tag.Division.elems[1])
What parsed xml looks like:
xml = {
find_child_by_tag = {Parent = ...},
[1] = {
tag = "Parent",
attrs = {},
find_child_by_tag = {Child = ...},
elems = {
[1] = {
tag = "Child",
attrs = {},
find_child_by_tag = {Details = ..., Division = ...},
elems = {
[1] = {
tag = "Details",
attrs = {},
find_child_by_tag = {},
elems = {[1] = "Text in Parent tag and Details child tag"}
},
[2] = {
tag = "Division",
attrs = {},
find_child_by_tag = {},
elems = {[1] = "Text in Parent tag and Division child tag"}
}
}
}
}
}
}
def coderay(text)
text.gsub(/\<pre( )?\="" lang="(.+?)">\<code( )?\="" lang="(.+?)">(.+?)\<\/code\>\<\/pre\>/m) do
lang = $4
text = CGI.unescapeHTML($5).gsub /\<code( )?\="" lang="(.+?)">|\<\/code\>/, ""
text = text.gsub('<br />', "\n")
text = text.gsub(/[\<]([\/])*([A-Za-z0-9])*[\>]/, '')
text = text.gsub('>', ">")
text = text.gsub('<', "<")
text = text.gsub(' ', " ")
text = text.gsub('&', "&")
CodeRay.scan(text, lang).div(:css => :class)
end
end
The above generates this at the last closing "end":
</code(></code(></pre(>
Anyone knows why? I am using gems CodeRay 1.1.0 and RedCloth 4.2.9. Ruby version: 2.1.1 and Rails 3.2.19. RefineryCMS 2.1.3 and their blog engine.
I thought that this line was the cure but it is not:
text = CGI.unescapeHTML($5).gsub /\<code( )?\="" lang="(.+?)">|\<\/code\>/, ""
Edited:
This is in the show.html.erb file:
<%= raw (coderay(RedCloth.new(render 'post').to_html)) %>
I'm trying to have greentext support for my Rails imageboard (though it should be mentioned that this is strictly a Ruby problem, not a Rails problem)
basically, what my code does is:
1. chop up a post, line by line
2. look at the first character of each line. if it's a ">", start the greentexting
3. at the end of the line, close the greentexting
4. piece the lines back together
My code looks like this:
def filter_comment(c) #use for both OP's and comments
c1 = c.content
str1 = '<p class = "unkfunc">' #open greentext
str2 = '</p>' #close greentext
if c1 != nil
arr_lines = c1.split('\n') #split the text into lines
arr_lines.each do |a|
if a[0] == ">"
a.insert(0, str1) #add the greentext tag
a << str2 #close the greentext tag
end
end
c1 = ""
arr_lines.each do |a|
strtmp = '\n'
if arr_lines.index(a) == (arr_lines.size - 1) #recombine the lines into text
strtmp = ""
end
c1 += a + strtmp
end
c2 = c1.gsub("\n", '<br/>').html_safe
end
But for some reason, it isn't working! I'm having weird things where greentexting only works on the first line, and if you have greentext on the first line, normal text doesn't work on the second line!
Side note, may be your problem, without getting too in depth...
Try joining your array back together with join()
c1 = arr_lines.join('\n')
I think the problem lies with the spliting the lines in array.
names = "Alice \n Bob \n Eve"
names_a = names.split('\n')
=> ["Alice \n Bob \n Eve"]
Note the the string was not splited when \n was encountered.
Now lets try this
names = "Alice \n Bob \n Eve"
names_a = names.split(/\n/)
=> ["Alice ", " Bob ", " Eve"]
or This "\n" in double quotes. (thanks to Eric's Comment)
names = "Alice \n Bob \n Eve"
names_a = names.split("\n")
=> ["Alice ", " Bob ", " Eve"]
This got split in array. now you can check and append the data you want
May be this is what you want.
def filter_comment(c) #use for both OP's and comments
c1 = c.content
str1 = '<p class = "unkfunc">' #open greentext
str2 = '</p>' #close greentext
if c1 != nil
arr_lines = c1.split(/\n/) #split the text into lines
arr_lines.each do |a|
if a[0] == ">"
a.insert(0, str1) #add the greentext tag
# Use a.insert id you want the existing ">" appended to it <p class = "unkfunc">>
# Or else just assign a[0] = str1
a << str2 #close the greentext tag
end
end
c1 = arr_lines.join('<br/>')
c2 = c1.html_safe
end
Hope this helps..!!
I'm suspecting that your problem is with your CSS (or maybe HTML), not the Ruby. Did the resulting HTML look correct to you?
In Ruby on Rails,I got a string number is made of 3 parts : prefix , counter , suffix
In model Setup:
def self.number
prefix = setup.receipt_prefix.blank? ? "" : setup.receipt_prefix.to_s
counter = setup.receipt_counter.blank? ? "" : setup.receipt_counter+1
suffix = setup.receipt_suffix.blank? ? "" : setup.receipt_suffix.to_s
each individual string shows fine:
puts prefix
=> \#_
puts counter
=>
1234
puts suffix
=>
#$#s
but when I add 3 string together, an addition back slash appear :
prefix + counter + suffix
=>
\\#_1234\#$#s
how can I escape "#" "\" when I add 3 string together ? like
=>
\#_1234#$#s
any Ruby or Rails's helper I can use in the model?
thx~~
The string will look different if you get the value versus print (puts) it out. See the following irb session.
>> a = "\\#_"
=> "\\#_"
>> puts a
\#_
=> nil
>> b = "1234"
=> "1234"
>> puts a + b
\#_1234
=> nil
>> a + b
=> "\\#_1234"
The actual string value has two backslashes in it. But only one shows up if you print the string.