How to apply additional inline style to html tags in ruby? - ruby-on-rails

I have a html string. In that string I want to parse all <p> tags and apply additional inline style.
Additional Style: style="margin:0px;padding:0px;" or it could be something else
Case1:
input string: <p>some string</p>
output string: <p style="margin:0px;padding:0px;">some string</p>
Case2:
input string: <p style="text-align:right;" >some string</p>
output string: <p style="text-align:right;margin:0px;padding:0px;">some string</p>
Case3:
input string: <p align="justify">some string</p>
output string: <p style="margin:0px;padding:0px;" align="justify">some string</p>
Right now I am using regex like this
myHtmlString.gsub("<p", "<p style = \"margin:0px;padding:0px\"")
Which works fine except it removes previous styling. I am using Ruby (ROR).
I need help to tweak this a bit.

You can do this using Nokogiri, by setting [:style] on the relevant Nodes.
require "nokogiri"
inputs = [
'<p>some string</p>',
'<p style="text-align:right;" >some string</p>',
'<p align="justify">some string</p>'
]
inputs.each do |input|
noko = Nokogiri::HTML::fragment(input)
noko.css("p").each do |tag|
tag[:style] = (tag[:style] || "") + "margin:0px;padding:0px;"
end
puts noko.to_html
end
This will loop through all elements matching the css selector p, and set the style attribute like you want.
Output:
<p style="margin:0px;padding:0px;">some string</p>
<p style="text-align:right;margin:0px;padding:0px;">some string</p>
<p align="justify" style="margin:0px;padding:0px;">some string</p>

I recommend against using regex for this, as in general HTML can't be properly parsed by regex. That said, as long as your input data is consistent, regex will still work. You want to match whatever content is already in a p element's style attribute using parentheses, then insert it in the substitution string:
myHtmlString.gsub(/<p( style="(.*)")?/,
"<p style=\"#{$2};margin:0px;padding:0px\"")
Here's how the match pattern works:
/ #regex delimiter
<p #match start of p tag
( #open paren used to group, everything in this group gets saved in $1
style=" #open style attribute
(.*) #group contents of style attribute, gets saved to $2
" #close style attribute
)? #question mark makes everything in the paren group optional
/ #regex delimiter

I ended up doing something like this, I had to do this just before sending the email. I know this is not the best way to do it but worth sharing here. Solutions given by #sgroves and #Dobert are really good and helpful.
But I din't want to included Nokogiri, though I have picked the idea from above 2 solutions only. Thanks.
Here is my code ( I am new to ROR so nothing much fancy here, I used it in HAML block)
myString.gsub!(/<p[^>]*>/) do |match|
match1 = match
style1_arr = match1.scan(/style=".*"/)
unless style1_arr.blank?
style1 = style1_arr.first.sub("style=", "").gsub(/\"/, "").to_s
style2 = style1 + "margin:0px;padding:0px;"
match2 = match1.sub(/style=".*"/, "style=\"#{style2.to_s}\"")
else
match2 = match1.sub(/<p/, "<p style = \"margin:0px;padding:0px;\"")
end
end
Now myString will be updated string.(notice the ! after gsub)

Related

How to write method to csv not just string? Ruby csv gem

I need to put the text content from an html element to a csv file. With the ruby csv gem, it seems that the primary write method for wrapped Strings and IOs only converts a string even if an object is specified.
For example:
Searchresults = puts browser.divs(class: 'results_row').map(&:text)
csv << %w(Searchresults)
returns only "searchresults" in the csv file.
It seems like there should be a way to specify the text from the div element to be put and not just a literal string.
Edit:
Okay arieljuod and spickermann were right. Now I am getting text content from the div element output to the csv, but not all of it like when I output to the console. The div element "results_row" has two a elements with text content. It also has a child div "results_subrow" with a paragraph of text content that is not getting written to the csv.
HTML:
<div class="bodytag" style="padding-bottom:30px; overflow:visible">
<h2>Search Results for "serialnum3"</h2>
<div id="results_banner">
Products
<span>Showing 1 to 2 of 2 results</span>
</div>
<div class="pg_dir"></div>
<div class="results_row">
FUJI
50mm lens
<div class="results_subrow">
<p>more product info</p>
</div>
</div>
<div class="results_row">
FUJI
50mm lens
<div class="results_subrow">
<p>more product info 2</p>
</div>
</div>
<div class="pg_dir"></div>
My code:
search_results = browser.divs(class: 'results_row').map(&:text)
csv << search_results
I'm thinking that including the child div "results_subrow" in the locator will find what I am missing. Like:
search_results = browser.divs(class: 'results_row', 'results_subrow').map(&:text)
csv << search_results
%w[Searchresults] creates an array containing the word Searchresults. You probably want something like this:
# assign the array returned from `map` to the `search_results` variable
search_results = browser.divs(class: 'results_row').map(&:text)
# output the `search_results`. Note that the return value of `puts` is `nil`
# therefore something like `Searchresults = puts browser...` doesn't work
puts search_results
# append `search_results` to your csv
csv << search_results

ujs + remotipaart - when running jQuery `.html(...)` calls, the appended html becomes just text

I am using rails 5, with remotipart. The remotipart version is:
gem 'remotipart', github: 'mshibuya/remotipart'
(At the moment of this question, the current commit is 88d9a7d55bde66acb6cf3a3c6036a5a1fc991d5e).
When I want to submit a form having multipart: true andremote: true` but without sending any attached file, it works great. But when I send a file, it fails.
To illustrate the case, consider a response like this:
(function() {
modelErrors('<div class=\'alert alert-danger alert-dismissible\' role=\'alert\'>\n <div aria-label=\'Close\' class=\'close fade in\' data-dismiss=\'alert\'>\n <span aria-hidden>\n ×\n <\/span>\n <\/div>\n Code can\'t be blank\n<\/div>\n');
}).call(this);
When this response is executed, immediately after arriving (since it is js), the form looks as it expected (in this case, this validation error is right to happen, and the rendered danger alert is right to appear with such text):
However, when I fill the file field, and repeat the exact same case (exact same validation error), the form looks quite different:
If you can guess, the contents are passed as text. The actual response being received from the server is a bit different:
<script type="text/javascript">try{window.parent.document;}catch(err){document.domain=document.domain;}</script>(function() {
modelErrors('<div class=\'alert alert-danger alert-dismissible\' role=\'alert\'>\n <div aria-label=\'Close\' class=\'close fade in\' data-dismiss=\'alert\'>\n <span aria-hidden>\n ×\n <\/span>\n <\/div>\n Code can\'t be blank\n<\/div>\n');
}).call(this);
This is the actual body of my response (it is not a bad copypaste, but the actual response as wrapped by remotipart).
The modalErrors function is quite dumb:
function modalErrors(html) {
$('#main-modal > .modal-dialog > .modal-content > .modal-body > .errors').html(html);
}
Comparing the jQuery-appended html chunks (I look for them in the browser's DOM inspector), they look like this:
Good:
<div class="alert alert-danger" role="alert">
<ul style="list-style-type: none">
<li>Code can't be blank</li>
</ul>
</div>
Bad:
Code can't be blank
What am I missing here? What I want is to allow my resposes to append html content when needed.
The remotipart lib, which uses something named iframe-transport, unwraps the content later and executes the response as if it was fetched with ajax.
However, the tags are stripped from the response, so the response will be converted to something like this:
try{window.parent.document;}catch(err){document.domain=document.domain;}(function() {
modelErrors('\n \n \n ×\n \n \n Code can\'t be blank\n\n');
}).call(this);
The solution I found to interact with this library in a sane way, is to define another helper helper, similar to j / escape_javascript:
module ApplicationHelper
JS_ESCAPE_MAP = {
'\\' => '\\\\',
'<' => '\\u003c',
'&' => '\\u0026',
'>' => '\\u003e',
"\r\n" => '\n',
"\n" => '\n',
"\r" => '\n',
'"' => '\\u0022',
"'" => "\\u0027"
}
JS_ESCAPE_MAP["\342\200\250".force_encoding(Encoding::UTF_8).encode!] = '
'
JS_ESCAPE_MAP["\342\200\251".force_encoding(Encoding::UTF_8).encode!] = '
'
def escape_javascript_with_inside_html(javascript)
if javascript
result = javascript.gsub(/(\\|\r\n|\342\200\250|\342\200\251|[\n\r<>&"'])/u) {|match| JS_ESCAPE_MAP[match] }
javascript.html_safe? ? result.html_safe : result
else
''
end
end
alias_method :jh, :escape_javascript_with_inside_html
end
In the views which are susceptible of being sent both by regular ajax and remotipart -in different scenarios, I mean- just replace the j calls with jh calls. Example:
modalErrors('<%= j render 'shared/model_errors_alert', instance: #team %>')
was replaced with
modalErrors('<%= jh render 'shared/model_errors_alert', instance: #team %>')

How do I remove white space between HTML nodes?

I'm trying to remove whitespace from an HTML fragment between <p> tags
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
as you can see, there always is a blank space between the <p> </p> tags.
The problem is that the blank spaces create <br> tags when saving the string into my database.
Methods like strip or gsub only remove the whitespace in the nodes, resulting in:
<p>FooBar</p> <p>barbarbar</p> <p>bla</p>
whereas I'd like to have:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
I'm using:
Nokogiri 1.5.6
Ruby 1.9.3
Rails
UPDATE:
Occasionally there are children nodes of the <p>Tags that generate the same problem: white space between
Sample Code
Note: the Code normally is in one Line, I reformatted it because it would be unbearable otherwise...
<p>
<p>
<strong>Selling an Appartment</strong>
</p>
<ul>
<li>
<p>beautiful apartment!</p>
</li>
<li>
<p>near the train station</p>
</li>
.
.
.
</ul>
<ul>
<li>
<p>10 minutes away from a shopping mall </p>
</li>
<li>
<p>nice view</p>
</li>
</ul>
.
.
.
</p>
How would I strip those white spaces aswell?
SOLUTION
It turns out that I messed up using the gsub method and didn't further investigate the possibility of using gsub with regex...
The simple solution was adding
data = data.gsub(/>\s+</, "><")
It deleted whitespace between all different kinds of nodes... Regex ftw!
This is how I'd write the code:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
EOT
doc.search('p, ul, li').each { |node|
next_node = node.next_sibling
next_node.remove if next_node && next_node.text.strip == ''
}
puts doc.to_html
It results in:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
Breaking it down:
doc.search('p')
looks for only the <p> nodes in the document. Nokogiri returns a NodeSet from search, or a nil if nothing matched. The code loops over the NodeSet, looking at each node in turn.
next_node = node.next_sibling
gets the pointer to the next node following the current <p> node.
next_node.remove if next_node && next_node.text.strip == ''
next_node.remove removes the current next_node from the DOM if the next node isn't nil and its text isn't empty when stripped, in otherwords, if the node has only whitespace.
There are other techniques to locate only the TextNodes if all of them should be stripped from the document. That's risky, because it can end up deleting all blanks between tags, causing run-on sentences and joined words, which probably isn't what you want.
A first solution can be to remove empty text nodes, a quick way to do this for your exact case can be:
require 'nokogiri'
doc = Nokogiri::HTML("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.css('body').first.children.map{|node| node.to_s.strip}.compact.join
This won't work for nested elements as-is but should give you a good path for start.
UPDATE:
You can actually optimise a little with:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.children.map{|node| node.to_s.strip}.compact.join
Here is all the possible task you can be looking for which deals with unnecessary whitespaces(including unicode one) in parsing output.
html = "<p>A paragraph.<em> </em> <br><br><em>
</em></p><p><em> </em>
</p><p><em>
</em><strong><em>\" Quoted Text \" </em></strong></p>
<ul><li><p>List 1</p></li><li><p>List 2</p></li><li><p>List 3 </p>
<p><br></p><p><br><em> </em><br>
A text content.<br><em><br>
</em></p></li></ul>"
doc = Nokogiri::HTML.fragment(html)
doc.traverse { |node|
# removes any whitespace node
node.remove if node.text.gsub(/[[:space:]]/, '') == ''
# replace mutiple consecutive spaces with single space
node.content = node.text.gsub(/[[:space:]]{2,}/, ' ') if node.text?
}
# Gives you html without any text node including <br> or multiple spaces anywhere in the text of html
puts doc.to_html
# Gives text of html, concatenating li items with a space between them
# By default li items text are concatenated without the space
Nokogiri::HTML(doc.to_html).xpath('//text()').map(&:text).join(' ')
#Output
# "A paragraph. \" Quoted Text \" \n List 1 \n List 2 \n \n List 3 \n A text content. \n \n"
# To Remove newline character '\n'
Nokogiri::HTML(doc.to_html).xpath('//text()').map(&:text).join(' ').gsub(/\n+/,'')
#Output
# "A paragraph. \" Quoted Text \" List 1 List 2 List 3 A text content."
Note: If you are not using fragment in case of a complete html doc then you might have to replace traverse with other function like search.
data.squish does the same thing and is way more readable.

How do I trim all whitespace in HAML?

In my example I want to individually markup letters in the word "word"
%span.word
%span.w W
%span.o O
%span.r R
%span.d D
As it is, this produces html like
<span class="word">
<span class="w">W</span>
<span class="o">O</span>
<span class="r">R</span>
<span class="d">D</span>
</span>
As you'd expect this displays as
W O R D
But I want it to display as
WORD
How can tell haml to remove all whitespace within the %span.word block?
%span.word
%span.w> W
%span.o> O
%span.r> R
%span.d> D
> (for whitespaces around a tag) and < (for whitespaces inside a tag) are used for whitespace removal.
http://haml.info/docs/yardoc/Haml/Options.html#remove_whitespace-instance_method
HAML allows you set a remove_whitespace option which will remove whitespace from all tags, if you don't want to litter your templates with < and > everywhere.

Getting attribute's value in Nokogiri to extract link URLs

I have a document which look like this:
<div id="block">
link
</div>
I can't get Nokogiri to get me the value of href attribute. I'd like to store the address in a Ruby variable as a string.
html = <<HTML
<div id="block">
link
</div>
HTML
doc = Nokogiri::HTML(html)
doc.xpath('//div/a/#href')
#=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
Or if you wanna be more specific about the div:
>> doc.xpath('//div[#id="block"]/a/#href')
=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
>> doc.xpath('//div[#id="block"]/a/#href').first.value
=> "http://google.com"
doc = Nokogiri::HTML(open("[insert URL here]"))
href = doc.css('#block a')[0]["href"]
The variable href is assigned to the value of the "href" attribute for the <a> element inside the element with id 'block'. The line doc.css('#block a') returns a single item array containing the attributes of #block a. [0] targets that single element, which is a hash containing all the attribute names and values. ["href"] targets the key of "href" inside that hash and returns the value, which is a string containing the url.
Having struggled with this question in various forms, I decided to write myself a tutorial disguised as an answer. It may be helpful to others.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
link1
</div>
<div id="block2">
link2
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the elements and then keep only the ones that have an href attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
But there's a better way: in the above cases, the .compact is necessary because the searches return the "just a bookmark" element as well. We can use a more refined search to find just the elements that contain an href attribute:
attrs = doc.xpath('//a/#href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[#id="block2"]/a/#href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath or at_css instead:
attr = doc.at_xpath('//div[#id="block2"]/a/#href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
And what if you want to find the text associated with a particular link?
Not a problem:
element = doc.at_xpath('//a[#href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
a handy Nokogiri cheat sheet
a tutorial on parsing HTML with Nokogiri
interactively test CSS selector queries
doc = Nokogiri::HTML("HTML ...")
href = doc.css("div[id='block'] > a")
result = href['href'] #http://google.com
data = '<html lang="en" class="">
<head>
<a href="https://example.com/9f40a.css" media="all" rel="stylesheet" /> link1</a>
<a href="https://example.com/4e5fb.css" media="all" rel="stylesheet" />link2</a>
<a href="https://example.com/5s5fb.css" media="all" rel="stylesheet" />link3</a>
</head>
</html>'
Here is my Try for above sample of HTML code:
doc = Nokogiri::HTML(data)
doc.xpath('//#href').map(&:value)
=> [https://example.com/9f40a.css, https://example.com/4e5fb.css, https://example.com/5s5fb.css]
document.css("#block a")["href"]
where document is the Nokogiri HTML parsed.

Resources