How to get text from <li> elements - ruby-on-rails

I have:
<ul>
<li>text1</li>
<li>text2 </li>
</ul>
Right now I get the text from <li> like this:
result = page.css(' ul li').text
The problem is, as a result I get a string with no spaces like
text1text2
I want it to be divided with <br>, like text1<br>text2<br>.
How do I do this?

From "Searching a XML/HTML Document"
:
methods xpath and css actually return a NodeSet, which acts very much
like an array, and contains matching nodes from the document.
So, if you want to concatenate all texts from all <li> tags, then you should work with the css method result as with a collection:
page.css('ul li') # selects all li tags and returns collection of Node objects
.map(&:text) # maps collection of li nodes into array of corresponding texts
.join('<br>') # concatenates all nodes texts into a single string with <br> separator
See: http://ruby.bastardsbook.com/chapters/html-parsing/

Related

How to write method to csv not just string? Ruby csv gem

I need to put the text content from an html element to a csv file. With the ruby csv gem, it seems that the primary write method for wrapped Strings and IOs only converts a string even if an object is specified.
For example:
Searchresults = puts browser.divs(class: 'results_row').map(&:text)
csv << %w(Searchresults)
returns only "searchresults" in the csv file.
It seems like there should be a way to specify the text from the div element to be put and not just a literal string.
Edit:
Okay arieljuod and spickermann were right. Now I am getting text content from the div element output to the csv, but not all of it like when I output to the console. The div element "results_row" has two a elements with text content. It also has a child div "results_subrow" with a paragraph of text content that is not getting written to the csv.
HTML:
<div class="bodytag" style="padding-bottom:30px; overflow:visible">
<h2>Search Results for "serialnum3"</h2>
<div id="results_banner">
Products
<span>Showing 1 to 2 of 2 results</span>
</div>
<div class="pg_dir"></div>
<div class="results_row">
FUJI
50mm lens
<div class="results_subrow">
<p>more product info</p>
</div>
</div>
<div class="results_row">
FUJI
50mm lens
<div class="results_subrow">
<p>more product info 2</p>
</div>
</div>
<div class="pg_dir"></div>
My code:
search_results = browser.divs(class: 'results_row').map(&:text)
csv << search_results
I'm thinking that including the child div "results_subrow" in the locator will find what I am missing. Like:
search_results = browser.divs(class: 'results_row', 'results_subrow').map(&:text)
csv << search_results
%w[Searchresults] creates an array containing the word Searchresults. You probably want something like this:
# assign the array returned from `map` to the `search_results` variable
search_results = browser.divs(class: 'results_row').map(&:text)
# output the `search_results`. Note that the return value of `puts` is `nil`
# therefore something like `Searchresults = puts browser...` doesn't work
puts search_results
# append `search_results` to your csv
csv << search_results

parse nested li inside ul and ol

I have a scenario in which when li comes under ul I need to replace it with a dot(.) and when li comes and ol I need to replace it with a number.
But the problem is-
1) It is not doing for nested li
2) It is appending at the same level. Same level means as soon as it finds li it will first add dot(.) and then it will add number.
What I want
1) Whenever li comes inside ul it should add dot(.).
2) Whenever li comes inside ol it should add a number.
data = "<ol>\n<li>Introduction\n<ol>\n<li>hyy ssss</li>\n</ol>\n</li>\n<li>Description</li>\n<li>Observation</li>\n<li>Results</li>\n<li>Summary</li>\n</ol>\n<ul>\n<li>Introduction</li>\n<li>Description\n<ul>\n<li>Observation\n<ul>\n<li>Results\n<ul>\n<li>Summary</li>\n</ul>\n</li>\n</ul>\n</li>\n</ul>\n</li>\n<li>Overview</li>\n</ul>\n<p>All the testing regarding bullet points would have been covered with the above content. Hence publishing this content will make an entry in in the selected page, cricket page and so on.</p>\n"
content = Nokogiri::HTML.parse(data)
content.at('ul').children.xpath("//li").each { |li| li.inner_html="\u2022 "+li.inner_html }
content.at('ol').children.xpath("//li").each_with_index { |li,index| li.inner_html="#{index} "+li.inner_html }
Perhaps you need this:
content.css('ol').reverse.each do |ol|
ol.css('> li').each_with_index { |li,index| li.inner_html="#{index + 1} "+li.inner_html }
end
content.css('ul > li').reverse.each { |li| li.inner_html="\u2022 "+li.inner_html }
puts content
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<ol>
<li>1 Introduction
<ol>
<li>1 hyy ssss</li>
</ol>
</li>
<li>2 Description</li>
<li>3 Observation</li>
<li>4 Results</li>
<li>5 Summary</li>
</ol>
<ul>
<li>• Introduction</li>
<li>• Description
<ul>
<li>• Observation
<ul>
<li>• Results
<ul>
<li>• Summary</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>• Overview</li>
</ul>
</body></html>
Reason of doing reverse -
Consider the dom:
<ul>
<li>Description
<ul>
<li>Observation</li>
</ul>
</li>
</ul>
When you do content.css('ul > li'), you get in order of [description, observation]. Without reverse, when you run the snippet, you change the description, but doing so will also change the object_id of observation node. Then you changed the observation node which is not referenced anywhere in content. That's why, I reversed it and acquired children before parents. By doing this, I made sure I'm changing the child first and then changed the parent so parent was aware of the change in child and there is no unreferenced node anywhere.
Suppose description's node id is 1234 and observation node_id is 2345. When you mutated description, it changed itself but also changed it's child(2345). New object id can be 3456 and 4567 respectively. Then you changed 2345 (by iteration), but it makes no effect because your content is showing 3456 -> 4567
Hope this makes sense.

How can I count list elements in unordered list? (RSpec/Capybara)

In my site I have this list:
<ul class="test">
<li class="social_1"></li>
<li class="social_2"></li>
<li class="social_3"></li>
<li class="social_3"></li>
</ul>
My question is: how can I count li in my ul class test
I have tried this:
my_ul = page.find("ul[class='test']")
my_ul.each do |li|
pp li['class']
end
but it doesn't work.
Is there anyway to do something like I coded above?
assuming ul parent element with id=parent .. you can do it like this
list = Array.new
list = find('#parent ul').all('li')
now you can get list size simply
list.size
and you can benefit from having all li's in array to collect text also in each li like this
list = find('#parent ul').all('li').collect(&:text)
I'd advise using the new RSpec 3 syntax for counting elements with Capybara:
it "should have 4 li elements" do
expect(find('ul.text')).to have_selector('li', count: 4)
end
More information here: https://github.com/jnicklas/capybara#querying
Use page.all("ul.test li").size

How do I have an Angular.dart filtered list update automatically

I have an html template that filters a list by the column property of the objects of that list like so:
<ul>
<li card-view
card-id="state.card"
ng-repeat="state in ctrl.game.states | filter:{column:'backlog'} "
ng-include="cardview.html">
</li>
</ul>
If I modify the column property in one of the elements of that list, the display does not update.
How can I make that happen?
Here's one option that uses an imaginary placeholder tag and avoids the |filter replacing it with an ng-if, but I hope someone has a better answer than this one.
<ul>
<xx ng-repeat="state in ctrl.game.states">
<li card-view
card-id="state.card"
ng-if="state.column == 'backlog'"
ng-include="cardview.html">
</li>
</xx>
</ul>
Doing the ng-if and ng-repeat on the same element didn't work.

How do I remove white space between HTML nodes?

I'm trying to remove whitespace from an HTML fragment between <p> tags
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
as you can see, there always is a blank space between the <p> </p> tags.
The problem is that the blank spaces create <br> tags when saving the string into my database.
Methods like strip or gsub only remove the whitespace in the nodes, resulting in:
<p>FooBar</p> <p>barbarbar</p> <p>bla</p>
whereas I'd like to have:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
I'm using:
Nokogiri 1.5.6
Ruby 1.9.3
Rails
UPDATE:
Occasionally there are children nodes of the <p>Tags that generate the same problem: white space between
Sample Code
Note: the Code normally is in one Line, I reformatted it because it would be unbearable otherwise...
<p>
<p>
<strong>Selling an Appartment</strong>
</p>
<ul>
<li>
<p>beautiful apartment!</p>
</li>
<li>
<p>near the train station</p>
</li>
.
.
.
</ul>
<ul>
<li>
<p>10 minutes away from a shopping mall </p>
</li>
<li>
<p>nice view</p>
</li>
</ul>
.
.
.
</p>
How would I strip those white spaces aswell?
SOLUTION
It turns out that I messed up using the gsub method and didn't further investigate the possibility of using gsub with regex...
The simple solution was adding
data = data.gsub(/>\s+</, "><")
It deleted whitespace between all different kinds of nodes... Regex ftw!
This is how I'd write the code:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
EOT
doc.search('p, ul, li').each { |node|
next_node = node.next_sibling
next_node.remove if next_node && next_node.text.strip == ''
}
puts doc.to_html
It results in:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
Breaking it down:
doc.search('p')
looks for only the <p> nodes in the document. Nokogiri returns a NodeSet from search, or a nil if nothing matched. The code loops over the NodeSet, looking at each node in turn.
next_node = node.next_sibling
gets the pointer to the next node following the current <p> node.
next_node.remove if next_node && next_node.text.strip == ''
next_node.remove removes the current next_node from the DOM if the next node isn't nil and its text isn't empty when stripped, in otherwords, if the node has only whitespace.
There are other techniques to locate only the TextNodes if all of them should be stripped from the document. That's risky, because it can end up deleting all blanks between tags, causing run-on sentences and joined words, which probably isn't what you want.
A first solution can be to remove empty text nodes, a quick way to do this for your exact case can be:
require 'nokogiri'
doc = Nokogiri::HTML("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.css('body').first.children.map{|node| node.to_s.strip}.compact.join
This won't work for nested elements as-is but should give you a good path for start.
UPDATE:
You can actually optimise a little with:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.children.map{|node| node.to_s.strip}.compact.join
Here is all the possible task you can be looking for which deals with unnecessary whitespaces(including unicode one) in parsing output.
html = "<p>A paragraph.<em> </em> <br><br><em>
</em></p><p><em> </em>
</p><p><em>
</em><strong><em>\" Quoted Text \" </em></strong></p>
<ul><li><p>List 1</p></li><li><p>List 2</p></li><li><p>List 3 </p>
<p><br></p><p><br><em> </em><br>
A text content.<br><em><br>
</em></p></li></ul>"
doc = Nokogiri::HTML.fragment(html)
doc.traverse { |node|
# removes any whitespace node
node.remove if node.text.gsub(/[[:space:]]/, '') == ''
# replace mutiple consecutive spaces with single space
node.content = node.text.gsub(/[[:space:]]{2,}/, ' ') if node.text?
}
# Gives you html without any text node including <br> or multiple spaces anywhere in the text of html
puts doc.to_html
# Gives text of html, concatenating li items with a space between them
# By default li items text are concatenated without the space
Nokogiri::HTML(doc.to_html).xpath('//text()').map(&:text).join(' ')
#Output
# "A paragraph. \" Quoted Text \" \n List 1 \n List 2 \n \n List 3 \n A text content. \n \n"
# To Remove newline character '\n'
Nokogiri::HTML(doc.to_html).xpath('//text()').map(&:text).join(' ').gsub(/\n+/,'')
#Output
# "A paragraph. \" Quoted Text \" List 1 List 2 List 3 A text content."
Note: If you are not using fragment in case of a complete html doc then you might have to replace traverse with other function like search.
data.squish does the same thing and is way more readable.

Resources