How to get all elements via CSS class - ruby-on-rails

I am trying to scrape this page using Nokogiri to get all the elements with class name of "teaser".
If I check the page with jQuery, I can see there are 25 elements:
$(".teaser").length => 25
However, when using Nokogiri, I only get the first teaser:
teasers = doc.css('.teaser')
teasers.count => 1
Where am I going wrong? How do I get all the teasers?

That document appears to have a load of null bytes in it for some reason, and this is causing Nokogiri/LibXML to assume the document has finished part way through.
You should be able to fix it by preprocessing the contents to remove the nulls. If page contains the text of the webpage:
page.gsub! /\x00/, ''
Then use Nokogiri on page as before.

Related

Nokogiri isn't picking up direct descendant link tags in CSS selectors [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I'm using the Nokogiri gem in a Ruby on Rails app and running across a weird issue. Here's the HTML tree I'm dealing with:
Using Nokogiri on the parent HTML doc, I can successfully traverse the tree like this:
y[0].css("div.postContainer.opContainer div.post.op")[0]['id']
# => "p25273352"
y[0].css("div.postContainer.opContainer div.post.op")[0].css(" > div").length
# => 3
y[0].css("div.postContainer.opContainer div.post.op")[0].css(" > blockquote").length
# => 1
However, when I attempt to do the same thing for the a or span tags, it can't find any direct descendants:
y[0].css("div.postContainer.opContainer div.post.op")[0].css(" > a").length
=> 0
y[0].css("div.postContainer.opContainer div.post.op")[0].css(" > span").length
# => 0
Feel like I must be missing something obvious here but can't figure it out. Any ideas?
I think you're making it too hard.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="a b">
<div></div>
<span>foo</span>
bar
</div>
EOT
doc.at('div.a span').to_html # => "<span>foo</span>"
doc.at('div.a a').to_html # => "bar"
or:
doc.at('div.a > span').to_html # => "<span>foo</span>"
doc.at('div.a > a').to_html # => "bar"
I'd access the child span/a nodes using at rather than css("blah")[0] or css("blah").first:
From the tutorial:
If you know you're going to get only a single result back, you can use the shortcuts at_css and at_xpath instead of having to access the first element of a NodeSet.
#doc.css("dramas name").first # => "<name>The A-Team</name>"
#doc.at_css("dramas name") # => "<name>The A-Team</name>"
at is the generic version of at_css and at_xpath and accepts either CSS or XPATH selectors.
There's a possibility the real HTML is mangled, which can cause problems locating a node when parsing. See the errors method. Never trust the browser's view of the HTML. Browsers will fix mangled HTML, basically rewriting it like Nokogiri will, but they might do it differently. Instead, always view your HTML using wget, curl or nokogiri at the command-line.
Feel like I must be missing something obvious here but can't figure it out. Any ideas?
JavaScript.
You have posted a screenshot from what looks like Chrome console. JavaScript runs in the browser and will modify DOM, potentially adding/removing HTML elements.
Your Ruby code most likely isn't running in Chrome though. Check the page source (Ctrl+U) or dump the response you are getting with your Ruby HTTP client and make sure that the elements you are trying to get with Nokogiri are actually there.

React Component not rendered properly with Turbolinks in Rails 5.1

I have a very simple Rails app with a react component that just displays "Hello" in an existing div element in a particular page (let's say the show page).
When I load the related page using its URL, it works. I see Hello on the page.
However, when I'm previously on another page (let's say the index page and then I go to the show page using Turbolinks, well, the component is not rendered, unless I go back and forth again. (going back to the index Page and coming back to the show page)
From here every time I go back and forth, I can say that the view is rendered twice more time.Not only twice but twice more time! (i.e. 2 times then 4, then 6 etc..)
I know that since in the same time I set the content of the div I output a message to the console.
In fact I guess that going back to the index page should still run the component code without the display since the div element is not on the index page. But why in a cumulative manner?
The problems I want to solve are:
To get the code run on the first request of the show page
To block the code from running in other pages (including the index page)
To get the code run once on subsequent requests of the show page
Here the exact steps and code I used (I'll try to be as concise as possible.)
I have a Rails 5.1 app with react installed with:
rails new myapp --webpack=react
I then create a simple Item scaffold to get some pages to play with:
rails generate scaffold Item name
I just add the following div element in the Show page (app/views/items/show.html.erb):
<div id=hello></div>
Webpacker already generated a Hello component (hello_react.jsx) that I modified as following in ordered to use the above div element. I changed the original 'DOMContentLoaded' event:
document.addEventListener('turbolinks:load', () => {
console.log("DOM loaded..");
var element = document.getElementById("hello");
if(element) {
ReactDOM.render(<Hello name="React" />, element)
}
})
I then added the following webpack script tag at the bottom of the previous view (app/views/items/show.html.erb):
<%= javascript_pack_tag("hello_react") %>
I then run the rails server and the webpack-dev-server using foreman start (installed by adding gem 'foreman' in the Gemfile) . Here is the content of the Procfile I used:
web: bin/rails server -b 0.0.0.0 -p 3000
webpack: bin/webpack-dev-server --port 8080 --hot
And here are the steps to follow to reproduce the described behavior:
Load the index page using the URL http://localhost:3000/items
Click New Item to add a new item. Rails redirects to the item's show page at the URL localhost:3000/items/1. Here we can see the Hello React! message. It works well!
Reload the index page using the URL http://localhost:3000/items. The item is displayed as expected.
Reload the show page using the URL http://localhost:3000/items/1. The Hello message is displayed as expected with one console message.
Reload the index page using the URL http://localhost:3000/items
Click to the Show link (should be performed via turbolink). The message is not shown neither the console message.
Click the Back link (should be performed via turbolink) to go to the index page.
Click again to the Show link (should be performed via turbolink). This time the message is well displayed. The console message for its part is shown twice.
From there each time I go back to the index and come back again to the show page displays two more messages at the console each time.
Note: Instead of using (and replacing) a particular div element, if I let the original hello_react file that append a div element, this behavior is even more noticeable.
Edit: Also, if I change the link_to links by including data: {turbolinks: false}. It works well. Just as we loaded the pages using the URLs in the browser address bar.
I don't know what I'm doing wrong..
Any ideas?
Edit: I put the code in the following repo if interested to try this:
https://github.com/sanjibukai/react-turbolinks-test
This is quite a complex issue, and I am afraid I don't think it has a straightforward answer. I will explain as best I can!
To get the code run on the first request of the show page
Your turbolinks:load event handler is not running because your code is run after the turbolinks:load event is triggered. Here is the flow:
User navigates to show page
turbolinks:load triggered
Script in body evaluated
So the turbolinks:load event handler won't be called (and therefore your React component won't be rendered) until the next page load.
To (partly) solve this you could remove the turbolinks:load event listener, and call render directly:
ReactDOM.render(
<Hello name="React" />,
document.body.appendChild(document.createElement('div'))
)
Alternatively you could use <%= content_for … %>/<%= yield %> to insert the script tag in the head. e.g. in your application.html.erb layout
…
<head>
…
<%= yield :javascript_pack %>
…
</head>
…
then in your show.html.erb:
<%= content_for :javascript_pack, javascript_pack_tag('hello_react') %>
In both cases, it is worth nothing that for any HTML you add to the page with JavaScript in a turbolinks:load block, you should remove it on turbolinks:before-cache to prevent duplication issues when revisiting pages. In your case, you might do something like:
var div = document.createElement('div')
ReactDOM.render(
<Hello name="React" />,
document.body.appendChild(div)
)
document.addEventListener('turbolinks:before-cache', function () {
ReactDOM.unmountComponentAtNode(div)
})
Even with all this, you may still encounter duplication issues when revisiting pages. I believe this is to do with the way in which previews are rendered, but I have not been able to fix it without disabling previews.
To get the code run once on subsequent requests of the show page
To block the code from running in other pages (including the index page)
As I have mentioned above, including page-specific scripts dynamically can create difficulties when using Turbolinks. Event listeners in a Turbolinks app behave very differently to that without Turbolinks, where each page gets a new document and therefore the event listeners are removed automatically. Unless you manually remove the event listener (e.g. on turbolinks:before-cache), every visit to that page will add yet another listener. What's more, if Turbolinks has cached that page, a turbolinks:load event will fire twice: once for the cached version, and another for the fresh copy. This is probably why you were seeing it rendered 2, 4, 6 times.
With this in mind, my best advice is to avoid adding page-specific scripts to run page-specific code. Instead, include all your scripts in your application.js manifest file, and use the elements on your page to determine whether a component gets mounted. Your example does something like this in the comments:
document.addEventListener('turbolinks:load', () => {
var element = document.getElementById("hello");
if(element) {
ReactDOM.render(<Hello name="React" />, element)
}
})
If this is included in your application.js, then any page with a #hello element will get the component.
Hope that helps!
I was struggling with similar problem (link_to helper method was changing URL but react content was not loaded; had to refresh page manually to load it properly). After some googling I've found simple workaround on this page.
<%= link_to "Foo", new_rabbit_path(#rabbit), data: { turbolinks: false } %>
Since this causes a full page refresh when the link is clicked, now my react pages are loaded properly. Maybe you will find it useful in your project as well :)
Upon what you said I tested some code.
First, I simply pull out the ReactDOM.render method from the listener as you suggested in your first snippet.
This provide a big step forward since the message is no longer displayed elsewhere (like in the index page) but only in the show page as wanted.
But something interesting happen in the show page. There is no more accumulation of the message as appended div element, which is good. In fact it's even displayed once as wanted. But.. The console message is displayed twice!?
I guess that something related to the caching mechanism is going on here, but since the message is supposed to be appended why it isn't displayed twice as the console message?
Putting aside this issue, this seems to work and I wonder why it's necessary in the first place to put the React rendering after the page is loaded (without Turbolinks there was the DOMContentLoaded event listener)?
I guess that this has do with unexpected rendering by javascript code executed when some DOM elements are yet to be loaded.
Then, I tried your alternative way using <%= content_for … %>/<%= yield %>.
And as you expected this give mitigate results ans some weird behavior.
When I load via the URL the index page and then go to the show page using the Turbolink, it works!
The div message as well as the console message are shown once.
Then if I go back (using Turbolink), the div message is gone and I got the ".. unmounted.." console message as wanted.
But from then on, whenever I go back to the show page, the div and the console message are both never displayed at all.
The only message that's displayed is the ".. unmounted.." console message whenever I go back to the index page.
Worse, if I load the show page using the URL, the div message is not displayed anymore!? The console message is displayed but I got an error regarding the div element (Cannot read property 'appenChild' of null).
I will not deny that I completely ignore what's happening here..
Lastly, I tried your last best advice and simply put the last code snippet in the HTML head.
Since this is jsx code, I don't know how to handle it within the Rails asset pipeline / file structure, so I put my javascript_pack_tag in the html head.
And indeed, this works well.
This time the code is executed everywhere so it makes sense to use page-specific element (as previously intended in the commented code).
The downside, is that this time the code could be messy unless I put all page-specific code inside if statements that test for the presence of the page-specific element.
However since Rails/Webpack has a good code structure, it should be easily manageable to put page-specific code into page-specific jsxfiles.
Nevertheless the benefit is that this time all the page-specific parts are rendered at the same time as the whole page, thus avoiding a display glitch that occurs otherwise.
I didn't address this issue at the first place, but indeed, I would like to know how to get page specific contents rendered at the same time as the whole page.
I don't know if this is possible when combining Turbolink with React (or any other framework).
But in conclusion I leave this question for later on.
Thank you for your contribution Dom..

Print web page with original look

I want to achieve print functionality such that user can print out the web form and use it as paper form for the same purpose. Of course I do not need all the web page header and footer to be printed, just content of a div which take most of the page. I did play around with media print css and menage print result to look almost as original page. But the I tried to print it in another browser(Chrome) and it is all messed. (before I tried Mozilla).
For the web form I user css framework Twitter Bootstrap and I had to override its css (in print media) for almost each element individually to get some normal look in the print result.
My question is is there some way (framework/plugin) to print just what you see on the page, maybe as an image or something?
Any other suggestions are welcome.
Thanks.
If you are familiar with PHP you can try the PHP class files of TCPDF or those of FPDF.
Or there is also dompdf which renders HTML to PDF, but this will include more than just the information of one div.
And for further info here is a post on Stack where users are discussing which they think is best.

multi line tag in grails or html

With a grails app and from a local database, I'm returning some text in a xml format.
I can return it well formed in a <textarea></textarea> tag with the correct indenting (tabulation, line return,...etc.)
I want to go a bit further. In the text I'm returning, there are some <img/> tags and I'd like to replace those tag by the real images themselves.
I searched around and found no solution as of now. I understood that you can't add an image to a textarea (other then in a background), and if I choose a div tag, I won't have the indenting anymore (and therefore, harder to read)
I was wondering if using a <g:textField/> or an other tag from the grails library will do the trick. And if so, How can I append them to a page using jquery.
For example, how to append a <g:textField/> in jquery. It doesn't interpret it and I get this error
SyntaxError: missing ) after argument list [Break On This Error]...+doc).append("<input type="text" id="FTMAP_"+nb_sec+"" ...
And in my javascript file, I have
$("#FTM_"+doc).append("<g:textField id='FTMAP_"+nb_sec+"' ... />
Any possible solutions ?
EDIT
I did forget to mention that my final intentions are to be able to modify the text (tags included) and to have a nice and neat indentation so that it is the easiest possible for the end user.
You are asking a few different questions:
1. Can I use a single HTML tag to include images inside pre-formatted text.
No. You will have to parse the text and translate it into styled text yourself.
2. Is there a tag in the grails standard tags to accomplish this for me?
No.
3. How can I add grails tags from my javascript code.
Grails tags are processed on the server-side, and javascript is processed on the client. This means you cannot directly add grails tags via javascript.
There are a couple methods that can accomplish the same result, however:
You can set a javascript variable to the rendered content of a grails tag. This solution is good for data that is known at the time of the initial request.
var tagOutput = "${g.textField(/* etc */)}";
You can make an ajax request for the content to be added. Then your server-side grails code can render the tags you need. This is better for realtime data, or data that will be updated more than once on a single rendered page.

Slim lang inline lists not working

I'm converting some existing HTML files to Slim (https://github.com/stonean/slim) and using it for the first time but I'm having problems getting lists to work in compact form (meaning all on one line rather than indented below). The docs say:
Inline tags
Sometimes you may want to be a little more compact and inline the
tags.
ul
li.first: a href="/a" A link
li: a href="/b" B link
But when I try that I get this output in the browser:
a href="/b" B
With the rendered HTML looking like this in the source:
<li:>a href="/b" B link</li:>
Any ideas why this isn't working and how to fix it?
Your syntax is correct and the output for me (slim 1.3.0) is, as expected:
<ul><li class="first">A link</li><li>B link</li></ul>
You should check your slim version and update appropriately.

Resources