Parsing contents of paragraph elements with Nokogiri - ruby-on-rails

I'd like to know the proper way to parse a block of contents with Nokogiri:
I have some documents to parse where they originally contained a format where each main container was a <p>. The main pieces of information within each one are divided up, oddly, with <font> tags.
Effectively a stock sample of <p> contents contains the following and is a typical example (some have a lot more content, some a lot less):
<p>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class="">
October 10, 1990 - Maybe a Title
</font>-
<font size="4" class="">
Some long text here.
<font color="#66CC00" class="">
[Blah Blah, October 27, 1982 p. 2
]
</font>.
More content.
<font color="#00FF33" class="">[Another Source, 1971, issue 01/4]
</font>.
</font>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class=""><font size="4" color="#00CCAA" class="">
Another fantastic article.
[Some Source, October 4, p.6]
</font>
</font>
</font>
</font>
</p>
Essentially the "font size" attribute is what sets each component apart in the article. The main points to extract are the FIRST <font size ="5"... (that is the article date and main title, if a title is given) tags, then the actual content.
Presently I have all paragraph chunks coming out with: doc.xpath('//p').each do |node|
However I am not sure if I should pass it through Nokogiri again to parse out it's contents or if I should just run it all through a regex. Was hoping for a small example of doing this "properly" with, I'm assuming, using an embedded xpath discovery within the initial block that pulls the elements out. I assume that there is a way to pull out the sub components based on the font size demarcation, but I've simply not seen a specific example of this yet.

Does that help you get started?
>> doc.xpath('//p').each do |node|
.. puts node.xpath("font[#size='5']/font").first.content.strip
.. end #=> 0
October 10, 1990 - Maybe a Title
Build similar expressions for the other parts you need and you are done :-)

Related

Trying to Add A Link to A Carousel

SOLVED
I Simply did not have the closing a tag at the end. Sheesh....
I need help. I'm still a student so still learning. Any help would be appreciated. I'm lost.
I'm using bootstrap5, adding a carousel ("With captions" https://getbootstrap.com/docs/5.1/components/carousel/ for the full code.) to a page and a link in that slide on the h5.
After adding the link, the text went from white to blue and gives me this
"Cannot GET /WilliamVest/%E2%80%9CKeyWestPhotoGallery.html"
I've checked this address NUMEROUS times. Even said it out loud to myself. The link IS in the same folder as the rest of the project.
So why won't it work?
I tried to google and found "data-bs-target" does that have anything to do with it? Like I said, still new so still learning.
****Someone pointed out I missed the quotes in the link. Did not fix it.
I added that quote. It was discolored so I don't know where that came from but I think it's because you were right about the quotation style.
I commented that whole line out and started over. I just wanted to see the difference.
So now, the blue is gone in the lettering but now it won't even open another page to show an error. It does nothing.
First slide label
Key West Photo Gallery -->
This is my code snippet ---- KeyWestPhotoGallery.html is the link I'm trying to get to.
<!-- Carousel of Projects -->
<div id="carouselExampleCaptions" class="carousel slide" data-bs-ride="carousel">
<div class="carousel-indicators">
<button type="button" data-bs-target="#carouselExampleCaptions" data-bs-slide-to="0" class="active" aria-current="true" aria-label="Slide 1"></button>
<button type="button" data-bs-target="#carouselExampleCaptions" data-bs-slide-to="1" aria-label="Slide 2"></button>
<button type="button" data-bs-target="#carouselExampleCaptions" data-bs-slide-to="2" aria-label="Slide 3"></button>
</div>
<div class="carousel-inner">
<div class="carousel-item active">
<img src="Images/palmLighthouse1.jpg" class="d-block w-100" alt="Key West Lighthouse">
<div class="carousel-caption d-none d-md-block">
<h5 class="display-2"><a href=“KeyWestPhotoGallery.html>Key West Photo Gallery</a></h5>
<!-- <p>Some representative placeholder content for the first slide.</p> -->
</div>
You have:
<a href=“KeyWestPhotoGallery.html>
That is not a quote " as used in HTML, that is a Unicode Left Double Quotation Mark a.k.a. opening typographer's quote (a display character with no special meaning). So it thinks that is part of the URL that is why it is urlencoded into that address as %E2%80%9C
let path = "/WilliamVest/%E2%80%9CKeyWestPhotoGallery.html"
console.log(decodeURI(path))
Solutions include the following:
Remove it: <a href=KeyWestPhotoGallery.html> (Unquoted attribute-value syntax)
Or, replace it adding the missing closing quote: <a href="KeyWestPhotoGallery.html"> (Double-quoted attribute-value syntax)
Or, use single quotes: <a href='KeyWestPhotoGallery.html'> (Single-quoted attribute-value syntax)

A Few Links On My Webpage Do Not Work

On my newly created webpage, I have a column of links that all work except for the top two. I have double checked the code and all of it seems to be uniform. I am not too sure why it is not allowing me to click on the two links at the top, but the rest seem to be working perfectly. If anyone could take a crack at what the problem may be I would really appreciate it! I have pasted the section of the code that I thought might contain the problem! Thanks!
<!DOCTYPEhtml>
<html>
<head>
<title> Rockwell Utilities </title>
<link href="rockwell.css" rel="stylesheet" type="text/css"/>
</head>
<div class="pos_top"><div id="bubbles"><img src="/rockwell/sepiawater.jpg" alt="Water Drop" height="250" width="1000"/></div></div>
<div id="header"><div id="logo"><img src="/rockwell/Rockwellnewnewedit.png" alt="Rockwell Utilities" height="400" width="500" /></div></div>
</body>
<div id="wrapper">
<h1> Rockwell Utilities Welcomes You! </h1>
<div id="intro">
<p>Rockwell Utilities is your number one choice in water and sewage.<br> We provide service to the Lakemoor, Illinois area,
<br>and have since 2007.
</p>
<img src="/rockwell/award.png" alt="Illinois Department of Health" height="200" width="500"/>
<a href="http://www.idph.state.il.us/public/press12/2011_Fluoridation_award_list.pdf"> <p>Rockwell
Utilities wins Illinois Department of Health Community Water Fluoridation Award four consecutive years!</p>
</a>
</div>
<div id="navigation_sidebar">
<div id="navigation_links">
<p>
<img src="/rockwell/emailus.png" alt="Email Us"/>
<img src="/rockwell/paymybill.png"alt="Pay My Bill"/>
<img src="/rockwell/calendar.png"alt="Calendar"/>
<img src="/rockwell/notices.png"alt="Notices"/>
<img src="/rockwell/paymentop.png"alt="Payment Options"/>
<img src="/rockwell/ourrates.png"alt="Our Rates"/>
</p>
Your logo (Rockwellnewnewedit.png) is overlapping the two links.
Have you double checked that email.html and payments.html are in the directory as calendar.html, notices.html, paymenttop.html, and rates.html? If the bottom 4 work and the top 2 give you a 404, that is most likely problem.
I don't know if this causes the problems you describe, but what catches my eye is the fact that there is no space between the src-attribute value ant the alt attribute, as in:
<img src="/rockwell/paymybill.png"alt="Pay My Bill"/>
http://i46.tinypic.com/dpvgo7.png
Your picture is overlapping the two links in this image my suggestion is to shrink the image and use the develop tools to see what is wrong because with that you can see what the picture is doing.

How do I remove a tag from enclosing tag

<td align="center" nowrap=""><img border="0" src="bus0.gif" /><font style="color:darkblue;">030-
FP</font><br />將到站</td>
For the above HTML, I'd like to remove the img and font tags as well as font tag's enclosed text using JSoup. How should I go about doing that?
Thanks!
Edit: I would like to remove the img and font tags, so the output would be
<td align="center" nowrap="">將到站</td>
//Assuming that you have a Document variable called doc
String img = doc.select("img").attr("src");
String font = doc.select("font").text();
And take a closer look at the API. It takes about 10 mins to find this out.

simple html dom parser and <span>

I hope anyone can help me with this.
I have an html code like this:
<div id="v4-95"><div id="v4-96" class="pview rs-pview"><table cellpadding="0" cellspacing="2" class="grid"><tr><td width="33%" class="gallery"><a name="item19c368bcd6"></a><table cellpadding="0" cellspacing="10" class="gallery"><tr><td class="picture camera" width="100%" height="140"><div class="image" style="width: 140px;"><img alt="Item image" title="Item image" src="http://thumbs3.ebaystatic.com/m/mvOLm6Tv8Lid54uveSlY80A/140.jpg" border="0"></div></td></tr><tr><td><div class="mi"></div></td></tr><tr><td class="details"><div class="ttl g-std"><a id="src110652603606" _sp="p4634.c0.m14.l1262" r="1" href="http://www.ebay.co.uk/itm/SAMSUNG-LTN156AT02-15-6-LAPTOP-SCREEN-NEW-/110652603606?pt=UK_Computing_LaptopAccess_RL&hash=item19c368bcd6" target="_parent" title="SAMSUNG LTN156AT02 15.6" LAPTOP SCREEN NEW">SAMSUNG LTN156AT02 15.6" LAPTOP SCREEN NEW</a><img src="http://q.ebaystatic.com/aw/pics/s.gif" width="16" alt="This seller accepts PayPal" height="16" class="ii iippl"></div><div><table cellpadding="0" cellspacing="0" class="fixed"><tr><td><img src="http://q.ebaystatic.com/aw/pics/bin_15x54.gif" alt="Buy It Now" title="Buy It Now"></td><td><span class="bin g-b">£41.50</span></td></tr>
I can retrive the title with this code:
$html = file_get_html('http://stores.ebay.co.uk/LCD-Kings/15-6-/_i.html?_fsub=886314010&_sid=73271570&_trksid=p4634.c0.m322');
foreach($html->find('a') as $element)
echo $element->title . '<br>';
But I don't understand how can I retrieve the £41,50 between the span and why it has a space in the class "bin gb"...
thanks for help...
It has a space in the class because that element has two classes. One is called bin, the other is called g-b. I'm guessing g-b refers to Great Britain so the price may be the span that has the class bin.
You haven't provided all the HTML but there may be an outer element that you can search for (such as: a div with id product and then, within that, find the price in the span with class bin).
You should lookup the documentation of your DOM parser and see what arguments it supports for find(). If it supports something like #product span.bin (or similar syntax) then you can select the span with that class.

retrieving text in selenium using xpath?

Hi i have this HTML code:
<tr class="odd events" style="">
<tr>
<a title="Expand to see Actions" class="toggleEvent" name="actions_for_host_1" href="#"></a>
<td id="7" colspan="6">
<div>
<ul>
<li>Low Alarm at 2010/06/25 07:09 ( <span title="2010/06/25 14:09 (UTC)" class="time_helper">Pacific</span> )</li>
</ul>
<div></div>
</div>
</td>
</tr>
How can I retrieve the <li> value using xpath and selenium ruby, so I get back "Low Alarm at 2010/06/25 07:09 Pacific"?
I don't know what the time and date will be but I know the <li> tag will always start with Low Alarm
thanks
sorry i couldn't phrase the question better.
You need to use the getText command to retrieve the text using a suitable locator. There are a few possible locators that you could use, some suggestions are listed below:
XPath:
locate the li based on the child span with the class time_helper
//li[span[#class='time_helper']]
locate the li based on the text starting with 'Low Alarm'
//li[starts-with(text(), 'Low Alarm')]
locate the li based on the text containing 'Low Alarm'
//li[contains(text(), 'Low Alarm')]
CSS:
locate the li based on the text starting with 'Low Alarm'
css=li:contains(^Low Alarm)
locate the li based on the text containing 'Low Alarm'
css=li:contains(Low Alarm)
Is there a reason why you're specifically trying to use an xpath locator for this?
A CSS locator may do the job just as well, e.g.
css=li:contains('Low Alarm')

Resources