simple html dom parser and <span> - html-parsing

I hope anyone can help me with this.
I have an html code like this:
<div id="v4-95"><div id="v4-96" class="pview rs-pview"><table cellpadding="0" cellspacing="2" class="grid"><tr><td width="33%" class="gallery"><a name="item19c368bcd6"></a><table cellpadding="0" cellspacing="10" class="gallery"><tr><td class="picture camera" width="100%" height="140"><div class="image" style="width: 140px;"><img alt="Item image" title="Item image" src="http://thumbs3.ebaystatic.com/m/mvOLm6Tv8Lid54uveSlY80A/140.jpg" border="0"></div></td></tr><tr><td><div class="mi"></div></td></tr><tr><td class="details"><div class="ttl g-std"><a id="src110652603606" _sp="p4634.c0.m14.l1262" r="1" href="http://www.ebay.co.uk/itm/SAMSUNG-LTN156AT02-15-6-LAPTOP-SCREEN-NEW-/110652603606?pt=UK_Computing_LaptopAccess_RL&hash=item19c368bcd6" target="_parent" title="SAMSUNG LTN156AT02 15.6" LAPTOP SCREEN NEW">SAMSUNG LTN156AT02 15.6" LAPTOP SCREEN NEW</a><img src="http://q.ebaystatic.com/aw/pics/s.gif" width="16" alt="This seller accepts PayPal" height="16" class="ii iippl"></div><div><table cellpadding="0" cellspacing="0" class="fixed"><tr><td><img src="http://q.ebaystatic.com/aw/pics/bin_15x54.gif" alt="Buy It Now" title="Buy It Now"></td><td><span class="bin g-b">£41.50</span></td></tr>
I can retrive the title with this code:
$html = file_get_html('http://stores.ebay.co.uk/LCD-Kings/15-6-/_i.html?_fsub=886314010&_sid=73271570&_trksid=p4634.c0.m322');
foreach($html->find('a') as $element)
echo $element->title . '<br>';
But I don't understand how can I retrieve the £41,50 between the span and why it has a space in the class "bin gb"...
thanks for help...

It has a space in the class because that element has two classes. One is called bin, the other is called g-b. I'm guessing g-b refers to Great Britain so the price may be the span that has the class bin.
You haven't provided all the HTML but there may be an outer element that you can search for (such as: a div with id product and then, within that, find the price in the span with class bin).
You should lookup the documentation of your DOM parser and see what arguments it supports for find(). If it supports something like #product span.bin (or similar syntax) then you can select the span with that class.

Related

Create photoswipe on UL>LI>FIGURE markup

On photoswipe docs the markup is div/figure/img. But i want other markup.
How to "Creating an Array of Slide Objects" for this ul/li/figure/img markup. I know i need somehow to edit the "var initPhotoSwipeFromDOM = function(gallerySelector) {" function. But do not now what the changes i need to do?
This is my markup:
<ul class="my-gallery" itemscope itemtype="http://schema.org/ImageGallery">
<li>
<figure>
<a href="large-image.jpg" data-size="600x400">
<img src="small-image.jpg" itemprop="thumbnail"/>
</a>
<figcaption itemprop="caption description">Image caption</figcaption>
</figure>
</li>
</ul>
Related Q i so on the internet:
https://codedump.io/share/Hc9do6CIJgwH/1/how-do-i-get-photoswipe-to-recognize-entire-gallery-from-list-of-thumbnail-images
You must correctly travers the DOM and pass proper elements, I am not able to explain it, it's just about understanding how and which nodes are selected - here's a gist: https://gist.github.com/TMMC/6ec51c46d9fa57e1fd6a480f0d5da86d - I had the same issue, exactly the same code. Look for comments starting with make it works with.
Try ".my-gallery > li > figure > a" as gallerySelector

Recommended approach to implementing inline editing for a MVC grid please?

I am using MVC3, C#, Razor, EF4.1
I have implemented grids in their most simple form ie Razor Tables. At present I have implemented editing of record fields off page ie Click "Edit" and the edit page appears, one then fills in data then save which returns user to main grid page.
I need an inline solution where only 1 or 2 fields need updating. Typically the user would either click on the row or on "edit" link and the row would change to "edit mode". One would then edit the data. One would then click on "Save" and the row would resort to read only, or the grid would refresh. Can you recommend a simple and robust solution for this. At present I am not thinking about 3rd party component solutions such as Telerik Kendo UI Grids , although in the near future I will no doubt upgrade to something like this. At present I want to keep it really simple.
Thoughts, wisdom, recommendations appreciated.
Many thanks.
EDIT:
Thanks all. I am going to give these suggestions a try.
Here is simplest way of doing it, see fiddle.
Save all your data using JSON web service. You'll end up having either array of cells or array of array of cells. (Alternatively you can put JSON in a hidden input box)
Use $.data api and put all information needed for server to save in data attributes.
You'll endup having something simple as
var f=$('#myform')
, t = $('table')
, inputs = t.find('input')
, b1 = $('button.save1')
, b2 = $('button.save2')
, ta = $('#save')
// update data-val attribute when value changed
t.on('change', 'input', (e) => $(e.target).data('val', e.target.value))
// store everything in $.data/data-* attributes
b1.on('click', () => {
var data = []
inputs.each((i,inp) => data.push($(inp).data()) )
ta.text(JSON.stringify(data))
})
// use $.serialize
b2.on('click', () => {
var data = f.serializeArray()
ta.text(JSON.stringify(data))
})
input {border : 1px solid #fff;margin:0; font-size:20px; }
input:focus { outline: 1px solid #eee; background-color:#eee; }
table { border : 1px solid #999; border-collapse:collapse;border-spacing:0; }
table td { padding:0; margin:0;border:1px solid #999; }
table th { background-color: #aaa; min-width:20px;border:1px solid #999; }
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<form name='myform' id='myform'>
<table>
<tr>
<th></th>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
<tr data-row="0">
<th>1</th>
<td><input type="text" data-row="0" data-col="0" data-val="a" value="a" name='data[0][0]'/></td>
<td><input type="text" data-row="0" data-col="1" data-val="b" value="b" name='data[0][1]'/></td>
<td><input type="text" data-row="0" data-col="2" data-val="c" value="c" name='data[0][2]'/></td>
</tr>
<tr data-row="1">
<th>2</th>
<td><input type="text" data-row="1" data-col="0" data-val="d" value="d" name='data[1][0]'/></td>
<td><input type="text" data-row="1" data-col="1" data-val="e" value="e" name='data[1][1]'/></td>
<td><input type="text" data-row="1" data-col="2" data-val="f" value="f" name='data[1][2]'/></td>
</tr>
<tr data-row="2">
<th>3</th>
<td><input type="text" data-row="2" data-col="0" data-val="g" value="g" name='data[2][0]' /></td>
<td><input type="text" data-row="2" data-col="1" data-val="h" value="h" name='data[2][1]' /></td>
<td><input type="text" data-row="2" data-col="2" data-val="i" value="i" name='data[2][2]' /></td>
</tr>
</table>
</form>
<div name="data" id="save" cols="30" rows="10"></div>
<button class='save1'>Save 1</button>
<button class='save2'>Save 2</button>
Given that you generate your table in Razor view and don't need to load data into table. So you "loading" data on the server and saving changes with tiny JS snippet above.
You can also style your input cells in the table so they would look different when with focus and not, making it look like Excel spreadsheet (without fancy Excel features though, just look).
Well in that case I will suggest you to add a div with a unique id with each grid row.
and on the click of edit button insert a row having text boxes with value using java script.
Using knockout.js is my preferred approach, and in my opinion, is simple to get started with but flexible enough to keep up with project demands.
Here are examples:
http://www.knockmeout.net/2011/03/guard-your-model-accept-or-cancel-edits.html
http://knockoutjs.com/examples/gridEditor.html
If you think this is for you then take an hour or two and go through the tutorials, it's well worth the time:
http://learn.knockoutjs.com/
I have implemented exactly what you are asking for, but I cannot assure you that it is robust. It definitely is not simple. Based on the article Get the Most out of WebGrid in ASP.NET MVC by Stuart Leeks I have created an MVC project which I have heavily modified with my own javascript. In the end I have come up with a solution that works but could be vastly improved. Took me at least a week to implement.
I write tutorial for implementing inline editable grid using mvc, knockoutjs with source code:
http://www.anhbui.net/blog?id=kojs-1

How do I remove a tag from enclosing tag

<td align="center" nowrap=""><img border="0" src="bus0.gif" /><font style="color:darkblue;">030-
FP</font><br />將到站</td>
For the above HTML, I'd like to remove the img and font tags as well as font tag's enclosed text using JSoup. How should I go about doing that?
Thanks!
Edit: I would like to remove the img and font tags, so the output would be
<td align="center" nowrap="">將到站</td>
//Assuming that you have a Document variable called doc
String img = doc.select("img").attr("src");
String font = doc.select("font").text();
And take a closer look at the API. It takes about 10 mins to find this out.

Parsing contents of paragraph elements with Nokogiri

I'd like to know the proper way to parse a block of contents with Nokogiri:
I have some documents to parse where they originally contained a format where each main container was a <p>. The main pieces of information within each one are divided up, oddly, with <font> tags.
Effectively a stock sample of <p> contents contains the following and is a typical example (some have a lot more content, some a lot less):
<p>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class="">
October 10, 1990 - Maybe a Title
</font>-
<font size="4" class="">
Some long text here.
<font color="#66CC00" class="">
[Blah Blah, October 27, 1982 p. 2
]
</font>.
More content.
<font color="#00FF33" class="">[Another Source, 1971, issue 01/4]
</font>.
</font>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class=""><font size="4" color="#00CCAA" class="">
Another fantastic article.
[Some Source, October 4, p.6]
</font>
</font>
</font>
</font>
</p>
Essentially the "font size" attribute is what sets each component apart in the article. The main points to extract are the FIRST <font size ="5"... (that is the article date and main title, if a title is given) tags, then the actual content.
Presently I have all paragraph chunks coming out with: doc.xpath('//p').each do |node|
However I am not sure if I should pass it through Nokogiri again to parse out it's contents or if I should just run it all through a regex. Was hoping for a small example of doing this "properly" with, I'm assuming, using an embedded xpath discovery within the initial block that pulls the elements out. I assume that there is a way to pull out the sub components based on the font size demarcation, but I've simply not seen a specific example of this yet.
Does that help you get started?
>> doc.xpath('//p').each do |node|
.. puts node.xpath("font[#size='5']/font").first.content.strip
.. end #=> 0
October 10, 1990 - Maybe a Title
Build similar expressions for the other parts you need and you are done :-)

fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI]

A triple job: I have to do a job with tree task. We have three tasks:
Fetch pages
Parse HTML
Store data... And yes - this is a true Perl-job!
I have to do a parser-job on all 6000 sub-pages of a site in suisse. (a governmental site - which has very good servers ).
see http://www.educa.ch/dyn/79362.asp?action=search and
(if you do not see approx 6000 results - then do a search with .
A detailed page is like this:
[link text][1]
Ecole nouvelle de la Suisse Romande
Ch. de Rovéréaz 20 Case postal 161
1000 Lausanne 12 Website
info#ensr.ch Tel:021 654 65 00
Fax:021 654 65 05
another detailed pages shows this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS - "><title>educa.ch</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><script src="102.htm"></script><script language="JavaScript"><!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="check();"><table cellspacing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></td><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz</td><td width="20" class="popuphead" valign="middle"><img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13"></td><td width="20" class="popuphead" valign="middle"><img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13"></td></tr><tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width="1" height="1"></td></tr></table><div class="leerzeile"> </div><div class="leerzeile"><img src="/0.gif" alt="" width="15" height="8">Auseklis - Schule für lettische Sprache und Kultur</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Mutschellenstrasse 37</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div><img src="/0.gif" alt="" width="15" height="8">8002 Zürich</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">latvia.yourworld.ch</div><div><img src="/0.gif" alt="" width="15" height="8">schorderet#inbox.lv</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">+41786488637</div><div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif" alt="" width="4" height="8"></div><div> </div></body></html>
I want to do this job with ** HTML::TokeParser or HTML::TokeParser** or *HTML::TreeBuilder::LibXML * but i have little experience with HTML::TreeBuilder::LibXML
Which one would you prefer for this job: Note - I want to store the results in a MySQL-DB. Best things would be to store it immitiately after parsing:
so we have three tasks:
Fetch pages
Parse HTML
Store data
First item: Use LWP::UserAgent to fetch. There are many examples in this forum of using that module to post data and get the resulting pages. BTW we can use Mechanize instead if we prefer.
Second: Parse the page as eg with HTML::TokeParser or some other module to get at only the data we need.
Third: Store the data straight away into a database. There is no need to take an intermediate step and write a temporary file.
hmmm - the first and the second question - how to fetch and how to parse.
Hard to be too specific as your question is very general. I've retrieved pages using LWP and used TokeParser to extract data and store the output in a database many times. I haven't used Mech, but by all accounts it is simpler than LWP.
Creating a user agent using LWP can be as simple as:
my $ua = LWP::UserAgent->new();
you will need to consider things like re-directs, proxies and cookies or passwords depending on your requirements.
To follow re-directs:
$ua = LWP::UserAgent->new(
requests_redirectable => ['GET', 'HEAD', 'POST' ]
);
To store cookies:
$ua->cookie_jar( {} );
To set up a proxy:
$ua->proxy("http", "http://localhost:8888"); # Fiddler
To add a password for authentication:
$ua->credentials( 'www.myhostingplace.com:443' , 'Realm' , 'userid', 'password');
To get content from a page for local processing:
$url = 'http://www.someurl.com'
my $response = $ua->get($url);
if ( $response->is_error() ) {
# Do some error stuff
}
my $content = $response->content();
To parse the content using TokeParser:
my $stream = new HTML::TokeParser(\$content);
while ( my $t = $stream->get_token() ) {
if ( $t->[0] eq 'S' and $t->[1] eq 'input' ) {
if ( uc( $t->[2]{ 'name' } ) eq 'SEARCHVALUE' ) {
my $data = $t->[2]{ 'value' };
# Do something with data
}
}
}
The data is passed into TokeParser as a reference; I then walk through the stream using get token. Each HTML element is passed into an array which you can examine to determine what you should do next.
In the above example I want to search for input tags with an attribute name of 'SEARCHVALUE' and then store the 'value' attribute. The HTML fragment might look something like this:
<input type="hidden" name="SEARCHVALUE" value="Spock" />
When I hit the start of the input tag ($t->[0] eq 'S' and $t->[1] eq 'input') I examine the "name" attribute of the tag (t->[2]{ 'name' }) to see if it matches the value I am searching for; if it does I store the value attribute of the tag ($t->[2]{ 'value' }) in a variable. I can then do whatever I like with the value including storing it in a database.
You can do a lot with TokeParser and in some cases it can be simpler than using regular expressions to carve up the page but it can also be a little challenging to get your head around. If you are trying to extract a simple pattern from the return HTML content then a regular expression can be just as good.
If you have a lot of this to do then I recommend "Perl and LWP" by Sean Burke from O'Reilly. It has been endlessly helpful for me in my web scraping endeavours.
Hope this helps you get started at least.

Resources