fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI]

fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI] - html-parsing

A triple job: I have to do a job with tree task. We have three tasks:
Fetch pages
Parse HTML
Store data... And yes - this is a true Perl-job!
I have to do a parser-job on all 6000 sub-pages of a site in suisse. (a governmental site - which has very good servers ).
see http://www.educa.ch/dyn/79362.asp?action=search and
(if you do not see approx 6000 results - then do a search with .
A detailed page is like this:
[link text][1]
Ecole nouvelle de la Suisse Romande
Ch. de Rovéréaz 20 Case postal 161
1000 Lausanne 12 Website
info#ensr.ch Tel:021 654 65 00
Fax:021 654 65 05
another detailed pages shows this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS - "><title>educa.ch</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><script src="102.htm"></script><script language="JavaScript"><!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="check();"><table cellspacing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></td><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz</td><td width="20" class="popuphead" valign="middle"><img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13"></td><td width="20" class="popuphead" valign="middle"><img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13"></td></tr><tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width="1" height="1"></td></tr></table><div class="leerzeile"> </div><div class="leerzeile"><img src="/0.gif" alt="" width="15" height="8">Auseklis - Schule für lettische Sprache und Kultur</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Mutschellenstrasse 37</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div><img src="/0.gif" alt="" width="15" height="8">8002 Zürich</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">latvia.yourworld.ch</div><div><img src="/0.gif" alt="" width="15" height="8">schorderet#inbox.lv</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">+41786488637</div><div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif" alt="" width="4" height="8"></div><div> </div></body></html>
I want to do this job with ** HTML::TokeParser or HTML::TokeParser** or *HTML::TreeBuilder::LibXML * but i have little experience with HTML::TreeBuilder::LibXML
Which one would you prefer for this job: Note - I want to store the results in a MySQL-DB. Best things would be to store it immitiately after parsing:
so we have three tasks:
Fetch pages
Parse HTML
Store data
First item: Use LWP::UserAgent to fetch. There are many examples in this forum of using that module to post data and get the resulting pages. BTW we can use Mechanize instead if we prefer.
Second: Parse the page as eg with HTML::TokeParser or some other module to get at only the data we need.
Third: Store the data straight away into a database. There is no need to take an intermediate step and write a temporary file.
hmmm - the first and the second question - how to fetch and how to parse.

Hard to be too specific as your question is very general. I've retrieved pages using LWP and used TokeParser to extract data and store the output in a database many times. I haven't used Mech, but by all accounts it is simpler than LWP.
Creating a user agent using LWP can be as simple as:
my $ua = LWP::UserAgent->new();
you will need to consider things like re-directs, proxies and cookies or passwords depending on your requirements.
To follow re-directs:
$ua = LWP::UserAgent->new(
requests_redirectable => ['GET', 'HEAD', 'POST' ]
);
To store cookies:
$ua->cookie_jar( {} );
To set up a proxy:
$ua->proxy("http", "http://localhost:8888"); # Fiddler
To add a password for authentication:
$ua->credentials( 'www.myhostingplace.com:443' , 'Realm' , 'userid', 'password');
To get content from a page for local processing:
$url = 'http://www.someurl.com'
my $response = $ua->get($url);
if ( $response->is_error() ) {
# Do some error stuff
}
my $content = $response->content();
To parse the content using TokeParser:
my $stream = new HTML::TokeParser(\$content);
while ( my $t = $stream->get_token() ) {
if ( $t->[0] eq 'S' and $t->[1] eq 'input' ) {
if ( uc( $t->[2]{ 'name' } ) eq 'SEARCHVALUE' ) {
my $data = $t->[2]{ 'value' };
# Do something with data
}
}
}
The data is passed into TokeParser as a reference; I then walk through the stream using get token. Each HTML element is passed into an array which you can examine to determine what you should do next.
In the above example I want to search for input tags with an attribute name of 'SEARCHVALUE' and then store the 'value' attribute. The HTML fragment might look something like this:
<input type="hidden" name="SEARCHVALUE" value="Spock" />
When I hit the start of the input tag ($t->[0] eq 'S' and $t->[1] eq 'input') I examine the "name" attribute of the tag (t->[2]{ 'name' }) to see if it matches the value I am searching for; if it does I store the value attribute of the tag ($t->[2]{ 'value' }) in a variable. I can then do whatever I like with the value including storing it in a database.
You can do a lot with TokeParser and in some cases it can be simpler than using regular expressions to carve up the page but it can also be a little challenging to get your head around. If you are trying to extract a simple pattern from the return HTML content then a regular expression can be just as good.
If you have a lot of this to do then I recommend "Perl and LWP" by Sean Burke from O'Reilly. It has been endlessly helpful for me in my web scraping endeavours.
Hope this helps you get started at least.

Related

How to retrieve simple xml from public Google spreadsheet

I am working in an Arduino device in which I need to retrieve public data from a Google spreadsheet.
So far I have published the spreadsheet and I can access it at https://spreadsheets.google.com/feeds/cells/1uphj-Oq3Xt6ImHJdezAUEX4u41_w1NNMlZU4Flr6lc4/1/public/full?range=a11:c12 which can be opened in the browser or in the Arduino (I am working with a SIM800 module so it can work with HTTPS without problems).
The output of this is xml items like (I am not very into XML):
<entry>
<id>https://spreadsheets.google.com/feeds/cells/1uphj-Oq3Xt6ImHJdezAUEX4u41_w1NNMlZU4Flr6lc4/1/public/full/R12C11</id>
<updated>2018-04-30T05:31:51.590Z</updated>
<category scheme='http://schemas.google.com/spreadsheets/2006' term='http://schemas.google.com/spreadsheets/2006#cell'/>
<title type='text'>K12</title>
<content type='text'>12345</content>
<link rel='self' type='application/atom+xml' href='https://spreadsheets.google.com/feeds/cells/1uphj-Oq3Xt6ImHJdezAUEX4u41_w1NNMlZU4Flr6lc4/1/public/full/R12C11'/>
<gs:cell row='12' col='11' inputValue='12345' numericValue='12345.0'>12345</gs:cell>
One of them for every cell requested.
The thing is that here I can see too much unneeded/redundant information, for example, in "title" and "content" I get the same information as in "gs:cell", "updated" may actually be useful but "link" and "category" are completely disposable to me.
Since I will be working with an Arduino and a sim800 module (which cannot handle high data transfer speeds) making this as simple as possible will be great.
Probably there is a way to request this simplified in the HTTP call, maybe adding some parameters or changing "full" to something else.
Any help will be greatly appreciated

You want to retrieve the simpler response from range=a11:c12 of the spreadsheet ID 1uphj-Oq3Xt6ImHJdezAUEX4u41_w1NNMlZU4Flr6lc4. If my understanding is correct, how about retrieving values using Query Language? I think that there may be several methods. So please think of this as one of them.
Pattern 1: Retrieve response as HTML
https://docs.google.com/spreadsheets/d/1uphj-Oq3Xt6ImHJdezAUEX4u41_w1NNMlZU4Flr6lc4/gviz/tq?range=a11:c12&tqx=out:html
Result :
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>A11:C12</title>
</head>
<body>
<table border="1" cellpadding="2" cellspacing="0">
<tr style="font-weight: bold; background-color: #aaa;">
<td></td><td></td><td></td>
</tr>
<tr style="background-color: #f0f0f0">
<td>DataInCellA11</td><td>DataInCellB11</td><td>DataInCellC11</td>
</tr>
<tr style="background-color: #ffffff">
<td>DataInCellA12</td><td>DataInCellB12</td><td>DataInCellC12</td>
</tr>
</table>
</body>
</html>
Pattern 2: Retrieve response as CSV
https://docs.google.com/spreadsheets/d/1uphj-Oq3Xt6ImHJdezAUEX4u41_w1NNMlZU4Flr6lc4/gviz/tq?range=a11:c12&tqx=out:csv
Result :
"DataInCellA11","DataInCellB11","DataInCellC11"
"DataInCellA12","DataInCellB12","DataInCellC12"
Note :
In this case, the response cannot be retrieved as the xml format. There is no tqx=out:xml.
As a sample, you can retrieve values using curl and browser from above URLs.
If you want to retrieve values from other sheets, please use the query of gid. In this sample, gid=od6 which means 1st sheet is omitted.
Reference :
Query Language Reference
If I misunderstand your question, I'm sorry.

h:graphicImage value tag parses spaces as +

i have a file name that contains spaces: bw3 - Copy_1340627264571.jpg
and i use this name to load the image as follows:
<h:graphicImage value="/#{myBean.imageFolder}/#{image.name}" width="30" height="30" style="border:0;"/>
this is translated to:
<img width="30" height="30" style="border:0;" src="/MyAPP/image/bw3+-+Copy_1340627264571.jpg">
while if i tried to print the name in outputText, it's printed correctly:
<h:outputText value="#{image.name}"/>
this is translated to:
<span id="myForm:viewImagesTable:0:_t68">bw3 - Copy_1340627264571.jpg</span>
any ideas how to fix that ?

This seems to be a bug in <h:graphicImage>. Spaces in request URI should be URL-encoded as %20 using java.net.URI and spaces in request query string should be URL-encoded as + using java.net.URLEncoder. It seems that <h:graphicImage> encodes the entire URI using java.net.URLEncoder.
Better replace them yourself beforehand:
<h:graphicImage value="/#{myBean.imageFolder}/#{image.name.replace(' ', '%20')}" />
Or, much better, don't allow spaces in filenames at all. When it concerns uploaded files, replace them by _ or something before saving.
Note that this has nothing to do with EL as your question tagging suggest.

simple html dom parser and <span>

I hope anyone can help me with this.
I have an html code like this:
<div id="v4-95"><div id="v4-96" class="pview rs-pview"><table cellpadding="0" cellspacing="2" class="grid"><tr><td width="33%" class="gallery"><a name="item19c368bcd6"></a><table cellpadding="0" cellspacing="10" class="gallery"><tr><td class="picture camera" width="100%" height="140"><div class="image" style="width: 140px;"><img alt="Item image" title="Item image" src="http://thumbs3.ebaystatic.com/m/mvOLm6Tv8Lid54uveSlY80A/140.jpg" border="0"></div></td></tr><tr><td><div class="mi"></div></td></tr><tr><td class="details"><div class="ttl g-std"><a id="src110652603606" _sp="p4634.c0.m14.l1262" r="1" href="http://www.ebay.co.uk/itm/SAMSUNG-LTN156AT02-15-6-LAPTOP-SCREEN-NEW-/110652603606?pt=UK_Computing_LaptopAccess_RL&hash=item19c368bcd6" target="_parent" title="SAMSUNG LTN156AT02 15.6" LAPTOP SCREEN NEW">SAMSUNG LTN156AT02 15.6" LAPTOP SCREEN NEW</a><img src="http://q.ebaystatic.com/aw/pics/s.gif" width="16" alt="This seller accepts PayPal" height="16" class="ii iippl"></div><div><table cellpadding="0" cellspacing="0" class="fixed"><tr><td><img src="http://q.ebaystatic.com/aw/pics/bin_15x54.gif" alt="Buy It Now" title="Buy It Now"></td><td><span class="bin g-b">£41.50</span></td></tr>
I can retrive the title with this code:
$html = file_get_html('http://stores.ebay.co.uk/LCD-Kings/15-6-/_i.html?_fsub=886314010&_sid=73271570&_trksid=p4634.c0.m322');
foreach($html->find('a') as $element)
echo $element->title . '<br>';
But I don't understand how can I retrieve the £41,50 between the span and why it has a space in the class "bin gb"...
thanks for help...

It has a space in the class because that element has two classes. One is called bin, the other is called g-b. I'm guessing g-b refers to Great Britain so the price may be the span that has the class bin.
You haven't provided all the HTML but there may be an outer element that you can search for (such as: a div with id product and then, within that, find the price in the span with class bin).
You should lookup the documentation of your DOM parser and see what arguments it supports for find(). If it supports something like #product span.bin (or similar syntax) then you can select the span with that class.

ui:repeat using the same client id. c:foreach works fine

I know this may have something to do with the phase each one comes in at.
If I do this.
<ui:repeat id="repeatChart" varStatus="loop" value="#{viewLines.jflotChartList}" var="jflotChart">
<p:panel>
<jflot:chart height="300" width="925" dataModel="#{jflotChart.dataSet}" dataModel2="#{jflotChart.dataSet2}"
xmin="#{jflotChart.startDateString}"
xmax="#{jflotChart.endDateString}"
shadeAreaStart ="#{jflotChart.shadeAreaStart}"
shadeAreaEnd ="#{jflotChart.shadeAreaEnd}"
lineMark="#{jflotChart.wrapSpec.benchmark}" yMin="#{jflotChart.yMin}" yMax="#{jflotChart.yMax}" />
</p:panel>
<br />
</ui:repeat>
My code will not work. Debugging the javascript shows that the same id is generated for every iteration. I've tried putting loop.index to create an id and that gives me an error saying that id can't be blank.
If I exchange the ui:repeat for a c:forEach it works fine. Debugging the javascript shows that a new id is created for each iteration.
Here is my backing code(some of it).
<div id="#{cc.id}_flot_placeholder" style="width:#{cc.attrs.width}px;height:#{cc.attrs.height}px;">
<script type="text/javascript">
//<![CDATA[
$(function () {
var placeholder = $("##{cc.id}_flot_placeholder");
var overviewPlaceholder = $("##{cc.id}_flot_overview");
The id needs to be different so the javascript can render to the correct div. I've tried explicitly defining an id attribute and then passing that as the id in the client code. Like I said before that doesn't work. Thanks for any help.
**EDIT**
Here is my problem. I can't use the clientId in the div tag because of the colon character obviously. I have modified it in javascript but how would I get that value to the div. I can't get the div tag by id because I need to generate the id. I can't seem to do a document.write() either. I'm stuck at this point.
<composite:implementation>
<div id="#{cc.clientId}_flot_placeholder" style="width:400px;height:400px;">
<script type="text/javascript">
//<![CDATA[
$(function () {
var clientIdOld = '#{cc.clientId}';
var clientId = clientIdOld.replace(':', '_');
var placeholder = $('#'+clientId+'_flot_placeholder');
var overviewPlaceholder = $('#'+clientId+'_flot_overview');

I did a quick test on local environment (Mojarra 2.0.4 on Tomcat 7.0.11). Using #{cc.clientId} gives you an unique ID back everytime.
<ui:repeat value="#{bean.items}" var="item">
<cc:test />
</ui:repeat>
with
<cc:implementation>
<div id="#{cc.clientId}_foo">foo</div>
</cc:implementation>
Here's the generated HTML source:
<div id="j_idt6:0:j_idt7_foo">foo</div>
<div id="j_idt6:1:j_idt7_foo">foo</div>
<div id="j_idt6:2:j_idt7_foo">foo</div>
This should be sufficient for your functional requirement. You might only want to escape the default separator : or to replace it by a custom separator since it's a reserved character in CSS selectors.
Update: so you want to escape it, you should then replace : by \: and not by _.
var clientId = clientIdOld.replace(/:/g, '\\:');
(the /:/g is a regex which ensures that all occurrences will be replaced and the double slash is just to escape the slash itself in JS strings, like as you normally do in Java strings)

Uncaught exception 'DOMException' with message 'Not Found Error'

Bascially I'm writing a templating system for my CMS and I want to have a modular structure which involves people putting in tags like:
<module name="news" /> or <include name="anotherTemplateFile" /> which I then want to find in my php and replace with dynamic html.
Someone on here pointed me towards DOMDocument, but I've already come across a problem.
I'm trying to find all <include /> tags in my template and replace them with some simple html. Here is my template code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>CMS</title>
<include name="head" />
</head>
<body>
<include name="header" />
<include name="content" />
<include name="footer" />
</body>
</html>
And here is my PHP:
$template = new DOMDocument();
$template->load("template/template.tpl");
foreach( $template->getElementsByTagName("include") as $include ) {
$element = '<input type="text" value="'.print_r($include, true).'" />';
$output = $template->createTextNode($element);
$template->replaceChild($output, $include);
}
echo $template->saveHTML();
Now, I get the fatal error Uncaught exception 'DOMException' with message 'Not Found Error'.
I've looked this up and it seems to be that because my <include /> tags aren't necessarily DIRECT children of $template its not replacing them.
How can I replace them independently of descent?
Thank you
Tom
EDIT
Basically I had a brainwave of sorts. If I do something like this for my PHP I see its trying to do what I want it to do:
$template = new DOMDocument();
$template->load("template/template.tpl");
foreach( $template->getElementsByTagName("include") as $include ) {
$element = '<input type="text" value="'.print_r($include, true).'" />';
$output = $template->createTextNode($element);
// this line is different:
$include->parentNode->replaceChild($output, $include);
}
echo $template->saveHTML();
However it only seems to change 1 occurence in the <body> of my HTML... when I have 3. :/

This is a problem with your DOMDocument->load, try
$template->loadHTMLFile("template/template.tpl");
But you may need to give it a .html extension.
this is looking for a html or an xml file. also, whenever you are using DOMDocument with html it is a good idea to use libxml_use_internal_errors(true); before the load call.
OKAY THIS WORKS:
foreach( $template->getElementsByTagName("include") as $include ) {
if ($include->hasAttributes()) {
$includes[] = $include;
}
//var_dump($includes);
}
foreach ($includes as $include) {
$include_name = $include->getAttribute("name");
$input = $template->createElement('input');
$type = $template->createAttribute('type');
$typeval = $template->createTextNode('text');
$type->appendChild($typeval);
$input->appendChild($type);
$name = $template->createAttribute('name');
$nameval = $template->createTextNode('the_name');
$name->appendChild($nameval);
$input->appendChild($name);
$value = $template->createAttribute('value');
$valueval = $template->createTextNode($include_name);
$value->appendChild($valueval);
$input->appendChild($value);
if ($include->getAttribute("name") == "head") {
$template->getElementsByTagName('head')->item(0)->replaceChild($input,$include);
}
else {
$template->getElementsByTagName("body")->item(0)->replaceChild($input,$include);
}
}
//$template->load($nht);
echo $template->saveHTML();

However it only seems to change 1 occurence in the of my HTML... when I have 3. :/
DOM NodeLists are ‘live’: when you remove an <include> element from the document (by replacing it), it disappears from the list. Conversely if you add a new <include> into the document, it will appear in your list.
You might expect this for a NodeList that comes from an element's childNodes, but the same is true of NodeLists that are returned getElementsByTagName. It's part of the W3C DOM standard and occurs in web browsers' DOMs as well as PHP's DOMDocument.
So what you have here is a destructive iteration. Remove the first <include> (item 0 in the list) and the second <include>, previously item 1, become the new item 0. Now when you move on to the next item in the list, item 1 is what used to be item 2, causing you to only look at half the items.
PHP's foreach loop looks like it might protect you from that, but actually under the covers it's doing exactly the same as a traditional indexed for loop.
I'd try to avoid creating a new templating language for PHP; there are already so many, not to mention PHP itself. Creating one out of DOMDocument is also going to be especially slow.
eta: In general regex replace would be faster, assuming a simple match pattern that doesn't introduce loads of backtracking. However if you are wedded to an XML syntax, regex isn't very good at parsing that. But what are you attempting to do, that can't already be done with PHP?
<?php function write_header() { ?>
<p>This is the header bit!</p>
<? } ?>
<body>
...
<?php write_header(); ?>
...
</body>

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI] - html-parsing

Related

How to retrieve simple xml from public Google spreadsheet

h:graphicImage value tag parses spaces as +

simple html dom parser and <span>

ui:repeat using the same client id. c:foreach works fine

Uncaught exception 'DOMException' with message 'Not Found Error'

Categories

Resources