Jsoup get all links from a page - hyperlink

I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag.
Here's my code snippet:
Document doc = Jsoup.connect(url)
.timeout(TIMEOUT * 1000)
.get();
Elements elts = doc.getElementsByTag("a");
And here's the example HTML:
<table>
<tr><td></td></tr>
</table>
My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?
EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?

In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()).
Tip: use select() instead of getElementsByX(), its faster and more flexible.
Elements elts = doc.select("a"); (edit)
Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

Try this code
String url = "http://test.com";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
Elements links = doc.select(<i>"a[href]"<i>);
Element link;
for(int j=0;j<150;j++){
link=links.get(j);
System.out.println("a= " link.attr("abs:href").toString() );
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Related

Dart - markdown formatting after building to js

I am building a simple blog page where I wish to use markdown as the text format.
I have a working page when running in Dartium but when I compile to js the markdown does not come out properly formatted. It's missing paragraphs only I think but headers and lists are working fine.
I'm displaying the blog post in a polymer element and reading in a simple file from the server. I have made a simple sample without polymer which seems to work fine but I haven't tried it on the production server.
The basic code is outlined below, any tips or a better way of doing this? I will eventually move the posts to a db as text but I'm open to suggestions for other ways of presenting blog posts with some simple formatting, thanks.
getPostsFromServer(){
String path = 'post1.md';
HttpRequest req = new HttpRequest();
req
..open('GET', path)
..onLoadEnd.listen((e) => printPost(req))
..send('');
}
void printPost(HttpRequest req){
var postdiv = $['article'];
if(req.status == 200){
var postText = req.responseText;
print(postText);
postdiv.innerHtml = markdownToHtml(postText);
}
else{
postdiv.innerHtml = 'Failed to load newsletter, sorry.';
}
}

How to parse article page links in rails with pismo and mechanic?

I'm trying to parse links on multiple-page articles to automatically click through them to extract the whole article content. I'm using mechanize, regarding to my last question and the helpful answer.
How can I search for pagination links? Each articles may have different link architectures, like:
ZEITONLINE:
<a id="hp.article.bottom.paginierung.2" class="pn-forward pn-button" title="Vor" href="http://www.zeit.de/politik/ausland/2013-01/Syrien-Fotografie-Reportage/seite-2">Vorwärts</a>
ARSTECHNICA:
<span class="next">Next <span class="arrow">→</span></span>
IGN:
Next »
For the IGN Link it's relatively simple to parse the link, because it contains the link text Next. But what about the other links? I know it should be doable, because pocket, readability and instapaper are extracting multiple page content.
Hope you can help me a bit.
I Write One Function For This
function proccessURL($ParentURL,$URL){
$parse_url=parse_url($URL);
if(#$parse_url['host']==""){
$Parent_URL=parse_url($ParentURL);
$path=explode("/",#$parse_url['path']);
$redirect=0;
$lkey=0;
$flag=false;
while(list($key,$val)=each($path)){
if($val==".." or $val=="." or $val=="..."){
$redirect++;
$lkey=$key;
$flag=true;
}else{
break;
}
}
if($flag){
$matches=explode("/",$Parent_URL['path']);
end($matches);
$b=each($matches);
$n=$b['key'];
$url='';
for($i=0;$i<$n-$redirect;$i++){
$url.=$matches[$i]."/";
}
for($i=$redirect+1;next($path);$i++){
$url.=$path[$i]."/";
}
rtrim($url,"/");
$parse_url['path']=$url;
}else{
$parse_url['path']="/".#$parse_url['path'];
}
}else{
$Parent_URL['scheme']=$parse_url['scheme'];
$Parent_URL['host']=$parse_url['host'];
}
//print_r($parse_url);
if(#$parse_url['query']!=""){
$parse_url['query']="?".#$parse_url['query'];
}
if(#$parse_url['fragment']!=""){
$parse_url['fragment']="#".#$parse_url['fragment'];
}
return $Parent_URL['scheme']."://".#$Parent_URL['host'].#$parse_url['path'].#$parse_url['query'].#$parse_url['fragment'];
}
This Function solve link Address
sample:
$CorrectLink=proccessURL("http://www.sepidarcms.ir/kernel/","../plugin/1.php");
The Output is "http://www.sepidarcms.ir/plugin/1.php"
Now You Can Parse Url By preg_match_all
$html="Your HTML Str";
$URL="Your HTML Page Link";
preg_match_all("/href=\"([^\"]*)\"/is", $html, $matches);
while(list($key,$val)=each($matches[1])){
$val=proccessURL($URL,$val);
echo $val;
}
This code List All href Url For You Correctly

Append data to iterator in struts2

Hi friends i am trying to create facebook like pagination in struts2
what i am trying is at the end of the webpage i am calling action class using javascript ajax using below code
<script>
$(window).scroll(function() {
if ($(window).scrollTop() == $(document).height() - $(window).height()) {
console.log("Bottom reached");
var ul = $('.ullist');
var start = ul.children().length;
$.post("postImage.action?", { start: start }, function(session2) {
// Here I am getting json data
alert("inside class " + session2);
});
} else {
console.log("Bottom reached not");
}
});
</script>
The problem is that I have already a list using iterator. Please tell me how to append the value to iterator.
<s:iterator value="#session.list">
.......//here i already have data
</s:iterator>
You don't "append data to the iterator", you append DOM elements to the DOM.
You have two main options:
Return rendered HTML and append it at the end of the page (wherever is appropriate in your DOM), or...
Return JSON (or XML or whatever) and build the DOM dynamically on the client side.
You already have a JSP that renders the same type of information, I'd re-use that chunk of JSP, return rendered HTML, and append it. That said, there are countless jQuery bottomless pagination examples and many plugins–I'd probably just pick one that gets you started and take it from there, and use whatever mechanism your choice uses.

Print all steps of asp:Wizard control

I have a asp:Wizard control in my Web Application.I need to be able to print at any step within the wizard , and print all steps up to that step not just the current step.
I've added a print button to every step page , and tried to call the javascript:window.Print(), but only the current step gets printed.
How do i get all the steps to print in 1 page?
i'd like to try and get this working in javascript first before i go down the PDF route . I've tried doing somehting like this :
protected void Page_Load(object sender, EventArgs e)
{
StringWriter sw = new StringWriter();
HtmlTextWriter tw = new HtmlTextWriter(sw);
this.WizardStep2.RenderControl(tw);
string wizardHtmlContent = sw.ToString().Replace("\r\n", "");
string printScript = #"function printDiv(printpage)
{
var headstr = '<html><head><title></title></head><body>';
var footstr = '</body>';
var newstr = printpage;
var oldstr = document.body.innerHTML;
document.body.innerHTML = headstr+newstr+footstr;
window.print();
document.body.innerHTML = oldstr;
return false;
}";
this.Page.ClientScript.RegisterStartupScript(this.GetType(), "PrentDiv", printScript, true);
this.Button1.Attributes.Add("onclick", "printDiv('" + wizardHtmlContent + "');");
}
and for the aspx:
<form id="form1" runat="server">
<div>
<asp:Wizard ID="Wizard1" runat="server">
<WizardSteps>
<asp:WizardStep ID="WizardStep1" runat="server" Title="Step 1">
step1
</asp:WizardStep>
<asp:WizardStep ID="WizardStep2" runat="server" Title="Step 2">
step2
</asp:WizardStep>
</WizardSteps>
</asp:Wizard>
<asp:Button ID="Button1" runat="server" Text="Button" />
</div>
</form>
But i'm getting a missing runat=server error on line 3 , when i attempt to render the wizard control , so i think i may need to create a new window, then output the string before i print it , but cant seem to get that working ...Anyone any ideas ?
i have found a solution for my problem , i didnt manage to accomplish it client side , but ive managed to solve it server side which is better than going down the PDF route which i didnt want to do.
I found a great article here :
Printing in ASP.NET
which i ammended to print all steps of my wizard control in one go. thanks for all your help.
The javascript print method you're already using will work if you put the wizard steps in to a single page so they all render ...
the other way I guess is to simply browse to each step and hit your print button.
the way I would do it is use something like pdfsharp and give it the markup generates by each step and tell it to create a pdf page for each steps worth of markup ... from there the user has a pdf doc which they can simply view save or print using their usual pdf viewer.
The problem is that the javascript method is using a dom based api call to ask the browser to print the page which of course ultimately means you can only print the wizard step you're currently looking at ... using the pdf method means the user can preview the expected print out before printing and you have more control over what's printed.
It does require a bit more code though ...
pdfsharp can be found here: http://www.pdfsharp.net/Downloads.ashx
As you can see its free and open source.

How to extract the headline and content from a crawled web page / article?

I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.
You could try the Simple HTML DOM Parser. It sports a syntax to find specific elements similar to jQuery.
They have an example on how to scrape Slashdot:
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);

Resources