I'm trying to parse links on multiple-page articles to automatically click through them to extract the whole article content. I'm using mechanize, regarding to my last question and the helpful answer.
How can I search for pagination links? Each articles may have different link architectures, like:
ZEITONLINE:
<a id="hp.article.bottom.paginierung.2" class="pn-forward pn-button" title="Vor" href="http://www.zeit.de/politik/ausland/2013-01/Syrien-Fotografie-Reportage/seite-2">Vorwärts</a>
ARSTECHNICA:
<span class="next">Next <span class="arrow">→</span></span>
IGN:
Next »
For the IGN Link it's relatively simple to parse the link, because it contains the link text Next. But what about the other links? I know it should be doable, because pocket, readability and instapaper are extracting multiple page content.
Hope you can help me a bit.
I Write One Function For This
function proccessURL($ParentURL,$URL){
$parse_url=parse_url($URL);
if(#$parse_url['host']==""){
$Parent_URL=parse_url($ParentURL);
$path=explode("/",#$parse_url['path']);
$redirect=0;
$lkey=0;
$flag=false;
while(list($key,$val)=each($path)){
if($val==".." or $val=="." or $val=="..."){
$redirect++;
$lkey=$key;
$flag=true;
}else{
break;
}
}
if($flag){
$matches=explode("/",$Parent_URL['path']);
end($matches);
$b=each($matches);
$n=$b['key'];
$url='';
for($i=0;$i<$n-$redirect;$i++){
$url.=$matches[$i]."/";
}
for($i=$redirect+1;next($path);$i++){
$url.=$path[$i]."/";
}
rtrim($url,"/");
$parse_url['path']=$url;
}else{
$parse_url['path']="/".#$parse_url['path'];
}
}else{
$Parent_URL['scheme']=$parse_url['scheme'];
$Parent_URL['host']=$parse_url['host'];
}
//print_r($parse_url);
if(#$parse_url['query']!=""){
$parse_url['query']="?".#$parse_url['query'];
}
if(#$parse_url['fragment']!=""){
$parse_url['fragment']="#".#$parse_url['fragment'];
}
return $Parent_URL['scheme']."://".#$Parent_URL['host'].#$parse_url['path'].#$parse_url['query'].#$parse_url['fragment'];
}
This Function solve link Address
sample:
$CorrectLink=proccessURL("http://www.sepidarcms.ir/kernel/","../plugin/1.php");
The Output is "http://www.sepidarcms.ir/plugin/1.php"
Now You Can Parse Url By preg_match_all
$html="Your HTML Str";
$URL="Your HTML Page Link";
preg_match_all("/href=\"([^\"]*)\"/is", $html, $matches);
while(list($key,$val)=each($matches[1])){
$val=proccessURL($URL,$val);
echo $val;
}
This code List All href Url For You Correctly
Related
what is the correct way to display the tag name on the tag specific page and a link to it in Modx revo using tagLister? e,g, a post has tags Tag1, Tag2 and Tag3. Now you click on one of the tags and it brings to the target resource displaying al posts with that single tag. What code to put in that target resource so it shows that the user has landed on the specific single tag page. I want to display the name and the link of that exact single tag.
My tags target resource is the main blog resource: Here is the code:
<section>
[[The Code to Display the Tag name to put here]]
[[!getResourcesTag#Blog Pagination Hy?
&elementClass=`modSnippet`
&element=`getResources`
&tpl=`Blog Post on Blog Page`
&hideContainers=`0`
&pageVarKey=`page`
&parents=`[[*id]]`
&limit=`3`
&includeTVs=`1`
&includeContent=`1`
&cache=`0`
]]
<div class="PaginationContainer">
<span class="TotalPages">p [[+page]] (total. [[+pageCount]])</span>
<ul>
[[!+page.nav]]
</ul>
</div>
</section>
is it possible at all?
Finally found on the web.
If you got better solution, please put it here.
So the idea is to make a snippet which gets the tag and call the snippet where we want.
Step by Step.
Step 1. Make a new snippet and name it something, e.g. Tag Name,
Step 2. Put the snippet code in the snippet code placeholder,
Snippet code:
//-- Get all request string key/value pairs
$s = $_REQUEST;
if($s['key'] == 'tags'){
return $s['tag'];
} else {
return false;
}
Step 3. Call the snippet where you want the tag nam to show up, e.g. [[!Tag Name]]
It will show up the tag name on tag pages only.
Here is where I found it
https://forums.modx.com/thread/11108/dynamically-generated-list-of-documents-that-are-tagged-with-categories?page=2#dis-post-397237
I tried to show dynamic urls stored in my database, but the database only have part of the URL, and I tried something like this, but it's not working, any ideas?
#foreach(var item in Model.dominios)
{
url
}
You don,t need Html.DisplayFor() here. Build your url and put it into the link:
#foreach(var item in Model.dominios)
{
var linkUrl = string.Format("http://www.{0}.sss.com", item.subdom);
link text
}
I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag.
Here's my code snippet:
Document doc = Jsoup.connect(url)
.timeout(TIMEOUT * 1000)
.get();
Elements elts = doc.getElementsByTag("a");
And here's the example HTML:
<table>
<tr><td></td></tr>
</table>
My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?
EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?
In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()).
Tip: use select() instead of getElementsByX(), its faster and more flexible.
Elements elts = doc.select("a"); (edit)
Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax
Try this code
String url = "http://test.com";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
Elements links = doc.select(<i>"a[href]"<i>);
Element link;
for(int j=0;j<150;j++){
link=links.get(j);
System.out.println("a= " link.attr("abs:href").toString() );
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I have some tabs, and I want to say "if they are currently on the page that this tab refers to, make this a span. Otherwise, make this a link." In pseudo-razor, that would look like this:
#if(CurrentlyOnThisPage) {
<span>
} else {
<a>
}
Tab Content
#if(CurrentlyOnThisPage){
</span>
} else {
</a>
}
Razor (correctly) notes that I'm not closing my beginning tags, and so has trouble parsing this syntax. If the tab content was small, I could use Html.ActionLink, but I've got a few lines of stuff and I'd like to keep the benefits of the HTML editor rather than putting it all into a string. Is there any way to do this?
You can write the tags as literal text to prevent Razor from parsing them:
#:<span>
How about something like this?
#{
var linkOrSpan= CurrentlyOnThisPage ? "span" : "a";
}
<#linkOrSpan><text>Tab Content</text></#linkOrSpan>
No errors about closing tags with this.
Looks a bit cleaner too ihmo.
HTH
Or just write it out explicitly:
#if(CurrentlyOnThisPage)
{
<span>tabcontent</span>
} else {
<a>tabcontent</a>
}
I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.
You could try the Simple HTML DOM Parser. It sports a syntax to find specific elements similar to jQuery.
They have an example on how to scrape Slashdot:
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);