I'd need a suggestion on the most effective way to list all links in a website. I am able to do it with either php vb and I have tried to do it with scrapy, but my problem is that with the first 2 it is not enough inputting the address of the website, I actually have to scrape the following links in my code, and with scrapy I have tried to list all susequent links in the page, but the spider seems never ending the research.
In otehr words I'd need to find out a way to input a website adress returning all links present in that website. I'd need to do that for a school project and I was thinking to do a small research on the retailing industry so I'd need to list up to 20 000 results for a given website.
any suggestions?
Scrapy is a perfect choice here. Use CrawlSpider with LinkExtractor.
The following spider would follow and gather all links on the website. Since there is an OffsiteMiddleware enabled by default - you would not get links from other domains.
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
If you want to limit the number of links and stop the spider after getting n links, use Close Spider extension and set CLOSESPIDER_ITEMCOUNT setting:
CLOSESPIDER_ITEMCOUNT
An integer which specifies a number of items. If the spider scrapes
more than that amount if items and those items are passed by the item
pipeline, the spider will be closed with the reason
closespider_itemcount. If zero (or non set), spiders won’t be closed
by number of passed items.
In your case you can also use CLOSESPIDER_PAGECOUNT setting instead.
Hope that helps.
Related
I am trying to find the correct template and id to use for a hotprint of an advanced pdf template of an Item Fulfillment.
The hot print url is (with the id bolded) https://system.na3.netsuite.com/app/accounting/print/hotprint.nl?regular=T&sethotprinter=T&id=7600&label=Packing%20Slip&printtype=packingslip&trantype=itemship&orgtrantype=TrnfrOrd&auxtrans=7605
For some reason only certain id=# seems to affect the outcome and the ids I have got to work for two different templates don't match the Custom Transaction Forms ID or the Advanced pdf script id. (example most ids=template 1, while 168,4954, and seemingly random other ids=template 2) I am very confused on how netsuite resolves the hot print url as it normally doesn't include the template= part though I have seen others use it for invoice print urls.
The parameters at the end of the url (the stuff after the ?) are used by Netsuite to control settings used by the webpage which prints the PDFs for you.
In this case, &id=##### refers to the internal id of the document you are printing. You can see this by going to the document, right clicking, selecting inspect, and typing nlapiGetRecordId() into the console. When you click Print, you should see that same number after &id=#####.
&template=### refers to the template you are printing. If you go to Customization -> Forms -> Advanced PDF/HTML Templates, you'll notice a Script ID field in the table. If you substitute the correct Script ID in for the number in &template=###, you'll notice you generate the same PDF. This Script ID acts the same as the number that was previously there.
The reason you're seeing unusual results when you change those numbers is because you're mismatching a record with a template not built for it. So it won't print exactly right, but will sometimes execute anyways.
Anyways, this sort of parameter scheme is a similar scheme to how Suitelets and Restlets work, so in the future, you might experience this sort of thing again.
EDIT: For those reading this in the future, please read the comments.
To customize a packing slip and return form:
If you are printing packing slips and need some customization, you can use a custom invoice form when printing packing slips. For example, you can customize an invoice form to hide the fulfilled item tax rate and amount, and the order total. Then, when you print the packing slip using the custom form through mass print, choose the the packing slip shows the customized information.
I am using Yahoo pipes to make automated Twitter Searches using terms from the description fields of an RSS feed.
Pipes makes one search from each item in the feed. Each search returns a set of results which are assigned as item.twitloop (all results)
I would like to replace the link from each item in the results with the link from the original query item;
So far I am only able to assign the original link to the first item in the results list rather than to each item.
http://pipes.yahoo.com/pipes/pipe.edit?_id=01f5f60eb8f3c22b45aa3708e5ae057a
Can anyone see where I'm going wrong?
The pipe isn't loading for me - perhaps you didn't set it as public? In any event, I have solved similar problems in the past by using the Loop module. You put the assignment into the loop (usually a string builder works well), and then have the Loop put that original link into item.link.
As you may know, if you have products that share a url key, the url key will have a digit appended to it:
i.e
http://www.example.com/main-category/sub-category/product-name-**6260**.html
How do I find the source of that 6260 (which is one of the #'s appended to my urls)? I tried product id, sku, I cannot find the source of it. The reason I ask is because if I can find it, I can create a string replace function to flush it out of url's before I echo them on certain product listing pages.
Thanks.
Before we get to the location in code where this happens, be advised you're entering a world of pain.
There's no simple rule as to how those numbers are generated. There's cases where it's the store ID, there's cases where it's the simple product ID. There's cases where it's neither
Even if there was, it's common for not-from-scratch Magento sites to contain custom functionality that changes this
Ultimately, since Magento's human readable/SEO-friendly URLs are located in the core_url_rewrite table, it's possible for people to insert arbitrary text
Warnings of doom aside, the Model you're looking for is Mage::getSingleton('catalog/url'). This contains most of the logic for generating Magento Catalog and product rewrites. All of these methods end by passing the request path through the getUnusedPath method.
#File: app/code/core/Mage/Catalog/Model/Url.php
public function getUnusedPath($storeId, $requestPath, $idPath)
{
//...
}
This method contains the logic for for creating a unique number on the end of the URL. Tracing this in its entirely is beyond the scope of a Stack Overflow post, but this line in particular is enlightening/disheartening.
$lastRequestPath = $this->getResource()
->getLastUsedRewriteRequestIncrement($match[1], $match[4], $storeId);
if ($lastRequestPath) {
$match[3] = $lastRequestPath;
}
return $match[1]
. (isset($match[3]) ? ($match[3]+1) : '1')
. $match[4];
In particular, these two lines
$match[3] = $lastRequestPath;
//...
. (isset($match[3]) ? ($match[3]+1) : '1')
//...
In case it's not obvious, there are cases where Magento will automatically append a 1 to a URL, and then continue to increment it. This makes the generation of those URLs dependent on system state when they were generated — there's no simple rule.
Other lines of interest in this file are
if (strpos($idPath, 'product') !== false) {
$suffix = $this->getProductUrlSuffix($storeId);
} else {
$suffix = $this->getCategoryUrlSuffix($storeId);
}
This $suffix will be used on the end of the URL as well, so those methods are worth investigating.
If all you're trying to do is remove numbers from the URL, you might be better off with a regular expression or some explode/implode string jiggering.
I have little to no idea why this works but this worked for me. Most probably because it makes urls non-unique. Magento ver. 1.7.0.2 had suddenly started adding numbers as suffixes to my new products' names, even if their url keys and names were different from the old products. On a hunch, I went to System -> Configuration -> Catalog -> Search Engine Optimizations -> Product URL Suffix and changed the default .html to -prod.html. I guess you could change it to any suffix you wanted to. Then I re-indexed my website, refreshed cache, and presto! All the numbers were gone from the product urls. The product urls now all have the format custom-product-name-prod.html. The canonical tag also shows custom-product-name-prod.html so I'm double happy.
Don't know if it'll work for others, but I hope it does. Do note that I did have old and new products with duplicate URLs and that I had disabled old products before doing this procedure. So if you have 2 products with the same url key and both are enabled, then this may NOT work for you.
There's a problem! As you look in the edit product you can see the link is correct but on your store the url is different.
Step 1: Edit the URL and add SS to the end of the custom url. Press
Save. \product-namess.html
Step 2: Go to Catalog -> URL Rewrite
Step 3: Narrow Down the Criteria to Include Only The Problem Situation.
Press Search.
Step 4: Download Jib Bit Mouse Recorder / Macro
Recorder. Its Free I pay.
Step 5: Edit the First Rewrite. Press
Delete and OK.
Step 6: Press Record on JitBit. Edit the 1st Rewrite.
Delete. OK. Press Stop.
Step 7: Drop Down Next to Play Select X
Times. Write the Number of Records. Press OK.
Step 8: Watch the
Program Delete All the Records for You.
Step 9: Catalog -> Manage
Products. Remove the SS from the end of the custom URL. Do not check
create redirect. Fixed.
Mine was 279 records of changing that product. So it took about an hour.
I would like to scrape / collect all the links on a page under a specific class name
e.g. HTML
Agriculture (92)
Agriculture
I have been toying with the following pieces of code:
List<?> links = page.getByXPath("//div[#class='generate']/#href");
OR
List<?> links = page.getAnchors();
System.out.println(links);
The getByXPath option returns null and the other option grabs all anchors. Is there a way to grab the links into a list?
This is a terrible XPath but I was having issues narrowing it down. (I can look into a better XPath if necessary, but for now this one worked:
List<?> links = page.getByXPath("/html/body/div[2]/div[2]/table/tbody/tr/td/table/tbody/tr[7]/td/table/tbody/tr/td/div/table/tbody/tr[2]/td/div/table/tbody/tr/td/table/tbody/tr/td/ul/li/a/#href").asList()
I'm not quite sure why it wasn't allow us to grab it by that class name.
Let me know how it works for you when you get the chance
I've seen some websites highlight the search engine keywords you used, to reach the page. (such as the keywords you typed in the Google search listing)
How does it know what keywords you typed in the search engine? Does it examine the referrer HTTP header or something? Any available scripts that can do this? It might be server-side or JavaScript, I'm not sure.
This can be done either server-side or client-side. The search keywords are determined by looking at the HTTP Referer (sic) header. In JavaScript you can look at document.referrer.
Once you have the referrer, you check to see if it's a search engine results page you know about, and then parse out the search terms.
For example, Google's search results have URLs that look like this:
http://www.google.com/search?hl=en&q=programming+questions
The q query parameter is the search query, so you'd want to pull that out and un-URL-escape it, resulting in:
programming questions
Then you can search for the terms on your page and highlight them as necessary. If you're doing this server side-you'd modify the HTML before sending it to the client. If you're doing it client-side you'd manipulate the DOM.
There are existing libraries that can do this for you, like this one.
Realizing this is probably too late to make any difference...
Please, I beg you -- find out how to accomplish this and then never do it. As a web user, I find it intensely annoying (and distracting) when I come across a site that does this automatically. Most of the time it just ends up highlighting every other word on the page. If I need assistance finding a certain word within a page, my browser has a much more appropriate "find" function built right in, which I can use or not use at will, rather than having to reload the whole page to get it to go away when I don't want it (which is the vast majority of the time).
Basically, you...
Examine document.referrer.
Have a list of domains to GET param that contains the search terms.
var searchEnginesToGetParam = {
'google.com' : 'q',
'bing.com' : 'q'
}
Extract the appropriate GET param, and decodeURIComponent() it.
Parse the text nodes where you want to highlight the terms (see Replacing text with JavaScript).
You're done!