I would like to scrape / collect all the links on a page under a specific class name
e.g. HTML
Agriculture (92)
Agriculture
I have been toying with the following pieces of code:
List<?> links = page.getByXPath("//div[#class='generate']/#href");
OR
List<?> links = page.getAnchors();
System.out.println(links);
The getByXPath option returns null and the other option grabs all anchors. Is there a way to grab the links into a list?
This is a terrible XPath but I was having issues narrowing it down. (I can look into a better XPath if necessary, but for now this one worked:
List<?> links = page.getByXPath("/html/body/div[2]/div[2]/table/tbody/tr/td/table/tbody/tr[7]/td/table/tbody/tr/td/div/table/tbody/tr[2]/td/div/table/tbody/tr/td/table/tbody/tr/td/ul/li/a/#href").asList()
I'm not quite sure why it wasn't allow us to grab it by that class name.
Let me know how it works for you when you get the chance
Related
I'd need a suggestion on the most effective way to list all links in a website. I am able to do it with either php vb and I have tried to do it with scrapy, but my problem is that with the first 2 it is not enough inputting the address of the website, I actually have to scrape the following links in my code, and with scrapy I have tried to list all susequent links in the page, but the spider seems never ending the research.
In otehr words I'd need to find out a way to input a website adress returning all links present in that website. I'd need to do that for a school project and I was thinking to do a small research on the retailing industry so I'd need to list up to 20 000 results for a given website.
any suggestions?
Scrapy is a perfect choice here. Use CrawlSpider with LinkExtractor.
The following spider would follow and gather all links on the website. Since there is an OffsiteMiddleware enabled by default - you would not get links from other domains.
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
If you want to limit the number of links and stop the spider after getting n links, use Close Spider extension and set CLOSESPIDER_ITEMCOUNT setting:
CLOSESPIDER_ITEMCOUNT
An integer which specifies a number of items. If the spider scrapes
more than that amount if items and those items are passed by the item
pipeline, the spider will be closed with the reason
closespider_itemcount. If zero (or non set), spiders won’t be closed
by number of passed items.
In your case you can also use CLOSESPIDER_PAGECOUNT setting instead.
Hope that helps.
I am using ColdFusion 9.0.1.
I have a new web site that uses Bikes.cfm and Makers.cfm as template pages. I need to be able to pass BikeID and MakerID to both of the these pages, along with other variables. I don't want to use the Actual page name in the URL, such as this:
MyDomain.com/Bikes.cfm?BikeID=1234&MakerID=1234
I want my URL to look more like this:
MyDomain.com/?BikeID=1234&MakerID=1234
I need to NOT specify the page name in the URL.
I want these two URLs to access different data:
MyDomain.com/?BikeID=1234&MakerID=1234 // goes to bike page
MyDomain.com/?MakerID=1234&BikeID=1234 // goes to maker page
So, if BikeID appears in the URL before MakerID, go to the Bikes.cfm page. If MakerID appears before BikeID, go the Makers.cfm page.
Is there an easy and existing method to arrange the URL keys in such a way to have them point to the appropriate page?
Should I just parse the the URL as a list and determine the first ID and go to the appropriate page? Is there a better way?
Any thoughts or hints or ideas would be appreciated.
UPDATE -- It certainly appears that using the order of parameters in a URL is a bad idea for the following reasons:
1) many programs append variables to the URL
2) some programs may reorder the variables
3) GoogleBot may not consider order relevant and will most likely not index the site correctly.
Thanks to everyone who provided advice in a positive manner that my approach was probably a bad idea and would not produce the results I wanted. Thanks to everyone who suggested alternate means to produce the results I wanted.
If anyone of you positive people would like to put your positive comment/advice as an answer, I'd be happy to accept it as the answer.
Despite my grave misgivings about the whole idea, here's how I would do it if I were forced to do so:
index.cfm:
<cfswitch expression="#ListFirst(cgi.query_string, '=')#">
<cfcase value="BikeID">
<cfinclude template="Bikes.cfm">
</cfcase>
<cfcase value="MakerID">
<cfinclude template="Makers.cfm">
</cfcase>
<cfdefaultcase>
<cfinclude template="Welcome.cfm">
</cfdefaultcase>
</cfswitch>
Today I heard from my colleague that search bot can index pages with sequential ids.
Is it really happens ?
As an example checkout two urls:
http://sample.com/myProduct?id=765
and
http://sample.com/myProduct?id=35d6eb6c-97f6-4cde-997c-ade657c285d3
So, if search bots can figure out that my product id in url is sequential it can possibly index other products up and down the sequence ...
Have you ever heard anything like that ?
Whomever told you that is mistaken. Search engines will only index pages they know exist. So they won't keep changing the ID in those URLs just see if they find anything. So if you want those other pages to be indexed you should use a HTML sitemap or XML sitemap to tell the search engines where those pages are. Linking to them from other product pages is also a good idea.
When creating tests for .Net applications, I can use the White library to find all elements of a given type. I can then write these elements to an Xml file, so they can be referenced and used for GUI tests. This is much faster than manually recording each individual element's info, so I would like to do the same for web applications using Selenium. I haven't been able to find any info on this yet.
I would like to be able to search for every element of a given type and save its information (location/XPath, value, and label) so I can write it to a text file later.
Here is the ideal workflow I'm trying to get to:
navigate_to_page(http://loginscreen.com)
log_in
open_account
button_elements = grab_elements_of_type(button) # this will return an array of XPaths and Names/IDs/whatever - some way of identifying each grabbed element
That code can run once, and I can then re-run it should any elements get changed, added, or removed.
I can then have another custom function iterate through the array, saving the info in a format I can use later easily; in this case, a Ruby class containing a list of constants:
LOGIN_BUTTON = "//div[1]/loginbutton"
EXIT_BUTTON = "//div[2]/exitbutton"
I can then write tests that look like this:
log_in # this will use the info that was automatically grabbed beforehand
current_screen.should == "Profile page"
Right now, every time I want to interact with a new element, I have to manually go to the page, select it, open it with XPather, and copy the XPath to whatever file I want my code to look at. This takes up a lot of time that could otherwise be spent writing code.
Ultimately what you're looking for is extracting the information you've recorded in your test into a reusable component.
Record your tests in Firefox using the Selenium IDE plugin.
Export your recorded test into a .cs file (assuming .NET as you mentioned White, but Ruby export options are also available)
Extract the XPath / CSS Ids and encapsulate them into a reusable classes and use the PageObject pattern to represent each page.
Using the above technique, you only need to update your PageObject with updated locators instead of re-recording your tests.
Update:
You want to automate the record portion? Sounds awkward. Maybe you want to extract all the hyperlinks off a particular page and perform the same action on them?
You should use Selenium's object model to script against the DOM.
[Test]
public void GetAllHyperLinks()
{
IWebDriver driver = new FireFoxDriver();
driver.Navigate().GoToUrl("http://youwebsite");
ReadOnlyCollection<IWebElement> query
= driver.FindElements( By.XPath("//yourxpath") );
// iterate through collection and access whatever you want
// save it to a file, update a database, etc...
}
Update 2:
Ok, so I understand your concerns now. You're looking to get the locators out of a web page for future reference. The challenge is in constructing the locator!
There are going to be some challenges with constructing your locators, especially if there are more than one instance, but you should be able to get far enough using CSS based locators which Selenium supports.
For example, you could find all hyperlinks using an xpath "//a", and then use Selenium to construct a CSS locator. You may have to customize the locator to suit your needs, but an example locator might be using the css class and text value of the hyperlink.
//a[contains(#class,'adminLink')][.='Edit']
// selenium 2.0 syntax
[Test]
public void GetAllHyperLinks()
{
IWebDriver driver = new FireFoxDriver();
driver.Navigate().GoToUrl("http://youwebsite");
ReadOnlyCollection<IWebElement> query
= driver.FindElements( By.XPath("//a") );
foreach(IWebElement hyperLink in query)
{
string locatorFormat = "//a[contains(#class,'{0}')][.='{1}']";
string locator = String.Format(locatorFormat,
hyperlink.GetAttribute("class"),
hyperlink.Value);
// spit out the locator for reference.
}
}
You're still going to need to associate the Locator to your code file, but this should at least get you started by extracting the locators for future use.
Here's an example of crawling links using Selenium 1.0 http://devio.wordpress.com/2008/10/24/crawling-all-links-with-selenium-and-nunit/
Selenium runs on browser side, even if you can grab all the elements, there is no way to save it in a file. As I know , Selenium is not design for that kinds of work.
You need to get the entire source of the page? if so, try the GetHtmlSource method
http://release.seleniumhq.org/selenium-remote-control/0.9.0/doc/dotnet/html/Selenium.DefaultSelenium.GetHtmlSource.html
I've seen some websites highlight the search engine keywords you used, to reach the page. (such as the keywords you typed in the Google search listing)
How does it know what keywords you typed in the search engine? Does it examine the referrer HTTP header or something? Any available scripts that can do this? It might be server-side or JavaScript, I'm not sure.
This can be done either server-side or client-side. The search keywords are determined by looking at the HTTP Referer (sic) header. In JavaScript you can look at document.referrer.
Once you have the referrer, you check to see if it's a search engine results page you know about, and then parse out the search terms.
For example, Google's search results have URLs that look like this:
http://www.google.com/search?hl=en&q=programming+questions
The q query parameter is the search query, so you'd want to pull that out and un-URL-escape it, resulting in:
programming questions
Then you can search for the terms on your page and highlight them as necessary. If you're doing this server side-you'd modify the HTML before sending it to the client. If you're doing it client-side you'd manipulate the DOM.
There are existing libraries that can do this for you, like this one.
Realizing this is probably too late to make any difference...
Please, I beg you -- find out how to accomplish this and then never do it. As a web user, I find it intensely annoying (and distracting) when I come across a site that does this automatically. Most of the time it just ends up highlighting every other word on the page. If I need assistance finding a certain word within a page, my browser has a much more appropriate "find" function built right in, which I can use or not use at will, rather than having to reload the whole page to get it to go away when I don't want it (which is the vast majority of the time).
Basically, you...
Examine document.referrer.
Have a list of domains to GET param that contains the search terms.
var searchEnginesToGetParam = {
'google.com' : 'q',
'bing.com' : 'q'
}
Extract the appropriate GET param, and decodeURIComponent() it.
Parse the text nodes where you want to highlight the terms (see Replacing text with JavaScript).
You're done!