JSoup: parse Twitter list - twitter

I want to parse a Twitter list (e.g. https://twitter.com/spdbt/lists/spd-bundestagsabgeordnete/members) using JSoup. My problem is, that the page is dynamic, i.e. that I only get the first 20 results from the page. Is there any way JSoup can fetch the whole page?
Currently, my codes looks as follows:
Document doc = Jsoup.connect(listAdress).get();
Elements usernames = doc.select(".username.js-action-profile-name");
Elements realNames = doc.select(".fullname.js-action-profile-name");
// iterate over usernames and realNames and do something
Thanks in advance!

Some work around to achieve this
Launch browser with above URL using Selenium
Load page fully
get the page source using Selenium method.
Pass this content to JSOUP
Parse it.
Logic
WebDriver driver = new FirefoxDriver();
driver.get("https://twitter.com/spdbt/lists/spd-bundestagsabgeordnete/members")
//some logic to scroll or you do it manually
String pageContent = driver.getPageSource();
Document doc = Jsoup.parse(pageContent);
//from here write your logic to get the required values

finally solved the problem by using a Twitter library, but thanks for your help.

Related

Google Spreadsheet getting text with importxml

I've tried this and other versions to no avail? Can anyone help please?
=IMPORTXML("http://performance.morningstar.com/fund/ratings-risk.action?t=MWTRX", "//*[#id='div_ratings_risk']/table/tbody/tr[4]/td[3]/text()")
As explained in the comments to your original question, initially the div Element with the id #div_ratings_risk is initially empty and does not consist of a table.
So Google spreadsheets is not able to parse content that is not there and yet needs to be loaded first.
The content (table) you try to fetch data from into your google spreadsheet is dynamically loaded using jQuery from another URL. You can get that URL using e.g. the chrome developer tools and filter for XHR request.
If you parse the content directly from that HTML it will work. So you would need to change your formula to that URL and adapt your XPath like so:
=IMPORTXML("http://performance.morningstar.com/ratrisk/RatingRisk/fund/rating-risk.action?&t=XNAS:MWTRX&region=usa&culture=en-US&cur=&ops=clear&s=0P00001G5L&ep=true&comparisonRemove=null&benchmarkSecId=&benchmarktype=", "//table/tbody/tr[4]/td[3]/text()")

Extracting data using Mechanize gem. Parsing data with CSS header

I am trying to extract data from a website (http://oregonpinotnoirwine.com/) using Mechanize.
I am able to go to the website and select search field. But I am not able to get the data. I am doing this on Ruby IRB.
require 'mechanize'
agent = Mechanize.new
agent.get("http://oregonpinotnoirwine.com/search.php")
form = agent.page.forms[0]
form["wineava"] = "Dundee Hills"
form.submit
Then I am trying to extract all the list of wines that are on the website. So in order to do that, I looked up CSS of the website and decided to use .a like
agent.page.search(".a")
But that didn't return anything. But when I type
agent.page.search(".")
It returns all the data from the website. Now I'm just trying different variations.. when I type
agent.page.search("#wineava")
It returns with the dropdown option from the site but not the wine list..
So I was over-thinking with this.
All the data I needed was on the dropdown menu. So after accessing the website through
agent.get("http://oregonpinotnoirwine.com/search.php")
I was able to get the data I need by
agent.page.search("#winemaker")
But this method will not work if the items were not displayed in the dropdown menu.. Would it?

twitter hashtag search

I am trying to search all tweets with a given hashtag (Using titanium appcelerator).
I have a working code to search all tweets from a given user (for example #prayforjapan).
Now I'm trying to get all the tweets from #prayforjapan. This isn't working..
I tried the following method (since i found it on here
Now to search for the names i use this url:
var xhr = Ti.Network.createHTTPClient();
xhr.timeout = 1000000;
xhr.open("GET","http://api.twitter.com/1/statuses/user_timeline.json?screen_name="+screen_name);
For the Hashsearch i tried the following code (doesn't work tho)
var xhr = Ti.Network.createHTTPClient();
xhr.timeout = 1000000;
xhr.open("GET","http://search.twitter.com/search.json?q=prayforjapan");
Does anyone know what's wrong with this search? or which link it should be?
Thanks!
Well, I played with it for a little bit and came up with this link format.
http://search.twitter.com/search?q=%23prayforjapan
How does that work?
%23prayforjapan is same as #prayforjapan
Sorry, Twitter does not supply this, neither with Rest or Streaming APIs. They only provide partial results unless you pay for the "garden hose" or "firehose," both of which are very costly. Garden hose starts at about $6,000/month, currently.

HTMLUnit collecting all links by class name

I would like to scrape / collect all the links on a page under a specific class name
e.g. HTML
Agriculture (92)
Agriculture
I have been toying with the following pieces of code:
List<?> links = page.getByXPath("//div[#class='generate']/#href");
OR
List<?> links = page.getAnchors();
System.out.println(links);
The getByXPath option returns null and the other option grabs all anchors. Is there a way to grab the links into a list?
This is a terrible XPath but I was having issues narrowing it down. (I can look into a better XPath if necessary, but for now this one worked:
List<?> links = page.getByXPath("/html/body/div[2]/div[2]/table/tbody/tr/td/table/tbody/tr[7]/td/table/tbody/tr/td/div/table/tbody/tr[2]/td/div/table/tbody/tr/td/table/tbody/tr/td/ul/li/a/#href").asList()
I'm not quite sure why it wasn't allow us to grab it by that class name.
Let me know how it works for you when you get the chance

Is there a way to automatically grab all the elements on the page using Selenium?

When creating tests for .Net applications, I can use the White library to find all elements of a given type. I can then write these elements to an Xml file, so they can be referenced and used for GUI tests. This is much faster than manually recording each individual element's info, so I would like to do the same for web applications using Selenium. I haven't been able to find any info on this yet.
I would like to be able to search for every element of a given type and save its information (location/XPath, value, and label) so I can write it to a text file later.
Here is the ideal workflow I'm trying to get to:
navigate_to_page(http://loginscreen.com)
log_in
open_account
button_elements = grab_elements_of_type(button) # this will return an array of XPaths and Names/IDs/whatever - some way of identifying each grabbed element
That code can run once, and I can then re-run it should any elements get changed, added, or removed.
I can then have another custom function iterate through the array, saving the info in a format I can use later easily; in this case, a Ruby class containing a list of constants:
LOGIN_BUTTON = "//div[1]/loginbutton"
EXIT_BUTTON = "//div[2]/exitbutton"
I can then write tests that look like this:
log_in # this will use the info that was automatically grabbed beforehand
current_screen.should == "Profile page"
Right now, every time I want to interact with a new element, I have to manually go to the page, select it, open it with XPather, and copy the XPath to whatever file I want my code to look at. This takes up a lot of time that could otherwise be spent writing code.
Ultimately what you're looking for is extracting the information you've recorded in your test into a reusable component.
Record your tests in Firefox using the Selenium IDE plugin.
Export your recorded test into a .cs file (assuming .NET as you mentioned White, but Ruby export options are also available)
Extract the XPath / CSS Ids and encapsulate them into a reusable classes and use the PageObject pattern to represent each page.
Using the above technique, you only need to update your PageObject with updated locators instead of re-recording your tests.
Update:
You want to automate the record portion? Sounds awkward. Maybe you want to extract all the hyperlinks off a particular page and perform the same action on them?
You should use Selenium's object model to script against the DOM.
[Test]
public void GetAllHyperLinks()
{
IWebDriver driver = new FireFoxDriver();
driver.Navigate().GoToUrl("http://youwebsite");
ReadOnlyCollection<IWebElement> query
= driver.FindElements( By.XPath("//yourxpath") );
// iterate through collection and access whatever you want
// save it to a file, update a database, etc...
}
Update 2:
Ok, so I understand your concerns now. You're looking to get the locators out of a web page for future reference. The challenge is in constructing the locator!
There are going to be some challenges with constructing your locators, especially if there are more than one instance, but you should be able to get far enough using CSS based locators which Selenium supports.
For example, you could find all hyperlinks using an xpath "//a", and then use Selenium to construct a CSS locator. You may have to customize the locator to suit your needs, but an example locator might be using the css class and text value of the hyperlink.
//a[contains(#class,'adminLink')][.='Edit']
// selenium 2.0 syntax
[Test]
public void GetAllHyperLinks()
{
IWebDriver driver = new FireFoxDriver();
driver.Navigate().GoToUrl("http://youwebsite");
ReadOnlyCollection<IWebElement> query
= driver.FindElements( By.XPath("//a") );
foreach(IWebElement hyperLink in query)
{
string locatorFormat = "//a[contains(#class,'{0}')][.='{1}']";
string locator = String.Format(locatorFormat,
hyperlink.GetAttribute("class"),
hyperlink.Value);
// spit out the locator for reference.
}
}
You're still going to need to associate the Locator to your code file, but this should at least get you started by extracting the locators for future use.
Here's an example of crawling links using Selenium 1.0 http://devio.wordpress.com/2008/10/24/crawling-all-links-with-selenium-and-nunit/
Selenium runs on browser side, even if you can grab all the elements, there is no way to save it in a file. As I know , Selenium is not design for that kinds of work.
You need to get the entire source of the page? if so, try the GetHtmlSource method
http://release.seleniumhq.org/selenium-remote-control/0.9.0/doc/dotnet/html/Selenium.DefaultSelenium.GetHtmlSource.html

Resources