Parse Stack Overflow page source code and get accepted answer - html-parsing

I am trying to write a function that takes an input URL of any Stack Overflow link, gets the source code of the page, parses it, gets the accepted answer, and also gets the answer with the most upvotes.
I am new to this and I don't know how to do this. This is what I've tried out. It just returns the first answer using jsoup.
protected void doHtmlParse(String url) {
// TODO Auto-generated method stub
Document doc;
try {
doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.referrer("http://www.google.com")
.get();
Element answer = doc.select("td[class=answercell]").get(0);
System.out.println("Answer is \n" + answer.toString());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
I only need to display the answer part, but it has to be the accepted answer. How do I approach this?

You don't really need to parse html. Use their REST API.
Have a look.
Here's an example. Note the is_accepted attribute.
EDIT:
Well, after you've got the chosen answer through the API, you could do this:
String answer = document.getElementById("answer-"+id).outerHtml();

I am now able to get the accepted answer via this code.
protected void doHtmlParse(String url) {
// TODO Auto-generated method stub
Document doc;
try {
doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.referrer("http://www.google.com")
.get();
Element answer = doc.select("div[class=answer accepted-answer]").first();
Elements tds = answer.getElementsByTag("td");
for(Element td : tds) {
String clasname = td.attr("class");
if(clasname.equals("answercell")) {
System.out.println("\n\nAccepted answerrr is \n" + td.text());
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

Related

Jsoup always return timeout error

I have the following code to get the tock price from yahoo finance.hk
But it always return time out error
please help
public GetStockPriceFromWebOneByOne(String url){
this.url = url;
}
private void setDataFromAAStock() throws IOException, InterruptedException{
Document document = Jsoup.connect(url).ignoreHttpErrors(true).timeout(timeOut*1000).get(); // s
//TimeUnit.SECONDS.sleep(2);
Elements answerers = document.select("div.yfi_rt_quote_summary div.yfi_rt_quote_summary_rt_top.sigfig_promo_0 span.time_rtq_ticker");
// Elements answerers = document.select(".content .inline_block.vat.float_l .boxForex .font26 .neg .arr_ud.arrow_d6");
for (Element answerer : answerers) {
//System.out.print(answerer.text()+"\n");
price = answerer.text();
// splitString(answerer.text());
}
}
public String getDataFromAAStock() throws IOException, InterruptedException{
setDataFromAAStock();
return price;
}
I did not check with yahoo finance hk, but you maybe should try to set a plausible browser userAgent string when connecting to it. See the docs
Document document = Jsoup.connect(url)
.ignoreHttpErrors(true)
.timeout(timeOut*1000)
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
Addendum:
Of course, you can turn off the timeout altogether by using:
Document document = Jsoup.connect(url)
.ignoreHttpErrors(true)
.timeout(0)
Did you look at the network traffic between the browser and the site with the browser developer tools? It might help you to analyze the underlying problem.
I would split the
Document document = Jsoup.connect(url).ignoreHttpErrors(true).timeout(timeOut*1000).get();
into
Connection connect = Jsoup.connect(url)
.ignoreHttpErrors(true)
.timeout(timeOut*1000);
// use this for chrome
.userAgent("Mozilla");
System.out.println("Connection made BEFORE document.");
Document document = connect.get();
System.out.println("Connection made AFTER document.");
I think there is an issue with your "Connection" because of the .get() which may need a .userAgent("Mozilla"); BEFORE you call .get();.

Parse Error - When convering a xmlstring to a document

Been breaking my head to get this straight. Pretty simple though.. have not been able to figure out why. Any help would be very much appreciated.
Here my XML file
<?xml version="1.0" encoding="UTF-8"?>
<User mode="Retrieve" simCardNumber=“9602875089237652" softwareVersion=“9" phoneManufacturer=“Nokia" phoneModel="I747" deviceId=“562372389498734" networkOperator=“Blu">
<Errors>
<Error number="404"/>
</Errors>
</User>
private static Document convertStringToDocument(String xmlStr) {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try
{
DocumentBuilder builder =factory.newDocumentBuilder();
//The below statement fails and jumps to return null
//Document doc = builder.parse( new InputSource(new StringReader(xmlStr)));
//Adding replace method on the string to handle the strange looking double quote on the xml string. However I still get the same error.
Document doc = builder.parse( new InputSource(new StringReader(xmlStr.replace("“", "\'\""))));
return doc;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
check the quotes..
networkOperator=“Blu"
Don't know if it isn't a paste error but you used “ instead of " in your code. The first one if often used in rich text editors as a starting quote, you need to change it manually to let it be parseable.
Ok this solution works. Thanks everyone for your time and support.
Document doc = null;
try
{
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmlStr));
doc = db.parse(is);
} catch (Exception e) {
e.printStackTrace();
}
return doc;

Get current URL with selenium

Hello
I would like to get the current url of the page after a click on link. When I click on the link (of the 1rst page), the link open a new page (2nd page), and I want get the url of the 2nd page but when I call GetCurrentUrl(), the method return the url of the first page.
This is my code :
String att = driver.findElement(By.linkText("Lien 2")).getAttribute("href");
driver.findElement(By.linkText("Lien 2")).click(); // Open a the 2nd page
driver.manage().timeouts().pageLoadTimeout(30, TimeUnit.SECONDS);
String act = driver.getCurrentUrl(); // Return the url of the 1rst page; but I want the 2nd
System.out.println("act "+act+" att "+att);
assertEquals(act, att);
Thanks very much for the help !
Instead of the change to pageloadTimeout(), try manually waiting for the Url to change (this is what I'm doing in my code so thought I'd answer).
Insert the function definition below into your code.
The function prints the current URL in a loop so its helpful to debug any problems.
Change the value "30" in WebDriverWait(driver, 30)) if you want a longer timeout.
Make sure to call the function as follows:
try {
waitForUrl(the-url-you-want-to-wait-for);
} catch (Exception ex) {
//handle exception
}
Function:
/**
* Wait until the current page's URL changes to whatever you specify.
* #param Url The URL you want to wait for.
*/
protected static void waitForUrl(WebDriver driver, final String Url) throws Exception {
try {
(new WebDriverWait(driver, 30)).until(new ExpectedCondition<Boolean>() {
public Boolean apply(WebDriver d) {
System.out.println(d.getCurrentUrl());
return d.getCurrentUrl().toLowerCase().endsWith(Url);
}
});
}
catch (TimeoutException e) {
throw new Exception("Timeout Exception encountered while waiting for URL " + Url + ": \n" + e.getMessage());
}
}

JSoup & URL non Latin Charsets

I am using the next implementation for Java Servlet -
String url = "http://mydomain.com/test.php?myparam="+myname;
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Where myname is a String in UTF Charset.
For some reason the result received is not OK (unreadable chars).
Is there a way to force the URL in JSoup to be UTF as well?
Thanks
Try this
url = URLEncoder.encode("http://mydomain.com/test.php?myparam="+myname, "UTF-8")

Blackberry: Make a iterative HTTP GET petition using Comms API

I want to store position coords (latitude, longitude) in a table in my MySQL DB querying a url in a way similar to this one: http://locationstore.com/postlocation.php?latitude=var1&longitude=var2 every ten seconds. PHP script works like a charm. Getting the coords in the device ain't no problem either. But making the request to the server is being a hard one. My code goes like this:
public class LocationHTTPSender extends Thread {
for (;;) {
try {
//fetch latest coordinates
coords = this.coords();
//reset url
this.url="http://locationstore.com/postlocation.php";
// create uri
uri = URI.create(this.url);
FireAndForgetDestination ffd = null;
ffd = (FireAndForgetDestination) DestinationFactory.getSenderDestination
("MyContext", uri);
if(ffd == null)
{
ffd = DestinationFactory.createFireAndForgetDestination
(new Context("MyContext"), uri);
}
ByteMessage myMsg = ffd.createByteMessage();
myMsg.setStringPayload("doesnt matter");
((HttpMessage) myMsg).setMethod(HttpMessage.POST);
((HttpMessage) myMsg).setQueryParam("latitude", coords[0]);
((HttpMessage) myMsg).setQueryParam("longitude", coords[1]);
((HttpMessage) myMsg).setQueryParam("user", "1");
int i = ffd.sendNoResponse(myMsg);
ffd.destroy();
System.out.println("Lets sleep for a while..");
Thread.sleep(10000);
System.out.println("woke up");
} catch (Exception e) {
// TODO Auto-generated catch block
System.out.println("Exception message: " + e.toString());
e.printStackTrace();
}
}
I haven't run this code to test it, but I would be suspicious of this call:
ffd.destroy();
According to the API docs:
Closes the destination. This method cancels all outstanding messages,
discards all responses to those messages (if any), suspends delivery
of all incoming messages, and blocks any future receipt of messages
for this Destination. This method also destroys any persistable
outbound and inbound queues. If Destination uses the Push API, this
method will unregister associated push subscriptions. This method
should be called only during the removal of an application.
So, if you're seeing the first request succeed (at least sometimes), and subsequent requests fail, I would try removing that call to destroy().
See the BlackBerry docs example for this here
Ok so I finally got it running cheerfully. The problem was with the transport selection; even though this example delivered WAP2 (among others) as an available transport in my device, running the network diagnostics tool showed only BIS as available. It also gave me the connection parameters that I needed to append at the end of the URL (;deviceside=false;ConnectionUID=GPMDSEU01;ConnectionType=mds-public). The code ended up like this:
for (;;) {
try {
coords.refreshCoordinates();
this.defaultUrl();
this.setUrl(stringFuncs.replaceAll(this.getUrl(), "%latitude%", coords.getLatitude() + ""));
this.setUrl(stringFuncs.replaceAll(this.getUrl(), "%longitude%", coords.getLongitude() + ""));
cd = cf.getConnection(this.getUrl());
if (cd != null) {
try {
HttpConnection hc = (HttpConnection)cd.getConnection();
final int i = hc.getResponseCode();
hc.close();
} catch (Exception e) {
}
}
//dormir
Thread.sleep(15000);
} catch (Exception e) {
} finally {
//cerrar conexiones
//poner objetos a null
}
Thanks for your help #Nate, it's been very much appreciated.

Resources