Parsing a web page - parsing

how can I parse a web page which uses AJAX...
I will be more specific here. there is a website http://www.wordcount.org/main.php which gives the rank of a word according to it's usage.
for a given word, I want to retrieve it's rank...
how can I get it?
this is extremely important.. thank you in a advance...

That flash page calls the following URL to get the data: http://www.wordcount.org/dbquery.php?toFind=0&method=SEARCH%5FBY%5FINDEX
If you are using PHP, then do something like:
$url = 'http://www.wordcount.org/dbquery.php?toFind=0&method=SEARCH%5FBY%5FINDEX';
parse_str(file_get_contents($url), $dataArray);

You can use http://htmlunit.sourceforge.net/ - it simulates a browser behaviour, leaving you with a current DOM to inspect. If you're using Java, it is straightforward, for any of the .NET languages it works fine with IKVM.

Related

HtmlUnit read specific link information inbetween the <a> tags

I am connecting to a webpage using HtmlUnit and I want to read the information inbetween the tags. I will demonstrate using some code. Lets suppose I have the following link:
Hello!
I would like to read the Hello that's in between, preferably saved into a String variable. Here is the code essential for the task
// Simulating a Chrome browser
WebClient webClient = new WebClient(BrowserVersion.CHROME);
loggedIn = webClient.getPage("random-page.com");
HtmlAnchor anchorLink = loggedIn.getAnchorByHref("/private-messages/inbox");
Now if I use anchorLink.toString() I get <a href="www.anypage.com"> from the previous example but nothing about the characters inbetween the tags. I have gone through the API and I can't seem to find anything useful. Any workarounds?
Would getTextContent() be what you are looking for?

What does "URL=" in a URL Mean?

May be a dumb question, but it's been bugging me recently. I see "URL=" inside alot of URL's, such as this one:
http://www.tierraantigua.com/search-2?url=http%3A%2F%2Flink.flexmls.com%2Fwws30ham
What exactly is this used to do? Is it part of the iFrame functionality? I know the last part of the URL (after the URL=) is the part being displayed in the iFrame, but I'm unsure of why it is included in the primary URL as well.
Thanks!
The url you see here is just a standard query parameter wit the name url and the encoded value http%3A%2F%2Flink.flexmls.com%2Fwws30ham which decodes to http://link.flexmls.com/Fwws30ham. Most of the times it is used for determining redirection or source information by the application you are using. It is entirely domain-specific and can have any meaning the website developer would like to use.
PHP GET
Description ΒΆ
An associative array of variables passed to the current script via the URL parameters.
$url = $_GET['url'];
echo $url; // http%3A%2F%2Flink.flexmls.com%2Fwws30ham

URL for a link to Twitter for a specific tweet

I have some Javascript that uses Twitter API to get tweets. I parse the data and use jQuery to generate HTML for the DOM.
An aspect of what I want to display is a "View this tweet" link -- yeah, sorta sounds silly, but it allows a user to get a URL for a specific tweet.
I am generating an a tag with an href. The URL is of the form:
http://twitter.com/{twitter-user-id}/status/{tweet-status-id}
where the content in curly braces is actual data extracted from the tweet (no, I am not including the curly braces). For example:
http://twitter.com/Atechtrader/status/57432099984130050
What happens in operation is that this works for some tweets, but not others. For the ones that fails, the Twitter server responds with content that says the requested page does not exist.
Am I doing something wrong?
https://twitter.com/statuses/ID should work.
it will redirect to the needed status.
Unfortunately, all of the answers provided so far rely on an HTTP redirect.
The direct link is of the form: https://twitter.com/i/web/status/{tweet-status-id}
FYI: id_str is the variable you need to call instead of id
id_str should be taken from the tweet object and replaced in
https://twitter.com/statuses/[id_str]
You can use like:
http://twitter.com/itdoesnotmatter/status/[YOURID]
Twitter redirect based on status ID not username.
It works for desktop and mobile.
You can use
'https://www.twitter.com/'+ user.screen_name+'/status/' + id_str
I've been tried it. It's work good:
- Web : https://twitter.com/statuses/ID
- Mobile && Web: https://twitter.com/User_ID/statuses/Tweet_ID
I hope it's helpful for you.

What is the simplest way to return different content types based on the extension in the URL?

I'd like to be able to change the extension of a url and recieve the model in a different format.
e.g. if
/products/list
returns a html page containing a list of products, then
/products/list.json
would return them in a json list.
Note: I like the simplicity of the ASP.NET MVC REST SDK, it only requires about 5 lines of code to hook it in, but the format is specified as a query string parameter i.e. /products/list?format=json which is not what I want, I could tweak this code if there are no other options, but I don't want to reinvent the wheel!
I wrote a blog post that shows one possible example. It's a tiny bit complicated, but might work for your needs.
http://haacked.com/archive/2009/01/06/handling-formats-based-on-url-extension.aspx
You should be able to just use routes in conjunction with the rest sdk
If you have the flexibility to drop Apache or something similar in front of your service, you can always use mod_rewrite to rewrite an external http://products/list.json into http://products/list?format=json which your framework can render more easily.
Instead of "list.json", you could choose "list/json" and use a route like
{controller}/{action}/{id}
Then ProductController.List would be called, with an ID parameter of "json". The .List() action then would decide whether or not to return an HTML view or JSON content.

How does a website know the Google query I used to find it?

When I search for something such as "rearrange table columns in asp.net" on Google, and click the link to Wrox's forum site, the site greets me with a message such as "Your Google search for 'rearrange table columns in asp.net' brought you to Wrox Forum...".
How does a site know what query I typed into Google? And how could I add such an ablity to my site?
It is parsing your query from the query parameters in the HTTP_REFERER server variable, which contains the URL you're coming from and is provided in your HTTP request.
It uses a header known as the "HTTP referrer". See http://en.wikipedia.org/wiki/HTTP_referrer
To use it in your site, you would need some kind of dynamic page generation, such as ASP / ASP.NET, PHP, or Perl. For example in Perl, you could do something like:
if ($ENV{HTTP_REFERER} =~ /google.com\?.+&q=(.+?)&/)
print "Your google search of $1 brought you to this site";
WARNING: The code above is only an example and may not be correct or secure!
Like these guys are suggesting, it's the HTTP_REFERER header variable. The query is in the "q" key in the URL. So if you want to parse that, you can just sort out the querystring and URL decode the "q" variable.
It looks at the referrer header. Here is some fairly basic PHP code to do it.

Resources