I wanna fetch the content of a website. But to get the correct content, it is necessary to change a Input Html sroll field on the side?
Many idea how to manage with xcode?
Thanks a lot!!
Lars
If you want to retrieve the HTML that you get after filling in a HTML form, you have to identify precisely what the series of requests looks like to fetch the data. And be careful because it's often not as simple as just looking at the request that the HTML in question generates: unfortunately, it is sometimes a complex series of requests (e.g. retrieving the original HTML is often seamlessly retrieving some critical hidden form fields and/or cookies).
Bottom line, to reverse engineer the required HTTP requests, you often have to pour through HTML code and/or watch the requests with something like Charles. It often takes quite a bit of time to do this with complicated sites.
Before you invest a lot of time here, though, you should first see if the web site provider's Terms of Service permit such usage. They often strictly prohibit this sort of practice. It's much better to contact the web site provider and see if they provide a web service to retrieve the data. That's far easier and will result in a far more robust interface for your app.
But if you're forced programmatically parsing HTML, I'd refer you to How to Parse HTML on iOS on Ray Wenderlich's site.
Related
Hi and thanks in advance for helping me with my question.
Is it possible to write a script that would extract the following information when provided with a craigslist or kiji post ie http://toronto.en.craigslist.ca/tor/atq/3346994296.html:
email address (default one provided by craigslist)
items in the post
address of poster
Above 1-3 is information that can be manually obtained but would like to just input a posting or ad ID and be able to extract this info.
The short answer to this question is...
Yes, automatically extracting the info listed from web pages similar to the one provided as example can be done by a relatively simple script.
In general, this activity [of automatically extracting info from web pages] is known as Web Scraping, a particular form of Data Scraping.
There are both off-the-shelf products that can be used (no or little programming involved; the parametrization of the desired pages and desired fields within the pages is specified by way of configuration.), as well as software libraies which can be used in relation with scripting languages such as python or java and which facilitate the parsing of HTML page, and more generally provide support for the various tasks associated with this activity.
Aside from technical considerations, you need to assert the etiquette and legality of performing this kind of scraping. Whereby some data and sites may be explicitly copyright-protected, it is always a good idea to perform big scraping jobs at low traffic hours and by throttling the requests as to not burden the host site unduly. Also many sites may provide an API or data dumps to supply the same info in a simpler and more controlled fashion.
I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.
I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.
Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.
When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.
Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).
All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.
If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.
I understand what progressive enhancement is, I'm just fuzzy on some of the details in actually pulling it off. Of course, that could be because I'm looking at it in the wrong way. Let me try to explain my difficulty with a hypothetical:
ASP.NET MVC site. I have a view that has tabbed navigation. Each tab is for a movie category/genre which displays 5-10 links to movies in that category. The movie data is obtained through Netflix's Odata.
My initial thought is to use Ajax to pull and parse the JSON from the proper OData GET requests when each tab is clicked. How would I provide a non-JavaScript version of that? Is it even possible?
For simpler requests where JSON isn't necessary - like, say, having a user log into the system - I see how I could simply set a cookie and dynamically change the page based on it to reflect the change. But what if I need to return and parse JSON? How do I provide an alternative?
The deal with progressive enhancement is that your server side must be fully capable of generating every last bit of HTML that appears in all of your pages. This is obvious, since otherwise (if JS is turned off) there will be no part of your application capable of doing said rendering.
Since the server side must know how to render everything, it doesn't make much sense to generate things (DOM elements/HTML) on the client side from JSON responses the server gives you. Why repeat yourself?
This brings us to the logical conclusion that when doing dynamic updates on the client, you need to get ready-made HTML from the server (since the rendering logic is over there) and insert it into the DOM as appropriate. You are then free to work on the newly inserted elements with jQuery and enhance them all you want.
So -- forget about parsing JSON on the client, otherwise you 're locking yourself out of progressive enhancement. If you want to call a third party, have the server be your intermediary: call the server with all the necessary information for it to call the third party and get ready-made HTML back.
If you do this, then the server can of course provide non-JS versions of everything on your site with no problem. Total non-reliance on JS achieved.
There is no JSON without JS, by definition (JavaScript Object Notation). Without JS you won't make AJAX calls. Your pages will render as is, just like oldschool sites.
If you need to do this progressively, you will have to call the odata service server-side, and provide .net objects to the site in viewdata, or your viewmodel, and have your views/partials render it.
In ASP.Net MVC actions, the httpcontext available via the controller will have a property on this path: this.HttpContext.Request.IsAjaxRequest() and can be used to test whether you want to return a view or just json data, or whatever type of ActionResult you want. This can be an excellent timesaver for building progressive enhancement style sites.
I'm looking to get the title of a webpage, a common feature of many IRC bots that I'm wanting to incorporate into a IRC client I'm writing for fun.
The method that I currently have working basically connects and sends a GET request for the entire webpage then seeks out the tags and reads inbetween them. For larger webpages this can be slower than I'd like. An additional problem I've noticed is webpages with dynamic titles (such as some phpbb forums) will not return the accurate title as it would show in a browser because I don't do any execution of javascript ect..
It seems one way to get an accurate title is to dump the html into a browser control (such as the IE COM control) and pull the title, but this is just going to make it even more time consuming.
Is there a simple method I am un aware of?
In a word, no, not really.
I guess rather than downloading the whole document you could stream the HTTP file into your application and just stop downloading when you reach </title> - that would save you waiting for the whole HTML document to download.
However that doesn't help the situation if you need to read the title after it's been changed by some client-side javascript. As you say, the only way I can think of doing that is by using a browser control.
Trying to parse/scrape the course site for memphis. The site is "https://spectrumssb2.memphis.edu/pls/PROD/bwckgens.p_proc_term_date". It appears to be some sort of javascript issue, or dynamic generation of the text. I can see the underlying DOM structure using livehttpdheaders/Firefox, but not when I simply view the underlying source/text of the page..
Thoughts/Comments/Pointers would be appreciated...
Well this modern days the site may be assembled in few steps. First the main structure is pulled in and then, often based on identity of the user additional AJAX calls are executed. Your best bet is to sniff HTTP to see what kind of requests are issued between the site is initially requested and when it's fully built
Since you are using firebug you can get HttpFox add-on which gives you what you need