Craigslist or Kijiji - Is it possible to extract posters email address? - craigslist

Hi and thanks in advance for helping me with my question.
Is it possible to write a script that would extract the following information when provided with a craigslist or kiji post ie http://toronto.en.craigslist.ca/tor/atq/3346994296.html:
email address (default one provided by craigslist)
items in the post
address of poster
Above 1-3 is information that can be manually obtained but would like to just input a posting or ad ID and be able to extract this info.

The short answer to this question is...
Yes, automatically extracting the info listed from web pages similar to the one provided as example can be done by a relatively simple script.
In general, this activity [of automatically extracting info from web pages] is known as Web Scraping, a particular form of Data Scraping.
There are both off-the-shelf products that can be used (no or little programming involved; the parametrization of the desired pages and desired fields within the pages is specified by way of configuration.), as well as software libraies which can be used in relation with scripting languages such as python or java and which facilitate the parsing of HTML page, and more generally provide support for the various tasks associated with this activity.
Aside from technical considerations, you need to assert the etiquette and legality of performing this kind of scraping. Whereby some data and sites may be explicitly copyright-protected, it is always a good idea to perform big scraping jobs at low traffic hours and by throttling the requests as to not burden the host site unduly. Also many sites may provide an API or data dumps to supply the same info in a simpler and more controlled fashion.

Related

Website fetch, using NSURLSession and changing INPUT Field value on this site

I wanna fetch the content of a website. But to get the correct content, it is necessary to change a Input Html sroll field on the side?
Many idea how to manage with xcode?
Thanks a lot!!
Lars
If you want to retrieve the HTML that you get after filling in a HTML form, you have to identify precisely what the series of requests looks like to fetch the data. And be careful because it's often not as simple as just looking at the request that the HTML in question generates: unfortunately, it is sometimes a complex series of requests (e.g. retrieving the original HTML is often seamlessly retrieving some critical hidden form fields and/or cookies).
Bottom line, to reverse engineer the required HTTP requests, you often have to pour through HTML code and/or watch the requests with something like Charles. It often takes quite a bit of time to do this with complicated sites.
Before you invest a lot of time here, though, you should first see if the web site provider's Terms of Service permit such usage. They often strictly prohibit this sort of practice. It's much better to contact the web site provider and see if they provide a web service to retrieve the data. That's far easier and will result in a far more robust interface for your app.
But if you're forced programmatically parsing HTML, I'd refer you to How to Parse HTML on iOS on Ray Wenderlich's site.

is it possible to making a posting to Craigslist through my own website?

What I am trying to do is allow users to making postings to Craiglist through my own website using PHP curl. This is NOT an automated posting system, I just want users to be able to post onto Craigslist and my website at the same time. So far, I've managed to log in using php but I'm still not sure how to post the title, description, contact information, etc. I am not familiar with cURL.
Your question is kind of broad, so I'll answer broadly. Narrow down your question (or post a follow-up) so we can help you better.
Is it possible to making a posting to Craigslist through my own website?
It depends, there are two major ways, but most websites block these so I suspect Craigslist does too.
1. Clientside
Your visitors become visitors of craigslist.
You take the form that you find on craigslist, and host it (the html code) on your site, but with the form 'action' pointed to theirs.
They'll probably block these, based on the REFERER, a session key or something alike.
2. Serverside
Your server acts as a client for Craigslist.
You host the form on your server, and the processing page as well. After you've captured all the input, your server will now act as a client to Craigslist, using indeed for example php curl.
You should try if 1 works, if not, start coding on 2. If you're stuck in a specific part, post a question and we'll help you further.
There is an API available now to make automated posts (one or more) in one request.
http://www.craigslist.org/about/bulk_posting_interface
There are two caveats in your case:
It uses RSS as the request/reponse.
Your users will need to provide their Craigslist user/pass (assuming they have an account).

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

Blocking to be indexed

I am wondering is there any (programming) way to block that any search engine indexes the content of a website.
You can specify it in robots.txt
User-agent: *
Disallow: /
As the other answers already say, Robots.txt is the standard that every proper search engine adheres to. This should be enough in most cases.
If you really want try to programmatically block malicious bots that do not listen to robots.txt, check out this question I asked a few months ago on how to tell bots apart from human visitors. You may find some good starting points there.
Create a robots.txt file for your site. For more info - see this link.
Most search engine bots identify themselves using a unique user agent.
You can block specific user agents using robots.txt
Here is a list of some user agents.
Since you did not mention programming language, I'll give my input on this as from a php perspective - there is a wordpress plugin called bad behavior, which does exactly what you are looking for, it is configurable via a code script listing an array of search agent's strings. And based on what the agent is crawling on your site, the plugin automatically checks the user-agent's string and id, or IP address and based on the array, if there's a match, it either rejects or accepts the agent.
It might be worth your while to have a peek at the code to see how is it done from a programmer's perspective of the code.
If the language is other than php, and not satisfy what you are looking for, then I apologize for posting this answer.
Hope this helps,
Best regards,
Tom.

What's the best method to capture URLs?

I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex

Resources