What's the best method to capture URLs? - url

I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.

$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.

How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.

modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex

Related

What's the best service to use for filtering out spam/abuse/malware links for a link shortening webapp?

I have two services - Lincr and LinkBunch. Lincr is a plain jane URL shortening service, while LinkBunch lets you shorten multiple links into one link. I've had too much spam posted into the services, so I had to shut down Lincr. Now, even LinkBunch seems to be facing the same problem, and it's been disabled by my web host for that reason.
I can't keep shutting down sites like this because of bad links being posted, so I need a malware-filtering API that I can use to filter out the links as and when they are posted.
There are services that let me download an entire bunch of bad links to check against, but instead, I'd prefer doing a live API call on a per-link basis. What can I use for that?
Finally, what's the best malware filtering service out there?
Lincr is down. On LinkBunch, where is your Captcha?
On either site, do you limit the number of posts by IP? Do you use a delay in your response? What about using hidden fields to reduce spam (http://www.reviewmylife.co.uk/blog/2008/05/30/hidden-field-spam-trap-for-phpformmail/)?
I know I'm dodging the question a bit, but you should at least take basic anti-spam measures before resorting to API calls. Even APIs will still fail for newly-hacked / newly-spammed sites.

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

Get country location of an IP with native PHP

Read on before you say this is a duplicate, it's not. (as far as I could see)
I want to get the county code in php from the client.
Yes I know you can do this using external sites or with the likes of "geoip_record_by_name" but I don't want to be dependent on an external site, and I can't install "pear" for php as im using shard Dreamhost hosting.
I thought I could just do something like this:
$output = shell_exec('whois '.$ip.' -H | grep country | awk \'{print $2}\'');
echo "<pre>$output</pre>";
But dreamhost seems to have an old version of whois (4.7.5), so I get this error on allot of IPs:
Unknown AS number or IP network. Please upgrade this program.
So unless someone knows how to get a binary of a newer version of whois onto dreamhost im stuck.
Or is there another way I could get the country code from the client who is loading the page?
Whois is just a client for the whois service, so technically you are still relying on an outside site. For the queries that fail, you could try falling back to another site for the query, such as hostip.info, who happen to have a decent API and seem friendly:
http://api.hostip.info/country.php?ip=4.2.2.2
returns
US
Good luck,
--jed
EDIT: #Mint Here is the link to the API on hostip.info: http://www.hostip.info/use.html
MaxMind provide a free PHP GeoIP country lookup class (there is also a free country+city lookup one).
The bit you want is what is mentioned under "Pure PHP module". This doesn't require you to install anything, or be dependent on them, nor does it need any special PHP modules installed. Just save the GeoIP data file somewhere, then use their provided class to interact with it.
Can you just install a copy of whois into your home directory and pass the full path into shell_exec? That way you're not bound to their upgrade schedule.
An alternative, somewhat extreme solution to your problem would be to:
Download the CSV format version of MaxMind's country database
Strip out the information you don't need from the CSV with a script and ...
... generate a standard PHP file which contains a data structure containing the IP address as the key and the country code as the value.
Include the resulting file in your usual project files and you now have a completely internal IP => country code lookup table.
The disadvantage is that, regularly, you would need to regenerate the PHP file from the latest version of the database. Also, it's a pretty nasty way of doing it in general and performance might not be the best :)
Consider ipcountryphp (my site, my code, my honour) as it provides a local internet-lifetime freely updated database. It's fast and fully self-contained, pluggable into anything PHP 5.3, SQLite3 and beyond. Very fast seeks and no performance penalties.
Enough with shameless self-promotion, let's get serious:
Relying on querying remote services in real-time to get visitor country can become a major bottleneck for your site's functionality depending on the response speed of the queried server. As a rule of thumb you should never query external services for real-time site functionality (like page loading). Using APIs in the background is great but when you need to query the country of each visitor before the page is rendered, you open yourself up to a world of pain. And do keep in mind you're not the only one abusing free services :)
So queries to 3rd-party services stay in the background while only local functionality that relies on no 3rd-party go into the layers there users interact with. Just my slightly performance paranoid take on this :)
PS: Above mentioned script I wrote has IPv6 support too.
Here is a site with a script i just used. The only problem is that you would probably every now and then need to regenerate IPs by yourself... which might be pain and tahts why everyone is telling you to use external API. But for me that wasnt solution as i was pulling like 50 IPs at once, which means i would probably get banned. So solution was to use my own script or to do saves to DB, but i was again pulling images from external sites. Anyway here is the site i found script on:
http://coding-talk.com/f29/country-flag-script-8882/
Here's a few:
http://api.hostip.info/get_html.php?ip=174.31.162.48&position=true
http://geoiplookup.net/geoapi.php?output=json&ipaddress=174.31.162.48
http://ip-api.com/json/174.31.162.48?callback=yourfunction
http://ipinfo.io/174.31.162.48
All return slightly different results.
here is also one of them. just change the IP to the variable:
http://api.codehelper.io/ips/?callback=codehelper_ip_callback&ip=143.3.87.193

Blocking to be indexed

I am wondering is there any (programming) way to block that any search engine indexes the content of a website.
You can specify it in robots.txt
User-agent: *
Disallow: /
As the other answers already say, Robots.txt is the standard that every proper search engine adheres to. This should be enough in most cases.
If you really want try to programmatically block malicious bots that do not listen to robots.txt, check out this question I asked a few months ago on how to tell bots apart from human visitors. You may find some good starting points there.
Create a robots.txt file for your site. For more info - see this link.
Most search engine bots identify themselves using a unique user agent.
You can block specific user agents using robots.txt
Here is a list of some user agents.
Since you did not mention programming language, I'll give my input on this as from a php perspective - there is a wordpress plugin called bad behavior, which does exactly what you are looking for, it is configurable via a code script listing an array of search agent's strings. And based on what the agent is crawling on your site, the plugin automatically checks the user-agent's string and id, or IP address and based on the array, if there's a match, it either rejects or accepts the agent.
It might be worth your while to have a peek at the code to see how is it done from a programmer's perspective of the code.
If the language is other than php, and not satisfy what you are looking for, then I apologize for posting this answer.
Hope this helps,
Best regards,
Tom.

Is there a search engine including indexing bot which can be used to make up special catalogue by feeding the bot with certain properties?

Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any massive external search engine like Google or Yahoo.
Is there any free open source solution we could use to install it on our server?
No point in re-inventing the wheel.
DataparkSearch might be the one you need. Or review this list of other Open Source Search Engines.

Resources