Typical crawling depth by search engines - search-engine

When a site is crawled by a search engine (google, bing, etc), what is the typical maximum depth a search engine would crawl into a site. By depth, I mean number of hops from homepage.
Thanks,

It depends on the overall rank of your site, and the rank of incoming links, especially if they aren't pointing at your homepage.
Crawlers for smaller search engines like blekko aren't going to go that far away from landing-points of external links, unless your overall site is awesome or you have lots of links from awesome sites. We save our crawling and indexing energy for stuff with higher rank, so if our estimate is that a page will have poor rank, we won't bother.
Google's crawler might crawl quite a distance even if you only have a poor inlink profile - but even they know about 10x more URLs than they actually crawl.

If you want to crawl whole world then 19 depth is enough. Because whole world cover in 19 depth. But if you want to crawl for a specific domain or a country then 10 depth is quite enough.
I have found this info from a paper. Which was used for developing Mercator.
Thanks
Mohiul Alam Prince

Related

When searching a list of specific sites in Google's Programmable Search, YouTube dominates the results. Any way to balance it among the site list?

I've given google's programmable search a list of sites, for now I'm starting with:
But when I search any query, I get pretty much only YouTube results, which I believe is because YouTube has the highest page rankings.
Is there any way to customize this to be more balanced among the different sites?
See below for more detail for what I'm thinking.
Possible Solution #1: Round-Robin
The top result from the first site displayed, then the next site, and so on, until all sites have had their top search result displayed, then starting over.
Possible Solution #2: Site-to-Page Rank Ratio
Let's pretend that youtube's site "rating" is a 10, and reddit's is a 5. Now, in a list of search results, let's say that youtube.com/some-result has a rating of 8, and reddit.com/some-other-result has a rating of 6. In this case, the reddit.com result should display first, even though the youtube.com result has a higher absolute ranking. The reddit page's page-to-site ratio is higher.
I have no idea if #2 would even be possible, but maybe it can serve as an illustration for what I'm looking for. I'd be plenty happy with a simple #1 Round Robin approach.

Image search engine use

is it possible to instantiate the use of an image search engine within an app? I have an idea to incorporate image search engines with the pictures that can be taken with the camera and then have the app return info about the picture that is recognized.
Google Goggles, Like.com (formerly Riya) now acquired by Google, Tineye.com are some sites that offer visual search. Not sure they offer an API.
If you want to whip one up, it is as you would expect, no trivial task. AFAIK, there are no OOTB solutions available: especially, considering your use-case of taking an image and getting related information (known in the trade parlance as RST invariant template matching) - and you would need to look into significant investment of time and $.
We offer an image search engine for mobile app cameras - www.iqengines.com.

Reverse geocoding services

I'm working on a project that returns information based on the user's location. I also want to display the user's town in text (no map) so they can change it if it's not accurate.
If things go well I hope this will be more than a small experiment, so can anyone recommend a good reverse geocoding service with the least restrictions? I notice that Google/Yahoo have a limit to the number of daily queries along with other usage terms. I basically need to take latitude and longitude and convert them to a city/town (which I presume cannot be done using the HTML5 Geolocation API).
Geocoda just launched a geocoding and spatial database service and offers up to 1K queries a month free, with paid plans starting at $49 for 25,000 queries/month. SimpleGeo just closed their Context API so you may want to look at Geocoda or other alternatives.
You're correct, the browser geolocation API only provides coordinates.
I use SimpleGeo a lot and recommend them. They offer 10K queries a day free then 0.25USD per 1K calls after that. Their Context API is what you're going to want, it pretty much does what is says on the tin. Works server-side and client-side (without requiring you to draw a map, like Google.)
GeoNames can also do this and allows up to 30K "credits" a day, different queries expend different credit amounts. The free service has highly variable performance, the paid service is more consistent. I've used them in the past, but don't much anymore because of the difficulty of automatically dealing with their data, which is more "pure" but less meaningful to most people.

how to collect millions of tweets?

I was browsing through fflick, nicely made app on top of twitter. How do they
collect millions of tweets?
accurately (mostly) categorize tweets into postive or negative sentiments?
The collect millions of tweets probably by crawler twitter with their API. Probably searching with Streaming API for keywords related to films, or just searching their own timeline looking for what their followers have to say about films.
Don't know. Probably using some natural language processing techniques from good old AI textbooks. :-)
2) look for smileys - ;), :), :D, :(
A few places provide the latter vas a service now. Check out ViralHeat and Evri:
http://www.viralheat.com/home/features
http://www.readwriteweb.com/archives/sentiment_analysis_is_ramping_up_in_2009.php

Are there any better geolocation databases / technologies / services or has anyone done any work with improving the accuracy of existing systems?

I am working on integrating geolocation services into a website and the best source of data I've found so far is MaxMind's GeoIP API with GeoLite City data. Even this data seems to often be questionable though. For example, I am located in downtown Palo Alto, but it locates my IP as being in Portola Valley, which is about 7 miles away. Palo Alto has a population of 60k+, whereas Portola Valley has a population of less than 5k. I would think if you see an IP originating somewhere around there it would make more sense to assume it was coming from the highly populated city, not the tiny one. I've also had it locate Palo Alto IPs completely across the country in Kentucky, etc.
Does anyone know of any better sources of data, or any tools/technologies/efforts to improve the accuracy of geolocation efforts? Commercial solutions are fine.
Where an IP comes up at the wrong end of the country, you probably won't find a better match elsewhere because it's probably an ISP that uses one group of IPs for customers in a wide area. My favourite example is trains here in the UK where the on-board wifi is identified as being in Sweden because they use a satellite connection to an ISP in Sweden.
A commercial supplier may be able to afford to spend more time tracking down the hard cases, but in many cases there just won't be a good answer to give you. They may, however, give you a confidence factor to tell you when they're guessing. I've heard good things about Quova, though I've never used them.
Assuming that you've got the best latitude and longitude that you can get (or can afford), then you're left dealing with cases where they pick the closest city rather than a more likely larger city nearby. Unfortunately I don't have the code to hand, but I had some success using the data from geonames to pick a "sensible" city near a point. They list lat/long and population, so you can do something like
ORDER BY ( Distance / LOG( Population ) )
You'd need to experiment with that to get something with the right level of bias towards larger cities, but I had it working quite nicely taking the centre of a Google Maps view and displaying a heading like "Showing results near London..." that changed as you moved the map.
I am not sure if this will help, but here is a site that has done a pretty good job of IP mapping. Maybe you could ask them for help :) seomoz.org
A couple of sites I saw referenced recently for free GeoIP services are
WIPmania
hostip.info

Resources