obviously, i think its overkill for me to run a spider that will crawl the internet autonomously like google or yahoos.
so i am wondering, if there is some way i can access a major search engine's database, instead of scraping them ?
Google and Yahoo both have APIs:
http://code.google.com/apis/ajaxsearch/
http://developer.yahoo.com/search/
But like everybody else said, we need more info about what you're trying to do to help you.
Other than doing regular searches, no.
What are you trying to do?
Nope. You're already violating terms of use policy by scraping them. That kind of information is carefully guarded for obvious reasons.
Related
I need to implement direct call feature via vicidial in my web application.
Let me explain the flow to better understand what I need to do. user would be able to login in vicidial via a webpage of my app. then can call on any number by entering number.
Is there any doc or wiki available to implement this feature. Guidance in simple steps would also be appreciating.
Thanks in advance :)
We implemented something like that using a product called WombatDialer that offers good APIs and is quite easy to set up (well, easier than ViciDial).
See http://www.wombatdialer.com/manuals/WD_UserManual-chunked/ar01s08.html for an API reference.
Are there any ways to get information about different places (cities, mountains, rivers , etc) via latitude/longtitude?
I'm planning to use it in my rails project.
Of course, it will be perfect to use information from wiki. Any example of searching over wiki via lat/lon?
May be any other technology/website/api?
You may also check the Geocoder gem to find address by latitude and longtitude. Then you can use wiki api to find articles, like this:
en.wikipedia.org/w/api.php?action=opensearch&search="place_name"&prop=info&format=xml
Or, maybe wikilocation will help you
Check out the Open Street Map API http://wiki.openstreetmap.org/wiki/API They tend to discourage read-only queries, but you can probably find someone else's API for that data.
Also check out http://www.gisgraphy.com/ They have a free webservice here http://www.gisgraphy.com/documentation/user-guide.htm#geolocwebservice that looks like it has what you are looking for.
Sorry for the bad title and description, but I was wondering if there is anyway I could search/list products from other sites (say Express, American Eagle), from a web app I create, even if the site doesn't have an API.
Thanks
Sure. How do you think Google and every other search engine does it? They just spider the sites and index the contents. The devil, of course, is in the details. But it's certainly possible to do.
I don't think so. Unless you want only to fetch some data from a certain HTML page, then you need to use some regular expressions. But searching the database is not possible if you don't have the ability to connect to it directly or via some APIs.
Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/
I am trying to find a way to track and produce reports for my site (out of interest). Does anyone know of any articles/projects etc that you can
Track pages / unique visitors etc
Tracking 1) relative to timestamp etc
in asp.net mvc or just asp.net ?
P.S - I know google analytics etc is available but looking to create some basic stats for myself out of interest about how web analytics work ?
There are a couple of good ways to try and determine unique visitors, none of them are exact (which is why different analytics will report different numbers).
The first is to use a cookie. Create a cookie for the user for each time frame that you want to track uniques, so you could create one that expires in a day and one that expires in a month. You can then use both of those to track how many unique daily/monthly visitors you have. Of course this is not perfect since people can clear or refuse cookies, but it is pretty accurate.
The other way is to track uniques using a combination of the IP address and User Agent of the requesting user, this is probably slightly less accurate since if a company has a good IT group lots of internal users will have the same User Agent and since they are all coming from the same internal network could have the same IP address.
If you are interested in reading more about the different methods there is a great article about it here: http://www.google.com/support/urchin45/bin/answer.py?answer=28325
I blogged about simple asp.net module.
You can check it here
http://ilkeraksu.com/post/2009/07/14/Very-very-simple-But-very-very-efficient-Aspnet-Tracking-module.aspx
I would recommend using google analytics instead of reinventing the wheel. All you have to do is stick a bit of javascript in your master page and your done.
Yo can check Piwik out. Its an open source web analytics written using PHP and mysql.
you can find great article in http://www.codeproject.com/KB/aspnet/PageTracking.aspx
which is upgraded version of http://www.15seconds.com/Issue/021119.htm
with help of a Session Tracker class that runs in Application_PreRequestHandlerExecute and mailing reports on session end and lot of usefull tips
thanks Wayne Plourde for all that stuff