Search Engine Ping Services? - search-engine

i'm looking for a good ping service like pingomatic.com but for general website (not necessarily a blog).
Any recommendations?
Thanks in advance!

Oh I just came across a massive list of services:
http://readymadeweb.com/2010/01/01/242-ways-to-ping-how-to-stay-on-search-engine-radar/

Search Engine Ping is the leading SEO tool on the internet that gets your website, blog or affiliate links ranked quickly and at the highest results in search engines!
recommend this service
http://statstool.com/search-engine-ping/

Related

What's the best service to use for filtering out spam/abuse/malware links for a link shortening webapp?

I have two services - Lincr and LinkBunch. Lincr is a plain jane URL shortening service, while LinkBunch lets you shorten multiple links into one link. I've had too much spam posted into the services, so I had to shut down Lincr. Now, even LinkBunch seems to be facing the same problem, and it's been disabled by my web host for that reason.
I can't keep shutting down sites like this because of bad links being posted, so I need a malware-filtering API that I can use to filter out the links as and when they are posted.
There are services that let me download an entire bunch of bad links to check against, but instead, I'd prefer doing a live API call on a per-link basis. What can I use for that?
Finally, what's the best malware filtering service out there?
Lincr is down. On LinkBunch, where is your Captcha?
On either site, do you limit the number of posts by IP? Do you use a delay in your response? What about using hidden fields to reduce spam (http://www.reviewmylife.co.uk/blog/2008/05/30/hidden-field-spam-trap-for-phpformmail/)?
I know I'm dodging the question a bit, but you should at least take basic anti-spam measures before resorting to API calls. Even APIs will still fail for newly-hacked / newly-spammed sites.

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

Blocking to be indexed

I am wondering is there any (programming) way to block that any search engine indexes the content of a website.
You can specify it in robots.txt
User-agent: *
Disallow: /
As the other answers already say, Robots.txt is the standard that every proper search engine adheres to. This should be enough in most cases.
If you really want try to programmatically block malicious bots that do not listen to robots.txt, check out this question I asked a few months ago on how to tell bots apart from human visitors. You may find some good starting points there.
Create a robots.txt file for your site. For more info - see this link.
Most search engine bots identify themselves using a unique user agent.
You can block specific user agents using robots.txt
Here is a list of some user agents.
Since you did not mention programming language, I'll give my input on this as from a php perspective - there is a wordpress plugin called bad behavior, which does exactly what you are looking for, it is configurable via a code script listing an array of search agent's strings. And based on what the agent is crawling on your site, the plugin automatically checks the user-agent's string and id, or IP address and based on the array, if there's a match, it either rejects or accepts the agent.
It might be worth your while to have a peek at the code to see how is it done from a programmer's perspective of the code.
If the language is other than php, and not satisfy what you are looking for, then I apologize for posting this answer.
Hope this helps,
Best regards,
Tom.

What's the best method to capture URLs?

I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex

ASP.NET Tracking Code & Unique Visitors

I am trying to find a way to track and produce reports for my site (out of interest). Does anyone know of any articles/projects etc that you can
Track pages / unique visitors etc
Tracking 1) relative to timestamp etc
in asp.net mvc or just asp.net ?
P.S - I know google analytics etc is available but looking to create some basic stats for myself out of interest about how web analytics work ?
There are a couple of good ways to try and determine unique visitors, none of them are exact (which is why different analytics will report different numbers).
The first is to use a cookie. Create a cookie for the user for each time frame that you want to track uniques, so you could create one that expires in a day and one that expires in a month. You can then use both of those to track how many unique daily/monthly visitors you have. Of course this is not perfect since people can clear or refuse cookies, but it is pretty accurate.
The other way is to track uniques using a combination of the IP address and User Agent of the requesting user, this is probably slightly less accurate since if a company has a good IT group lots of internal users will have the same User Agent and since they are all coming from the same internal network could have the same IP address.
If you are interested in reading more about the different methods there is a great article about it here: http://www.google.com/support/urchin45/bin/answer.py?answer=28325
I blogged about simple asp.net module.
You can check it here
http://ilkeraksu.com/post/2009/07/14/Very-very-simple-But-very-very-efficient-Aspnet-Tracking-module.aspx
I would recommend using google analytics instead of reinventing the wheel. All you have to do is stick a bit of javascript in your master page and your done.
Yo can check Piwik out. Its an open source web analytics written using PHP and mysql.
you can find great article in http://www.codeproject.com/KB/aspnet/PageTracking.aspx
which is upgraded version of http://www.15seconds.com/Issue/021119.htm
with help of a Session Tracker class that runs in Application_PreRequestHandlerExecute and mailing reports on session end and lot of usefull tips
thanks Wayne Plourde for all that stuff

Resources