I have a collection of URLs that I want to estimate the age of. Let me phrase the question this way:
How to estimate the earliest point in time in which querying a URL would be successful (let's say HTTP status code 200 for a GET request)?
The solution I'm currently thinking about is perhaps Google (or some other crawler) has some (publicly available) way of providing the timestamp when they first visited that URL (preferably API).
I know how to get the age of Google's cached version, e.g.: webcache.googleusercontent.com/search?q=cache:stackoverflow.com. However, because the cached versions are updated rather frequently, this isn't very useful.
Not possible in a reliable way. (Well, unless you have all access log files of the servers you are interested in.)
Internet Archive’s Wayback Machine shows the first time it crawled a webpage. Of course it may take time until their bots find and crawl a page for the first time, so most indexed pages probably are much older.
Also note: as soon as the crawler is blocked (e.g., via robots.txt), the history/copies will be removed (from the FAQ):
When a URL has been excluded at direct owner request from being archived, that exclusion is retroactive and permanent.
Related
Let's say someone has posted a resource at https://this-site.com/files/pdfs/some_file_name.pdf
Another resource is then posted at that URL, which we don't know the name of. However, the pathname is the same: https://this-site.com/files/pdfs/another_unique_resource98237219.pdf
Is it possible to detect when a new PDF is posted to this location? Or would we have to know more about the backend infrastructure? Keeping in mind that:
None of the other pieces of the URL are valid paths, in other words https://this-site.com/files/pdfs and https://this-site.com/files both return 404 errors.
The names of the files are unique and do not follow a specific pattern.
If this is not possible, what are other ways you might inspect the request/response infrastructure to look for resources posted to that URL?
my first suggestion is to look if there is another page that displays a list of the resource available on the website, of course assuming the website owner actually provides such page
the other method would be effectively brute forcing all the URLs under that path. you will need to collect some SOCKS to use with your crawlers to distribute your requests among multiple IP addresses otherwise the server will probably block your IP address. should you be able to distinguish the minimum and maximum number of characters in the file names (not pattern, just length) this operation can be drastically optimized.
Currently I'm using this link forwarding structure:
bit.ly/{some_hash} > example.com/s/{ID} > example.com/blog/full-seo-optimized-url/
Because the id of the blog never changes but the url might change (e.g. spelling mistake), I'm forwarding my bit.ly short urls to a special subpage (/s/{UD}) that will eventually get the full url from the database and forward the visitor to the blog entry.
Pros:
If the URL changes, the bit.ly short link will still work and forward to the updated url
(Possible) Cons:
Might be seen as spammy method (hiding target link)?
Might be violating any terms of service?
... ?
Therefore I would know, if this is a good and proper way or if I should remove the step in the middle?
Those redirects will cause a slower user experience and when used will cause a loss of PageRank being sent to the destination.
I'd avoid doing it where possible.
There are URL shorteners out there that let you directly edit the destination which would avoid your need for the middle redirect.
You also want to avoid changing the destinations URL as other people will not use your fancy redirects and you will lose PageRank every time they change.
I'm building a service where people gets notified (mails) when they follow a link with the format www.domain.com/this_is_a_hash. The people that use this server can share this link on different places like, twitter, tumblr, facebook and more...
The main problem I'm having is that as soon as the link is shared on any of this platforms a lot of request to the www.domain.com/this_is_a_hash are coming to my server. The problem with this is that each time one of this requests hits my server a notification is sent to the owner of the this_is_a_hash, and of course this is not what I want. I just want to get notifications when real people is going into this resource.
I found a very interesting article here that talks about the huge amount of request a server receives when posting to twitter...
So what I need is to avoid search engines to hit the "resource" url... the www.mydomain.com/this_is_a_hash
Any idea? I'm using rails 3.
Thanks!
If you don’t want these pages to be indexed by search engines, you could use a robots.txt to block these URLs.
User-agent: *
Disallow: /
(That would block all URLs for all user-agents. You may want to add a folder to block only those URLs inside of it. Or you could add the forbidden URLs dynamically as they get created, however, some bots might cache the robots.txt for some time so they might not recognize that a new URL should be blocked, too.)
It would, of course, only hold back those bots that are polite enough to follow the rules of your robots.txt.
If your users would copy&paste HTML, you could make use of the nofollow link relationship type:
cute cat
However, this would not be very effective, as even some of those search engines that support this link type still visit the pages.
Alternatively, you could require JavaScript to be able to click the link, but that’s not very elegant, of course.
But I assume they only copy&paste the plain URL, so this wouldn’t work anyway.
So the only chance you have is to decide if it’s a bot or a human after the link got clicked.
You could check for user-agents. You could analyze the behaviour on the page (e.g. how long it takes for the first click). Or, if it’s really important to you, you could force the users to enter a CAPTCHA to be able to see the page content at all. Of course you can never catch all bots with such methods.
You could use analytics on the pages, like Piwik. They try to differentiate users from bots, so that only users show up in the statistics. I’m sure most analytics tools provide an API that would allow sending out mails for each registered visit.
On the webmaster's Q and A site, I asked the following:
https://webmasters.stackexchange.com/questions/42730/how-does-indeed-com-make-it-to-the-top-of-every-single-search-for-every-single-c
But, I would like a little more information about this from a development perspective.
If you search Google for anything job related, for example, Gastonia Jobs (City + jobs), then, in addition to their search results dominating the first page of Google, you get a URL structure back that looks like this:
indeed.com/l-Gastonia,-NC-jobs.html
I am assumming that the L stands for location in the URL structure. If you do a search for an industry related job, or a job with a specific company name, you will get back something like the following (Microsoft jobs):
indeed.com/q-Microsoft-jobs.html
With just over 40,000 cities in the USA I thought, ok, maybe it's possible they looped through them and created a page for every single one. That would not be hard for a computer. But then obviously the site is dynamic as each of those pages has 10000s of results and paginated by 10. The q above obviously stands for query. The locations I can understand, but they cannot possibly have created a web page for every single query combination, could they?
Ok, it gets a tad weirder. I wanted to see if they had a sitemap, so I typed into Google "indeed.com sitemap.xml" I got the response:
indeed.com/q-Sitemap-xml-jobs.html
.. again, I searched for "indeed.com url structure" and, as I mentioned in the other post on webmasters, I got back:
indeed.com/q-change-url-structure-l-Arkansas.html
Is indeed.com somehow using programming to create a webpage on the fly based on my search input into google? If they are not, how are they able to have a static page for millions and millions and millions possible query combinations, have them dynamically paginate, and then have all of those dominate google's first page of results (albeit that very last question may be best for the webmasters QA)?
Does the javascript in the page somehow interact with the URL
It's most likely not a bunch of pages. The "actual" page might be http://indeed.com/?referrer=google&searchterm=jobs%20in%20washington. The site then cleverly produces a human readable URL using URL rewrite, fetches jobs in the database that matches the query, and voíla...
I could be dead wrong of course. Truth be told, the technical aspect of it can probably be solved in a multitude of ways. Every time a job is added to the site, all pages that need to be done to match that job, might be created, thus producing an enormous amount of pages for Google to crawl.
This is a great question however remains unanswered on the ground that a basic Google search using,
ste:indeed.com
returns over 120MM results and secondly a query such as, "product manager new york" ranks #1 in results. These pages are obviously pre-generated which is confirmed by the fact the page is cached by the search engine (sometimes several days before) has different results from a live query on the site.
Easy when Googles search bot crawls the pages on indeed or any other job search site those page are dynamically created. Here is another site: http://jobuzu.co.uk i run this which is similar to how indeed works.
PHP is your friend in this and Indeed don't just use standard databases look into Sphinx and Solr as they offer Full text search for better performance then MySql etc.
They also make clever use of rel="canonical" and thorough internal linking:
http://www.indeed.com/find-jobs.jsp
Notice that all the pages that actually rank can be found from that direct internal link structure.
I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex