Is it possible to detect uniquely-named files if you have the URL path? - url

Let's say someone has posted a resource at https://this-site.com/files/pdfs/some_file_name.pdf
Another resource is then posted at that URL, which we don't know the name of. However, the pathname is the same: https://this-site.com/files/pdfs/another_unique_resource98237219.pdf
Is it possible to detect when a new PDF is posted to this location? Or would we have to know more about the backend infrastructure? Keeping in mind that:
None of the other pieces of the URL are valid paths, in other words https://this-site.com/files/pdfs and https://this-site.com/files both return 404 errors.
The names of the files are unique and do not follow a specific pattern.
If this is not possible, what are other ways you might inspect the request/response infrastructure to look for resources posted to that URL?

my first suggestion is to look if there is another page that displays a list of the resource available on the website, of course assuming the website owner actually provides such page
the other method would be effectively brute forcing all the URLs under that path. you will need to collect some SOCKS to use with your crawlers to distribute your requests among multiple IP addresses otherwise the server will probably block your IP address. should you be able to distinguish the minimum and maximum number of characters in the file names (not pattern, just length) this operation can be drastically optimized.

Related

What are URL codes called?

I came across a website with a blog post teaching all how to clear cache for web development purposes. My personal favourite one is to do /? on the end of a web address at the URL bar.
Are there any more little codes like that? if so what are they and where can I find a cheat sheet?
Appending /? may work for some URLs, but not for all.
It works if the server/site is configured in a way that, for example, http://example.com/foo and http://example.com/foo/? deliver the same document. But this is not the case for all servers/sites, and the defaults can be changed anyway.
There is no name for this. You just manipulate the canonical URL, hoping to craft a URL that points to the same document, without getting redirected.
Other common variants?
I’d expect appending ? would work even more often than /? (both, of course, only work if the URL has no query component already).
http://example.com/foo
http://example.com/foo?
You’ll also find sites that allow any number of additional slashes where only one slash used to be.
http://example.com/foo/bar
http://example.com/foo////bar
Not sure if it affects the cache, but specifying the domain as FQDN, by adding a dot after the TLD, would work for many sites, too.
http://example.com/foo
http://example.com./foo
Some sites might not have case-sensitive paths.
http://example.com/foo
http://example.com/fOo

Will using multiple url redirects hurt my URL?

Currently I'm using this link forwarding structure:
bit.ly/{some_hash} > example.com/s/{ID} > example.com/blog/full-seo-optimized-url/
Because the id of the blog never changes but the url might change (e.g. spelling mistake), I'm forwarding my bit.ly short urls to a special subpage (/s/{UD}) that will eventually get the full url from the database and forward the visitor to the blog entry.
Pros:
If the URL changes, the bit.ly short link will still work and forward to the updated url
(Possible) Cons:
Might be seen as spammy method (hiding target link)?
Might be violating any terms of service?
... ?
Therefore I would know, if this is a good and proper way or if I should remove the step in the middle?
Those redirects will cause a slower user experience and when used will cause a loss of PageRank being sent to the destination.
I'd avoid doing it where possible.
There are URL shorteners out there that let you directly edit the destination which would avoid your need for the middle redirect.
You also want to avoid changing the destinations URL as other people will not use your fancy redirects and you will lose PageRank every time they change.

Heuristics for lightweight ways for how to tell if two html pages are the same "page"

I know similar questions have already been asked, but I want to know whether there exists some code\package, or some ideas on how to tell if two urls are the same page.
For motivation, assume that what I want to do is write a chrome extension that tells you how many of your facebook friends visited a link.
Of course simply comparing urls won't work as some url parameters might be critical while other are not, e.g google.com?query=help is not the same page as google.com?query=idea as the query parameter is critical, while google.com?referrer=facebook is the same as google.com?referrer=twitter (I am of course making up these examples).
Also, comparing the page's content is not guaranteed to work, as if there are randomized parts ("related stories") or user-specific content (headline of "Hi Noam, we haven't seen you in a while").
Of course, I am not looking for a foolproof method, just something that work on most normal-behaving sites.
Any good recommendations of packages (any language) or ideas on how to do this?
Any standard distance metric on string comparisons should give you a score for various URLs' contents. Presumably, contents that are more similar will have better scores than less similar URLs, so rank results and compare.
There is no way to make sure that two pages are the same. There may be user-specific content (Login Buttons for some users, a personal greeting for others), advertising, browser-specific content (CSS3 for Chrome, CSS2 for Opera, a drive-by-download-exploit for IE6 users :))
The same resource may be available under different URLs (/article/4-funny-ways-to-encrypt-your-shellcode-123456 or /article.php?id=123456). There might my two domains for the same content (www.domain.com and domain.com, maybe even domain.co.uk). You could get some clues from the Last-Modified: header which might contain the file modification date, but when it comes to dynamic content, it can also contain the generation date. There may be a ETag header that contains a hash for the underlying resource, at least in ruby on rails, if correctly implemented, which is not often the case.
So the only thing left you could probably do is to compare so pages and calculate some metrics. I would consider the domain, IP address and page content for comparision. With a higher weight on IP address and domain (or domain fractions). So you can calculate certain probabilites, but there is no way to make sure two pages are the same.

Hide website filenames in URL

I would like to hide the webpage name in the url and only display either the domain name or parts of it.
For example:
I have a website called "MyWebSite". The url is: localhost:8080/mywebsite/welcome.xhtml. I would like to display only the "localhost:8080/mywebsite/".
However if the page is at, for example, localhost:8080/mywebsite/restricted/restricted.xhtml then I would like to display localhost:8080/mywebsite/restricted/.
I believe this can be done in the web.xml file.
I believe that you want URL rewriting. Check out this link: http://en.wikipedia.org/wiki/Rewrite_engine - there are many approaches to URL rewriting, you need to decide what is appropriate for you. Some of the approaches do make use of the web.config file.
You can do this in several ways. The one I see most is to have a "front door" called a rewrite engine that parses the URL dynamically to internally redirect the request, without exposing details about how that might happen as you would see if you used simple query strings, etc. This allows the URL you specify to be digested into a request for a master page with specific content, instead of just looking up a physical page at that location to serve.
The StackExchange sites do this so that you can link to a question in a semi-permanent fashion (and thus can use search engines with crawlers that log these URLs) without them having to have a real page in the file system for every question that's ever been asked (we're up to 9,387,788 questions as of this one).

What's the best method to capture URLs?

I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex

Resources