How to get subpages of an URL without knowing them? - url

I'd like to know any Subpages of a certain URL. E.g. I have the URL example.com. There might exist the subpages example.com/home, example.com/help and so on. Is it possible to get all of such subpages without knowing there exact name?
I thought I can handle this problem with a web crawler. But it just crawls for pages mentioned on the page itself.
I hope you understand my problem and can help me with it.
Thank you!

To answer your question, yes. Scrapy "crawl" spiders work by setting rules that can be set to do exactly what you're trying to. When in doubt, always go to the docs!
Couple things to note:
You can create a crawl spider the same way when creating the generic spider!
scrapy genspider -t crawl nameOfSpider website.com
With a crawl spider, you then have to set rules to basically tell scrapy where and where not to go; how's your regex?!
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com'] # PART 1: Domain Restriction
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(allow=('.*')), callback='parse_item'), # PART 2: Call Back
)
Now I copied and pasted this from the Official docs, and changed up what it should look like for you but I havent checked the code so yeah... teh logic is there though..
IThis works by getting ALL the link that it can see depending on the rules you set, does something with said link.
You want to restrict all other domains but the one your scrapinng
In the example I set the wildcard to literally accept every and any page in the domain... once you figure out tehs tructure of a website, you can use logic to build out what you need.
You should take a look at the docs more often though. I have been using scrapy for about 6-7 years and I still find myself going back to the man pages!

No, you can’t.
The way you describe the situation, the website intends those desired URLs to be secret.
Any way to find such URLs would be a security exploit that should be reported to the website owners right away so they can fix it.

Related

What are URL codes called?

I came across a website with a blog post teaching all how to clear cache for web development purposes. My personal favourite one is to do /? on the end of a web address at the URL bar.
Are there any more little codes like that? if so what are they and where can I find a cheat sheet?
Appending /? may work for some URLs, but not for all.
It works if the server/site is configured in a way that, for example, http://example.com/foo and http://example.com/foo/? deliver the same document. But this is not the case for all servers/sites, and the defaults can be changed anyway.
There is no name for this. You just manipulate the canonical URL, hoping to craft a URL that points to the same document, without getting redirected.
Other common variants?
I’d expect appending ? would work even more often than /? (both, of course, only work if the URL has no query component already).
http://example.com/foo
http://example.com/foo?
You’ll also find sites that allow any number of additional slashes where only one slash used to be.
http://example.com/foo/bar
http://example.com/foo////bar
Not sure if it affects the cache, but specifying the domain as FQDN, by adding a dot after the TLD, would work for many sites, too.
http://example.com/foo
http://example.com./foo
Some sites might not have case-sensitive paths.
http://example.com/foo
http://example.com/fOo

How would I redirect urls in this way?

So, this is a little bet more of a high level question. I'm not necessarilly looking for specifics, but more of the general tools and technologies I need to use. I'm really new to website hosting and development.
I want to redirect a domain, say something.com to something.squarespace.com. How would I go about doing this so that the following occurs:
The address bar never has the url something.squarespace.com in it.
When a user clicks a link on the site that goes to a local page on something.squarespace.com (so say, something.squarespace.com/page1), the address bar says something.com/page1.
something.com currently is pointing to a shared hosting apache webserver. I would like to be able to maintain access to files and email on that server. If I couldn't get the files, that's fine. But the email is crucial.
I know this is a lot to ask - but if anyone can help me out with some advice on this I'll be very thankful!
Thanks.
I think you should take a look at URL rewriting concept. that's where you can achieve what you're asking in 1 at 2. As for no 3, I couldn't understand what you mean exactly.

how do I block my rails app from being hit by bots?

I'm not even sure I'm using the right terminology, whether this is actually bots or not. I didn't want to use the word 'spam' because it's not like I have comments or posts that are being created/spammed. It looks more like something is making the same repeated request to my domain, which is what made me think it was some kind of bot.
I've opened my first rails app to the 'public', which is a really a small group of users, <50 currently. That was last Friday. I started having performance issues today, so I looked at the log and I see tons of these RoutingErrors
ActionController::RoutingError (No route matches "/portalApp/APF/pages/business/util/whichServer.jsp" with {:method=>:get}):
They are filling up the log and I'm assuming this is causing the slowdown. Note the .jsp on the end and this is a rails app, so I've got no urls remotely like this in my app. I mean, the /portalApp I don't even have, so I don't know where this is coming from.
This is hosted at Dreamhost and I chatted with one of their support people, and he suggested a couple sites that detail using htaccess to block things. But that looks like you need to know the IP or domain that the requests are coming from, which I don't.
How can I block this? How can I find the IP or domain from the request? Any other suggestions?
Follow up info:
After looking at the access logs, it looks like it's not a bot. Maybe I'm not reading the logs right, but there are valid url requests (generated from within my Flex app) coming from the same IP. So now I'm wondering if it's some kind of plugin generating the requests, but I really don't know. Now I'm wondering if it's possible to block a certain url request, based on a pattern, but I suppose that's a separate question.
Old question, but for people who are still looking for alternatives I suggest checking out Kickstarter's rack-attack gem. Allows not only blacklisting and whitelisting, but also throttling.
These page seems to offer some good advice:
Here
The section on blocking by user agent may be something you could look at implementing. Is there anyway you can get the useragent from the bot from your logs? If so look for the unique aspect of the useragent that probably identifies the bot and add the following to .htaccess replacing the relevant bits
BrowserMatchNoCase SpammerRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot
Its detail on that link in more detail and of course, if you can't get the useragent from your logs then this will be of no use to you!
You can also update your public/robots.txt file to allow/disallow robots.
http://www.robotstxt.org/wc/robots.html

How can I display short URLs without file extention?

I've looked around but wasn't able to find what I was looking for. I'm looking for a way to automatically create short URLs displayed in the browser, not using a URL shortener. Basically I would like to re-create something like this:
idzr.org/1ptb
I upload screenshots to my server with "GrabUp" on a regular basis but it creates rather long URLs for example:
/2523e3c90d60f08e952215424e7c5d99.png
It's a bit annoying having to shorten them each time.
I have seen this method a lot lately with pretty much any file including html files. If this has been discussed already I'm sorry I'm posting it again. I just seem to be stuck.
Thanks in advance for any help & advice!
I don't know, what webserver do you use.
You write rule for rewrite
-- htaccess for Apache or equivalent for IIS
You push content to user thru your code, because browser doesn't know what content get from web server
-- use http header - MIME type

Why would I put ?src= in a link?

I feel dumb for not knowing this, but I see a lot of links in web pages and instead of this:
<a href="http://foo.com/">
...they use this:
<a href="http://foo.com/?src=bar.com">
Now I understand that the ?src= is telling something that this referral is coming from bar.com, but I don't understand why this needs to be called out explicitly. Can anyone shed some light on it for me? Is this something I need to include in my program generated links?
EDIT: Ok, sorry, I'm not being clear enough. I understand the GET syntax with a question mark and parameters separated by ampersands. I'm wondering what's this special src parameter? Why would one site link to another and tack an src parameter on the end even though there's no indication that the destination site uses this normally.
For example, on this page hover your mouse over the screenshot. The link URL is http://moms4mom.com/?src=stackexchangesites
But moms4mom.com is our site. Passing the src parameter does nothing, so why include it?
There are a few reasons that the src is being used explicitly. But in general, it is easier and more reliable to trust a query string to determine referer[sic] than it is to trust the referer, since the latter is often broken, deliberately or not. On the other hand, browsers almost never break the query string in a url, since this, unlike referers, is pretty important for pages to function. Besides, a referer is often done without any deliberate action on the part of the site doing the refering, which some users dislike.
The reason (I do it) is that popular analytics tools sometimes make it easier to filter on query strings than referrers.
There is no standard to the src parameter. Each site has its own and it's usually up to the site that gets the link to define how it wants to read it (as usually it's that site that's going to pay for the click).
The second is a dynamic link, it's a URL that another language(like ASP and PHP) interpret as something to do, like in those Google URLs, but i never used this site(foo.com), then i don't much things about this parameter.
Depending on how the site processes its URL, you may or may not need to include the ?... information.
This is passed to the website, and the server can process it just like form input. Some sites require this - and build their navigation off a single page, using nothing but the "extra" stuff passed afterwards. If you're generating a link to a site like that, it will be required.
In other cases, this is just used to pass extra, unrequired info (such as advertising, tracking info, etc)... In those cases, you can leave it off.
Unfortunately, there's no way to know without trying whether you can remove the "extra" bits from the URL.
After reading some of your comments - I'll also say:
There is nothing special about the "src" field in a query string. The server is free to use it any way it wishes. Unless you know specific info about the server, you cannot assume it can be left out.
The part after the ? is the query string. Different sites use it for different things, and it is usually used for passing information to the server side code for that URL, but can also be used in javascript.
For more info see Query String

Resources