Find all the web pages in a domain and its subdomains - url

I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu).
I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch's collected data. But, I dont think I need solr, since I am not performing any searches. All I need are the URLs that belong to a given domain.
Thanks

If you're familiar with ruby, consider using anemone. Wonderful crawling framework. Here is sample code that works out of the box.
require 'anemone'
urls = []
Anemone.crawl(site_url)
anemone.on_every_page do |page|
urls << page.url
end
end
https://github.com/chriskite/anemone
Disclaimer: You need to use a patch from the issues to crawl subdomains and you might want to consider adding a maximum page count.

The easiest way to find all subdomains of a given domain is to ask the DNS administrators of the site in question to provide you with a DNS Zone Transfer or their zone files; if there are any wildcard DNS entries in the zone, you'll have to also get the configurations (and potentially code) of the servers that respond to requests on the wildcard DNS entries. Don't forget that portions of the domain name space might be handled by other DNS servers -- you'll have to get data from them all.
This is especially complicated because HTTP servers might have different handling for requests to different names baked into their server configuration files, or the application code running the servers, or perhaps the application code running the servers will perform database lookups to determine what to do with the given name. FTP does not provide for name-based virtual hosting, and whatever other services you're interested in may or may not provide name-based virtual hosting protocols.

Related

I got a confusion about some URLs that I see on the Internet

Please tell me the difference between "someSite.com/something" and "something.someSite.com". Are they equivalent? As an amteur programmer, I know how to do the former. I think that I may need to learn network administration to be able to do the latter.
It's usually referred to as a subdomain. This is the over simplified version:
You have a DNS server that converts the domain to an IP. That DNS server also handles subdomains. usually www is synonymous with the base domain itself. You can have more subdomains also, like sub.domain.something.someSite.com/something
You can make them resolve to the same or different IPs, depending on their purpose.
Even if they resolve to the same IP, the web server at that IP receives a request that includes the original domain name. So on that same IP, the server can give different responses for each domain. This is usually the case with small hosting packages, as they can have thousands of domains on a single IP and they all serve up different websites from different clients.
someSite.com/something is from technical point of view a file on the server, while something.someSite.com is a subdomain, which could link to a completely different webserver.
In most cases, the two variants does give you the identical content, because both of them are server-side linked to the same page.

Identifying unique browsers in Rails

I am building a multiplayer site in Rails, and I need a reliable way of blocking computers from using the site. For example, someone has been banned from the site multiple times, and there should be a way to block their computer from using the site.
I would find a way to do this with IP addresses but they're not always static are they? Also I'm asking about identifying browsers so that if there was unauthorised access from a different browser or machine, then I could implement a security feature to make sure the account belonged to them. I would also need to store these locations as trusted locations in a database, hashed.

How to redirect users to a subdomain based on his/her current location?

I am using Rails 3.1.1 and I would like to redirect users, for example, from the U.S.A. to a proper subdomain us.site.com (this one is hosted on the same server as site.com). I know that I can localize a user by his/her IP address but how can I do that so to redirect he/she to the proper subdomain? There is a technique/gem to geo-locate user IPs and then handle redirection?
P.S.: Maybe, for performance reasons, I should use middlewares...
https://rubygems.org/gems/rack-geoipcity is a rack middleware gem I've published which you could use, or just use the GeoIP gem in your controllers.
With the rack-geoipcity, you would query the X-headers it adds in and make a decision based on that. Something like:
if headers['X_GEOIP_COUNTRY_CODE'] == "IN"
redirect "/india"
end
though I don't currently use Rails, so it might be slightly different.
There are plenty to choose from if you don't fancy using the MaxMind db.
One approach that I've used happily in the past is to perform the geolocation lookup via DNS before users connect to the service; this way, they automatically connect to the server nearest them, you get cheap and easy load balancing and ability to remove servers from active use as you need, and individual sites can go down without influencing other sites.
OFTC uses a self-written oftcdns tool to provide users with nearest servers. During the time I was an administrator on the OFTC network, this tool was a drastic improvement over running a simpler Bind-based DNS server that did not provide geo-location features and complicated bringing servers in and out of the rotation.
Wikipedia uses PowerDNS with a geobackend to provide their geo-ip services. PowerDNS is definitely well-tested, high-demand-capable tool.

Is there a way I can block access to my website by country?

I would like to block access to my web site from some countries. The reason is that I suspect they are stealing the information from the site and copying for their own sites.
Is there a way I can block access from certain countries or even better redirect users accessing my site from these countries to a very plain web page that makes it look like the site is under construction.
Note that my site uses MVC3. I am looking for a .net solution or some IIS solution if that's not possible.
You could setup the free GeoLite Country database at yours and check each IP address (HTTP remote address of the visitor) against it and then decide what to do.
Another way would be to reverse-lookup the IP addresses, but then again, which country is a visitor with a hostname ending in .net from?
Finally, be aware that there are free proxy servers out there, so if someone really wants to fake his "country", he easily can do.

How to server multiple sites with nginx/passenger?

I have different websites/applications built with rails, which has different domain names. The thing is I want to serve them from a server with Nginx/passenger. I tried some techniques, but I cannot make them work, basically, I have very few information about this.
So, I can serve different websites/applications on different ports. But how can I make people to see application "AAA" if they are coming from aaa.com and see application "BBB" if they are coming from bbb.com?
Phusion Passenger's documentation has a passage on this here, section 3.2: http://www.modrails.com/documentation/Users%20guide%20Nginx.html
Basically, you can set up virtual hosts that point to different applications on the same web server/app server pair.
You can also do rewrites or forwarding purely through nginx configuration, if the above doesn't work.

Resources