I have seen on varias different websites that when a forum post or something like that is doing they all have different URLs that make it look like they are in different directories, but I am sure they cannot make different directories for each post.
If you look at this website: https://oc.tc/forums/topics/5181a374ba6087261f000c59
The number at the end (5181a374ba6087261f000c59) changes for each post and it looks liek this is a different directory but I am sure it is not!
Could you please explain how they do this?
Thanks in advance!
Rob
Use apache .htaccess file to handle redirect to your php script
What they're doing there is providing the 518.... as a query string parameter. Their site interprets the request as http://oc.tc/forums/topics/{post}. It would be the same as doing something like http://oc.tc/forums/topics?post=5181a374ba6087261f000c59 (this is an example to show the idea).
Those sites use a technique called URL rewiring at Apache.
What they do is that they convert URL requests like this:
http://site.com/products/categoryA/myawesomeproduct
To something internal like:
http://site.com/?query=products/categoryA/myawesomeproduct
Then the process the rest in PHP. You can learn how to it with examples on the following link: http://roshanbh.com.np/2008/03/url-rewriting-examples-htaccess.html
Edit:
A full guide to redirects here: http://httpd.apache.org/docs/2.0/misc/rewriteguide.html
Related
I'd like to know any Subpages of a certain URL. E.g. I have the URL example.com. There might exist the subpages example.com/home, example.com/help and so on. Is it possible to get all of such subpages without knowing there exact name?
I thought I can handle this problem with a web crawler. But it just crawls for pages mentioned on the page itself.
I hope you understand my problem and can help me with it.
Thank you!
To answer your question, yes. Scrapy "crawl" spiders work by setting rules that can be set to do exactly what you're trying to. When in doubt, always go to the docs!
Couple things to note:
You can create a crawl spider the same way when creating the generic spider!
scrapy genspider -t crawl nameOfSpider website.com
With a crawl spider, you then have to set rules to basically tell scrapy where and where not to go; how's your regex?!
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com'] # PART 1: Domain Restriction
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(allow=('.*')), callback='parse_item'), # PART 2: Call Back
)
Now I copied and pasted this from the Official docs, and changed up what it should look like for you but I havent checked the code so yeah... teh logic is there though..
IThis works by getting ALL the link that it can see depending on the rules you set, does something with said link.
You want to restrict all other domains but the one your scrapinng
In the example I set the wildcard to literally accept every and any page in the domain... once you figure out tehs tructure of a website, you can use logic to build out what you need.
You should take a look at the docs more often though. I have been using scrapy for about 6-7 years and I still find myself going back to the man pages!
No, you can’t.
The way you describe the situation, the website intends those desired URLs to be secret.
Any way to find such URLs would be a security exploit that should be reported to the website owners right away so they can fix it.
I use a website that has a URL like....
https://wwws.something.com/overview.event
I have never seen a period used in a URL like this before.
I cannot find anything on google or stack overflow of anyone describing this
What does it mean? How is it used?
To clarify it is the "overview.event" that I am confused about
Times where url was a path to file on server are gone. Now HTTP servers use rewriting (like mod_rewrite in Apache) to map url's to files with proper parameters.
Old PHP sites had url's like www.myblog.com/page.php?page=1 where page.php was actual file and ?page=1 was GET argument that was used by PHP interpreter.
Some people decided that pages looks nicer and are more readable if we do something like www.myblog.com/page/1 but there is no problem to do www.myblog.com/page.1 as well.
The URL just means that we want to mean it !
See informations on wikipedia : http://en.wikipedia.org/wiki/Uniform_resource_locator
You can have URL like http://www.example.com/image.jpg and have an gif image or a simple page or a video...
I was visiting the site asos.com the other day. If you search 'tshirt' on their site the resulting URL is 'http://www.asos.com/search/tshirt?q=tshirt'. Does anyone know which technique they use to make it seem that the live generate a page called 'tshirt' which basically takes any extension?
Also if you select a product the URL becomes something like: 'http://www.asos.com/ralph_lauren/polo/product.aspx' I know they don't have a file and folder for every brand and item, so how is it possible for the browser to follow this url?
I'm not looking for any code, just a hint on what to google for more information.
Hope this doesn't sound too ignorant!
Many Ragards,
Andreas
In most cases, this sort of functionality (often called clean URL's, user-friendly URL's, or spider-friendly URL's), is achieved through server-side rewrites. To point all requests of a specific known structure to a single backend script for processing.
Now these specific URL's you mention are not, in my opinion, the best examples of clean URL's. I will give you an example however of how such a clean URL might be achieved using Apache mod_rewrite (since Apache is so popular).
Take for example a URL like http://somedomain.com/product/ralph_lauren/polo
You might be able to do something like this in mod_rewrite
RewriteEngine On
RewriteRule /?product/(.*)/(.*) /product.php?cat=$1&subcat=$2 [L]
This would silently (to the end user) redirect the incoming request for any URL's of the structure /product/*/* to a script called /product.php, passing the second and third parts of the URL as cat and subcat parameters to be evaluated by the script.
I'm not sure I understand what you are asking, but in the example you cited it's using a query string which is everything after the '?' in the URL.
On the backend server it uses the variables passed in the query string to determine what to return back to you.
I'm not even sure I'm using the right terminology, whether this is actually bots or not. I didn't want to use the word 'spam' because it's not like I have comments or posts that are being created/spammed. It looks more like something is making the same repeated request to my domain, which is what made me think it was some kind of bot.
I've opened my first rails app to the 'public', which is a really a small group of users, <50 currently. That was last Friday. I started having performance issues today, so I looked at the log and I see tons of these RoutingErrors
ActionController::RoutingError (No route matches "/portalApp/APF/pages/business/util/whichServer.jsp" with {:method=>:get}):
They are filling up the log and I'm assuming this is causing the slowdown. Note the .jsp on the end and this is a rails app, so I've got no urls remotely like this in my app. I mean, the /portalApp I don't even have, so I don't know where this is coming from.
This is hosted at Dreamhost and I chatted with one of their support people, and he suggested a couple sites that detail using htaccess to block things. But that looks like you need to know the IP or domain that the requests are coming from, which I don't.
How can I block this? How can I find the IP or domain from the request? Any other suggestions?
Follow up info:
After looking at the access logs, it looks like it's not a bot. Maybe I'm not reading the logs right, but there are valid url requests (generated from within my Flex app) coming from the same IP. So now I'm wondering if it's some kind of plugin generating the requests, but I really don't know. Now I'm wondering if it's possible to block a certain url request, based on a pattern, but I suppose that's a separate question.
Old question, but for people who are still looking for alternatives I suggest checking out Kickstarter's rack-attack gem. Allows not only blacklisting and whitelisting, but also throttling.
These page seems to offer some good advice:
Here
The section on blocking by user agent may be something you could look at implementing. Is there anyway you can get the useragent from the bot from your logs? If so look for the unique aspect of the useragent that probably identifies the bot and add the following to .htaccess replacing the relevant bits
BrowserMatchNoCase SpammerRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot
Its detail on that link in more detail and of course, if you can't get the useragent from your logs then this will be of no use to you!
You can also update your public/robots.txt file to allow/disallow robots.
http://www.robotstxt.org/wc/robots.html
I've looked around but wasn't able to find what I was looking for. I'm looking for a way to automatically create short URLs displayed in the browser, not using a URL shortener. Basically I would like to re-create something like this:
idzr.org/1ptb
I upload screenshots to my server with "GrabUp" on a regular basis but it creates rather long URLs for example:
/2523e3c90d60f08e952215424e7c5d99.png
It's a bit annoying having to shorten them each time.
I have seen this method a lot lately with pretty much any file including html files. If this has been discussed already I'm sorry I'm posting it again. I just seem to be stuck.
Thanks in advance for any help & advice!
I don't know, what webserver do you use.
You write rule for rewrite
-- htaccess for Apache or equivalent for IIS
You push content to user thru your code, because browser doesn't know what content get from web server
-- use http header - MIME type