We are currently capturing the requested URL when someone gets redirected to our 404 page. However, this does not allow us to see reports on things like broken images. Is it possible to get this information in SiteCatalyst, for example by taking the URL of every server request that received a 404 response and store it in a a variable? What would be a sensible way to go about this? I Googled and couldn't find anything
I want to be able to pull a report on every broken URL reference of a site and the page it happened on...
You can configure your webserver, says Apache, for instance, to redirect an 404 error to a specific web page, says:
ErrorDocument 404 /my_path/not_found.html
Then you can configure your dispatch inside the not_found.html, in a embedded JavaScript.
Here's how to configure apache to redirect this error request:
http://httpd.apache.org/docs/2.2/custom-error.html
Related
I have defined a location for the page in the xml
<error-page>
<error-code>404</error-code>
<location>/faces/public/error-page-not-found.xhtml</location>
</error-page>
<error-page>
but I want the URL to be like below:
faces/{variable}/public/error-page-not-found.xhtml
where the value of the variable will change according to different situations
This question is a bit subjective though in general HTTP errors are handled by the server and most of the time by the scripting language on the server (and occasionally the HTTP server software directly).
In example the Apache HTTP web server software allows for rewrites. So you can request a page at example.com/123 though there is no "123" file there. In the code that would determine if you would have something that would be available for that request you would also determine if a resource exists for that request; if not then your server scripting code (PHP, ColdFusion, Perl, ASP.NET, etc) would need to return an HTTP 404. The server code would then have a small snippet that you would put in to the body of the code such as the code you have above.
You would not need to redirect to an error page, you would simply respond with the HTTP 404 response and any XML you'd use to notify the visitor that there is nothing there. HTTP server software such as Apache can't really produce code (it can only reference or rewrite some file to be used for certain requests).
Generally speaking if you have a website that uses a database you'd do the following...
Parse the URL requested so you can determine what the visitor requested.
Determine if a resource should be retrieved for that request (e.g. make a query to the database).
Once you know whether a resource is available or not you then either show the resource (e.g. a member's profile) or server the HTTP status (401: not signed in at all, 403:, signed in though not authorized where no increase in privileges will grant permission, 404: not found, etc) and display the corresponding content.
I would highly recommend that you read about Apache rewrites and PHP, especially it's $_SERVER array (e.g. <?php print_r($_SERVER);?>). You'd use Apache to rewrite all requests to a file so even if they request /1, /a, /about, /contact/, etc they all get processed by a single PHP file where you first determine what the requested URL is. There are tons of questions here and elsewhere on the web that will help you really get a good quick jump start on handling all that such as this: Redirect all traffic to index.php using mod_rewrite. If you do not know how to setup a local HTTP web server I highly recommend looking in to XAMPP, it's what I started out with years ago. Good luck!
I'm trying to reproduce an exception my rails site generates whenever a specific crawler hits a certain page:
ActionView::Template::Error:incompatible character encodings: ASCII-8BIT and UTF-8
The page takes GET parameters. When I visit the page with the same GET parameters with my browser, everything renders correctly.
The IP of the crawler is always EU-based (my site is US-based), and one of the user agents is:
Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)
Looking at the HTTP headers sent, the only difference I see between my browser requests and the crawler's is it includes HTTP_ACCEPT_CHARSET, whereas mine does not:
-- HTTP_ACCEPT_CHARSET: utf-8,iso-8859-1;q=0.7,*;q=0.6
I tried setting this in my request but I couldn't reproduce. Are there HTTP header params that can change how rails renders? Are there any other settings I can try to reproduce this?
That's not a browser, more likely an automatic crawler. In fact, if you follow the link in the user agent you get the following explanation
The Grapeshot crawler is an automated robot that visits pages to examine and analyse the content, in this sense it is somewhat similar to the robots used by the major search engine companies.
Unless the crawler is submitting a POST request (which is really unlikely as crawlers tend to follow links via GET and not to issue POST requests), it means the crawler is somehow injecting some information in your page which causes your controller to crash.
The most common cause is a malformed query string. Check the query string associated with the request: it's likely it contains a not-UTF8 encoded character that is read by your controller and somehow it's crashing it.
It's also worth to inspect the stacktrace of the exception (either in the Rails logs or using a third party apps such as Bugsnag) to determine what component of your stack is causing the exception, reproduce, test and fix it.
I have a web application which i need to be load tested using LoadRunner. When I record the website using vugen it works good and there is no any application bug. But when I tried to replay the script, script failed after login and while navigating to next page, say, Transaction. At the end of log, I receive error:
Action.c(252): Error -26612: HTTP Status-Code=500 (Internal Server Error)
for "http://rob.com/common/transaction
Please help me to resolve this error.
LoadRunner generates HTTP request just as your browser does, this error is the same error you would get if you would go to that URL using your browser. Error code 500 is a generic server error that is returned when there is no better (more specific error to return).
Most likely the login process requires some form of authentication which is protected against a replay attack by using some form of token. It is up to you to capture this token using Correlations in LoadRunner and replay it as the server expects. The Correlation Studio in VuGen should detect and identify the token for you but since authentication methods vary it is sometimes impossible to do this automatically and you will have to create manual correlation. Please consult the product documentation for more details on how to do it. If your website is publicly available online then post its URL and I will try to record the script on my machine.
Thanks,
Boris.
Most common reasons
You are not checking each request for a valid result being returned and using a 200 HTTP status as an assumed correct step without examining the content of what is being returned. As a result when data being returned is incorrect you are not branching the code to handle the exception. Go one to two steps beyond where your business process has come off the rails with an assumptive success and you will have a 500 status message for an out of context action occurring 100% of the time.
Missed dynamic element. Record three times. Compare the code. Address the changing components.
I've got a Rails 3.2 app where I'm trying to debug a weird ssl problem.
I'm page caching throughout my app and to maintain dynamic content I am making ajax requests to update certain aspects.
All of the pages that are cached are supposed to be requested by http only. All but two of them are redirecting to http when an ssl request is made.
The problem is, that these two pages Artists and the Blog are failing to redirect back to http and the ajax requests to refresh content is getting canceled. To the best of my knowledge, it is getting canceled because it sees normal http as a different site and you can't make ajax requests to a different site under ssl.
Setting up a local signed certificate has not helped. In development these two pages are acting appropriately. I'm also using AWS ELB where the ssl terminates at the load balancer and goes to port 80 and that seems to also be working appropriately.
I could force just these two pages to redirect to http every time but I much rather get to the bottom of this.
I am using ssl_requirement to do the app level redirects.
I'm looking for ideas of how this could be happening. I've combed my codebase and I can't find anything at the app level that would be making this happen. I don't think my apache vhost is perfect but there's nothing pertaining to just these two pages. Anyone got a clue of where in the stack this could be occurring?
Edit:
Finally realized that since the pages are fully cached, the request is only hitting apache and never the application where it would get redirected. This makes me question why the Ajax requests are getting canceled. The Ajax requests are to the same domain but not encrypted. Shouldn't that just show up as a warning of 'insecure content'? I'm using jquery getScript to load the dynamic content.
Well, this will go down in the history of stupid mistakes but I thought I'd leave it up to help others googling.
The simple answer is that my ajax calls were not being allowed over an ssl connection. If your ssl_requirement has the method ssl_allowed, make sure that the ajax method your calling to load the dynamic content is allowed over ssl. Otherwise the request will be canceled or it will get a 302 redirect.
I have a little problem with google bot, I have a server working on windows server 2009, the system called Workcube and it works on coldfusion, there is an error reporter built-in, thus i recieve every message of error, especially it concerned with google bot, that trying to go to a false link, which doesn't exist! the links looks like this:
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=282&HIERARCHY=215.005&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=145&HIERARCHY=200.003&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=123&HIERARCHY=110.006&brand_id=xxblpflyevlitojg
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=1&HIERARCHY=100&brand_id=xxblpflyevlitojg
of course with definition like brand_id=hoyrrolmwdgldah or brand_id=xxblpflyevlitojg is false, i don't have any idea what can be the problem?! need advice! thank you all for help! ;)
You might want to verify your site with Google Webmaster Tools which will provide URLs that it finds that error out.
Your logs are also valid, but you need to verify that it really is Googlebot hitting your site and not someone spoofing their User Agent.
Here are instructions to do just that: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
Essentially you need to do a reverse DNS lookup and then a forward DNS lookup after you receive the host from the reverse lookup.
Once you've verified it's the real Googlebot you can start troubleshooting. You see Googlebot won't request URLs that it hasn't naturally seen before, meaning Googlebot shouldn't be making direct object reference requests. I suspect it's a rogue bot with a User Agent of Googlebot, but if it's not you might want to look through your site to see if you're accidentally linking to those pages.
Unfortunately you posted the full URLs, so even if you clean up your site, Googelbot will see the links from Stack Overflow and continue to crawl them because it'll be in their crawl queue.
I'd suggest 301 redirecting these URLs to someplace that make sense to your users. Otherwise I would 404 or 410 these pages so Google know to remove these pages from their index.
In addition, if these are pages you don't want indexed, I would suggest adding the path to your robots.txt file so Googlebot can't continue to request more of these pages.
Unfortunately there's no real good way of telling Googlebot to never ever crawl these URLs again. You can always go into Google Webmaster Tools and request the URLs to be removed from their index which may stop Googlebot from crawling them again, but that doesn't guarantee it.