What browser settings can cause server encoding issues? - ruby-on-rails

I'm trying to reproduce an exception my rails site generates whenever a specific crawler hits a certain page:
ActionView::Template::Error:incompatible character encodings: ASCII-8BIT and UTF-8
The page takes GET parameters. When I visit the page with the same GET parameters with my browser, everything renders correctly.
The IP of the crawler is always EU-based (my site is US-based), and one of the user agents is:
Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)
Looking at the HTTP headers sent, the only difference I see between my browser requests and the crawler's is it includes HTTP_ACCEPT_CHARSET, whereas mine does not:
-- HTTP_ACCEPT_CHARSET: utf-8,iso-8859-1;q=0.7,*;q=0.6
I tried setting this in my request but I couldn't reproduce. Are there HTTP header params that can change how rails renders? Are there any other settings I can try to reproduce this?

That's not a browser, more likely an automatic crawler. In fact, if you follow the link in the user agent you get the following explanation
The Grapeshot crawler is an automated robot that visits pages to examine and analyse the content, in this sense it is somewhat similar to the robots used by the major search engine companies.
Unless the crawler is submitting a POST request (which is really unlikely as crawlers tend to follow links via GET and not to issue POST requests), it means the crawler is somehow injecting some information in your page which causes your controller to crash.
The most common cause is a malformed query string. Check the query string associated with the request: it's likely it contains a not-UTF8 encoded character that is read by your controller and somehow it's crashing it.
It's also worth to inspect the stacktrace of the exception (either in the Rails logs or using a third party apps such as Bugsnag) to determine what component of your stack is causing the exception, reproduce, test and fix it.

Related

POST Request is Displaying as GET Request During Replay In Jmeter

I have a Jmeter script where during replay, Post request is displaying as Get request and the parameters in the request are not sent to the server. Due to this, correlations are failing at this request.
One of the parameters in the request is ViewState with so many characters. Is this large parameter value causing the above issue? How to proceed now?
Most probably you're sending a malformed request therefore instead of properly responding to a POST request you're being redirected somewhere (most probably to Login page)
Use View Results Tree listener in HTML or Browser mode to see what page you're hitting in the reality
With regards to the ViewState, "so many characters" is not a problem, the problem is that these are not random characters. ViewState is being used for client-side state management and if you fail to provide the proper value you won't be able to move further so you need to design your test as follows:
Open first page
Extract ViewState using a suitable Post-Processor
Open second page
here you need to pass viewstate from the step 1 along with other parameters
More information: ASP.NET Login Testing with JMeter
Also don't forget to add HTTP Cookie Manager to your Test Plan
What I'm able to understand is the request may be getting redirected. This happens usually when the server expects a unique request. If you recorded the request, you may be possibly using older headers that carry old cookie information. Check your headers and then reconstruct the request.
Make sure you are not using old cookies anywhere. remove that cookie part from HTTP Header Manager everywhere.

IE using Negotiate authorization instead of Basic

I initially asked this question, which shows that I see MVC errors of missing POST values. I was unable to reproduce - I still can't reproduce it on demand, but I did get the error myself on IE11, and I got a clue...
I have an application in IIS7.5 running with Basic authentication only. I look in Fiddler, and normally all transactions have Authorization: Basic xxxxx as expected. The body contains POST values as expected, and Content-Length is correct.
When I experienced this problem, I found that every single request (GETs and POSTs, including static content) was now showing Authorization: Negotiate xxxxx in Fiddler, with an empty body and zero Content-Length, even when I submitted a POST object via jQuery AJAX, and IE's dev tools shows the real POST body (which of course means IE is lying - not the first time). It gets a 401 response, and then a new request occurs with Basic, but also with an empty POST body, which means ASP.NET throws an error about missing parameter values.
Other web applications on the same top-level domain do use Windows authentication instead of Basic, and my suspicion is that the user goes to one of these sites, and IE becomes confused and thinks that my application should use Windows authentication as well - but I can't reproduce that every time. I have reproduced it twice, but out of a dozen or so times of doing the same thing over and over, so I'm not finding a way to make it reproduce every time.
I don't know why the POST body would get emptied, even if it does switch over and try to do WinAuth instead of basic - but that's when the problem occurs, so I'm sure it's related.
Any ideas on how to prevent IE from getting confused and using Negotiate, or at least how to detect and gracefully handle this on the server? I've only seen it in IE, but I can't be sure it's IE-only.
Here's what a normal POST looks like:
Then after the problem starts occurring, the exact same POST looks like:
EDIT
Here's an interesting edit - I just saw a new symptom. This time, all GET requests are coming in with no Authorization header at all, and the response comes back with a 401 for basic, and the GET is re-done properly with basic. But the POSTs are going through normally, with basic on the first try. I don't know what started this happening, but it's a similar symptom of the same problem.

How to Determine if "200 OK" is Really a Misconfigured ASP.NET MVC Custom 404 Page?

tl;dr: Misconfigured ASP.NET MVC servers return "200 OK" when they should 404.
I'm building a list of tech employeer career page links. I am flummoxed to find it quite common that such companies have open positions listed on their sites, but they don't have any links to them. That is, if you visit www.example.com, nowhere on the homepage - sometimes, nowhere on the whole website - can be found a link to www.example.com/jobs
To get around that, after manually indexing a few hundred sites, I made a list of common URL paths:
/careers
/careers/
/careers.html
/jobs.aspx
I have written a straightforward python script that when given a list of company homepages, uses pycurl - a wrapper around libcURL - to attempt HTTP HEAD requests for each (homepage, urlpath) pair:
http://www.example.com/careers
http://www.example.com/jobs
http://www.example.net/careers
http://www.example.net/jobs
This mostly works.
However there is what I gather to be a common misconfiguration problem with ASP.NET MVC which results in custom 404 pages producing a 200 response code while displaying the custom "Not Found" page. For example
http://www.microsoft.com/bill-gates-is-the-spawn-of-satan.html
Yes, that's right folks: Microsoft misconfigured their own server. :-D
If you use Firefox' web developer tools you can see that the above link produces a 200 OK instead of a 404 Not Found.
I expect this is a common problem for anyone who deals with scraping or robots: is there a straightforward programmatic way that I could tell that the above link should produce a 404 instead of a 200?
In my particular case, a modestly unsatisfactory solution would be to note that none of my links produce 404s, then produce a "can't find" output. In such cases I manually google the careers pages:
http://www.google.com/search?q=site:microsoft.com+careers
My goal for the near term is to partially automate the discovery of the links for my tech employer index. I expect that fully automating it would be intractible; I hope to automate the easy stuff.
I don't know of any way from the client end to know that a page is invalid when the server is explicitly telling the client that the page is valid. The second best solution that I can come up with is to grep for common text that is usually displayed on such pages such as "sorry" and "not found". This will, of course, do nothing for you if the custom error page is actually a redirect to a completely valid page such as the home page.

Finding 404 errors logged in database: '../:/0'

All the errors that occur in our web application is logged to a database, and I'm finding a 404 error occurring hundreds of times in the last month. The page users are attempting to access is "https://companysite.com/applicationsite/:/0"
The application is a classic ASP site with some ASP.NET MVC 3 included through i-frames, although this error appears to be occurring on the classical ASP side judging by the URL.
I've done a search through the entire code (classic and .NET) for the string ":/0" but I'm not seeing anything. I'm at a loss at how this error is occurring. It is happening too often and for too many users to be intentional.
Would anyone happen to know why users are getting this error? Unfortunately I only have the database logs so I'm not really user how to reproduce this error, nor do I know how users are coming across it.
I would suspect that someone (outside of your site) is hitting that URL, which does not exist.
It could simply be that a spider has that URL indexed and is trying to crawl it. Or maybe that is a path to some application that has a vulnerability and someone is testing to see if you are running that application.
Try logging the IP address of where the request is coming from and also the User-Agent. If it is a web crawler, you should be able to see it in the User-Agent.
You could also block the IP address from accessing your site.

google bot, false links

I have a little problem with google bot, I have a server working on windows server 2009, the system called Workcube and it works on coldfusion, there is an error reporter built-in, thus i recieve every message of error, especially it concerned with google bot, that trying to go to a false link, which doesn't exist! the links looks like this:
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=282&HIERARCHY=215.005&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=145&HIERARCHY=200.003&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=123&HIERARCHY=110.006&brand_id=xxblpflyevlitojg
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=1&HIERARCHY=100&brand_id=xxblpflyevlitojg
of course with definition like brand_id=hoyrrolmwdgldah or brand_id=xxblpflyevlitojg is false, i don't have any idea what can be the problem?! need advice! thank you all for help! ;)
You might want to verify your site with Google Webmaster Tools which will provide URLs that it finds that error out.
Your logs are also valid, but you need to verify that it really is Googlebot hitting your site and not someone spoofing their User Agent.
Here are instructions to do just that: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
Essentially you need to do a reverse DNS lookup and then a forward DNS lookup after you receive the host from the reverse lookup.
Once you've verified it's the real Googlebot you can start troubleshooting. You see Googlebot won't request URLs that it hasn't naturally seen before, meaning Googlebot shouldn't be making direct object reference requests. I suspect it's a rogue bot with a User Agent of Googlebot, but if it's not you might want to look through your site to see if you're accidentally linking to those pages.
Unfortunately you posted the full URLs, so even if you clean up your site, Googelbot will see the links from Stack Overflow and continue to crawl them because it'll be in their crawl queue.
I'd suggest 301 redirecting these URLs to someplace that make sense to your users. Otherwise I would 404 or 410 these pages so Google know to remove these pages from their index.
In addition, if these are pages you don't want indexed, I would suggest adding the path to your robots.txt file so Googlebot can't continue to request more of these pages.
Unfortunately there's no real good way of telling Googlebot to never ever crawl these URLs again. You can always go into Google Webmaster Tools and request the URLs to be removed from their index which may stop Googlebot from crawling them again, but that doesn't guarantee it.

Resources