Why do Envato Marketplace have utf8=✓ in their search url's - url

On all of the Envato Market places like AudioJungle, ActiveDen and GraphicRiver.
When you use the search bar the url it goes to contains utf8=✓
http://audiojungle.net/search?utf8=✓&term=dfg
Why do they need to declare the charset?
Why would they use a tick instead of true or 1?
I would expect ✓ to be less browser safe?

Older versions of IE have a problem with proper Unicode handling when submitting forms under certain circumstances. Including a character which must definitely be encoded in one of the Unicode encodings is a hack to trigger the proper behaviour. Just including it fixes the bug, "utf8=✓" is a tongue-in-cheek implementation for this fix.

Related

Any problems with using a period in URLs to delimiter data?

I have some easy to read URLs for finding data that belongs to a collection of record IDs that are using a comma as a delimiter.
Example:
http://www.example.com/find:1%2C2%2C3%2C4%2C5
I want to know if I change the delimiter from a comma to a period. Since periods are not a special character in a URL. That means it won't have to be encoded.
Example:
http://www.example.com/find:1.2.3.4.5
Are there any browsers (Firefox, Chrome, IE, etc) that will have a problem with that URL?
There are some related questions here on SO, but none that specific say it's a good or bad practice.
To me, that looks like a resource with an odd query string format.
If I understand correctly this would be equal to something like:
http://www.example.com/find?id=1&id=2&id=3&id=4&id=5
Since your filter is acting like a multi-select (IDs instead of search fields), that would be my guess at a standard equivalent.
Browsers should not have any issues with it, as long as the application's route mechanism handles it properly. And as long as you are not building that query-like thing with an HTML form (in which case you would need JS or some rewrites, ew!).
May I ask why not use a more standard URL and querystring? Perhaps something that includes element class (/reports/search?name=...), just to know what is being queried by find. Just curious, I knows sometimes standards don't apply.

URL containing non-visual characters

My crawler engine seems to have a problem with a specific customer's site.
At that site, there are redirects to URLs that look like this:
http://example.com/dir/aaa$0081 aaa.php
(Showing the URL as non-encoded, with $0081 being two bytes represented using HEX.)
Now, this is when inspecting the buffer returned after using the WinInet Windows API call HttpQueryInfo, so the two bytes actually represent a WideChar at this point.
Now, I can see that e.g. $0081 is a non-visual control character:
Latin-1 Supplement (Unicode block)
The problem is that if I use the URL "as-is" (URL encoded) for future requests to the server, it responds with 400 or 404. (On the other hand, is it removed entirely, it works and the server delivers the correct page and response...)
I suspect that FireFox/IE/etc. is stripping non-visible controls characters in URLs before making the HTTP requests... (At least IEHTTPHeaders and FF Live HTTP Headers addins don't show any non-visible characters.)
I was wondering if anyone can point to a standard for this? For what I can see non-visible chracters should not be found in URLs, so I am thinking a solution might be (in this and future cases) that I remove these. But it is not a topic that seems widely discussed on the net.
In the example given, $0081 is just five Ascii characters. But if you mean that this is just what it looks like and you have (somehow) inferred that the actual URL contains U+0081, then what should happen, and does happen at least on Firefox, is that it is %-encoded (“URL encoded”) as %C2%81 (formed by %-encoding the two bytes of the UTF-8 encoded form of U+0081. Firefox shows this as empty in its address bar, since U+0081 is control character, but the server actually gets %C2%81 and must take it from there.
I have no idea of where the space comes from, but a URL must not contain a space, except as %-encoded (%20).
The relevant standard is Internet-standard STD 66, URI Generic Syntax. (Currently RFC 3986. Beware: people still often refer to older RFCs as “standard” in this issue.)

#! as opposed to just # in a permalink

I'm designing a permalink system and I just noticed that Twitter and Hipmunk both prefix their permalinks with #!. I was wondering why this is, and if the exclamation point in particular is there for a reason. Wouldn't #/ work just as well, since they're no doubt using a framework that lets them redirect queries to certain templates with a regex URL parser?
http://www.hipmunk.com/#!BOS.SEA,Dec15.Jan02
http://twitter.com/#!/dozba
My only guess is it's because browsers use # to link to an anchor element. Is this why the exclamation point is appended?
This is done to make an "AJAX" page crawlable [by google] for indexing -- It does not affect the other well-defined semantics of the fragment identifier at all!
See Making AJAX Applications Crawlable: Getting Started
Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler. The search results will show the original URL.
I am sure other search-engines are also following this lead/protocol.
Happy coding.
Also, It is actually perfectly valid, at least per HTML5, to have an element with an ID of "!foo" so the
reasoning in the post is invalid. See the article "The id attribute just got more classy":
HTML5 gets rid of the additional restrictions on the id attribute. The only requirements left — apart from being unique in the document — are that the value must contain at least one character (can’t be empty), and that it can’t contain any space characters.
My guess is that both pages use this in their JavaScript to differ between # (a link to an anchor) and their custom #! which loads some additional content using Ajax.
In that case pretty much everything else would work after the # sign.

What is the _snowman param in Ruby on Rails 3 forms for?

In Ruby on Rails 3 (currently using Beta 4), I see that when using the form_tag or form_for helpers there is a hidden field named _snowman with the value of ☃ (Unicode \x9731) showing up.
So, what is this for?
This parameter was added to forms in order to force Internet Explorer (5, 6, 7 and 8) to encode its parameters as unicode.
Specifically, this bug can be triggered if the user switches the browser's encoding to Latin-1. To understand why a user would decide to do something seemingly so crazy, check out this google search. Once the user has put the web-site into Latin-1 mode, if they use characters that can be understood as both Latin-1 and Unicode (for instance, é or ç, common in names), Internet Explorer will encode them in Latin-1.
This means that if a user searches for "Ché Guevara", it will come through incorrectly on the server-side. In Ruby 1.9, this will result in an encoding error when the text inevitably makes its way into the regular expression engine. In Ruby 1.8, it will result in broken results for the user.
By creating a parameter that can only be understood by IE as a unicode character, we are forcing IE to look at the accept-charset attribute, which then tells it to encode all of the characters as UTF-8, even ones that can be encoded in Latin-1.
Keep in mind that in Ruby 1.8, it is extremely trivial to get Latin-1 data into your UTF-8 database (since nothing in the entire stack checks that the bytes that the user sent at any point are valid UTF-8 characters). As a result, it's extremely common for Ruby applications (and PHP applications, etc. etc.) to exhibit this user-facing bug, and therefore extremely common for users to try to change the encoding as a palliative measure.
All that said, when I wrote this patch, I didn't realize that the name of the parameter would ever appear in a user-facing place (it does with forms that use the GET action, such as search forms). Since it does, we will rename this parameter to _e, and use a more innocuous-looking unicode character.
This is here to support Internet Explorer 5 and encourage it to use UTF-8 for its forms.
The commit message seen here details it as follows:
Fix several known web encoding issues:
Specify accept-charset on all forms. All recent browsers, as well as
IE5+, will use the encoding specified
for form parameters
Unfortunately, IE5+ will not look at accept-charset unless at least one
character in the form's values is not
in the page's charset. Since the
user can override the default
charset (which Rails sets to UTF-8),
we provide a hidden input containing
a unicode character, forcing IE to
look at the accept-charset.
Now that the vast majority of web input is UTF-8, we set the inbound
parameters to UTF-8. This will
eliminate many cases of incompatible
encodings between ASCII-8BIT and
UTF-8.
You can safely ignore params[:_snowman]
In short, you can safely ignore this parameter.
Still, I am not sure why we're supporting old technologies like Internet Explorer 5. It seems like a very non-Ruby on Rails decision if you ask me.

How good is the Rails sanitize() method?

Can I use ActionView::Helpers::SanitizeHelper#sanitize on user-entered text that I plan on showing to other users? E.g., will it properly handle all cases described on this site?
Also, the documentation mentions:
Please note that sanitizing
user-provided text does not guarantee
that the resulting markup is valid
(conforming to a document type) or
even well-formed. The output may still
contain e.g. unescaped ’<’, ’>’, ’&’
characters and confuse browsers.
What's the best way to handle this? Pass the sanitized text through Hpricot before displaying?
Ryan Grove's Sanitize goes a lot farther than Rails 3 sanitize. It ensures the output HTML is well-formed and has three built-in whitelists:
Sanitize::Config::RESTRICTED
Allows only very simple inline formatting markup. No links, images, or block elements.
Sanitize::Config::BASIC
Allows a variety of markup including formatting tags, links, and lists. Images and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto protocols, and a attribute is added to all links to mitigate SEO spam.
Sanitize::Config::RELAXED Allows an even wider variety of markup than BASIC, including images and tables. Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images are limited to HTTP and HTTPS. In this mode, is not added to links.
Sanitize is certainly better than the "h" helper. Instead of escaping everything, it actually allows the html tags that you specify. And yes, it does prevent cross-site scripting because it removes javascript from the mix entirely.
In short, both will get the job done. Use "h" when you don't expect anything other than plaintext, and use sanitize when you want to allow some, or you believe people may try to enter it. Even if you disallow all tags with sanitize, it'll "pretty up" the code by removing them instead of escaping them as "h" does.
As for incomplete tags: You could run a validation on the model that passes html-containing fields through hpricot, but I think this is overkill in most applications.
The best course of action depends on two things:
Your rails version (2.x or 3.x)
Whether your users are supposed to enter any html at all on the input or not.
As a general rule, I don't allow my users to input html - instead I let them input textile.
On rails 3.x:
User input is sanitized by default. You don't have to do anything, unless you want your users to be able to send some html. In that case, keep reading.
This railscast deals with XSS attacks on rails 3.
On rails 2.x:
If you don't allow any html from your users, just protect your output with the h method, like this:
<%= h post.text %>
If you want your users to send some html: you can use rails' sanitize method or HTML::StathamSanitizer

Resources