I found some character like "<" ">" "&" in tweet being automatically escaped by twitter.
For example, tweeting with content <>& will result in <>& despite the content is serialized to JSON object.
I've searched a lot but found no official document describing this behavior.
Where can I get a full list of escape characters?
Related
We are using google search appliance product in our application. We have added the non English character in Frontend->Keymatch. When we are searching from our site, no result found error page is displayed.
Kindly suggest us to fix this issue.
I guess by saying non english character, you meant the accented characters.
GSA can retrun keymatch results for query terms with accented characters if the query is encoded properly. Encode your query and added ie request parameter with appropriate encoding value.
You can read more about ie and other request parameters here.
You should have added some sample keymatch entry which you configured that would have helped to assist you effectively.
I have noticed that Google does not encode all special characters in the query part of the URL . For example:
Placing this string in Google's search: !##$%^&*()
Yields this URL: https://www.google.com/#q=!%40%23%24%25^%26*()
Notice that the !, ^, *, ( , and ) are not encoded.
Some of the characters such as : or < are considered unsafe or reserved, yet Google doesn't encode them.
Can someone explain why Google does this, and if they have a reference document as to exactly what characters get encoded and which don't?
Thanks for any help!
As documented here:
Some characters are not safe to use in a URL without first being
encoded. Because a Google search request is made by using an HTTP URL,
the search request must follow URL conventions, including character
encoding, where necessary.
The HTTP URL syntax defines that only alphanumeric characters, the
special characters $-_.+!*'(), and the reserved characters ;/?:#=& can
be used as values within an HTTP URL request. Since reserved
characters are used by the search engine to decode the URL, and some
special characters are used to request search features, then all
non-alphanumeric characters used as a value to an input parameter must
be URL-encoded.
To URL-encode a string:
Replace space characters with a "+" character Replace each
non-alphanumeric character by its hexadecimal ASCII value, in the
format of a "%" character followed by two hexadecimal digits. (Such an
ASCII value may be referred to as an escape code.)
Some input parameters require that the values passed to Google search are double-URL-encoded. This requirement means that you must apply the URL encoding to the string twice in succession to generate the final value.
In developing an iOS app containing a twitter client, I must allow for user generated hashtags (which may be created elsewhere within the app, not just in the tweet body).
I would like to ensure any such hashtags are valid for twitter, so I would like to error check the entered value for invalid characters. Bear in mind that users may be from non-English speaking countries.
I am aware of the usual limitations, such as not beginning a hashtag with a number, and no special punctuation characters, but I was wondering if there is a known list of all additional characters that are technically allowed within hashtags (i.e. international characters).
Karl, as you've rightly pointed out, any word in any language can be a valid twitter hashtag (as long as it meets a number of basic criteria). As such what you are asking for is a list of valid international word characters. I'm sure someone has compiled such a list somewhere, but using it would not be the most efficient approach to reaching what appears to be your initial goal: ensuring that a given hashtag is valid for twitter.
I believe, what you are looking for is a regular expression that can match all word characters within a Unicode range. Such an expression would not be dependant on your locale and would match all characters in the modern typography that can appear as part of a word.
You didn't specify what language you are writing your app in, so I can't help you with a language specific implementation. However, the basic approach would be as follows:
Check if any of the bracket expressions or character classes already support Unicode character ranges in your language. If yes, then use them.
Check if there is regex modifier that can enable Unicode character range support for your language.
Most modern languages implement regular expressions in a fairly similar way and a lot of them borrow heavily from Perl, so I hope the following two example will put you on the right track:
Perl:
Use POSIX bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) as they give you greater control over the characters you want to match, compared to character classes (eg: \w).
Use /u modifier to enable Unicode support when pattern matching. Under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, \w will match any of the more than 100,000 word characters in Unicode.
See Perl documentation for more info:
http://perldoc.perl.org/perlre.html#Character-set-modifiers
http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes
Ruby:
Use POSIX bracket expressions as they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
See Ruby documentation for more info:
http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Classes
Examples:
Given a list of hashtags, the following regex will match all hashtags that start with a word character (inc. international word characters) followed by at least one other word character, a number or an underscore:
m/^#[[:alpha:]][[:alnum:]_]+$/u # Perl
/^#[[:alpha:]][[:alnum:]_]+$/ # Ruby
Twitter allows letters, numbers, and underscores.
I checked this by generating tweets via their API. For example, tweeting
Hash tag test #foo[bar
resulted in "#foo" being marked as a hash tag, and "[bar" being unformatted text.
Well, for starters you can't use a # in the hashtag (##hash).
The guidelines below are being quoted from Twitter's help center:
People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search.
Clicking on a hashtagged word in any message shows you all other Tweets marked with that keyword.
Hashtags can occur anywhere in the Tweet – at the beginning, middle, or end.
Hashtagged words that become very popular are often Trending Topics.
Example: In the Tweet below, #eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays.
Using hashtags correctly:
If you Tweet with a hashtag on a public account, anyone who does a search for that hashtag may find your Tweet
Don't #spam #with #hashtags. Don't over-tag a single Tweet. (Best practices recommend using no more than 2 hashtags per Tweet.)
Use hashtags only on Tweets relevant to the topic.
Just want to add that in addition to alphanumeric characters and underscore, you can apparently use em dash in a Twitter hashtag like #COVIDー19.
Only letters and numbers are allowed to be part of a hashtag. If a character other than these follows the leading # and a letter or number, the hashtag will be cut off at this point.
I would recommend that your user interface indicate this to the user by changing the text color of the input field if the user enters anything other than a letter or number.
I had the same issue to implement in golang.
It seems allowed chars with [[:alpha:]] is only English-alphabet and could not use this syntax for other language characters.
Instead, I could use \p{L} for this purpose.
My test with \p{L} is here.
* Arabic, Hebrew, Hindi...etc is not confirmed yet.
In a TextArea, I am using the ' character but it is not displaying properly. Instead, it is displaying something like this: –.
How do I get the ' character to display properly?
You are probably not using the Ascii apostrophe (') but some non-Ascii punctuation mark, such as the correct punctuation apostrophe (’). The problem arises because your HTML document is (probably) UTF-8 encoded but the browser interprets it as windows-1252 encoded. If there encoding is not declared in HTTP headers, adding the tag <meta charset=utf-8> into the head part would help. For general advice on encodings, see the W3C page Character encodings.
The textarea element is meant for user input. For presenting your content, other elements (possibly styled with CSS) are usually a better choice. However, the encoding issue is the same.
What’s the difference between an URL Encode and a HTML Encode?
HTML Encoding escapes special characters in strings used in HTML documents to prevent confusion with HTML elements like changing
"<hello>world</hello>"
to
"<hello>world</hello>"
URL Encoding does a similar thing for string values in a URL like changing
"hello+world = hello world"
to
"hello%2Bworld+%3D+hello+world"
urlEncode replaces special characters with characters that can be understood by web browsers/web servers for the purpose of addressing... hence URL. For instance, spaces are replaced with %20, ' = %27 etc...
See these references:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
http://www.degraeve.com/reference/urlencoding.php
HtmlEncode replaces special characters with character strings that are recognised by the HTML engine itself to render the content of the page - things like & become & or < = <, > = > this prevents the HTML engine from interpreting these characters as parts of the HTML markup and therefore render them as if they were strings.
See this reference:
http://msdn.microsoft.com/en-us/library/ms525347.aspx
Both HTML and URL's are essentially very constrained languages. As a language they add meaning to specific keywords or operators. For both of these languages though, keywords are almost always single characters. For example
HTML: > and <
URL: / and :
In the use of each language though it is possible to use these constructs in a manner that does not ensure the meaning of the language. For instance this post contains a > character. I do not want it to be interpreted as HTML, just text.
This is where Encode and Decode methods come into play. These methods will respectively take a string and convert any of the characters that would otherwise be treated as keywords into an escaped form which will not be interpreted as part of the language.
For instance: Passing > into HtmlEncode will return >
HTMLEncode and URLEncode deal with invalid characters in HTML and URLs, or more accurately, characters that need to be specially written to be interpreted correctly. For example, in HTML the < and > characters are used to indicate tags. Thus, if you wanted to write a math formula, something like 1+1 < 2+2, the '<' would normally be interpreted as the beginning of a tag. HTMLEncoding turns this character into "<" which is the encoded representation of the less-than sign. URLEncoding does the same, but for URLs, for which the special characters are different, although there is some overlap.
I don't know what language you are working in, but the PHP manual for example provides good explanations.
URLEncode
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
Read on