What characters are allowed in twitter hashtags? - twitter

In developing an iOS app containing a twitter client, I must allow for user generated hashtags (which may be created elsewhere within the app, not just in the tweet body).
I would like to ensure any such hashtags are valid for twitter, so I would like to error check the entered value for invalid characters. Bear in mind that users may be from non-English speaking countries.
I am aware of the usual limitations, such as not beginning a hashtag with a number, and no special punctuation characters, but I was wondering if there is a known list of all additional characters that are technically allowed within hashtags (i.e. international characters).

Karl, as you've rightly pointed out, any word in any language can be a valid twitter hashtag (as long as it meets a number of basic criteria). As such what you are asking for is a list of valid international word characters. I'm sure someone has compiled such a list somewhere, but using it would not be the most efficient approach to reaching what appears to be your initial goal: ensuring that a given hashtag is valid for twitter.
I believe, what you are looking for is a regular expression that can match all word characters within a Unicode range. Such an expression would not be dependant on your locale and would match all characters in the modern typography that can appear as part of a word.
You didn't specify what language you are writing your app in, so I can't help you with a language specific implementation. However, the basic approach would be as follows:
Check if any of the bracket expressions or character classes already support Unicode character ranges in your language. If yes, then use them.
Check if there is regex modifier that can enable Unicode character range support for your language.
Most modern languages implement regular expressions in a fairly similar way and a lot of them borrow heavily from Perl, so I hope the following two example will put you on the right track:
Perl:
Use POSIX bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) as they give you greater control over the characters you want to match, compared to character classes (eg: \w).
Use /u modifier to enable Unicode support when pattern matching. Under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, \w will match any of the more than 100,000 word characters in Unicode.
See Perl documentation for more info:
http://perldoc.perl.org/perlre.html#Character-set-modifiers
http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes
Ruby:
Use POSIX bracket expressions as they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
See Ruby documentation for more info:
http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Classes
Examples:
Given a list of hashtags, the following regex will match all hashtags that start with a word character (inc. international word characters) followed by at least one other word character, a number or an underscore:
m/^#[[:alpha:]][[:alnum:]_]+$/u # Perl
/^#[[:alpha:]][[:alnum:]_]+$/ # Ruby

Twitter allows letters, numbers, and underscores.
I checked this by generating tweets via their API. For example, tweeting
Hash tag test #foo[bar
resulted in "#foo" being marked as a hash tag, and "[bar" being unformatted text.

Well, for starters you can't use a # in the hashtag (##hash).
The guidelines below are being quoted from Twitter's help center:
People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search.
Clicking on a hashtagged word in any message shows you all other Tweets marked with that keyword.
Hashtags can occur anywhere in the Tweet – at the beginning, middle, or end.
Hashtagged words that become very popular are often Trending Topics.
Example: In the Tweet below, #eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays.
Using hashtags correctly:
If you Tweet with a hashtag on a public account, anyone who does a search for that hashtag may find your Tweet
Don't #spam #with #hashtags. Don't over-tag a single Tweet. (Best practices recommend using no more than 2 hashtags per Tweet.)
Use hashtags only on Tweets relevant to the topic.

Just want to add that in addition to alphanumeric characters and underscore, you can apparently use em dash in a Twitter hashtag like #COVIDー19.

Only letters and numbers are allowed to be part of a hashtag. If a character other than these follows the leading # and a letter or number, the hashtag will be cut off at this point.
I would recommend that your user interface indicate this to the user by changing the text color of the input field if the user enters anything other than a letter or number.

I had the same issue to implement in golang.
It seems allowed chars with [[:alpha:]] is only English-alphabet and could not use this syntax for other language characters.
Instead, I could use \p{L} for this purpose.
My test with \p{L} is here.
* Arabic, Hebrew, Hindi...etc is not confirmed yet.

Related

If it is valid that Wikipedia uses Chinese characters (and other unicode characters) in URL

On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.

How can I a add arabic character in my topic url in blogger?

I tried to use a Custom permanent link to my blog. But I can't use arabic character. I tried also to use utf-8 code but the "%" is invalid in blogger. What can I do please?
www.online.com/يكتمل
www.online.com/%D9%8A%D9%83%D8%AA%D9%85%%D9%84
thanks in advance
It is currently not possible to achieve what you want to do. According to the blog post on the official blog -
At present, the characters allowed in a custom URL are limited to:
a-z, A-Z, 0-1. The only special characters available are underscore,
dash, and period.
The Labels field has no such limitations and allows for using encoded characters (for example - https://tests-blogger.blogspot.com/search/label/%D9%8A%D9%83%D8%AA%D9%85%D9%84)

MALLET default token not remove bracket

In Java Mallet, the default token should be one or more characters in [A-Za-z] according to their website. However, when I have a text such as:
lower(location select testing) top
It thinks "lower(location" is one word. But default token should be all letter words. How can I deal with this situation?
The documentation had not been updated for the most recent version of Mallet, thank you for pointing this out. Here's a current version:
As of version 2.0.8, the default token expression is '\p{L}[\p{L}\p{P}]+\p{L}', which is valid for all Unicode letters, and supports typical English non-letter patterns such as hyphens, apostrophes, and acronyms. Note that this expression also implicitly drops one- and two-letter words. Other options include:
For non-English text, a good choice is --token-regex '[\p{L}\p{M}]+', which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support Chinese or Japanese word segmentation.
To include short words, use \p{L}+ (letters only) or '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}' (letters possibly including punctuation).

Regular expression for hashtag text with multi language support

I have a texts like #sample_123 , #123_sample , #_sample123 so i have to use regular expression to check the text contains only alphanumeric and underscore and also i want to support multi languages.
Currently i am using regular expression like (#)([:alpha:]+) but it detects only #sample( eg: #sample_123). So, Can any one please suggest the correct regular expression to fix out this problem.
you can use:
^#(\d|\w|_)+$
Debuggex Demo
This would validate any words that start with an hash and contains only alpha numeric characters or underscore. Of course there are no restrictions on how many characters after the hashtag there should be, so for example, a hashtag like #_ is considered valid, if this is not the wanted behavior please be more detailed on the constraints you want.

Is it necessary to use — and – in XHTML or HTML5?

It seems that it is best to use the & escape, instead of simply typing the ampersand (&).
However, should we be using X/HTML character entity references for dashes and other common typographical characters when writing blog posts on CMSs like WordPress or hard-coding websites by hand?
For example:
– is an en dash (–)
— is an em dash (—)
What is the risk if we do not?
Why is the hyphen (-) never written as - but simply typed directly from the keyboard in HTML? (Assuming that it the hyphen, and not a minus sign.)
The W3C released an official response about when to use and when not to use character escapes which you can find here. As they are also the group that is in charge of the HTML specification, I think it's best to follow their advice.
From the section "When to Use Escapes"
Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.
< (<)
> (>)
& (&)
They also mention using characters that might not be supported in the current encoding.
From the section "When Not to Use Escapes"
It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.
http://www.w3.org/International/questions/qa-escapes
Those entities are there to help you, the author, with characters not usually typable on your average keyboard. (The em dash is an example —, as well as © and ).
You only need to escape those characters that have meaning in (X)HTML < > and &.

Resources