Does AWS Lex support emoji conversations & intent ? - amazon-lex

I've tried building an intent that targets :thumbsup: and also the unicode representation U+1F44D
how can I build an intent around emojis ?

Unfortunately :thumbsup: and U+1F44D will be invalid.
An utterance can consist only of Unicode characters, spaces, and valid
punctuation marks. Valid punctuation marks are: periods for
abbreviations, underscores, apostrophes, and hyphens.
You have to handle emoji before sending it to Lex. Like if you get :thumbsup: value then send thumbsup to Lex and it will handle that intent.

Related

MALLET default token not remove bracket

In Java Mallet, the default token should be one or more characters in [A-Za-z] according to their website. However, when I have a text such as:
lower(location select testing) top
It thinks "lower(location" is one word. But default token should be all letter words. How can I deal with this situation?
The documentation had not been updated for the most recent version of Mallet, thank you for pointing this out. Here's a current version:
As of version 2.0.8, the default token expression is '\p{L}[\p{L}\p{P}]+\p{L}', which is valid for all Unicode letters, and supports typical English non-letter patterns such as hyphens, apostrophes, and acronyms. Note that this expression also implicitly drops one- and two-letter words. Other options include:
For non-English text, a good choice is --token-regex '[\p{L}\p{M}]+', which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support Chinese or Japanese word segmentation.
To include short words, use \p{L}+ (letters only) or '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}' (letters possibly including punctuation).

What characters are allowed in twitter hashtags?

In developing an iOS app containing a twitter client, I must allow for user generated hashtags (which may be created elsewhere within the app, not just in the tweet body).
I would like to ensure any such hashtags are valid for twitter, so I would like to error check the entered value for invalid characters. Bear in mind that users may be from non-English speaking countries.
I am aware of the usual limitations, such as not beginning a hashtag with a number, and no special punctuation characters, but I was wondering if there is a known list of all additional characters that are technically allowed within hashtags (i.e. international characters).
Karl, as you've rightly pointed out, any word in any language can be a valid twitter hashtag (as long as it meets a number of basic criteria). As such what you are asking for is a list of valid international word characters. I'm sure someone has compiled such a list somewhere, but using it would not be the most efficient approach to reaching what appears to be your initial goal: ensuring that a given hashtag is valid for twitter.
I believe, what you are looking for is a regular expression that can match all word characters within a Unicode range. Such an expression would not be dependant on your locale and would match all characters in the modern typography that can appear as part of a word.
You didn't specify what language you are writing your app in, so I can't help you with a language specific implementation. However, the basic approach would be as follows:
Check if any of the bracket expressions or character classes already support Unicode character ranges in your language. If yes, then use them.
Check if there is regex modifier that can enable Unicode character range support for your language.
Most modern languages implement regular expressions in a fairly similar way and a lot of them borrow heavily from Perl, so I hope the following two example will put you on the right track:
Perl:
Use POSIX bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) as they give you greater control over the characters you want to match, compared to character classes (eg: \w).
Use /u modifier to enable Unicode support when pattern matching. Under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, \w will match any of the more than 100,000 word characters in Unicode.
See Perl documentation for more info:
http://perldoc.perl.org/perlre.html#Character-set-modifiers
http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes
Ruby:
Use POSIX bracket expressions as they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
See Ruby documentation for more info:
http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Classes
Examples:
Given a list of hashtags, the following regex will match all hashtags that start with a word character (inc. international word characters) followed by at least one other word character, a number or an underscore:
m/^#[[:alpha:]][[:alnum:]_]+$/u # Perl
/^#[[:alpha:]][[:alnum:]_]+$/ # Ruby
Twitter allows letters, numbers, and underscores.
I checked this by generating tweets via their API. For example, tweeting
Hash tag test #foo[bar
resulted in "#foo" being marked as a hash tag, and "[bar" being unformatted text.
Well, for starters you can't use a # in the hashtag (##hash).
The guidelines below are being quoted from Twitter's help center:
People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search.
Clicking on a hashtagged word in any message shows you all other Tweets marked with that keyword.
Hashtags can occur anywhere in the Tweet – at the beginning, middle, or end.
Hashtagged words that become very popular are often Trending Topics.
Example: In the Tweet below, #eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays.
Using hashtags correctly:
If you Tweet with a hashtag on a public account, anyone who does a search for that hashtag may find your Tweet
Don't #spam #with #hashtags. Don't over-tag a single Tweet. (Best practices recommend using no more than 2 hashtags per Tweet.)
Use hashtags only on Tweets relevant to the topic.
Just want to add that in addition to alphanumeric characters and underscore, you can apparently use em dash in a Twitter hashtag like #COVIDー19.
Only letters and numbers are allowed to be part of a hashtag. If a character other than these follows the leading # and a letter or number, the hashtag will be cut off at this point.
I would recommend that your user interface indicate this to the user by changing the text color of the input field if the user enters anything other than a letter or number.
I had the same issue to implement in golang.
It seems allowed chars with [[:alpha:]] is only English-alphabet and could not use this syntax for other language characters.
Instead, I could use \p{L} for this purpose.
My test with \p{L} is here.
* Arabic, Hebrew, Hindi...etc is not confirmed yet.

IOS Localizing Push Notifications

When I was prototyping for push notifications I had a php script with a non localized string pushed to the iphone, here I could also include emoji symbols.
I have now made a webservice using asp.net and I localize the push notifications on the IOS side.
But I cant seem to get emoji symbols to work now, Ive tried every combination og unicode escape
sequences in the localized string, and also tried sending the emoji as an argument in the push notification and having it included in the localized string "%#" and "%C" but to no luck.
Im stuck atm, so any tip to put me back on track is very much appriciated
Stefan
According to the JSON RFC, characters that are not part of the "Basic Multilingual Plane" can be escaped using a UTF-16 surrogate pair.
For example the emoji symbol "THUMBS UP SIGN" is Unicode codepoint U+1F44D, and the UTF-16 surrogate pair for this is
0xD83D, 0xDC4D
The JSON Unicode escape sequence would be
\ud83d\udc4d
If you include this in the alert part of the push notification payload, the symbol will be correctly displayed (I have tested it).
But you can also use UTF-8, for the "THUMBS UP SIGN" this would be the bytes
0xF0, 0x9F, 0x91, 0x8D

Are Latin encoded characters considered URL safe?

Are Latin encoded characters considered URL safe?
Having read this post, I'm aware that web safe characters are outlined in this document. The specs do not make clear, however, if Latin encoded characters are part of the unreserved list. For example: ç and õ.
I don't see why those characters would not be included in the unreserved list. But, that said, I'm yet to see any URLs that contain such characters.
Relevant question: Assuming I can use such characters in my URL, should I?
My URLs will be generated by user input. Should I keep titles with such characters, or substitute them? For example, ç to becomes c, and so on.
My reader's native language is Portuguese, but I'm not sure if they will care about these characters in the page's friendly-URL.
The RFC you linked mentioned specifically mentions ASCII as the character set for URIs:
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII].
That would make characters outside of ASCII not safe, as far as the RFC is concerned.
Of course, this is all before IDN existed. There is an RFC that specifies how conversions between ASCII and Unicode on the URL should occur.
You can use any characters you want, because if any character is used outside the range of ASCII code list the percent-code octets is used in order to make the uri transportable

Encoding minimum characters in POST request: is it safe or not?

I came across an approach to encode just the following 4 characters in the POST parameter's value: # ; & +. What problems can it cause, if any?
Personally I dislike such hacks. The reason why I'm asking about this one is that I have an argument with its inventor.
Update. To clarify, this question is about encoding parameters in the POST body and not about escaping POST parameters on the server side, e. g. before feeding them into shell, database, HTML page or whatever.
From rfc1738 (if you're using application/x-www-form-urlencoded encoding to transfer data):
Unsafe:
Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text; the quote mark (""") is used to delimit URLs in some systems. The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character "%" is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`".
All unsafe characters must always be encoded within a URL. For example, the character "#" must be encoded within URLs even in systems that do not normally deal with fragment or anchor identifiers, so that if the URL is copied into another system that does use them, it will not be necessary to change the URL encoding.
Escaping metacharacters is usually (always?) done to prevent injection attacks. Different systems have different metacharacters, so each needs its own way of preventing injections. Different systems have different ways of escaping characters. Some systems don't need to escape characters, since they have different channels for control and data (e.g. prepared statements). Additionally, the filtering is usually best performed when the data is introduced to a system.
The biggest problem is that escaping only those four characters won't provide complete protection. SQL, HTML and shell injection attacks are still possible after filtering the four characters you mention.
Consider this: $sql ='DELETE * fromarticlesWHEREid='.$_POST['id'].';
And you enter in the form: 1' OR '10
It then Becomes this : $sql ='DELETE * fromarticlesWHEREid='1' OR '10';

Resources