this time i've one Q, which is how to search through YouTube Api...
English is perfectly searched... but other language(arabic, korean, etc...) doesn't work...T T
http://gdata.youtube.com/feeds/api/videos?q=(SEARCH_WORD)&start-index=1&max-results=3&v=2
=> My Access Code...
I'd like to search Arabic or Korean.. plz comment anything....
I need you guys help...
Have a Nice day~!!!
Please try
http://gdata.youtube.com/feeds/api/videos?q=%EA%B0%95%EB%82%A8%EC%8A%A4%ED%83%80%EC%9D%BC&start-index=1&max-results=3&v=2
or
http://gdata.youtube.com/feeds/api/videos?q=\uac15\ub0a8\uc2a4\ud0c0\uc77c&start-index=1&max-results=3&v=2
on your browser.
Cheers
The more general concept underlying the other answer is that the 'q' parameter (all parameters, really) will only accept any valid unicode character; so if the string you're trying to search on is not unicode (i.e. some other encoding set), those code points will be interpreted as unicode and thus result in a search on random characters (returning no results).
Related
We are using google search appliance product in our application. We have added the non English character in Frontend->Keymatch. When we are searching from our site, no result found error page is displayed.
Kindly suggest us to fix this issue.
I guess by saying non english character, you meant the accented characters.
GSA can retrun keymatch results for query terms with accented characters if the query is encoded properly. Encode your query and added ie request parameter with appropriate encoding value.
You can read more about ie and other request parameters here.
You should have added some sample keymatch entry which you configured that would have helped to assist you effectively.
I am facing one issue in one of my Rails project.
My users database contain names with special character and i want them to be shown in search result while searching it with simple characters.
Example: Lets suppose i have a user whose name is "Noël Nocciolo" (please notice soft sign on e) and i want that to be searched if i pass "Noel Nocciolo" as a parameter.
Can anyone tell me how to handle with these cases because no one knows how to provide input of "e with two dots".
And i am using postgres as my databse.
Regards,
Karan
You can create separate field "indexed_name" for search and fill it only with ASCII characters.
Then you have to preprocess query string with .gsub('ë', 'e') (or any other non ASCII characters to its ASCII analog) and search with this processed query
and i believe there is more elegant way to convert any string to ascii analog i just gave you direction )
.parameterize or ActiveSupport::Inflector.transliterate will probably be acceptable for your use case.
"àáâãäå".parameterize
=> "aaaaaa"
However, it won't handle ligatures such as ffi, so for that you'll need:
"àáâãÀffi".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/,'').to_s
=> "aaaaAffi"
In developing an iOS app containing a twitter client, I must allow for user generated hashtags (which may be created elsewhere within the app, not just in the tweet body).
I would like to ensure any such hashtags are valid for twitter, so I would like to error check the entered value for invalid characters. Bear in mind that users may be from non-English speaking countries.
I am aware of the usual limitations, such as not beginning a hashtag with a number, and no special punctuation characters, but I was wondering if there is a known list of all additional characters that are technically allowed within hashtags (i.e. international characters).
Karl, as you've rightly pointed out, any word in any language can be a valid twitter hashtag (as long as it meets a number of basic criteria). As such what you are asking for is a list of valid international word characters. I'm sure someone has compiled such a list somewhere, but using it would not be the most efficient approach to reaching what appears to be your initial goal: ensuring that a given hashtag is valid for twitter.
I believe, what you are looking for is a regular expression that can match all word characters within a Unicode range. Such an expression would not be dependant on your locale and would match all characters in the modern typography that can appear as part of a word.
You didn't specify what language you are writing your app in, so I can't help you with a language specific implementation. However, the basic approach would be as follows:
Check if any of the bracket expressions or character classes already support Unicode character ranges in your language. If yes, then use them.
Check if there is regex modifier that can enable Unicode character range support for your language.
Most modern languages implement regular expressions in a fairly similar way and a lot of them borrow heavily from Perl, so I hope the following two example will put you on the right track:
Perl:
Use POSIX bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) as they give you greater control over the characters you want to match, compared to character classes (eg: \w).
Use /u modifier to enable Unicode support when pattern matching. Under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, \w will match any of the more than 100,000 word characters in Unicode.
See Perl documentation for more info:
http://perldoc.perl.org/perlre.html#Character-set-modifiers
http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes
Ruby:
Use POSIX bracket expressions as they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
See Ruby documentation for more info:
http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Classes
Examples:
Given a list of hashtags, the following regex will match all hashtags that start with a word character (inc. international word characters) followed by at least one other word character, a number or an underscore:
m/^#[[:alpha:]][[:alnum:]_]+$/u # Perl
/^#[[:alpha:]][[:alnum:]_]+$/ # Ruby
Twitter allows letters, numbers, and underscores.
I checked this by generating tweets via their API. For example, tweeting
Hash tag test #foo[bar
resulted in "#foo" being marked as a hash tag, and "[bar" being unformatted text.
Well, for starters you can't use a # in the hashtag (##hash).
The guidelines below are being quoted from Twitter's help center:
People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search.
Clicking on a hashtagged word in any message shows you all other Tweets marked with that keyword.
Hashtags can occur anywhere in the Tweet – at the beginning, middle, or end.
Hashtagged words that become very popular are often Trending Topics.
Example: In the Tweet below, #eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays.
Using hashtags correctly:
If you Tweet with a hashtag on a public account, anyone who does a search for that hashtag may find your Tweet
Don't #spam #with #hashtags. Don't over-tag a single Tweet. (Best practices recommend using no more than 2 hashtags per Tweet.)
Use hashtags only on Tweets relevant to the topic.
Just want to add that in addition to alphanumeric characters and underscore, you can apparently use em dash in a Twitter hashtag like #COVIDー19.
Only letters and numbers are allowed to be part of a hashtag. If a character other than these follows the leading # and a letter or number, the hashtag will be cut off at this point.
I would recommend that your user interface indicate this to the user by changing the text color of the input field if the user enters anything other than a letter or number.
I had the same issue to implement in golang.
It seems allowed chars with [[:alpha:]] is only English-alphabet and could not use this syntax for other language characters.
Instead, I could use \p{L} for this purpose.
My test with \p{L} is here.
* Arabic, Hebrew, Hindi...etc is not confirmed yet.
Is there a library where I can simple call a method on a string to find out if it is non-English? I'm trying to only save English strings and the incoming stream of strings has plenty of non-English in them.
You can try to use linguo.
"your string".lang
# will return "en" for english strings
Disclaimer: I'm the creator of this gem.
You can use GoogleTranslate API with the RailsBridge for it - http://code.google.com/apis/gdata/articles/gdata_on_rails.html
Not that I'm aware... but you could get this list into an array (http://www.langmaker.com/wordlist/basiclex.htm) and then match the string's words against it... Decide on some percentage as good, and go from there.
You could even use bayesian algorithm here to mark those words as "good" and learn from there, but that might be overkill.
Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?
Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.
From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.
originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.
Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)
Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1