Passing tokenized strings to Google Translate API - parsing

I've written a program which translates strings by passing them to Google's Cloud Translate API. The majority of these strings are simple, as in:
I am a string! Please translate me!
But some of them are tokenized, as in:
Page {1} of {2}
What is the correct way to pass these tokenized strings to the API? If I pass in the tokenized strings as they are, bugs start to creep in. For example, if I pass in Page {1} of {2} and ask Google to translate it into Italian, it returns:
Pagina 1 di 2}
Is there a recommended method for passing such tokenized strings to Google's API? If there is such a method, would it stop bugs such as the above from cropping up?
Note: I am aware of the following workaround: Page <span>{1}</span> of <span>{2}</span>. However since the strings I'm working with sometimes include snippets of HTML, that doesn't look like a viable long-term strategy.

Related

Convert ID from speech to text without spaces

I'm using Google Cloud Speech API with IBM Voice Gateway in order to interact with a VoiceBot through a phone.
If I say an identifier contening letters and numbers through the phone, the Google Cloud Speech converts it into string with spaces. For example, if I say "A1B2C3", it will convert it into the following string "a 1 b 2 c 3".
Do you know if there is way to avoid these useless spaces ?
Thanks for your help!
Lucas
I don't see any way in which you can eliminate spaces from the API response. What you could do is experiment with the available features, as this is probably your best chance to get a recognition more similar to what you are looking for.
For example: you can provide some sample hint phrases echoing your use case, indicate that the audio is a phone call, or use an enhanced model (although for the latter to be available you need to first opt in for data logging).
Honestly though, for your case, it might be better if you post process the returned string (e.g. with a simple "a 1 b 2 c 3".replace(' ','') ).

Searching soft signs with normal character

I am facing one issue in one of my Rails project.
My users database contain names with special character and i want them to be shown in search result while searching it with simple characters.
Example: Lets suppose i have a user whose name is "Noël Nocciolo" (please notice soft sign on e) and i want that to be searched if i pass "Noel Nocciolo" as a parameter.
Can anyone tell me how to handle with these cases because no one knows how to provide input of "e with two dots".
And i am using postgres as my databse.
Regards,
Karan
You can create separate field "indexed_name" for search and fill it only with ASCII characters.
Then you have to preprocess query string with .gsub('ë', 'e') (or any other non ASCII characters to its ASCII analog) and search with this processed query
and i believe there is more elegant way to convert any string to ascii analog i just gave you direction )
.parameterize or ActiveSupport::Inflector.transliterate will probably be acceptable for your use case.
"àáâãäå".parameterize
=> "aaaaaa"
However, it won't handle ligatures such as ffi, so for that you'll need:
"àáâãÀffi".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/,'').to_s
=> "aaaaAffi"

What should I do with emails using charset ansi_x3.110-1983?

My application is parsing incoming emails. I try to parse them as best as possible but every now and then I get one with puzzling content. This time is an email that looks to be in ASCII but the specified charset is: ansi_x3.110-1983.
My application handles it correctly by defaulting to ASCII, but it throws a warning which I'd like to stop receiving, so my question is: what is ansi_x3.110-1983 and what should I do with it?
According to this page on the IANA's site, ANSI_X3.110-1983 is also known as:
iso-ir-99
CSA_T500-1983
NAPLPS
csISO99NAPLPS
Of those, only the name NAPLPS seems interesting or informative. If you can, consider getting in touch with the people sending those mails. If they're really using Prodigy in this day and age, I'd be amazed.
The IANA site also has a pointer to RFC 1345, which contains a description of the bytes and the characters that they map to. Compared to ISO-8859-1, the control characters are the same, as are most of the punctuation, all of the numbers and letters, and most of the remaining characters in the first 7 bits.
You could possibly use the guide in the RFC to write a tool to map the characters over, if someone hasn't written a tool for it already. To be honest, it may be easier to simply ignore the whines about the weird character set given that the character mapping is close enough to what is expected anyway...

Standardizing "character set ranges" as internationally defined values

Lets say I have a field which accepts A-Z,a-z,0-9 . If I'm trying to communicate to someone, via documenation or api creation "what" my code can accept, i HAVE to say:
A-Z,a-z,0-9
Now that in my mind this is restrictive and error prone.
Compare that to what i'm proposing.
Suppose A-Z,a-z,0-9 was allocated the "code" ANSI456
When I'm communicating that to someone, I can say that my code accepts ANSI456. If someone else was developing a check, there is no confusion on what my code can or cannot accept.
To those who will suggest just specifying character ranges, please note that what i'm envisioning will handle scenarios where even this is defined as a valid "code"
0-9, +, -, *, /
In fact, if its done properly, we can have a site generate automatic code in various languages to accomodate the different "codes".
Okay - i KNOW there are ~ infinite values, eg:
a-z
is different from
a-l,n-z
And these would have two different codes in this "system".
I'm not proposing a HUMAN moderated system - it can be completely automatic BUT systematic way of generating these "codes"
There already is such a standard, although it doesn't have the word "standard" in its name. It is called Perl 5 compatible regular expressions, and it is used in Perl 5, Java, JavaScript, libpcre and many other contexts.

Allowing non-English (ASCII) characters in the URL for SEO?

I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI
but do not have a reserved purpose are
called unreserved. These include
uppercase and lowercase letters,
decimal digits, hyphen, period,
underscore, and tilde.
The solution seems to be to:
Convert the character string into a
sequence of bytes using the UTF-8
encoding
Convert each byte that is
not an ASCII letter or digit to %HH,
where HH is the hexadecimal value of
the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.
From Google Webmaster Central
Does that mean I should avoid
rewriting dynamic URLs at all?
That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases. If you want to serve a
static equivalent of your site, you
might want to consider transforming
the underlying content by serving a
replacement which is truly static. One
example would be to generate files for
all the paths and make them accessible
somewhere on your site. However, if
you're using URL rewriting (rather
than making a copy of the content) to
produce static-looking URLs from a
dynamic site, you could be doing harm
rather than good. Feel free to serve
us your standard dynamic URL and we
will automatically find the parameters
which are unnecessary.
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
Hypertext Transfer Protocol
GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
[Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
Request Version: HTTP/1.0
User-Agent: Wget/1.11.4\r\n
Accept: */*\r\n
Host: hy.wikipedia.org\r\n
Connection: Keep-Alive\r\n
\r\n
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.

Resources