How to convert a formatted string into plain text

How to convert a formatted string into plain text - ruby-on-rails

User copy paste and send data in following format: "𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖"
I need to convert it into plain txt (we can say ascii chars) like 'jovy debbie'
It comes in different font and format:
ex:
'𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔'
'𝙶𝚎𝚟𝚒𝚎𝚕𝚢𝚗 𝙽𝚒𝚌𝚘𝚕𝚎 𝙻𝚞𝚖𝚋𝚊𝚐'
Any Help will be Appreciated, I already refer other stack overflow question but no luck :(

Those letters are from the Mathematical Alphanumeric Symbols block.
Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".tr("𝕒-𝕫", "a-z")
#=> "jovy debbie"
The same approach can be used for the other styles, e.g.
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".tr("𝒂-𝒛𝑨-𝒁", "a-zA-Z")
#=> "Jenica Dugos"
This gives you full control over the character mapping.
Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".unicode_normalize(:nfkc)
#=> "jovy debbie"
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".unicode_normalize(:nfkc)
#=> "Jenica Dugos"

Related

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks

Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/

You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.

You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

URL Escape in Uppercase

I have a requirement to escape a string with url information but also some special characters such as '<'.
Using cl_http_utility=>escape_url this translates to '%3c'. However due to our backend webserver, it is unable to recognize this as special character and takes the value literally. What it does recognize as special character is '%3C' (C is upper case). Also if one checks http://www.w3schools.com/tags/ref_urlencode.asp it shows the value with all caps as the proper encoding.
I guess my question is is there an alternative to cl_http_utility=>escape_url that does essentially the same thing except outputs the value in upper case?
Thanks.

Use the string function.
l_escaped = escape( val = l_unescaped
format = cl_abap_format=>e_url ).
Other possible formats are e_url_full, e_uri, e_uri_full, and a bunch of xml/json stuff too. The string function escape is documented pretty well, demo programs and all.

Mime encoded headers with extra '=' (==?utf-8?b?base64string?=)

This might be a silly question but... here it goes!
I wrote my own MIME parser in native C++. It's a nightmare with the encodings! It was stable for the last 3 months or so but recently I noticed this Subject: header.
Subject: =?UTF-8?B?T2ZpY2luYSBkZSBJbmZvcm1hY2nDs24sIEluaWNpYXRpdmFzIHkgUmVjbGFt?===?UTF-8?B?YWNpb25lcw==?=
which should decode to this:
Subject: Oficina de Información, Iniciativas y Reclamaciones
The problem is there is one extra = (equal) in there which I can't figure out binding the two (why 2?) encoded elements which I don't understand why are separated. In theory the format should be: =?charset?encoding?encoded_string?= but found another subject that starts with two =.
==?UTF-8?B?blahblahlblah?=
How should I handle the extra =?
I could replace ==? with =? (which I am) before doing anything (and it works)... but I'm wondering if there's any kind of spec regarding this so I don't hack my way into proper functionality.
PS: How much I hate these relic protocols! All text communications should be UTF-8 and XML :)

In MIME headers encoded words are used (RFC 2047 Section 2.).
... (why 2?)
To overcome 75 encoded word limit, which is there because of 78 line length limit (or to use 2 different encodings like Chinese and Polish for example).
RFC 2047:
An 'encoded-word' may not be more than 75 characters long,
including 'charset', 'encoding', 'encoded-text', and delimiters.
If it is desirable to encode more text than will fit in an
'encoded-word' of 75 characters, multiple 'encoded-word's
(separated by CRLF SPACE) may be used.
Here's the example from RFC2047 (note there is no '=' in between):
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
Your subject should be decoded as:
"Oficina de Información, Iniciativas y Reclam=aciones"
mraq answer is incorrect. Soft line breaks apply to 'Quoted Printable' Content-Transfer-Encoding only, which can be used in MIME body.

It is called the "Soft Line Break" and it is the heritage of the SMTP protocol.
Quoting page 20 of RFC2045
(Soft Line Breaks) The Quoted-Printable encoding
REQUIRES that encoded lines be no more than 76
characters long. If longer lines are to be encoded
with the Quoted-Printable encoding, "soft" line breaks
must be used. An equal sign as the last character on a
encoded line indicates such a non-significant ("soft")
line break in the encoded text.
And also Wikipedia on Quoted-printable
A soft line break consists of an "=" at the end of an encoded line,
and does not appear as a line break in the decoded text.

From what I can see in the MIME RFC double equal signs are not valid input (for encoding), but keep in mind you could interpret the first equal sign as what it is and then use the following stuff for decoding. But seriously, those extra equal signs look like artifacts, maybe from an incorrect encoder.

list of garbage characters like â€™

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with â€™. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx

Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as â€™. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.

Question reminder:
"...I noticed characters like '’' is replaced with â€™... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

.gsub erroring with non-regular character 194

I've seen this posted a couple of times but none of the solutions seem to work for me so far...
I'm trying to remove a spurious Â character from a string...
e.g.
"myÂstring here Â$100"
..but it should be my string here $100
I've tried:
string.gsub(/\194/,'')
string.gsub(194.chr,'')
string.delete 194.chr
All of these still leave the Â intact..
Any thoughts?

By default, Rails supports UTF-8.
You can use your favorite editor to write a gsub call using the proper character you want to replace, as in:
"myÂstring here Â$100".gsub(/Â/,"")
If this does not work as well, you might be having an encoding error somewhere on your stack, probably on your HTML document. Try running rails console, extract somehow that string (if it comes from the Model, try to perform a find on the containing class) and run the gsub. It won't solve your problem, but you'll get a clue to where exactly the problem may lie.

Looks like a character encoding problem to me. For every Unicode code point in the range U+0080..U+00BF inclusive, the UTF-8 encoding is a two-byte sequence, 0xC2 (194 decimal) and the numeric value the code point. For example, a non-breaking space--U+00A0--becomes 0xC2 0xA0. Was there another extra character in there, that you already removed?
At any rate, gsub(/\194/,'') is wrong. \nnn is supposed to be an octal escape, but the number is in its decimal form. 194 in octal is \302.

"myÂstring here Â$100".gsub("Â","") # "mystring here $100"
Is that what you meant?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart