Is there a way to dumb down text from Unicode to ASCII? - mapping

What I need is something like, for each ASCII character, a list of equivalent Unicode characters.
The problem is that programs like Microsoft Excel and Word insert non-ASCII double-quotes, single-quotes, dashes, etc. when people type into documents. I want to store this text in a database field of type "varchar", which requires single-byte characters.
For the sake of storing ASCII (single-byte) text, some of those Unicode characters could be considered equivalent to or similar enough to a particular ASCII character that replacing the Unicode character with the equivalent ASCII character would be fine.
I would like a simple function like MapToASCII, that would convert Unicode text to an ASCII equivalent, allowing me to specify a replacement character for any Unicode characters that are not similar to any ASCII character.

The Win32 API WideCharToMultiByte can be used for this conversion (Unicode to ANSI). Use CP_ACP as the first parameter. Something like that would likely be better than trying to build your own mapping function.
Edit At the risk of sounding like I am trying to promote this as a solution against the OP's wishes, it seems that it may be worth pointing out that this API does much (all?) of what is being asking for. The goal is to map (I think) a Unicode string as much as possible to "ANSI" (where ANSI may be something of a moving target in this case). An additional requirement is to be able to specify some alternative character for those that cannot be mapped. The following example does this. It "converts" a Unicode string to char and uses an underscore (second to last parameter) for those characters that cannot be converted.
ret = WideCharToMultiByte( CP_ACP, 0, L"abc個חあЖdef", -1,
ac, sizeof( ac ), "_", NULL );
for ( i = 0; i < strlen( ac ); i++ )
printf( "%c %02x\n", ac[i], ac[i] );

A highly relevant question is here: Replacing unicode punctuation with ASCII approximations
Although the answer there is insufficient, it gave me an idea. I could map each of the Unicode code points in the Basic Multilingual Plane (0) to an equivalent ASCII character, if one exists. The following C# code will help by creating an HTML form in which you can type a replacement character for each value.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.IO;
namespace UnicodeCharacterCategorizer
{
class Program
{
static void Main(string[] args)
{
string output_filename = "output.htm"; //set a filename if not specifying one through the command line
Dictionary<UnicodeCategory,List<char>> category_character_sets = new Dictionary<UnicodeCategory,List<char>>();
foreach (UnicodeCategory c in Enum.GetValues(typeof(UnicodeCategory)))
category_character_sets.Add( c, new List<char>() );
for (int i = 0; i <= 0xFFFF; i++)
{
if (i >= 0xD800 && i <= 0xDFFF) continue; //Skip ranges reserved for high/low surrogate pairs.
char c = (char)i;
UnicodeCategory category = char.GetUnicodeCategory( c );
category_character_sets[category].Add( c );
}
StringBuilder file_data = new StringBuilder( #"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html xmlns=""http://www.w3.org/1999/xhtml""><head><title>Unicode Category Character Sets</title><style>.categoryblock{border:3px solid black;margin-bottom:10px;padding:5px;} .characterblock{display:inline-block;border:1px solid grey;padding:5px;margin-right:5px;} .character{display:inline-block;font-weight:bold;background-color:#ffeeee} .numericvalue{color:blue;}</style></head><body><form id=""charactermap"">" );
foreach (KeyValuePair<UnicodeCategory,List<char>> entry in category_character_sets)
{
file_data.Append( #"<div class=""categoryblock""><h1>" + entry.Key.ToString() + ":</h1><br />" );
foreach (char c in entry.Value)
{
string hex_value = ((int)c).ToString( "x" );
file_data.Append( #"<div class=""characterblock""><span class=""character"">&#x" + hex_value + #";<br /><span class=""numericvalue"">" + hex_value + #"</span><br /><input type=""text"" name=""r_" + hex_value + #""" /></div>" );
}
file_data.Append( "</div>" );
}
file_data.Append("</form></body></html>" );
File.WriteAllText( output_filename, file_data.ToString(), Encoding.Unicode );
}
}
}
Specifically, that code will generate an HTML form containing all characters in the BMP, along with input text boxes named after the hex values prefixed with "r_" (r is for "replacement value"). If this ported over to an ASP.NET page, additional code could be written to pre-populate replacement values as much as possible:
with their own value if already ASCII, or
with Unicode normalized FormD or FormKD decomposed equivalents, or
a single ASCII value for an entire category (i.e. all "punctuation initial" characters with a ASCII double quote)
You could then go through manually and make adjustments, and it probably wouldn't take as long as you'd think. There are only 64512 code points, and large chunks of entire categories can probably be dismissed as "no even close to anything ASCII". So, I'm going to build this map and function.

Related

Objective-C some special char uncontrollably changing

I have a string that include some special char (like é,â,î,ı etc.), When I use substring on this string. I encounter inconsistent results. Some special char change uncontrollably
You are assuming that these are all characters:
[newword substringWithRange:NSMakeRange(0,1)];
[newword substringWithRange:NSMakeRange(1,1)];
[newword substringWithRange:NSMakeRange(2,1)];
[newword substringWithRange:NSMakeRange(3,1)];
// and so on...
In other words, you believe that:
A location always falls at the start of a character.
A character always has length 1.
Both assumptions are wrong. Please read the Characters and Grapheme Clusters chapter of Apple's String Programming Guide (here).
Your é happens to have length 2, because it is a base letter e followed by a combining diacritical accent. If you want it to have length 1, you need to normalize the string before you use it. Call precomposedStringWithCanonicalMapping and use the resulting string.
Example and proof (in Swift, but it won't matter, as I use NSString throughout):
let s = "é,â,î,ı" as NSString
let c = s.substring(with: NSRange(location: 0, length: 1)) // e
let s2 = s.precomposedStringWithCanonicalMapping as NSString
let c2 = s2.substring(with: NSRange(location: 0, length: 1)) // é
You're treating a unicode string like a sequence of bytes. Unicode codepoints, aside from low UTF8 can be multi-byte so you are changing the text style by stripping out parts responsible for the accent above the letter like this part: https://www.compart.com/en/unicode/U+0301
UTF8 is variable width so by treating it as raw bytes you may get weird results, I would suggest using something that is more aware of unicode like ICU (International Components for Unicode).
Now imagine you have a two byte sequence like this (this may not be 100% accurate but it illustrates my point):
0x056 0x000
e NUL
Now you have a UTF8 string with 1 codepoint and a null terminator. Now say you want to add an accent to that e. How would you do that? You could use a special unicode codepoint to modify the e so now the string is:
0x056 0x0CC 0x810 0x000
e U+0301 NUL
Where U+0301 is 2 a byte control character (Combining Acute Accent) and makes the e accented.
Edit: The answer assumes UTF8 encoding which is likely a bad assumption but I think the answer, whether UTF8 or UTF16, or any other type of encoding with control characters, illustrates why you may have mysterious dissapearing accents. While this may be UTF16, for the sake of simplicity let's pretend we live in a world where life is just slightly better because everyone only uses UTF8 and UTF16 doesn't exist.
To address the comment (this is less to do with the question but is some fun trivia) and for some fun detils about NS/CF/Swift runtimes and bridging and constant CF strings and other fun stuff like that: The representation of the actual string in memory is implementation defined and can vary (even for constant strings, trust me, I know, I fixed the ELF implementation of them in Clang for CoreFoundation a few days ago). Anyway, here's some code:
CF_INLINE CFStringEncoding __CFStringGetSystemEncoding(void) {
if (__CFDefaultSystemEncoding == kCFStringEncodingInvalidId) (void)CFStringGetSystemEncoding();
return __CFDefaultSystemEncoding;
}
CFStringEncoding CFStringFileSystemEncoding(void) {
if (__CFDefaultFileSystemEncoding == kCFStringEncodingInvalidId) {
#if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI || DEPLOYMENT_TARGET_WINDOWS
__CFDefaultFileSystemEncoding = kCFStringEncodingUTF8;
#else
__CFDefaultFileSystemEncoding = CFStringGetSystemEncoding();
#endif
}
return __CFDefaultFileSystemEncoding;
}
Throughout CoreFoundation/Foundation/SwiftFoundation (Yes you never know what sort of NSString is actually the one you're holding, they usually pretend to be the same thing but under the hood depending on how you got the object you may be holding onto one of the three variations of it).
This is why code like this exists, because NS/CF(Constant)/Swift strings have implementation defined internal representation.
if (((encoding & 0x0FFF) == kCFStringEncodingUnicode) && ((encoding == kCFStringEncodingUnicode) || ((encoding > kCFStringEncodingUTF8) && (encoding <= kCFStringEncodingUTF32LE)))) {
If you want consistent behavior you have to encode the string using a specific fixed encoding instead of relying on the internal representation.

PDFBox 2.0: Overcoming dictionary key encoding

I am extracting text from PDF forms with Apache PDFBox 2.0.1, extracting the details of AcroForm fields. From a radio button field I dig up the appearance dictionary. I'm interested in the /N and /D entries (normal and "down" appearance). Like this (interactive Bean shell):
field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
ap = annot.getAppearance();
keys = ap.getCOSObject().getDictionaryObject("N").keySet();
keyList = new ArrayList(keys.size());
for (cosKey : keys) {keyList.add(cosKey.getName());}
print(String.join("|", keyList));
}
The output is
Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off
The question mark blotches should be Swedish characters "ä" or "å". Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
This sample PDF form can be downloaded here: http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf
Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
Changing the assumed encoding
PDFBox' interpretation of the encoding of bytes in names (only names can be used as dictionary keys in PDFs) takes place in BaseParser.parseCOSName() when reading the name from the source PDF:
/**
* This will parse a PDF name from the stream.
*
* #return The parsed PDF name.
* #throws IOException If there is an error reading from the stream.
*/
protected COSName parseCOSName() throws IOException
{
readExpectedChar('/');
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
int c = seqSource.read();
while (c != -1)
{
int ch = c;
if (ch == '#')
{
int ch1 = seqSource.read();
int ch2 = seqSource.read();
if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
{
String hex = "" + (char)ch1 + (char)ch2;
try
{
buffer.write(Integer.parseInt(hex, 16));
}
catch (NumberFormatException e)
{
throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
}
c = seqSource.read();
}
else
{
// check for premature EOF
if (ch2 == -1 || ch1 == -1)
{
LOG.error("Premature EOF in BaseParser#parseCOSName");
c = -1;
break;
}
seqSource.unread(ch2);
c = ch1;
buffer.write(ch);
}
}
else if (isEndOfName(ch))
{
break;
}
else
{
buffer.write(ch);
c = seqSource.read();
}
}
if (c != -1)
{
seqSource.unread(c);
}
String string = new String(buffer.toByteArray(), Charsets.UTF_8);
return COSName.getPDFName(string);
}
As you can see, after reading the name bytes and interpreting the # escape sequences, PDFBox unconditionally interprets the resulting bytes as UTF-8 encoded. To change this, therefore, you have to patch this PDFBox class and replace the charset named at the bottom.
Is PDFBox correct here?
According to the specification, when treating a name object as text
the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.
(section 7.3.5 Name Objects, ISO 32000-1)
BaseParser.parseCOSName() implements just that.
PDFBox' implementation is not completely correct, though, as already the act of interpreting the name as string without need is wrong:
name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text
Thus, PDF libraries should handle names as byte arrays as long as possible and only find a string representation when it is explicitly required, and only then the recommendation above (to assume UTF-8) should play a role. The specification even indicates where this may cause trouble:
PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.
Another situation becomes apparent in the document at hand, if the sequence of bytes constitutes no valid UTF-8, it still is a valid name. But such names are changed by the method above, any unparsable byte or subsequence is replaced by the Unicode Replacement Character '�'. Thus, different names may collapse into a single one.
Another issue is that when writing back a PDF, PDFBox is not acting symmetrically but instead interprets the String representation of the name (which has been retrieved as a UTF-8 interpretation if read from a PDF) using pure US_ASCII, cf. COSName.writePDF(OutputStream):
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
int current = (b + 256) % 256;
// be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
if (current >= 'A' && current <= 'Z' ||
current >= 'a' && current <= 'z' ||
current >= '0' && current <= '9' ||
current == '+' ||
current == '-' ||
current == '_' ||
current == '#' ||
current == '*' ||
current == '$' ||
current == ';' ||
current == '.')
{
output.write(current);
}
else
{
output.write('#');
output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
}
}
}
Thus, any interesting Unicode character is replaced with the US_ASCII default replacement character which I assume to be '?'.
So it is quite fortunate that PDF names most often do merely contain ASCII characters... ;)
Historically
According to the implementation notes from the PDF 1.4 reference,
In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.
Thus, the sample document at hand seems to follow conventions from Acrobat 4, i.e. from the last century.
Source code excerpts are from PDFBox 2.0.0 but at first glance do not seem to have been changed in 2.0.1 or the development trunk.

Parsing RTF non-breaking space

I am making a simple parser from RTF to HTML.
I have the following raw RTF:
who\\~nursed\\~and
According to the RTF specification \~ is the keyword for a non-breaking space.
The end of a keyword is marked by a Delimiter which is defined as follows:
A space. This serves only to delimit a control word and is ignored in subsequent processing.
A numeric digit or an ASCII minus sign (-), which indicates that a numeric parameter is associated with the control word. The subsequent digital sequence is then delimited by any character other than an ASCII digit (commonly another control word that begins with a backslash). The parameter can be a positive or negative decimal number. The range of the values for the number is nominally –32768 through 32767, i.e., a signed 16-bit integer. A small number of control words take values in the range −2,147,483,648 to 2,147,483,647 (32-bit signed integer). These control words include \binN, \revdttmN, \rsidN related control words and some picture properties like \bliptagN. Here N stands for the numeric parameter. An RTF parser must allow for up to 10 digits optionally preceded by a minus sign. If the delimiter is a space, it is discarded, that is, it’s not included in subsequent processing.
Any character other than a letter or a digit. In this case, the delimiting character terminates the control word and is not part of the control word. Such as a backslash “\”, which means a new control word or a control symbol follows.
As i understand it, the highlighted part above, is the rule used in this particular instance. But if that is the case, then my parser would read until the ~ sign, and conclude that since this is not a letter or a digit, it is not part of the keyword.
This currently results in the following output:
who~nursed~and
I have the following code for reading a keyword:
public GetKeyword(index: number): KeywordSet {
var keywordarray: string[] = [];
var valuearray: string[] = [];
index++;
while (index < this.m_input.length) {
var remainint = this.m_input.substr(index);
//Keep going until we hit a delimiter
if (this.m_input[index] == " ") {
index++;
break;
} else if (this.IsNumber(this.m_input[index])) {
valuearray.push(this.m_input[index]);
} else if (this.IsDelimiter(this.m_input[index])) {
break;
} else keywordarray.push(this.m_input[index]);
index++;
}
var value: number = null;
if (valuearray.length > 0) value = parseInt(valuearray.join(""));
var keywordset = new KeywordSet(keywordarray.join(""), index, value);
return keywordset;
}
private IsDelimiter(char: string): boolean {
if (char == "*" || char == "'") return false;
return !this.IsLetterOrDigit(char);
}
When GetKeyword() reaches "~" it recognises it as a delimiter, and stops reading, resulting in an empty keyword as return value.
I do not have an AST constructed for this. Don't think it is necessary for this?
The quote in your question describes the syntax of an entity called control word but the \~ is actually a different entity called control symbol. Control symbols have a different syntax:
Control Symbol
A control symbol consists of a backslash followed by a single, non-alphabetical character. For example, \~ (backslash tilde) represents a non-breaking space. Control symbols do not have delimiters, i.e., a space following a control symbol is treated as text, not a delimiter.
See page 9 of Rich Text Format (RTF) Specification, version 1.9.1.

ASP MVC 3 - Export to CSV method including junk characters not in database

Below is the (crude) method I'm using to export the contents of a table into a CSV. I came up with this on the fly, however the data in the table has been loaded from an Excel spreadsheet created by a Sharepoint site. I do not know if that conversion process or my method is the cause, but a large number of these characters: Â are being imported into the cells.
Also, a large number of records are having their fields split up into two rows as opposed to just one. This is my first attempt at exporting to a CSV programatically (as opposed to using excel) so any help would be greatly apprecitated.
Controller Method
public ActionResult ExportToCsv()
{
using (StringWriter writer = new StringWriter())
{
var banks = db.BankListMaster.Include(b => b.BankListAgentId).ToList();
writer.WriteLine("Bank Name, EPURL, AssociatedTPMBD, Tier, FixedLifeMasterSAF, VariableLifeMasterSAF, FixedLifeSNY, VariableLifeMasterSNY, SpecialNotes, WelcomeLetterReq, " +
"BackOfficeNotification, LinkRepsToDynamics, RelationshipCode, INDSGC, PENSGC, LicensingContract, MiscellaneousNotes, ContentTypeID1, CreatedBy, MANonresBizNY, Attachment");
foreach (var item in banks)
{
writer.Write(item.BankName + ",");
if(String.IsNullOrWhiteSpace(item.EPURL))
{
writer.Write(item.EPURL + ",");
}
else
{
writer.Write(item.EPURL.Trim() + ",");
}
writer.Write(item.AssociatedTPMBD + ",");
writer.Write(item.Tier + ",");
writer.Write(item.LicensingContract + ",");
writer.Write(item.MiscellaneousNotes + ",");
writer.Write(item.ContentTypeID1 + ",");
writer.Write(item.CreatedBy + ",");
writer.Write(item.MANonresBizNY + ",");
writer.Write(item.Attachment);
writer.Write(writer.NewLine);
}
return File(new System.Text.UTF8Encoding().GetBytes(writer.ToString().Replace("Â", "")), "text/csv", "BankList.csv");
}
}
CSV is a file format that's poorly specified. Several important things aren't specified:
The field separator. Even though it's called "comma separated", Excel will use the semicolon sometimes (depending on your locale!).
The encoding (UTF-8, ISO-8859-1, ANSI/Windows-1252 etc.)
The kind of newlines (CR, NL or CR NL).
Whether all fields have to be quoted with double quotes or just the ones containing the field separator, the line separator, blanks etc.
Whether white space is trimmed from unquoted fields.
Whether newlines are allowed within quoted fields (Excel allows them).
How double quotes are escaped if they are part of the field content (normally they are doubled)
Excel is usually the reference for a valid CSV format. But even Excel chooses the field separator and the encoding depending on your locale.
In your case, the encoding is most likely the main problem. You use UTF-8 but the consumer treats it as ISO-8859-1 or ANSI. For that reason, the character  often appears whose binary code is used in UTF-8 to introduce a two byte sequence. Change the encoding to fix the Â.
As the next step, properly quote the text fields, i.e. add double quotes at the start and at the end and double all double quotes within the field.

Dectect ASCII codes for asian double byte / cyrillic character sets?

Is it possible to detect if an ascii character belongs to Asian double byte or Cyrillic character sets? Perhaps specific code ranges? I've googled, but not finding anything at first glance.
There's an RSS feed I'm tapping into that has the locale set as 'en-gb'. But there are some Asian double byte characters in the feed itself - which I need to handle differently. Just not sure how to detect it since the meta locale data is incorrect. I do not have access to correct the public feed.
If your rss feed uses utf-8, which it probably does - just look that character value is greater than 255.
A quick Google suggest that you might wanna look at String.charCodeAt
I don't know ActionScript, but I would expect a code snippet to look something like
var stringToTest : String;
for each (var i : Number = 0; i < stringToTest.length; i++) {
if (stringToTest.charCodeAt(i) > 255) {
// Do something to your double-byte character here
} else {
// You have a plain ASCII character here
}
}
I hope this helps!

Resources