How to Validate a Character Compatible with MacRoman - ios

Is there a simple way (a function, a method...) of validating a character that a user types to see if it's compatible with Mac OS Roman? I've read a few dozen topics to find out why an iOS application crashes in reference to CGContextShowTextAtPoint. I guess an application can crash if it tries to draw on an image a string (i.e. ©) containing a character that is not included in the Mac OS Roman set. There are 256 characters in this set. I wonder if there's a better way other than matching the selected character one by one with those 256 characters?
Thank you

You might give https://developer.apple.com/library/mac/#documentation/graphicsimaging/conceptual/drawingwithquartz2d/dq_text/dq_text.html a closer read.
You can draw any encoding using CGContextShowGlyphsAtPoint instead of CContextShowTextAtPoint so you can tell it what the encoding is. If the user types it then you'll be getting the string as an NSString which is a Unicode string underneath. Probably the easiest is going to be to get the utf8 encoding of that user entered string via NSString's UTF8String method.
If you really want to stick with the very limited MacRoman for some reason, then use NSString's cStringUsingEncoding: passing in NSMacOSRomanStringEncoding to get a MacRoman string. Read the documentation on this in NSString though. Will return null if the user string can't be encoded in MacRoman losslessly. As it discusses you can use dataUsingEncoding:allowLossyConversion: and canBeConvertedToEncoding: to check. Read the cautions in the Discussion for cStringUsingEncoding: about about lifecycle of the returned strings though. getCString:maxLength:encoding: might end up being a better choice for you. All discussed in the class documentation for NSString.

This doesn't directly answer the question but this answer may be a solution to your problem.
If you have an NSString, instead of using CGContextShowTextAtPoint, you can do:
[someStr drawAtPoint:somePoint withFont:someFont];
where someStr is an NSString containing any Unicode characters a user can type, somePoint is a CGPoint, and someFont is the UIFont to use to render the text.

Related

Cannot get expected result for Spring4D cryptography examples

The Spring4D library has cryptography classes, however I cannot get them to work as expected. I'm probably using them incorrectly, however lack of any examples makes it difficult.
For example on the website https://quickhash.com/hash-sha256-online, I can hash the word "test" to generate the following hash:
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
Using the Spring4D library, the following code produces a different hash:
CreateSHA256.ComputeHash('test').ToString;
results in:
9EFEA1AEAC9EDA04A892885A65FDAE0E6D9BE8C9FC96DA76D31B929262E12B1D
Upper/lower case aside, it is a different hash altogether. I know must be doing something wrong, but again there's no examples of use so I'm stuck on how to do this.
Hashing algorithms operate on binary data, typically represented using byte arrays.
Unfortunately, both of the resources you have used offer the ability to hash text. In order to hash text, you first need to convert from text to binary. To do so requires a choice of encoding. And neither method makes it clear what that choice is.
When I use this Delphi code:
LowerCase(CreateSHA256.ComputeHash(TEncoding.UTF8.GetBytes('test')).ToString)
I get the same hash as appears in your question.
I urge you never to attempt to encrypt/hash text and instead regard these operations as operating on binary. Always use an explicit encoding and then encrypt/hash the array of bytes that the encoding produced.
I've picked the UTF-8 encoding here, because it is a full Unicode encoding, and tends to be efficient in terms of space. However, I don't think your online encoder uses UTF-8. In fact I've no idea what encoding it uses, it is unclear on the matter. This is of course the same old issue of text being different from binary.
In my opinion it is a design flaw of the Delphi library that you use that it allows you to hash text without an explicit choice of encoding. If this library must offer a function that hashes text, then it should require the caller to supply an extra TEncoding parameter.
There is no conversion going on internally so it hashes the UnicodeString which is at least 2 bytes per character.
If you want the same result as on the page you have to use UTF8Encode or directly pass as AnsiString.
However I tried some strings that contained different unicode characters and the page returned a different result. So I am not quite sure how they treat the strings there. I guess it's a codepage thing.
Edit: If you use this page http://www.xorbin.com/tools/sha256-hash-calculator it generates the same hash as TSHA256 with UTF8Encode.
Which type of string are you using? Do you use AnsiString or WideString (Unicode string). Delphi 2009 and Newer are using WideString by default.
Why is string type inportant? All hasging algorithm operates on raw bytes data so it is omportant if each character of your string is stored in one Byte of memory (AnsiString) or multiple Bytes of memory (WideString).

How to identify if a TBytes array may safely convert to AnsiString, string or UTF8String?

Given a TBytes array, can we identify if the array may convert to AnsiString, String or UTF8String without losing any characters?
What you appear to be asking to do is impossible. You seem to have a byte array of unknown provenance, that may be encoded as ANSI, UTF-8 or UTF-16. You are hoping to be able to determine which encoding is correct.
This is impossible because there exist byte arrays that are valid in all three of those encodings, and that represent different strings in each encoding. Raymond Chen shows a nice clean example here: The Notepad file encoding problem, redux.
You can use heuristic algorithms to attempt to guess the encoding, an example of which is IsTextUnicode. But any such approach is by necessity not robust.

CGI::unescape can't handle unescaping "wymiana+teflon%F3w"?

I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML

Avoid backslash-encoding utf8 characters when deriving Read (and Show) in Haskell

I'm having trouble parsing utf8 characters into Text when deriving a Read instance. For example, when I run the following in ghci...
> import Data.Text
> data Message = Message Text deriving (Read, Show)
> read ("Message \"→\"") :: Message
Message "\8594"
Can I do anything to keep my text inside Message utf-8 encoded? I.e. The result should be...
Message "→"
(P.S. I already receive my serialized messages as Text, but currently need to unpack to a String in order to call read. I'd love to avoid this...)
EDIT: Ah sorry, answers rightly point out that it's show not read which converts to "\8594" - is there a way to show and convert back to Text again without the backslash encoding?
To the best of my knowledge, the internal encoding used by Text (which is actually UTF-16) is consistent and not exposed directly. If you want UTF-8, you can decode/encode a Text value as appropriate. Similarly, it doesn't make sense to talk about an encoding for String, because that's just a list of Char, where each Char is a unicode code point.
Most likely, it's only the Show instance for Text displaying things differently here.
Also, keep in mind that (by consistent convention in standard libraries) read and show are expected to behave as (de-)serialization functions, with a "serialized" format that, interpreted as a Haskell expression, describes a value equivalent to the one being (de-)serialized. As such, the slash encoding with ASCII text is often preferred for being widely supported and unambiguous. If you want to display a Text value with the actual code points, show isn't what you want.
I'm not entirely clear on what you want to do with the Text--using show directly is exactly what you're trying to avoid. If you want to display text in a terminal window that's going to dictate the encoding, and you want the stuff defined in Data.Text.IO. If you need to convert to a specific encoding for whatever other reason, Data.Text.Encoding will give you an encoded ByteString (emphasis on "byte", not "string"--a ByteString is a sequence of raw bytes, not a string of characters).
If you just want to convert from Text to String and back to Text... what's wrong with the slash encoding? show is not really intended for pretty-printing output for users to read, despite many people's initial expectations otherwise.

Delphi XE - should I use String or AnsiString?

I finally upgraded to Delphi XE. I have a library of units where I use strings to store plain ANSI characters (chars between A and U). I am 101% sure that I will never ever use UNICODE characters in those places.
I want to convert all other libraries to Unicode, but for this specific library I think it will be better to stick with ANSI. The advantage is the memory requirement as in some cases I load very large TXT files (containing ONLY Ansi characters). The disadvantage might be that I have to do lots and lots of typecasts when I make those libraries to interact with normal (unicode) libraries.
There are some general guidelines to show when is good to convert to Unicode and when to stick with Ansi?
The problem with general guidelines is that something like this can be very specific to a person's situation. Your example here is one of those.
However, for people Googling and arriving here, some general guidelines are:
Yes, convert to Unicode. Don't try to keep an old app fully using AnsiStrings. The reason is that the whole VCL is Unicode, and you shouldn't try to mix the two, because you will convert every time you assign a Unicode string to an ANSI string, and that is a lossy conversion. Trying to keep the old way because it's less work (or some similar reason) will cause you pain; just embrace the new string type, convert, and go with it.
Instead of randomly mixing the two, explicitly perform any conversions you need to, once - for example, if you're loading data from an old version of your program you know it will be ANSI, so read it into a Unicode string there, and that's it. Ever after, it will be Unicode.
You should not need to change the type of your string variables - string pre-D2009 is ANSI, and in D2009 and alter is Unicode. Instead, follow compiler warnings and watch which string methods you use - some still take an AnsiString parameter and I find it all confusing. The compiler will tell you.
If you use strings to hold bytes (in other words, using them as an array of bytes because a character was a byte) switch to TBytes.
You may encounter specific problems for things like encryption (strings are no longer byte/characters, so 'character' for 'character' you may get different output); reading text files (use the stream classes and TEncoding); and, frankly, miscellaneous stuff. Search here on SO, most things have been asked before.
Commenters, please add more suggestions... I mostly use C++Builder, not Delphi, and there are probably quite a few specific things for Delphi I don't know about.
Now for your specific question: should you convert this library?
If:
The values between A and U are truly only ever in this range, and
These values represent characters (A really is A, not byte value 65 - if so, use TBytes), and
You load large text files and memory is a problem
then not converting to Unicode, and instead switching your strings to AnsiStrings, makes sense.
Be aware that:
There is an overhead every time you convert from ANSI to Unicode
You could use UTF8String, which is a specific type of AnsiString that will not be lossy when converted, and will still store most text (Roman characters) in a single byte
Changing all the instances of string to AnsiString could be a bit of work, and you will need to check all the methods called with them to see if too many implicit conversions are being performed (for performance), etc
You may need to change the outer layer of your library to use Unicode so that conversion code or ANSI/Unicode compiler warnings are not visible to users of your library
If you convert to Unicode, sets of characters (can't remember the syntax, maybe if 'S' in MySet?) won't work. From your description of characters A to U, I could guess you would like to use this syntax.
My recommendation? Personally, the only reason I would do this from the information you've given is the memory use, and possibly performance depending on what you're doing with this huge amount of A..Us. If that truly is significant, it's both the driver and the constraint, and you should convert to ANSI.
You should be able to wrap up the conversion at the interface between this unit and its clients. Use AnsiString internally and string everywhere else and you should be fine.
In general only use AnsiString if it is important that the Chars are single bytes, Otherwise the use of string ensures future compatibility with Unicode.
You need to check all libraries anyway because all Windows API functions in Delhpi XE replaced by their unicode-analogues, etc. If you will never use UNICODE you need to use Delphi 7.
Use AnsiString explicitly everywhere in this unit and then you'll get compiler warning errors (which you should never ignore) for String to AnsiString conversion errors if you happen to access the routines incorrectly.
Alternately, perhaps preferably depending on your situation, simply convert everything to UTF8.
Stick with Ansi strings ONLY if you do not have the time to convert the code properly. The use of Ansi strings is really only for backward compatibility - to my knowledge C# does not have an equiavalent to Ansi strings. Otherwise use the standard Unicode strings. If you have a look on my web-site I have a whole strings routines unit (about 5,000 LOC) that works with both Delphi 2007 (non-Uniocde) and XE (Unicode) with only "string" interfaces and contains almost all of the conversion issues you might face.

Resources