Process unicode string in C and Objective C - ios

I write a C function to read characters in an user-input string. Because this string is user-input, so it can contains any unicode characters. There's an Objective C method receives the user-input NSString, then convert this string to NSData and pass this data to the C function for processing. The C function searches for these symbol characters: *, [, ], _, it doesn't care any other characters. Everytime it found one of the symbols, it processes and then calls an Objective C method, pass the location of the symbol.
C code:
typedef void (* callback)(void *context, size_t location);
void process(const uint8_t *data, size_t length, callback cb, void *context)
{
size_t i = 0;
while (i < length)
{
if (data[i] == '*' || data[i] == '[' || data[i] == ']' || data[i] == '_')
{
int valid = 0;
//do something, set valid = 1
if (valid)
cb(context, i);
}
i++;
}
}
Objective C code:
//a C function declared in .m file
void mycallback(void *context, size_t location)
{
[(__bridge id)context processSymbolAtLocation:location];
}
- (void)processSymbolAtLocation:(NSInteger)location
{
NSString *result = [self.string substringWithRange:NSMakeRange(location, 1)];
NSLog(#"%#", result);
}
- (void)processUserInput:(NSString*)string
{
self.string = string;
//convert string to data
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
//pass data to C function
process(data.bytes, data.length, mycallback, (__bridge void *)(self));
}
The code works fine if the input string contains only English characters. If it contains composed character sequences, multibyte characters or other unicode characters, the result string in processSymbolAtLocation method is not the expected symbol.
How to convert the NSString object to NSData correctly? How to get the correct location?
Thanks!

Your problem is that you start off with a UTF-16 encoded NSString and produce a sequence of UTF-8 encoded bytes. The number of code units required to represent a string in UTF-16 may not be equal to that number required to represent it in UTF-8, so the offsets in your two forms may not match - as you have found out.
Why are you using C to scan the string for matches in the first place? You might want to look at NSString's rangeOfCharacterFromSet:options:range: method which you can use to find the next occurrence of character from your set.
If you need to use C then convert your string into a sequence of UTF-16 words and use uint16_t on the C side.
HTH

Related

Objective-C how to convert a keystroke to ASCII character code?

I need to find a way to convert an arbitrary character typed by a user into an ASCII representation to be sent to a network service. My current approach is to create a lookup dictionary and send the corresponding code. After creating this dictionary, I see that it is hard to maintain and determine if it is complete:
__asciiKeycodes[#"F1"] = #(112);
__asciiKeycodes[#"F2"] = #(113);
__asciiKeycodes[#"F3"] = #(114);
//...
__asciiKeycodes[#"a"] = #(97);
__asciiKeycodes[#"b"] = #(98);
__asciiKeycodes[#"c"] = #(99);
Is there a better way to get ASCII character code from an arbitrary key typed by a user (using standard 104 keyboard)?
Objective C has base C primitive data types. There is a little trick you can do. You want to set the keyStroke to a char, and then cast it as an int. The default conversion in c from a char to an int is that char's ascii value. Here's a quick example.
char character= 'a';
NSLog("a = %ld", (int)test);
console output = a = 97
To go the other way around, cast an int as a char;
int asciiValue= (int)97;
NSLog("97 = %c", (char)asciiValue);
console output = 97 = a
Alternatively, you can do a direct conversion within initialization of your int or char and store it in a variable.
char asciiToCharOf97 = (char)97; //Stores 'a' in asciiToCharOf97
int charToAsciiOfA = (int)'a'; //Stores 97 in charToAsciiOfA
This seems to work for most keyboard keys, not sure about function keys and return key.
NSString* input = #"abcdefghijklkmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890!##$%^&*()_+[]\{}|;':\"\\,./<>?~ ";
for(int i = 0; i<input.length; i ++)
{
NSLog(#"Found (at %i): %i",i , [input characterAtIndex:i]);
}
Use stringWithFormat call and pass the int values.

how to read chinese from pdf in ios correctly

here is what I have done, but it appears disorderly. Thanks in advance.
1.use CGPDFStringCopyTextString to get the text from the pdf
2.encode the NSString to char*
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
const char *char_content = [self.currentData cStringUsingEncoding:enc];
Below is how I get the currentData:
void arrayCallback(CGPDFScannerRef inScanner, void *userInfo)
{
BIDViewController *pp = (__bridge BIDViewController*)userInfo;
CGPDFArrayRef array;
bool success = CGPDFScannerPopArray(inScanner, &array);
for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 1)
{
if(n >= CGPDFArrayGetCount(array))
continue;
CGPDFStringRef string;
success = CGPDFArrayGetString(array, n, &string);
if(success)
{
NSString *data = (__bridge NSString *)CGPDFStringCopyTextString(string);
[pp.currentData appendFormat:#"%#", data];
}
}
}
- (IBAction)press:(id)sender {
table = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback(table, "TJ", arrayCallback);
CGPDFOperatorTableSetCallback(table, "Tj", stringCallback);
self.currentData = [NSMutableString string];
CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(pagerf);
CGPDFScannerRef scanner = CGPDFScannerCreate(contentStream, table, (__bridge void *)(self));
bool ret = CGPDFScannerScan(scanner);
}
According to the Mac Developer Library
CGPDFStringCopyTextString returns a CFString object that represents a PDF string as a text string. The PDF string is given as a CGPDFString which is a series of bytes—unsigned integer values in the range 0 to 255; thus, this method already decodes the bytes according to some character encoding.
It is given none explicitly, so it assumes one encoding type, most likely the PDFDocEncoding or the UTF-16BE Unicode character encoding scheme which are the two encodings that may be used to represent text strings in a PDF document outside the document’s content streams, cf. section 7.9.2.2 Text String Type and Table D.1, Annex D in the PDF specification.
Now you have not told us from where you received your CGPDFString. I assume, though, that you received it from inside one of the document’s content streams. Text strings there, on the other hand, can be encoded with any imaginable encoding. The encoding used is given by the embedded data of the font the string is to be displayed with.
For more information on this you may want to read CGPDFScannerPopString returning strange result and have a look at PDFKitten.

iOS - XML to NSString conversion

I'm using NSXMLParser for parsing XML to my app and having a problem with the encoding type. For example, here is one of the feeds coming in. It looks similar to this"
\U2026Some random text from the xml feed\U2026
I am currently using the encoding type:
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
Which encoding type am I suppose to use for converting \U2026 into a ellipse (...) ??
The answer here is you're screwed. They are using a non-standard encoding for XML, but what if they really want the literal \U2026? Let's say you add a decoder to handle all \UXXXX and \uXXXX encodings. What happens when another feed want the data to be the literal \U2026?
You're first choice and best bet is to get this feed fixed. If they need to encode data, they need to use proper HTML entities or numeric references.
As a fallback, I would isolate the decoder away from the XML parser. Don't create a non-conforming XML parser just because your getting non-conforming data. Have a post processor that would only be run on the offending feed.
If you must have a decoder, then there is more bad news. There is no built in decoder, you will need to find a category online or write one up yourself.
After some poking around, I think Using Objective C/Cocoa to unescape unicode characters, ie \u1234 may work for you.
Alright, heres a snippet of code that should work for any unicode code-point:
NSString *stringByUnescapingUnicodeSymbols(NSString *input)
{
NSMutableString *output = [NSMutableString stringWithCapacity:[input length]];
// get the UTF8 string for this string...
const char *UTF8Str = [input UTF8String];
while (*UTF8Str) {
if (*UTF8Str == '\\' && tolower(*(UTF8Str + 1)) == 'u')
{
// skip the next 2 chars '\' and 'u'
UTF8Str += 2;
// make sure we only read 4 chars
char tmp[5] = { UTF8Str[0], UTF8Str[1], UTF8Str[2], UTF8Str[3], 0 };
long unicode = strtol(tmp, NULL, 16); // remember that Unicode is base 16
[output appendFormat:#"%C", unicode];
// move on with the string (making sure we dont miss the end of the string
for (int i = 0; i < 4; i++) {
if (*UTF8Str == 0)
break;
UTF8Str++;
}
}
else
{
if (*UTF8Str == 0)
break;
[output appendFormat:#"%c", *UTF8Str];
}
UTF8Str++;
}
return output;
}
You should simple replace literal '\U2026' on a quotation, then encode it with NSUTF8StringEncoding encodind to NSData

MD5 with ASCII Char

I have a string
wDevCopyright = [NSString stringWithFormat:#"Copyright: %c 1995 by WIRELESS.dev, Corp Communications Inc., All rights reserved.",0xa9];
and to munge it I call
-(NSString *)getMD5:(NSString *)source
{
const char *src = [source UTF8String];
unsigned char result[CC_MD5_DIGEST_LENGTH];
CC_MD5(src, strlen(src), result);
return [NSString stringWithFormat:
#"%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
result[0], result[1], result[2], result[3],
result[4], result[5], result[6], result[7],
result[8], result[9], result[10], result[11],
result[12], result[13], result[14], result[15]
]; //ret;
}
because of 0xa9 *src = [source UTF8String] does not create a char that represents the string, thus returning a munge that is not comparable with other platforms.
I tried to encode the char with NSASCIIStringEncoding but it broke the code.
How do I call CC_MD5 with a string that has ASCII characters and get the same hash as in Java?
Update to code request:
Java
private static char[] kTestASCII = {
169
};
System.out.println("\n\n>>>>> msg## " + (char)0xa9 + " " + (char)169 + "\n md5 " + md5(new String(kTestASCII), false) //unicode = false
Result >>>>> msg## \251 \251
md5 a252c2c85a9e7756d5ba5da9949d57ed
ObjC
char kTestASCII [] = {
169
};
NSString *testString = [NSString stringWithCString:kTestASCII encoding:NSUTF8StringEncoding];
NSLog(#">>>> objC msg## int %d char %c md5: %#", 0xa9, 169, [self getMD5:testString]);
Result >>>> objC msg## int 169 char © md5: 9b759040321a408a5c7768b4511287a6
** As stated earlier - without the 0xa9 the hashes in Java and ObjC are the same. I am trying to get the hash for 0xa9 the same in Java and ObjC
Java MD5 code
private static char[] kTestASCII = {
169
};
md5(new String(kTestASCII), false);
/**
* Compute the MD5 hash for the given String.
* #param s the string to add to the digest
* #param unicode true if the string is unciode, false for ascii strings
*/
public synchronized final String md5(String value, boolean unicode)
{
MD5();
MD5.update(value, unicode);
return WUtilities.toHex(MD5.finish());
}
public synchronized void update(String s, boolean unicode)
{
if (unicode)
{
char[] c = new char[s.length()];
s.getChars(0, c.length, c, 0);
update(c);
}
else
{
byte[] b = new byte[s.length()];
s.getBytes(0, b.length, b, 0);
update(b);
}
}
public synchronized void update(byte[] b)
{
update(b, 0, b.length);
}
//--------------------------------------------------------------------------------
/**
* Add a byte sub-array to the digest.
*/
public synchronized void update(byte[] b, int offset, int length)
{
for (int n = offset; n < offset + length; n++)
update(b[n]);
}
/**
* Add a byte to the digest.
*/
public synchronized void update(byte b)
{
int index = (int)((count >>> 3) & 0x03f);
count += 8;
buffer[index] = b;
if (index >= 63)
transform();
}
I believe that my issue is with using NSData withEncoding as opposed to a C char[] or the Java byte[]. So what is the best way to roll my own bytes into a byte[] in objC?
The character you are having problems with, ©, is the Unicode COPYRIGHT SIGN (00A9). The correct UTF-8 encoding of this character is the byte sequence 0xc9 0xa9.
You are attempting, however to convert from the single-byte sequence 0xa9 which is not a valid UTF-8 encoding of any character. See table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G7404 . Since this is not a valid UTF-8 byte sequence, stringWithCString is converting your input to the Unicode REPLACEMENT_CHARACTER (FFFD). When this character is then encoded back into UTF-8, it yields the byte sequence 0xef 0xbf 0xbd. The MD5 of this sequence is 9b759040321a408a5c7768b4511287a6 as reported by your Objective-C example.
Your Java example yields an MD5 of a252c2c85a9e7756d5ba5da9949d57ed, which simple experimentation shows is the MD5 of the byte sequence 0xa9, which I have already noted is not a valid UTF-8 representation of the desired character.
I think we need to see the implementation of the Java md5() method you are using. I suspect it is simply dropping the high bytes of every Unicode character to convert to a byte sequence for passing to the MessageDigest class. This does not match your Objective-C implementation where you are using a UTF-8 encoding.
Note: even if you fix your Objective-C implementation to match the encoding of your Java md5() method, your test will need some adjustment because you cannot use stringWithCString with the NSUTF8StringEncoding encoding to convert the byte sequence 0xa9 to an NSString.
UPDATE
Having now seen the Java implementation using the deprecated getBytes method, my recommendation is to change the Java implementation, if at all possible, to use a proper UTF-8 encoding.
I suspect, however, that your requirements are to match the current Java implementation, even if it is wrong. Therefore, I suggest you duplicate the bad behavior of Java's deprecated getBytes() method by using NSString getCharacters:range: to retrieve an array of unichars, then manually create an array of bytes by taking the low byte of each unichar.
stringWithCString requires a null terminated C-String. I don't think that kTestASCII[] is necessarily null terminated in your Objective-C code. Perhaps that is the cause of the difference.
Try:
char kTestASCII [] = {
169,
0
};
Thanks to GBegan's explanation - here is my solution
for(int c = 0; c < [s length]; c++){
int number = [s characterAtIndex:c];
unsigned char c[1];
c[0] = (unsigned char)number;
NSMutableData *oneByte = [NSMutableData dataWithBytes:&c length:1];
}

Find Character String In Binary Data

I have a binary file I've loaded using an NSData object. Is there a way to locate a sequence of characters, 'abcd' for example, within that binary data and return the offset without converting the entire file to a string? Seems like it should be a simple answer, but I'm not sure how to do it. Any ideas?
I'm doing this on iOS 3 so I don't have -rangeOfData:options:range: available.
I'm going to award this one to Sixteen Otto for suggesting strstr. I went and found the source code for the C function strstr and rewrote it to work on a fixed length Byte array--which incidentally is different from a char array as it is not null terminated. Here is the code I ended up with:
- (Byte*)offsetOfBytes:(Byte*)bytes inBuffer:(const Byte*)buffer ofLength:(int)len;
{
Byte *cp = bytes;
Byte *s1, *s2;
if ( !*buffer )
return bytes;
int i = 0;
for (i=0; i < len; ++i)
{
s1 = cp;
s2 = (Byte*)buffer;
while ( *s1 && *s2 && !(*s1-*s2) )
s1++, s2++;
if (!*s2)
return cp;
cp++;
}
return NULL;
}
This returns a pointer to the first occurrence of bytes, the thing I'm looking for, in buffer, the byte array that should contain bytes.
I call it like this:
// data is the NSData object
const Byte *bytes = [data bytes];
Byte* index = [self offsetOfBytes:tag inBuffer:bytes ofLength:[data length]];
Convert your substring to an NSData object, and search for those bytes in the larger NSData using rangeOfData:options:range:. Make sure that the string encodings match!
On iPhone, where that isn't available, you may have to do this yourself. The C function strstr() will give you a pointer to the first occurrence of a pattern within the buffer (as long as neither contain nulls!), but not the index. Here's a function that should do the job (but no promises, since I haven't tried actually running it...):
- (NSUInteger)indexOfData:(NSData*)needle inData:(NSData*)haystack
{
const void* needleBytes = [needle bytes];
const void* haystackBytes = [haystack bytes];
// walk the length of the buffer, looking for a byte that matches the start
// of the pattern; we can skip (|needle|-1) bytes at the end, since we can't
// have a match that's shorter than needle itself
for (NSUInteger i=0; i < [haystack length]-[needle length]+1; i++)
{
// walk needle's bytes while they still match the bytes of haystack
// starting at i; if we walk off the end of needle, we found a match
NSUInteger j=0;
while (j < [needle length] && needleBytes[j] == haystackBytes[i+j])
{
j++;
}
if (j == [needle length])
{
return i;
}
}
return NSNotFound;
}
This runs in something like O(nm), where n is the buffer length, and m is the size of the substring. It's written to work with NSData for two reasons: 1) that's what you seem to have in hand, and 2) those objects already encapsulate both the actual bytes, and the length of the buffer.
If you're using Snow Leopard, a convenient way is the new -rangeOfData:options:range: method in NSData that returns the range of the first occurrence of a piece of data. Otherwise, you can access the NSData's contents yourself using its -bytes method to perform your own search.
I had the same problem.
I solved it doing the other way round, compared to the suggestions.
first, I reformat the data (assume your NSData is stored in var rawFile) with:
NSString *ascii = [[NSString alloc] initWithData:rawFile encoding:NSAsciiStringEncoding];
Now, you can easily do string searches like 'abcd' or whatever you want using the NSScanner class and passing the ascii string to the scanner. Maybe this is not really efficient, but it works until the -rangeOfData method will be available for iPhone also.

Resources