how to read chinese from pdf in ios correctly

how to read chinese from pdf in ios correctly - ios

here is what I have done, but it appears disorderly. Thanks in advance.
1.use CGPDFStringCopyTextString to get the text from the pdf
2.encode the NSString to char*
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
const char *char_content = [self.currentData cStringUsingEncoding:enc];
Below is how I get the currentData:
void arrayCallback(CGPDFScannerRef inScanner, void *userInfo)
{
BIDViewController *pp = (__bridge BIDViewController*)userInfo;
CGPDFArrayRef array;
bool success = CGPDFScannerPopArray(inScanner, &array);
for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 1)
{
if(n >= CGPDFArrayGetCount(array))
continue;
CGPDFStringRef string;
success = CGPDFArrayGetString(array, n, &string);
if(success)
{
NSString *data = (__bridge NSString *)CGPDFStringCopyTextString(string);
[pp.currentData appendFormat:#"%#", data];
}
}
}
- (IBAction)press:(id)sender {
table = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback(table, "TJ", arrayCallback);
CGPDFOperatorTableSetCallback(table, "Tj", stringCallback);
self.currentData = [NSMutableString string];
CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(pagerf);
CGPDFScannerRef scanner = CGPDFScannerCreate(contentStream, table, (__bridge void *)(self));
bool ret = CGPDFScannerScan(scanner);
}

According to the Mac Developer Library
CGPDFStringCopyTextString returns a CFString object that represents a PDF string as a text string. The PDF string is given as a CGPDFString which is a series of bytes—unsigned integer values in the range 0 to 255; thus, this method already decodes the bytes according to some character encoding.
It is given none explicitly, so it assumes one encoding type, most likely the PDFDocEncoding or the UTF-16BE Unicode character encoding scheme which are the two encodings that may be used to represent text strings in a PDF document outside the document’s content streams, cf. section 7.9.2.2 Text String Type and Table D.1, Annex D in the PDF specification.
Now you have not told us from where you received your CGPDFString. I assume, though, that you received it from inside one of the document’s content streams. Text strings there, on the other hand, can be encoded with any imaginable encoding. The encoding used is given by the embedded data of the font the string is to be displayed with.
For more information on this you may want to read CGPDFScannerPopString returning strange result and have a look at PDFKitten.

Related

Process unicode string in C and Objective C

I write a C function to read characters in an user-input string. Because this string is user-input, so it can contains any unicode characters. There's an Objective C method receives the user-input NSString, then convert this string to NSData and pass this data to the C function for processing. The C function searches for these symbol characters: *, [, ], _, it doesn't care any other characters. Everytime it found one of the symbols, it processes and then calls an Objective C method, pass the location of the symbol.
C code:
typedef void (* callback)(void *context, size_t location);
void process(const uint8_t *data, size_t length, callback cb, void *context)
{
size_t i = 0;
while (i < length)
{
if (data[i] == '*' || data[i] == '[' || data[i] == ']' || data[i] == '_')
{
int valid = 0;
//do something, set valid = 1
if (valid)
cb(context, i);
}
i++;
}
}
Objective C code:
//a C function declared in .m file
void mycallback(void *context, size_t location)
{
[(__bridge id)context processSymbolAtLocation:location];
}
- (void)processSymbolAtLocation:(NSInteger)location
{
NSString *result = [self.string substringWithRange:NSMakeRange(location, 1)];
NSLog(#"%#", result);
}
- (void)processUserInput:(NSString*)string
{
self.string = string;
//convert string to data
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
//pass data to C function
process(data.bytes, data.length, mycallback, (__bridge void *)(self));
}
The code works fine if the input string contains only English characters. If it contains composed character sequences, multibyte characters or other unicode characters, the result string in processSymbolAtLocation method is not the expected symbol.
How to convert the NSString object to NSData correctly? How to get the correct location?
Thanks!

Your problem is that you start off with a UTF-16 encoded NSString and produce a sequence of UTF-8 encoded bytes. The number of code units required to represent a string in UTF-16 may not be equal to that number required to represent it in UTF-8, so the offsets in your two forms may not match - as you have found out.
Why are you using C to scan the string for matches in the first place? You might want to look at NSString's rangeOfCharacterFromSet:options:range: method which you can use to find the next occurrence of character from your set.
If you need to use C then convert your string into a sequence of UTF-16 words and use uint16_t on the C side.
HTH

Obfuscating a number(in a string) Objective C

I'm using the following code to obfuscate a passcode for a test app of mine.
- (NSString *)obfuscate:(NSString *)string withKey:(NSString *)key
{
// Create data object from the string
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
// Get pointer to data to obfuscate
char *dataPtr = (char *) [data bytes];
// Get pointer to key data
char *keyData = (char *) [[key dataUsingEncoding:NSUTF8StringEncoding] bytes];
// Points to each char in sequence in the key
char *keyPtr = keyData;
int keyIndex = 0;
// For each character in data, xor with current value in key
for (int x = 0; x < [data length]; x++)
{
// Replace current character in data with
// current character xor'd with current key value.
// Bump each pointer to the next character
*dataPtr = *dataPtr++ ^ *keyPtr++;
// If at end of key data, reset count and
// set key pointer back to start of key value
if (++keyIndex == [key length])
keyIndex = 0, keyPtr = keyData;
}
return [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
}
This works like a charm with all strings, but i've ran into a bit of a problem comparing the following results
NSLog([[self obfuscate:#"0000", #"maki"]); //Returns 0]<W
NSLog([[self obfuscate:#"0809", #"maki"]); //Returns 0]<W
As you can see, the two strings with numbers in, while different, return the same result! Whats gone wrong in the code i've attached to result in the same result for these two numbers?
Another example:
NSLog([self obfuscate:#"8000" withKey:#"maki"]); //Returns 8U4_
NSLog([self obfuscate:#"8290" withKey:#"maki"]); //Returns 8U4_ as well
I may be misunderstanding the concept of obfuscation, but I was under the impression that each unique string returns a unique obfuscated string!
Please help me fix this bug/glitch
Source of Code: http://iosdevelopertips.com/cocoa/obfuscation-encryption-of-string-nsstring.html

The problem is your last line. You create the new string with the original, unmodified data object.
You need to create a new NSData object from the modified dataPtr bytes.
NSData *newData = [NSData dataWithBytes:dataPtr length:data.length];
return [[NSString alloc] initWithData:newData encoding:NSUTF8StringEncoding];
But you have some bigger issues.
The calls to bytes returns a constant, read-only reference to the bytes in the NSData object. You should NOT be modifying that data.
The result of your XOR on the character data could, in theory, result in a byte stream that is no longer a valid UTF-8 encoded string.

The obfuscation algorithm that you have selected is based on XORing the data and the "key" values together. Generally, this is not very strong. Moreover, since XOR is symmetric, the results are very prone to producing duplicates.
Although your implementation is currently broken, fixing it would not be of much help in preventing the algorithm from producing identical results for different data: it is relatively straightforward to construct key/data pairs that produce the same obfuscated string - for example,
[self obfuscate:#"0123" withKey:#"vwxy"]
[self obfuscate:#"pqrs" withKey:#"6789"]
will produce identical results "FFJJ", even though both the strings and the keys look sufficiently different.
If you would like to "obfuscate" your strings in a cryptographically strong way, use a salted secure hash algorithm: it will produce very different results for even slightly different strings.

iOS - XML to NSString conversion

I'm using NSXMLParser for parsing XML to my app and having a problem with the encoding type. For example, here is one of the feeds coming in. It looks similar to this"
\U2026Some random text from the xml feed\U2026
I am currently using the encoding type:
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
Which encoding type am I suppose to use for converting \U2026 into a ellipse (...) ??

The answer here is you're screwed. They are using a non-standard encoding for XML, but what if they really want the literal \U2026? Let's say you add a decoder to handle all \UXXXX and \uXXXX encodings. What happens when another feed want the data to be the literal \U2026?
You're first choice and best bet is to get this feed fixed. If they need to encode data, they need to use proper HTML entities or numeric references.
As a fallback, I would isolate the decoder away from the XML parser. Don't create a non-conforming XML parser just because your getting non-conforming data. Have a post processor that would only be run on the offending feed.
If you must have a decoder, then there is more bad news. There is no built in decoder, you will need to find a category online or write one up yourself.
After some poking around, I think Using Objective C/Cocoa to unescape unicode characters, ie \u1234 may work for you.

Alright, heres a snippet of code that should work for any unicode code-point:
NSString *stringByUnescapingUnicodeSymbols(NSString *input)
{
NSMutableString *output = [NSMutableString stringWithCapacity:[input length]];
// get the UTF8 string for this string...
const char *UTF8Str = [input UTF8String];
while (*UTF8Str) {
if (*UTF8Str == '\\' && tolower(*(UTF8Str + 1)) == 'u')
{
// skip the next 2 chars '\' and 'u'
UTF8Str += 2;
// make sure we only read 4 chars
char tmp[5] = { UTF8Str[0], UTF8Str[1], UTF8Str[2], UTF8Str[3], 0 };
long unicode = strtol(tmp, NULL, 16); // remember that Unicode is base 16
[output appendFormat:#"%C", unicode];
// move on with the string (making sure we dont miss the end of the string
for (int i = 0; i < 4; i++) {
if (*UTF8Str == 0)
break;
UTF8Str++;
}
}
else
{
if (*UTF8Str == 0)
break;
[output appendFormat:#"%c", *UTF8Str];
}
UTF8Str++;
}
return output;
}

You should simple replace literal '\U2026' on a quotation, then encode it with NSUTF8StringEncoding encodind to NSData

How do i ignore illegal characters when parsing an rss feed using nsxmlparser?

When using the NSXMLParser (indirectly through Michael Waterfalls MWFeedParser library)
and parsing the following RSS feed:
http://qdb.us/qdb.xml?action=latest
NSURL *feedURL = [NSURL URLWithString:#"http://qdb.us/qdb.xml?action=random"];
self.feedParser = [[MWFeedParser alloc] initWithFeedURL:feedURL];
self.feedParser.delegate = self;
self.feedParser.feedParseType = ParseTypeFull; // Parse feed info and all items
self.feedParser.connectionType = ConnectionTypeAsynchronously;
[self.feedParser parse];
I receive back an invalid formatted xml document that appears to be an illegal character in the feed.
http://validator.w3.org/check?uri=http%3A%2F%2Fqdb.us%2Fqdb.xml%3Faction%3Dlatest&charset=utf-8&doctype=Inline&group=0&user-agent=W3C_Validator%2F1.1
I've tried changing the documents encoding from ISO-8859-1 to UTF-8 but the problem still occurs.
How do I identify the illegal character and then how do I make it so parsing the RSS feed won't fall over when encountering these illegal characters?
References: (links I've already investigated)
HTML character decoding in Objective-C / Cocoa Touch
https://stackoverflow.com/users/106244/michael-waterfall

I don't know how to ignore illegal character, but you might consider to do some regex correction to remove them before parsing, but I suggest to use killxml instand of nsxmlparser, which could be ok with illegal character, here is "How To Choose The Best XML Parser for Your iPhone Project"

I found something like this while parsing EPG Data grabbed from the REST API of my Enigma2 receiver. In this case one service was pushing EPGInfo with the illegal character 0x05.
I have implemented a cleanup method for incoming NSData. This is the poor man's way to filter these 0x05 bytes from the NSData I receive from NSURLSession before passing it to the parser:
-(NSData *)DataCleaned:(NSData *)data {
NSData *clean = nil;
const char *old = (const char *)data.bytes;
char *flt = (char *)calloc( data.length, sizeof( char ) );
NSInteger cnt = 0;
for( NSInteger i = 0; i < data.length; i++ ) {
if ( old[i] != 0x05 )
flt[cnt++] = old[i];
}
clean = [NSData dataWithBytes:flt length:cnt];
free( flt );
return clean;
}
In my case, this solved the problem. But of course this requires to load the response into NSData prior to parsing it.

Find Character String In Binary Data

I have a binary file I've loaded using an NSData object. Is there a way to locate a sequence of characters, 'abcd' for example, within that binary data and return the offset without converting the entire file to a string? Seems like it should be a simple answer, but I'm not sure how to do it. Any ideas?
I'm doing this on iOS 3 so I don't have -rangeOfData:options:range: available.
I'm going to award this one to Sixteen Otto for suggesting strstr. I went and found the source code for the C function strstr and rewrote it to work on a fixed length Byte array--which incidentally is different from a char array as it is not null terminated. Here is the code I ended up with:
- (Byte*)offsetOfBytes:(Byte*)bytes inBuffer:(const Byte*)buffer ofLength:(int)len;
{
Byte *cp = bytes;
Byte *s1, *s2;
if ( !*buffer )
return bytes;
int i = 0;
for (i=0; i < len; ++i)
{
s1 = cp;
s2 = (Byte*)buffer;
while ( *s1 && *s2 && !(*s1-*s2) )
s1++, s2++;
if (!*s2)
return cp;
cp++;
}
return NULL;
}
This returns a pointer to the first occurrence of bytes, the thing I'm looking for, in buffer, the byte array that should contain bytes.
I call it like this:
// data is the NSData object
const Byte *bytes = [data bytes];
Byte* index = [self offsetOfBytes:tag inBuffer:bytes ofLength:[data length]];

Convert your substring to an NSData object, and search for those bytes in the larger NSData using rangeOfData:options:range:. Make sure that the string encodings match!
On iPhone, where that isn't available, you may have to do this yourself. The C function strstr() will give you a pointer to the first occurrence of a pattern within the buffer (as long as neither contain nulls!), but not the index. Here's a function that should do the job (but no promises, since I haven't tried actually running it...):
- (NSUInteger)indexOfData:(NSData*)needle inData:(NSData*)haystack
{
const void* needleBytes = [needle bytes];
const void* haystackBytes = [haystack bytes];
// walk the length of the buffer, looking for a byte that matches the start
// of the pattern; we can skip (|needle|-1) bytes at the end, since we can't
// have a match that's shorter than needle itself
for (NSUInteger i=0; i < [haystack length]-[needle length]+1; i++)
{
// walk needle's bytes while they still match the bytes of haystack
// starting at i; if we walk off the end of needle, we found a match
NSUInteger j=0;
while (j < [needle length] && needleBytes[j] == haystackBytes[i+j])
{
j++;
}
if (j == [needle length])
{
return i;
}
}
return NSNotFound;
}
This runs in something like O(nm), where n is the buffer length, and m is the size of the substring. It's written to work with NSData for two reasons: 1) that's what you seem to have in hand, and 2) those objects already encapsulate both the actual bytes, and the length of the buffer.

If you're using Snow Leopard, a convenient way is the new -rangeOfData:options:range: method in NSData that returns the range of the first occurrence of a piece of data. Otherwise, you can access the NSData's contents yourself using its -bytes method to perform your own search.

I had the same problem.
I solved it doing the other way round, compared to the suggestions.
first, I reformat the data (assume your NSData is stored in var rawFile) with:
NSString *ascii = [[NSString alloc] initWithData:rawFile encoding:NSAsciiStringEncoding];
Now, you can easily do string searches like 'abcd' or whatever you want using the NSScanner class and passing the ascii string to the scanner. Maybe this is not really efficient, but it works until the -rangeOfData method will be available for iPhone also.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

how to read chinese from pdf in ios correctly - ios

Related

Process unicode string in C and Objective C

Obfuscating a number(in a string) Objective C

iOS - XML to NSString conversion

How do i ignore illegal characters when parsing an rss feed using nsxmlparser?

Find Character String In Binary Data

Categories

Resources