How to search for non-ASCII characters ( cyrillic) in PDF using QuartzPDF? - ios

I've stumbled upon a searching cyrillic (as well as any other non-ASCII) characters in PDF using PDDScanner. The code I am using is similar to mentioned to SO code from Randon ideas blog. The problem is that for cyrilic PDFs the output of scanner is a complete garbage, which can't be decoded to anything meaningful. English characters in cyrillic PDFs are searched just perfectly. So, the problem is that when it comes to cyrillic it is encoded and we can't get how to decode it properly.
What do we miss here?
Thanks in advance to anyone who can shed any light on the subject.

Have you tried pushing that string through a different encoding? When I look at NSString.h, I see something suspiciously labelled "cyrillic" which also has "Adobe" on the same line :) (i.e., try NSWindowsCP1251StringEncoding)
enum {
NSASCIIStringEncoding = 1, /* 0..127 only */
NSNEXTSTEPStringEncoding = 2,
NSJapaneseEUCStringEncoding = 3,
NSUTF8StringEncoding = 4,
NSISOLatin1StringEncoding = 5,
NSSymbolStringEncoding = 6,
NSNonLossyASCIIStringEncoding = 7,
NSShiftJISStringEncoding = 8, /* kCFStringEncodingDOSJapanese */
NSISOLatin2StringEncoding = 9,
NSUnicodeStringEncoding = 10,
NSWindowsCP1251StringEncoding = 11, /* Cyrillic; same as AdobeStandardCyrillic */
NSWindowsCP1252StringEncoding = 12, /* WinLatin1 */
NSWindowsCP1253StringEncoding = 13, /* Greek */
NSWindowsCP1254StringEncoding = 14, /* Turkish */
NSWindowsCP1250StringEncoding = 15, /* WinLatin2 */
NSISO2022JPStringEncoding = 21, /* ISO 2022 Japanese encoding for e-mail */
NSMacOSRomanStringEncoding = 30,
NSUTF16StringEncoding = NSUnicodeStringEncoding, /* An alias for NSUnicodeStringEncoding */
NSUTF16BigEndianStringEncoding = 0x90000100, /* NSUTF16StringEncoding encoding with explicit endianness specified */
NSUTF16LittleEndianStringEncoding = 0x94000100, /* NSUTF16StringEncoding encoding with explicit endianness specified */
NSUTF32StringEncoding = 0x8c000100,
NSUTF32BigEndianStringEncoding = 0x98000100, /* NSUTF32StringEncoding encoding with explicit endianness specified */
NSUTF32LittleEndianStringEncoding = 0x9c000100 /* NSUTF32StringEncoding encoding with explicit endianness specified */
};

You might have to get deeper into the Apple spec and headers on this - add NSLog lines (and post them here) for what the scanner finds for the normal PDF and the cyrillic ones. There are lots of possibilities (perhaps a different encoding, i.e. you need to translate the string you have to a different one using the encoding). I'm sure there is a way to list all the operators in the table, to see if there are extra ones in your cyrillic pdf. Also, this might help as a vastly similar problem you're trying to solve - it points to a library that is more tuned to scanning too.

Related

How to set the Eddystone-URL to a country specific expansion?

I can make my BLE device broadcast as an Eddystone Beacon. It is broadcasting Eddystone URL with "http://www.cypress.com". Now I want to change that URL to a country specific expansion, e.g. "---.com.tr"
Here is the GitHub source for Eddystone protocol. It does not give any clue about using special URL expansions. Do you have any idea how can implement it?
Also here is the code snippet from my project:
cyBle_discoveryData.advData[13] = 0x00; /* URL scheme- http://www. */
cyBle_discoveryData.advData[14] = 0x63; /* Encoded URL - 'c' */
cyBle_discoveryData.advData[15] = 0x79; /* Encoded URL - 'y' */
cyBle_discoveryData.advData[16] = 0x70; /* Encoded URL - 'p' */
cyBle_discoveryData.advData[17] = 0x72; /* Encoded URL - 'r' */
cyBle_discoveryData.advData[18] = 0x65; /* Encoded URL - 'e' */
cyBle_discoveryData.advData[19] = 0x73; /* Encoded URL - 's' */
cyBle_discoveryData.advData[20] = 0x73; /* Encoded URL - 's' */
cyBle_discoveryData.advData[21] = 0x00; /* Expansion - .com */
/* ADV packet length */
cyBle_discoveryData.advDataLen = 22;
Understand that the special expansion codes for extensions like .com are useful to save bytes, but are completely optional. You can also simply put in the bytes of the extension like this:
...
cyBle_discoveryData.advData[20] = 0x73; /* Encoded URL - 's' */
cyBle_discoveryData.advData[21] = 0x00; /* Expansion - .com */
cyBle_discoveryData.advData[22] = 0x2e; /* Encoded URL - '.' */
cyBle_discoveryData.advData[23] = 0x74; /* Encoded URL - 't' */
cyBle_discoveryData.advData[24] = 0x72; /* Encoded URL - 'r' */
/* ADV packet length */
cyBle_discoveryData.advDataLen = 25;
So while there is no special code for .tr, you can simply put the ASCII bytes for it in the advertisement.

base64 encoding with ios8/ios9 api without line length limit

How to base64 encode string with ios8 and ios9 api without line length limit.
I'm preparing some custom basic authentication and I need to encode credentials according to standard which means:
The resulting string is then encoded using the RFC2045-MIME variant of Base64, except not limited to 76 char/line
In old ios7 there was a method: NSData base64Encoding but now is deprecated and instead of it I have:
- (NSString *)base64EncodedStringWithOptions:
(NSDataBase64EncodingOptions)options NS_AVAILABLE(10_9, 7_0);
and the options are:
typedef NS_OPTIONS(NSUInteger, NSDataBase64EncodingOptions) {
// Use zero or one of the following to control the maximum line length after which a line ending is inserted. No line endings are inserted by default.
NSDataBase64Encoding64CharacterLineLength = 1UL << 0,
NSDataBase64Encoding76CharacterLineLength = 1UL << 1,
// Use zero or more of the following to specify which kind of line ending is inserted. The default line ending is CR LF.
NSDataBase64EncodingEndLineWithCarriageReturn = 1UL << 4,
NSDataBase64EncodingEndLineWithLineFeed = 1UL << 5,
} NS_ENUM_AVAILABLE(10_9, 7_0);
So I can choose line length 64 or 76. The base64 encoding for basic authentication don't have line length limit so how can I approach to this.
If you don't choose any options, line endings are not added.
[myData base64EncodedStringWithOptions:0]; // One long string

iOS how to decoding a text that is ISO6937 encoded?

I'm implementing an TV parser software.
When in Czech, I found that the EPG is encoded by ISO6937.
I used the following API to decode and switch to NSString.
- (instancetype)initWithBytes:(const void *)bytes length:(NSUInteger)len encoding:(NSStringEncoding)encoding;
but I can't find the encoding ENUM.
CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingISOLatin2) is not correct.
Is anyone can help me?
If the encoding, you mentioned, is not working try these following encoding
CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingMacCentralEurRoman)
or
NSISOLatin2StringEncoding
Also check this post.
Edit
Here is the list of encoding supported on iOS8.3. Try responsibly.
NSASCIIStringEncoding = 1, /* 0..127 only */
NSNEXTSTEPStringEncoding = 2,
NSJapaneseEUCStringEncoding = 3,
NSUTF8StringEncoding = 4,
NSISOLatin1StringEncoding = 5,
NSSymbolStringEncoding = 6,
NSNonLossyASCIIStringEncoding = 7,
NSShiftJISStringEncoding = 8, /* kCFStringEncodingDOSJapanese */
NSISOLatin2StringEncoding = 9,
NSUnicodeStringEncoding = 10,
NSWindowsCP1251StringEncoding = 11, /* Cyrillic; same as AdobeStandardCyrillic */
NSWindowsCP1252StringEncoding = 12, /* WinLatin1 */
NSWindowsCP1253StringEncoding = 13, /* Greek */
NSWindowsCP1254StringEncoding = 14, /* Turkish */
NSWindowsCP1250StringEncoding = 15, /* WinLatin2 */
NSISO2022JPStringEncoding = 21, /* ISO 2022 Japanese encoding for e-mail */
NSMacOSRomanStringEncoding = 30,
NSUTF16StringEncoding = NSUnicodeStringEncoding, /* An alias for NSUnicodeStringEncoding */
NSUTF16BigEndianStringEncoding = 0x90000100, /* NSUTF16StringEncoding encoding with explicit endianness specified */
NSUTF16LittleEndianStringEncoding = 0x94000100, /* NSUTF16StringEncoding encoding with explicit endianness specified */
NSUTF32StringEncoding = 0x8c000100,
NSUTF32BigEndianStringEncoding = 0x98000100, /* NSUTF32StringEncoding encoding with explicit endianness specified */
NSUTF32LittleEndianStringEncoding = 0x9c000100 /* NSUTF32StringEncoding encoding with explicit endianness specified */

Display 5-digit base unicode character from the Entypo font

I'm using the Entypo font in my iPhone app but it's working fine only for some characters. I'm not able to display icons using five-digit unicode values.
I found some information on the Web telling this is due to the UTF encoding supported on iOS (and within other languages too) and the 5-digit unicode values should be splitted in two values.
But I'm not found a clear how-to description or a code sample.
My code to display a Entypo symbol is something like this:
myLabel.text = [NSString stringWithUTF8String:"\u25B6"];
myLabel.font = [UIFont fontWithName:#"Entypo" size:200];
If I replace the unicode value by "\u1F342" which is the icon leaf in the Entypo font then a non-valid character is displayed.
If you already have encountered this issue, perhaps you could help me to save time.
Thanks very much
If you check out the unicode page for that character, you'll see that its UTF-8 encoding is 0xF0 0x9F 0x8D 0x82 - that's what you should be using:
myLabel.text = [NSString stringWithUTF8String:"\uf0\u9f\u8d\u82"];
Note: totally untested.
After several searches I finally found a solution being easy to use in the different cases: symbol encoded up to 4 digits and more than 4 digits.
I defined a NSString category as follows:
#import "NSString+Extension.h"
#implementation NSString (Extension)
/**
* Convert a UTF8 symbol to a string which can directly be used as text in a label view for instance, for which the right font has been specified.
*
* The method can be used for both cases
* . the symbol is defined as a const char with a maximum of 4 digits. In this case the first parameter must be 0 and the second is used. Example: NSString *symbolString = [NSString symbolStringfromUnicode:0 orChar:"\uE766"]
* . the symbol is defined as an integer with hexadecimal notation. It can be have either less or more than 4 digits. In this case, only the first parameter is used. Example : NSString *prefixSymbol = [NSString symbolStringfromUnicode:0x1F464 orChar:nil];
*
* #param symbolUnicode symbol to convert defined as int
* #param symbolChar symbol to convert defined as const char *
*
*/
+ (NSString *)symbolStringfromUnicode:(int)symbolUnicode orChar:(const char *)symbolChar
{
NSString *symbolString;
if (symbolUnicode == 0) {
symbolString = [NSString stringWithUTF8String:symbolChar];
}
else {
int unicode = symbolUnicode;
symbolString = [[NSString alloc] initWithBytes:&unicode length:sizeof(unicode) encoding:NSUTF32LittleEndianStringEncoding];
}
return symbolString;
}
#end

iOS UTF-8 Label String

I have a UTF-8 encoding string that I want to display in a label.
When I set a break-point and examine the variable holding the string, all looks good. However, when I try to output to the log, or to the label, I get latin encoding.
I have tried almost every suggestion on SO and beyond, but I just cannot get the string to display properly.
Here is my code:
NSString *rawString = [NSString stringWithFormat:#"%#",m_value];
const char *utf8String = [rawString UTF8String];
NSLog (#"%#", [NSString stringWithUTF8String:utf8String]);
NSLog (#"%s", utf8String);
NSLog (#"%#", rawString);
self.resultText.text = [NSString stringWithUTF8String:utf8String];
m_value is an NSString, and in the debug window, it also displays the correct encoding.
m_value NSString * 0x006797b0 #"鄧樂愚..."
NSObject NSObject
isa Class 0x3bddd8f4
[0] Class
I am using the iOS 6.1 SDK.
Ok, if m_value is a const char contained UTF-8 string you have to use this method:
- (id)initWithUTF8String:(const char *)bytes
NSString *correctString = [[NSString alloc] initWithUTF8String: m_value];
It's incorrect to pass const char* to # formatter, because # means NSObject, so it will be always incorrect and can lead to app crash
When I want to show khmer on label, I use font 'Hanuman.ttf'. This is code I use:
`UIFont *font = [UIFont fontWithName:#"Hanuman" size:20.0f];
self.nameLabel.text = [NSString stringWithFormat:#"%#",itemName];
self.nameLabel.font = font;`
I don't know this can help you or not , but this is what I did before !
So I finally managed to get to the bottom of this.
The m_value NSString was being set by a third party library to which I had no access to the source. Even though the value of this variable was being decoded correctly in the (I.e. displaying the Chinese characters) in the debug panel, the string was actually encoded with NSMacOSRomanStringEncoding.
I was able to determine this by copying the output into TextWrangler, and flipping encodings until I found the one that translated correctly into UTF-8.
Then to fix in Objective-C, I first translated the NSString to a const char:
const char *macString = [bxr.m_value cStringUsingEncoding:NSMacOSRomanStringEncoding];
Then converted back to an NSString:
NSString *utf8String = [[NSString alloc]initWithCString:macString encoding:NSUTF8StringEncoding];
+1 to #Vitaly_S and #iphonic whose answers eventually led me to this solution. For anyone else that stumbles across this; it seems that as of Xcode 4.6.1, the debug window cannot be trusted to render strings correctly, but you can rely on the NSLog output.
Considering your variable m_value NSData, you can try the following
self.resultText.text = [[NSString alloc] initWithData:m_value encoding:NSISOLatin1StringEncoding];
There are many encoding available you can try them too
NSASCIIStringEncoding /* 0..127 only */
NSNEXTSTEPStringEncoding
NSJapaneseEUCStringEncoding
NSUTF8StringEncoding
NSISOLatin1StringEncoding
NSSymbolStringEncoding
NSNonLossyASCIIStringEncoding
NSShiftJISStringEncoding /* kCFStringEncodingDOSJapanese */
NSISOLatin2StringEncoding
NSUnicodeStringEncoding
NSWindowsCP1251StringEncoding /* Cyrillic; same as AdobeStandardCyrillic */
NSWindowsCP1252StringEncoding /* WinLatin1 */
NSWindowsCP1253StringEncoding /* Greek */
NSWindowsCP1254StringEncoding /* Turkish */
NSWindowsCP1250StringEncoding /* WinLatin2 */
NSISO2022JPStringEncoding /* ISO 2022 Japanese encoding for e-mail */
NSMacOSRomanStringEncoding
NSUTF16StringEncoding /* An alias for NSUnicodeStringEncoding */
NSUTF16BigEndianStringEncoding /* NSUTF16StringEncoding encoding with explicit endianness specified */
NSUTF16LittleEndianStringEncoding /* NSUTF16StringEncoding encoding with explicit endianness specified */
NSUTF32StringEncoding
NSUTF32BigEndianStringEncoding /* NSUTF32StringEncoding encoding with explicit endianness specified */
NSUTF32LittleEndianStringEncoding /* NSUTF32StringEncoding encoding with explicit endianness specified */

Resources