Unicode representation in iOS

Unicode representation in iOS - ios

I'm saving a file with unicode(korean) in its name and I'm storing the name of the file in memory for bookkeeping in my app. The file is saved fine, what's bothering me is the way the name is given back to me by the OS.
IF I make a fcntl(fd, F_GETPATH, charArray) call on the file's fd, the filename being returned is different to the file name returned by listing out its directory contents. I did some research on the filenames returned in the two cases and found out that the format in the initial case is Hangul Symbols(length of filename in bytes:18) and in the later case it is Hangul Jamo(length in bytes:36). IOS seamlessly works with the two formats, if I do a localizedcompare on the two names being returned it'll return a 1.
When I'm doing the bookkeeping I store the path to the file(including its name) and the length of this path. I'll do a quick compare on these two attributes when a request comes in and return a handle to the file only if they match. The problem now is when the file is being stored the fcntl call gives me the path in Hangul symbols and when the user requests it back the run time gives me the file path in hangul jamo. As I've stored the path in Hangul symbols, the app'll think its a different file that's somehow not created by the user and returns an 'invalid file' popup.
Visually the korean text looks the same in both the encoding schemes, the only difference is in the byte representation.
다른 문서.docx - Hangul Symbols - returned during file creation by FD
다른 문서.docx - Hangul Jamo - returned by the OS in runtime and also if I list the directory contents.
char *fileName1="다른 문서.docx"; //Hangul Symbols
NSLog(#"fileName3:%s length:%lu", fileName3, strlen(fileName3));
char *fileName2="다른 문서.docx"; //Hangul Jamo
NSLog(#"fileName4:%s length:%lu", fileName4, strlen(fileName4));
If you run the above code you could see the names are different in their memory footprint. Any idea on how/why iOS is changing the filename at run time from one scheme to another? and also if someone could explain how localizedcompare is returning 1 in both the cases would be great.

Related

How can I find the proper zone_map offset for v0rtex exploit on iOS 9.3.5?

I am building an APNonce setter tool with the aid of siguza’s v0rtex exploit and for now, I have most of the offsets I need, but zone_map offset seems to be wrong no matter what I do.
What I tried:
I decrypted the kernel and loaded it in IDA on macOS. Searched strings for zone_map and found nothing relevant.
I had a bit of luck when I searched for zone_init, but the xref I followed wasn’t leading anywhere.
My device is iPod Touch 5 and iOS 9.3.5. The offset I found is 0xffffffff0070d1aa4 but it panics the kernel so it’s not correct.

The ZONE_MAP offset isn't very easy to find, but I will detail a method below that should work.
One caveat: I used Hopper for this, instead of IDA Pro. You can use the Demo version of Hopper though.
Step 1: Decrypt your KernelCache. Make sure it's decrypted, otherwise all the rest of the steps will essentially fail. You can use Decrypt0r for this. You know you're good when the Decrypt0r spews the following output:
Enter key for /Users/geosn0w/Desktop/kernelcache.release.n78: 87aa19c72db6e662d6c3dbcf74da88026fda5a66469baa7e271725918133cd2f
Enter key IV for /Users/geosn0w/Desktop/kernelcache.release.n78: 2692e6004e6240aab57f2affa0daedc0
[DEBUG] Opening /Users/geosn0w/Desktop/kernelcache.release.n78
Parsed TYPE element
Parsed DATA element
Parsed SEPO element
Parsed KBAG element
Parsed KBAG element
File opened successfully
Setting Img3 Key and IV
Fetching KBAG element from image
Found KBAG element in image
KBAG Type = 256, State = 1
Decrypting Img3 file
Fetching DATA element from image
Found DATA element in image
Setting keys to decrypt with
Performing decryption...
magic = 0x706d6f63
Image compressed, decompressing
signature = 0x706d6f63
compression_type = 0x73737a6c
Found LZSS compression type
Found output file listed as /Users/geosn0w/Desktop/kernelcache.release.n78.dec
Image claims it's decrypted, dump raw data
Closing Img3 file
/Users/geosn0w/Desktop/kernelcache.release.n78.dec copied to the root of IPSW folder
Step 2: Open Hopper Disassembler and pop the decrypted KernelCache
file inside. The kernel is huge, so give it time to analyze it. It
can take a few minutes.
Step 3: Once the kernel file has successfully been analyzed, navigate to the Strings Tab in Hopper and search for zone_init: kmem_suballoc failed.
Step 4: Double-click the single result that appeared, and then double-click the DATA XREF: subXXXXXXXXXXX subroutine cross-reference.
Step 5: If you did all the above, you will jump into a subroutine containing something like ; :lower16:(0x803bde69 - 0x80036856), "\\\"zone_init: kmem_suballoc failed\\\"", CODE XREF=sub_80032808+6204. Double click the CODE XREF=sub_XXXXXXXX part at the far right.
Step 6: Your offset is the first QWORD on the location you jumped to. In my case it was 0x8003684a and that's the offset for ZONE_MAP.

How to replace these extended ascii codes?

I am opening up .txt files but when they are loaded on Xojo weird characters like these (â€™ , â€ک) show up.
I've tried DefineEncoding and ConvertEncoding but it still doesn't seem to work.
output.text = output.text.DefineEncoding(Encodings.WindowsANSI)
output.text = output.text.ConvertEncoding(Encodings.UTF8)

You may have to define the encoding already at time of loading, not afterwards, or you'll get UTF8 chara from loading that you will then mess up with your posted code. So, pass the encoding to the Read function or load the data as a binary file, not as a text file.

PHPEXCEL weird characters on form inputs

I need some help with PHPEXCEL library, everything works great, I'm successfully extracting my SQL query to excel5 file, I need to give this file to transport company in order to auto collect informations about packages, unfotunately the generated excel file has some ascii characters between each letter of the cell text, and when the excel file is imported you need to manually delete these charaters.
If I open the excel file, everything is fine I see: COMPANY NAME, If I open the excel file with notepad++, I see the cell values this way: C(NUL)O(NUL)M(NUL)P(NUL)A(NUL)N(NUL)Y N(NUL)A(NUL)M(NUL)E
If I open again the file with excel and save, then reopen with notepad++ I see COMPANY NAME.
So I do not understan why every time I create an excel file using PHPEXCEL my every letter of all words are filled with (nul) every letter.
So how do I prevent the generated excel file to include (nul) between every word????
Also if you open the original excel file generated from PHPExcel samples are also filled with (nul) and if you open and save it, the (nul) is gone.
Any help would be appreciated, thanks.
what is the (nul) ??? 0x00??? char(0)???
ok, here is the example:
error_reporting(E_ALL);
ini_set('display_errors', TRUE);
ini_set('display_startup_errors', TRUE);
date_default_timezone_set('Europe/London');
if (PHP_SAPI == 'cli')
die('Disponibile solo su browser');
require_once dirname(__FILE__) . '/Classes/PHPExcel.php';
$objPHPExcel = new PHPExcel();
$objPHPExcel->getProperties()->setCreator("Solidus")
->setLastModifiedBy("Solidus")
->setTitle("Import web")
->setSubject("Import File")
->setDescription("n.a")
->setKeywords("n.a")
->setCategory("n.a");
$objPHPExcel->setActiveSheetIndex(0)
->setCellValueExplicit("A1", "COMPANY")
->setCellValue('A2', 'SAMSUNG');
$objPHPExcel->getActiveSheet()->setTitle('DDT');
$objPHPExcel->setActiveSheetIndex(0);
header('Content-Type: application/vnd.ms-excel');
header('Content-Disposition: attachment;filename="TEST.xls"');
header('Cache-Control: max-age=0');
header('Cache-Control: max-age=1');
header('Cache-Control: private',false);
$objWriter = PHPExcel_IOFactory::createWriter($objPHPExcel, 'Excel5');
ob_end_clean();
$objWriter->save('php://output');
As you can see from this little example, this scripts creates a file excel5 with 2 cells, A1 = COMPANY, A2 = SAMSUNG
when I send this file to the transport company, they import the file into their system, but as you can see from the picture, there is an weird character between each letter.
so I noticed every time I open the generated Excel5 with notepad++ file I get:
S(nul)A(nul)M(nul)S(nul)U(nul)N(nul)G
If I save the save with excel and then open it again with notepad++ I get:
SAMSUNG
and this file is ok for the transport company
so my question is, how should I avoid the file generated to contain thi '(nul) charachter between each letter????
some help?
weird characters
SAMSUNG

I found the soluion by myself, I explain just in case anyone has also this problem:
there is not way to change the way the excelfile is encoded by PHPEXCEL
so I figured out the problem was reading the file, I did some simulations and reproduce the problem, every time a read the file and put the result into inputs a get weird characters:
C�O�M�P�A�N�Y�
If I set the output enconding enconding as follows:
$excel->setOutputEncoding('UTF-8');
the file loads fine, so the problem was not creating the excel file, but reading the excel file.
If I print the variable with ECHO I get: "COMPANY",
if I put the variable on input as value I get: "C�O�M�P�A�N�Y�"
setting the output solves the problem, but I would like to know why the difference when I put the variable on input as value, thanks

Using MSXML2.ServerXMLHTTP to access data from a web page returns truncated data in Lua

I am trying to download a source code file from a web site which works fine for small files, but a couple of larger ones get truncated.
The example below should be returning a file 146,135 bytes in size, but returns one of 141,194 bytes with a status of 200.
I have tried winhttp.winhttprequest.5.1 as well, but both seem to truncate at the same point.
I have also found quite a few people with similar problems, but have not been able to find a solution.
require('luacom')
http = luacom.CreateObject('MSXML2.ServerXMLHTTP')
http:Open("GET","http://www.family-historian.co.uk/wp-content/plugins/forced-download2/download.php?path=/wp-content/uploads/formidable/tatewise/&file=Map-Life-Facts3.fh_lua&id=190",true)
http:Send()
http:WaitForResponse(30)
print('Status: '..http.Status)
print('----------------------------------------------------------------')
headers = http:GetAllResponseHeaders()
data = http.Responsetext
print('Data Size = '..#data)
print('----------------------------------------------------------------')
print(headers)

I finally worked out what was going on so will post it here for others.
To avoid the truncation I needed to use ResponseBody and not ResponseText, what appears to be happening is the file is being sent in binary format, the ResponseText data is the same number of bytes as the ResponseBody one, but is in UTF-8 format, this means the number if special characters in the file (which are double byte in UTF-8 are dropped from the end of the ResponseText. I am not sure at what level the "mistake" in the length is made, but the way to avoid it is to use ResponseBody.

Text searching PDF

When parsing a PDF, given a string (popped from the Tj or TJ operator callbacks) with the Identity-H encoding how do you map that string to a unicode (say UTF8) representation?
If I need a CMap for this, how do I create (or retrieve) and apply the CMap?

You'll probably have to parse the font data itself. Identity-H just means "use the bytes as raw glyph indexes into the given font". That's why you MUST embed fonts when using Identity-H... different versions of the same font need not have the same glyph order.
There's example code on how to do this sort of thing in several different open source projects. iText, for example (yes, I'm biased).
You'd mentioned a CMap. Identity-H fonts can have a CMap but aren't required to do so. The /ToUnicode entry will be a stream that is a CMap, as defined in some adobe spec somewhere. They aren't all that complex:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TTX+0)
/Ordering (T42UV)
/Supplement 0
>> def
/CMapName /TTX+0 def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
80 beginbfrange
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
<0033><0033><0050>
<0034><0034><0051>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<0038><0038><0055>
<0039><0039><0056>
<003a><003a><0057>
<003b><003b><0058>
<003c><003c><0059>
<003d><003d><005a>
<0065><0065><00c9>
<00c8><00c8><00c1>
<00cb><00cb><00cd>
<00cf><00cf><00d3>
<00d2><00d2><00da>
<00e2><00e2><0160>
<00e4><00e4><017d>
<00e9><00e9><00dd>
<00fd><00fd><010c>
<0104><0104><0104>
<0106><0106><010e>
<0109><0109><0118>
<010b><010b><011a>
<0115><0115><0147>
<011b><011b><0158>
<0121><0121><0164>
<0123><0123><016e>
<01a0><01a0><0116>
<01b2><01b2><012e>
<01cb><01cb><016a>
<01cf><01cf><0172>
<022c><022c><0401>
<023b><023b><0411>
<023c><023c><0412>
<023d><023d><0413>
<023e><023e><0414>
<023f><023f><0415>
<0240><0240><0416>
<0241><0241><0417>
<0242><0242><0418>
<0243><0243><0419>
<0244><0244><041a>
<0245><0245><041b>
<0246><0246><041c>
<0247><0247><041d>
<0248><0248><041e>
<0249><0249><041f>
<024a><024a><0420>
<024b><024b><0421>
<024c><024c><0422>
<024d><024d><0423>
<024e><024e><0424>
<024f><024f><0425>
<0250><0250><0426>
<0251><0251><0427>
<0252><0252><0428>
<0253><0253><0429>
<0254><0254><042a>
<0255><0255><042b>
<0256><0256><042c>
<0257><0257><042d>
<0258><0258><042e>
<0259><0259><042f>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
Wow. That particular CMap is horribly inefficient. A "bfrange" starts from parameter 1, and goes to and includes parameter 2, maping values starting at parameter 3 (and continuing on until there are no more things to map.
For example:
<0003><0003><0020>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0027><0027><0044>
<0028><0028><0045>
<0029><0029><0046>
<002a><002a><0047>
<002b><002b><0048>
<002c><002c><0049>
<002d><002d><004a>
<002e><002e><004b>
<002f><002f><004c>
<0030><0030><004d>
<0031><0031><004e>
<0032><0032><004f>
could be represented as
<0003><0003><0020>
<0024><0032><0041>
A quick google search turned up the CMap/CID font spec.
There are also beginbfchar/endbfchar which just take two parameters (src and dest values, no ranges), CID based versions (at which point you need to have access to Adobe's character ID tables. They're part of Acrobat/Reader installations, though Reader will need to be prodded into downloading the various Language Packs (or kits or whatever they're called)), and various other stuff you really out to read that spec to find out about.

There are multiple ways this data may be encoded (some using CMAPs). You can also have custom encodings (http://www.jpedal.org/PDFblog/2011/04/understanding-the-pdf-file-format-%E2%80%93-custom-font-encodings/). You also need to understand CID fonts (http://www.jpedal.org/PDFblog/2011/03/understanding-the-pdf-file-format-%E2%80%93-what-are-cid-fonts/)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart