I have a requirement in which conversion of xml to text which is fixed. The DFDL Test Serialize model is padding character if length is less than specified length of field. When running flow after mapping padding character is not showing. If in xml field value is hello after padding should be helloAAAAAAAAA following is dfdl source and mapping source.
Related
I am reading an Excel file (see syntax below) where some of the fields are text mixed with numbers. The problem is that SPSS reads some of these fields as numeric instead of string and then the text is deleted.
I assume this happens in cases where a large part of the first rows are empty or with a numeric value and then it defines the variable as numeric.
How can this be avoided?
GET DATA
/TYPE=XLSX
/FILE='M:\MyData.xlsx'
/SHEET=name 'Sheet1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0
/HIDDEN IGNORE=YES.
When you use the get data command, the subcommand /DATATYPEMIN PERCENTAGE=95.0 tells SPSS that if up to 5% of the values in the field do not conform to the selected format it's still ok. So in order to avoid cases where only very few values are text and the field is read as number, you have to correct the subcommand to:
/DATATYPEMIN PERCENTAGE=100
I have one PDF and I am trying to scan PDF using CGPDFScanner.
While scanning the pdf, when the word "file" is encountered, the CGPDFStringGetBytePtr API returns "\x02le". PDF is having Type1 font and no ToUnicodeMapping(CMap). Encoding dictionary is not present in the PDF hence using NSUTF8String encoding. However I have tried with all NSMacOSRomanStringEncoding, NSASCIIStringEncoding but had no luck.
What can be the problem?
Thanks.
The code \x02 corresponds to 'fi' string. The 'fi' sequence is drawn using a ligature this is why you have only one character code.
The correspondence between the code and the string is done in the font encoding. The font encoding contains a /Differences array that specifies the mapping between code \x02 and the sequence 'fi'
We are using Jfreechart along with iText for generating pdf reports. For Japanese, we realized that in the rendered content for the graph legend, characters don't have any spaces between them. They basically overlap which makes it hard to read.
Do we need to use any special encoding?
Attached are images for expected and actual(generated by jfreechart), in that order
Below is a snippet of the graph generated with the legend
According to the PDF specification, a CIDFont dictionary contains an optional dictionary called DW and an optional array called W. DW is the default width for glyphs. If not set, it defaults to 1000.
The W array describes individual widths for characters in the font (if not specified they default to the value of DW). For many Japanese fonts, I've seen the value set to lower than 1000, but in this case it might be too low.
You can take a look at these values using Acrobat's "preflight>browse internal structure" tool. If these seem off, you make be using the wrong encoding. Setting encoding to "UniJIS-UCS2-H" should help resolve this issue.
First off this solution doesn't work for ligatures:
Convert or Print CGPDFStringRef string
I'm reading text from a PDF and trying to convert it to a NSString. I can get a byte array of text using Apple's CGPDFScanner in the form of a CGPDFString. The "fi" ligature character is giving me trouble. When I look at my byte array in the debugger I see a '\f'
So for simplicity sake lets say that I have this char:
unsigned char myLigatureFromPDF = '\f';
Ultimately I'd like to convert it to this (the unicode value for the "fi" ligature):
unichar whatIWant = 0xFB01;
This is my failed attempt (I copied this from PDFKitten btw):
const char str[] = {myLigatureFromPDF, '\0'};
NSString* stringEncodedLigature = [NSString stringWithCString:str encoding:NSUTF8StringEncoding];
unichar encodedLigature = [stringEncodedLigature characterAtIndex:0];
If anyone can tell me how to do this that would be great
Also, as a side note how does the debugger interpret the unencoded byte array, in other words when I hover over the array how does it know to show a '\f'
Thanks!
Every PDF parser is limited in its capabilities by one single important point of the PDF specifications: characters in literal strings are encoded as bytes or words, but the encoding does not need to be included in the file.
For example, if a subset of a font is included where the code "1" corresponds to the image (character glyph) of an "h" and the code "2" maps to a glyph "a", the string (\1\2\1\2) will show "haha", as expected. But if the PDF contains no further information on how the glyphs in that font correspond to Unicode, there is no way for a string decoder to find out the correct character codes for "glyph #1" and "glyph #2".
It seems your test PDF does contain that information -- else, how could it infer the correct characters for "regular" characters? -- but in this case, the "regular" characters were simply not remapped to other binary codes, for convenience. Also, again for convenience, the glyph for the single character "fi" was remapped to "0x0C" in the original font (or in the subset that got included into your file). But, again, if the file does not contain a translation table between character codes and Unicode values, there is no way to retrieve the correct code.
The above is true for all PDFs and strings. If the font definition in the PDF contains an encoding, your string extraction method should use it; if the PDF contains a /ToUnicode table for the font, again, your method should use it. If it contains neither, you get the literal string contents (and, presumably, you are not informed which method was used and how reliable it is).
As a final footnote: in TeX and LaTeX fonts, ligatures are mapped to lower ASCII codes (as well as a smattering of other non-ASCII codes, such as the curly quotes). It seems you are reading a PDF that was created through TeX here -- but that can only be inferred from this particular encoding. Also, even if you know in advance that the PDF was generated through TeX, it's not guaranteed that it does use this particular encoding, as the decision to translate or not translate is at the discretion of the PDF generator, not TeX itself.
I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.
As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.
I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.
My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.
I have used the original sample PDF document from PDFKitten for my screenshot.
This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.
appendPDFString in StringDetector works with two strings when processing some string data:
// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];
// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];
stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.
Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.
So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.
I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.
This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...
(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))