I have a pdf which contains some character like dashes and double quotes on page number 8 having width 0.
It has Times-Roman font.
I have tried to find the width using AFM files for Times-Roman font but had no luck.
How can I find width of such characters?
Thanks.
Times Roman is one of the Standard 14 Fonts. These fonts oftentimes are exceptions in the PDF specification concerning required data, e.g.
Widths array (Required except for the standard 14 fonts;...
(Table 111 – Entries in a Type 1 font dictionary - ISO 32000-1)
The section on these fonts explains where to get the information instead:
The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, HelveticaOblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique
These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.
NOTE The character sets and encodings for these fonts are listed in Annex D. The font metrics files for the standard 14 fonts are available from the ASN Web site (see the Bibliography). For more information on font metrics, see Adobe Technical Note #5004, Adobe Font Metrics File Format Specification.
(section 9.6.2.2 - Standard Type 1 Fonts (Standard 14 Fonts) - ISO 32000-1)
Having clicked throught the Adobe sites it looks like font metrics currently are available at ftp://ftp.adobe.com/pub/adobe/type/
The OP clarified his problems in a comment:
I have tried Adobe's Font metric file to get width value char \x93,\x94,\x96,\x97,\x98. However these values are not present in AFM file. How can I find widths of these values?
First of all you have to look up the meaning of those values.
You mentioned that the problem occurs on page number 8 and the font being Times-Roman. On page 8 there are Times-Roman fonts F28 and and F46, and there is a Fimes-Bold font F43. The other fonts are Courier and CMSY10 (TeX Computer Modern Symbol?). F28, F43, and F46 have the same Encoding entry:
163 0 obj
<</Differences
[0/.notdef 1/dotaccent/fi/fl/fraction/hungarumlaut/Lslash/lslash/ogonek/ring 10/.notdef 11/breve/minus 13/.notdef 14/Zcaron/zcaron/caron/dotlessi/dotlessj/ff/ffi/ffl 22/.notdef 30/grave/quotesingle/space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quoteright/parenleft/parenright/asterisk/plus/comma/hyphen/period/slash/zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon/less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright/asciicircum/underscore/quoteleft/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde 127/.notdef 130/quotesinglbase/florin/quotedblbase/ellipsis/dagger/daggerdbl/circumflex/perthousand/Scaron/guilsinglleft/OE 141/.notdef 147/quotedblleft/quotedblright/bullet/endash/emdash/tilde/trademark/scaron/guilsinglright/oe 157/.notdef 159/Ydieresis 160/.notdef 161/exclamdown/cent/sterling/currency/yen/brokenbar/section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot/hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior/acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine/guillemotright/onequarter/onehalf/threequarters/questiondown/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute/thorn/ydieresis]
/Type/Encoding>>
endobj
You are looking for \x93,\x94,\x96,\x97,\x98, i.e. (decimal) 147, 148, 150, 151, 152. According to the Encoding above, especially this Differences section:
147/quotedblleft/quotedblright/bullet/endash/emdash/tilde
that means quotedblleft, quotedblright, endash, emdash, and tilde. Searching for those names in the font metrics file one gets:
C 170 ; WX 444 ; N quotedblleft ; B 43 433 414 676 ;
C 186 ; WX 444 ; N quotedblright ; B 30 433 401 676 ;
C 177 ; WX 500 ; N endash ; B 0 201 500 250 ;
C 208 ; WX 1000 ; N emdash ; B 0 201 1000 250 ;
C 196 ; WX 333 ; N tilde ; B 1 532 331 638 ;
So here are the metrics, especially the widths, of your characters.
Related
I am trying to print special characters on a Zebra Printer (é, à, Ô).
So far, I've tried solutions found on StackOverflow (like this one Print characters with an acute in ZPL). In this particular one, the special characters does print correctly, but the font is big and the printer unroll a few inches of paper before actually printing.
I've read Zebra Programming doc but I can't seem to make it work.
Also, it does not look at all like the code I have so far :
T 0 3 40 0 ^FDHimudit\82
T 0 3 40 30
T 0 3 40 60
T 0 3 40 90 Déroulage Réduction Öyster
T 0 3 40 120 Règle ÀAA ÂA
SETFF 100 2.5 FORM PRINT
I'm writing a function in Octave to easily add particles on an image, but I have a problem.
function [ out ] = enparticle( mainImg, particleNames, particleData, frames, fpp, sFrame, eFrame )
%particleData format:
% [ p1Xline p1StartHeight p1EndHeight;
% p2Xline p2StartHeight p2EndHeight;
% p3Xline p3StartHeight p3EndHeight;
% ... ]
%particleNames format:
% [ p1Name;
% p2Name;
% p3Name;
% ... ]
pAmount = size(particleData, 1);
for i= 1:pAmount
tmp = particleNames(i,:)
[ pIMG pMAP pALPHA ] = imread( tmp );
end
end
When I run this simple code with
enparticle( "ffield.png", [ "p_dot.png"; "p_star.png"; "p_dot.png" ], [ 100 50 100; 200 50 100; 300 50 100 ], 30, 10, 5, 25 )
I get this written in console
tmp = p_dot.png
error: imread: unable to find file p_dot.png
error: called from
imageIO at line 71 column 7
imread at line 106 column 30
enparticle at line 24 column 23
When I try to imread() file this way, Octave thinks, that there is no file named like this. But it is actually. In the same folder as script file.
The most curious thing is that, when I change
tmp = particleNames(i,:)
to
tmp = particleNames(:,:)
and Octave assigns all names to tmp as array, it magically find all the files with passed names.
But it's not the way I want it to work, because all files will be replaced, or merged, or sth along image processing then.
Why I'm trying to do it that way is corelated with fact, that I want to put every frame (of image and alpha) separately into a cell array later.
I totally don't have any clue, about what I do wrong there and can't google it anywhere also :(
The code:
filenames = [ "p_dot.png"; "p_star.png"; "p_dot.png" ]
does not do what you think it does. This will create a 2 dimensional
array of characters. See
octave> size (filenames)
ans =
3 10
Of interest note is that it lists 10 columns. Take a look at your
filenames and you will notice that their file names are of different
lengths, two have length 9 and one has length 10. But this is just
like a numeric matrix, the only difference is you have ascii
characters. And like a numeric matrix, all rows must have the length.
What is happening, is that the shortest rows get padded with
whitespace. You can confirm this by checking the ascii decimal code
of your filenames:
octave> int8 (filenames)
ans =
112 95 100 111 116 46 112 110 103 32
112 95 115 116 97 114 46 112 110 103
112 95 100 111 116 46 112 110 103 32
Note how the first and third row end in '32'. In ASCII, that's the
space character (see the wikipedia article about
ASCII which has the tables)
So the problem is not that imread does not find a file named
'p_dot.png', the problem is that it does not find a file named
'p_dot.png '.
You should not be using character arrays for this. Instead, use a
cell array. In a cell array of char arrays. Like so:
filenames = {"p_dot.png", "p_start.png", "p_dot.png"}
for i = 1:numel (filenames)
fname = filenames{i};
[pIMG, pMAP, pALPHA] = imread (fname);
## do some stuff
endfor
With a Postscript driver (Xerox, Canon, HP, all), when I create a PS file, for example when I print the test page in the printer properties, I get :
OK :
The view of the result is correct (with GSview for example)
Not OK :
The file size is to big, more than 4 MB.
When I edit the file, I have one big image (doNimage). I think is the reason of the big size file.
The example file : https://drive.google.com/open?id=0B9bet657DEU5alV6WFZZdDFjMmc
I'm on Windows 10, similar problem with Windows server 2012 r2.
I let the configuration of the driver by default.
Anyone has an idea ?
Thanks a lot.
Regards.
I don't understand your problem, the file you posted a link to contains text. Here's an example:
360 4485 M <202530360E0F1102381030100D100B0824152D30103102020C302A1E19181B1E1730132E28301530132D3B02230B2A2E22081308>[46 16 28 70 18 42 44 44 54 32 28 32 36 32 25 39 65 40 40 28 32 44 44 44 18 28 53 45 20 47 38 45
40 28 34 40 40 28 40 28 34 40 18 44 44 25 53 40 16 39 34 0]xS
M is a moveto and xS uses the xshow operator to draw the glyphs represented by the character codes in the hexstring, using the values in the array to modify the width of each glyph.
If you were expecting to see ASCII character codes you are going to be sadly disappointed, the files uses an incrementally downloaded subset TrueType font, so the character codes are defined as they are encountered, that is the first glyph used will be given character code 1, the second will be character code 2 and so on.
Even without that, using ASCII would limit the languages that could be supported. Back in the 1980s that maybe didn't seem like a problem, but its a long time since that was considered acceptable.
If you were expecting to be able to modify the text by editing it in a text editor, forget it. PostScript is a programming language, and the output of a PostScript printer driver is a machine-generated program. Its a lengthy process for a skilled user of the language to decipher what the program is doing. The program is not amenable to alteration, if there's a fault in the output, correct the original document and recreate the PostScript program from the original.
PostScript is not an editable format.
Thanks all for your response. I see I was not very clear in my question.
Here is the state :
With the PS driver, on a windows server 2008, I get this file :
http://expirebox.com/download/0bb511565377e8b74eead67641fe7f68.html
Inside the file I can see the text "Page de test d\222imprimante"
On a Windows server 2012 R2 :
http://expirebox.com/download/60fa957cba97c82bbcd5c0e975825b52.html
I can't see any text. It's a printer page test too.
I need to see text because I'll print document with code inside. Code for a printer to identify page type. (for example a white page for the tray n° 1, yellow page for tray 2)
KenS : I understand your point. But why the same driver give different file.
I checked if it's really the same. The only difference I see is the OS, one x86, the other x64.
Thanks.
Regards.
I am trying to extract the text of page 5 in pdf.
The pdf have a font YLJAAA+CMSY10 which has no mappings (CMap) or even encodings (default encoding or /Differences). While extracting text, after string "tetex package" CGPDFScanner returns "\x15" character which is encountered many times. When this character is encountered current font is the above mentioned font which has nothing to extract the text from pdf string.
What is this \x15 character?
Thanks.
I found 2 (not "many") occurrences of this:
[ (\025) ] TJ
which is a number in octal – this is the number that is \x15 in hexadecimal.
The font definition for "YLJAA+CMSY10" in the PDF carries no special encoding, so it has the default encoding for "CMSY" ("Computer Modern Symbol"):
114 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont 210 0 R % -> "/YLJAAA+CMSY10"
/FirstChar 0
/FontDescriptor 211 0 R
/LastChar 127
/Widths 204 0 R
>>
211 0 obj
<<
/Ascent 750
/CapHeight 683
/CharSet (/bullet/greaterequal/arrowright/arrowdblright/element/negationslash/backslash/radical)
/Descent 0
/Flags 4
/FontBBox [ -29 -960 1116 775 ]
/FontFile 205 0 R
/FontName 210 0 R % -> '/YLJAAA+CMSY10'
/ItalicAngle -14
/StemV 85
/XHeight 430
>>
endobj
In itself, this still says nothing definitive: a PDF producer may reorder glyphs and encodings at will, as long as it does the same with the embedded font). Assuming the font set is not reordered, checking a random list of CMxx encodings shows that the character code 0x1F could well be GREATER-THAN OR EQUAL TO (Unicode U+2265).
Acrobat agrees; inspecting the font in the PDF shows that character code 21 (decimal) is named 'GREATER-THAN OR EQUAL' and looks like it as well.
Hi I have a directory of files and each file has multiple languages text strings over multiple lines. Using grepwin I would like to extract all the English text strings and save into another text file. Typically in the each file the english text is inside a Switch/Case condition like this:
Default //English
bitmap 8 20 "bmp5/warning.bmp"
Ltext 5 1 11 "USB Device Overload"
LText 85 20 13 "USB"
Ltext 50 33 13 "Device Overload!"
Break
Case _French
bitmap 8 20 "bmp5/warning.bmp"
LTEXT 5 1 11 "Surcharge clé USB!"
LTEXT 45 30 13 "Surcharge clé USB"
Break
Since all the English text is always between 'Default' and 'Break' I want to use those two keywords as the delimeter. Finally all the text between the two keywords needs to be saved out to another text file.
Can anyone help at all.
Thanks guys
Through grep.
$ grep -oPz '(?s)(?<=\n|^)Default\b[^\n]*\n\K.*?(?=\nBreak\b)' file
bitmap 8 20 "bmp5/warning.bmp"
Ltext 5 1 11 "USB Device Overload"
LText 85 20 13 "USB"
Ltext 50 33 13 "Device Overload!"
To save the result to another file, you need to use output redirection operator.
grep -oPz '(?s)(?<=\n|^)Default\b[^\n]*\n\K.*?(?=\nBreak\b)' infile > outfile
From man grep
-P, --perl-regexp PATTERN is a Perl regular expression
-z, --null-data a data line ends in 0 byte, not newline