I am trying to extract the text of page 5 in pdf.
The pdf have a font YLJAAA+CMSY10 which has no mappings (CMap) or even encodings (default encoding or /Differences). While extracting text, after string "tetex package" CGPDFScanner returns "\x15" character which is encountered many times. When this character is encountered current font is the above mentioned font which has nothing to extract the text from pdf string.
What is this \x15 character?
Thanks.
I found 2 (not "many") occurrences of this:
[ (\025) ] TJ
which is a number in octal – this is the number that is \x15 in hexadecimal.
The font definition for "YLJAA+CMSY10" in the PDF carries no special encoding, so it has the default encoding for "CMSY" ("Computer Modern Symbol"):
114 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont 210 0 R % -> "/YLJAAA+CMSY10"
/FirstChar 0
/FontDescriptor 211 0 R
/LastChar 127
/Widths 204 0 R
>>
211 0 obj
<<
/Ascent 750
/CapHeight 683
/CharSet (/bullet/greaterequal/arrowright/arrowdblright/element/negationslash/backslash/radical)
/Descent 0
/Flags 4
/FontBBox [ -29 -960 1116 775 ]
/FontFile 205 0 R
/FontName 210 0 R % -> '/YLJAAA+CMSY10'
/ItalicAngle -14
/StemV 85
/XHeight 430
>>
endobj
In itself, this still says nothing definitive: a PDF producer may reorder glyphs and encodings at will, as long as it does the same with the embedded font). Assuming the font set is not reordered, checking a random list of CMxx encodings shows that the character code 0x1F could well be GREATER-THAN OR EQUAL TO (Unicode U+2265).
Acrobat agrees; inspecting the font in the PDF shows that character code 21 (decimal) is named 'GREATER-THAN OR EQUAL' and looks like it as well.
Related
I am trying to print special characters on a Zebra Printer (é, à, Ô).
So far, I've tried solutions found on StackOverflow (like this one Print characters with an acute in ZPL). In this particular one, the special characters does print correctly, but the font is big and the printer unroll a few inches of paper before actually printing.
I've read Zebra Programming doc but I can't seem to make it work.
Also, it does not look at all like the code I have so far :
T 0 3 40 0 ^FDHimudit\82
T 0 3 40 30
T 0 3 40 60
T 0 3 40 90 Déroulage Réduction Öyster
T 0 3 40 120 Règle ÀAA ÂA
SETFF 100 2.5 FORM PRINT
I'm looking for the exact specifications of this file format. Anyone got a link? Or want to comment?
I have spent the better part of the day searching, yet I keep getting directed back to the GIMP online user-manual. It says "look at a .gpl file and you will see it is easy" to build manually with a text editor. I don't actually have GIMP, but I see examples online. Yep, easy. • EXCEPT:
• What meaning do the color names ultimately have? Are they purely semantic, or does a program rely on them? If the latter, then what if there are two (2) or more colors with the same name?
• What does the "Columns" line do?
I've seen examples that have no "Columns" line.
I've seen examples that have values of 0, 4, and 16; yet this does not in any way that I can see correspond to the color data. I see 3 columns of decimal-sRGB values, and an optional 4th column with the color-name; seems I remember the example with "Columns 4" had no color names, only the 3 RGB columns.
• Do columns of RGB values need to "line up"? Or will the following example from my output algorithm work? (from the Crayola palette):
159 129 112 Beaver
253 124 110 Bittersweet
0 0 0 Black
172 229 238 Blizzard Blue
31 117 254 Blue
162 162 208 Blue Bell
102 153 204 Blue Gray
13 152 186 Blue Green
• Does this format accept sRGBA colors? And if so, how is the "A" value defined (0-1, 0%-100%, 0-127, 0-255, etc.?) (seems I remember when creating .png files with PHP, the "A" value was 7-bit)?
• How exactly do you add comments / metadata?
Today I see an example that says lines that begin with # are comments, or anything after a # on a line is a comment. Yesterday I thought (maybe I'm confused) I saw an example that said that comment lines begin with ;
• Is any other data-format supported?
Originally I thought the text-line just before the color-data that I see in every example indicated the format: "#" signifying decimal-sRGB; until today when I see that is just a blank-line comment.
• What line ending character(s) can / must I use?
\n
\r
• What character-encodings can I use? ASCII only? ¿UTF-8 ☺ with extended ♪♫ charset (¡hopefully!)?
• Anything I'm missing? Any other options available?
Here is an example from http://gimpchat.com/viewtopic.php?f=8&t=3375#
GIMP Palette
Name: bugslife_final.png-10
Columns: 16
#
191 180 180 Index 0
163 158 157 Index 1
145 136 132 Index 2
130 125 112 Index 3
… … …
56 50 49 Index 29
41 38 38 Index 30
23 23 23 Index 31
242 245 213 Index 32
227 232 181 Index 33
210 217 147 Index 34
195 204 118 Index 35
… … …
0 0 0 Index 251
0 0 0 Index 252
0 0 0 Index 253
0 0 0 Index 254
0 0 0 Index 255
Aloha!
Looking at the source code:
Columns is just an indication for display in the palette editor
Comments must start with a #. In non-empty lines that don't, the first three tokens are parsed as numbers
There is no alpha support
I'm writing a function in Octave to easily add particles on an image, but I have a problem.
function [ out ] = enparticle( mainImg, particleNames, particleData, frames, fpp, sFrame, eFrame )
%particleData format:
% [ p1Xline p1StartHeight p1EndHeight;
% p2Xline p2StartHeight p2EndHeight;
% p3Xline p3StartHeight p3EndHeight;
% ... ]
%particleNames format:
% [ p1Name;
% p2Name;
% p3Name;
% ... ]
pAmount = size(particleData, 1);
for i= 1:pAmount
tmp = particleNames(i,:)
[ pIMG pMAP pALPHA ] = imread( tmp );
end
end
When I run this simple code with
enparticle( "ffield.png", [ "p_dot.png"; "p_star.png"; "p_dot.png" ], [ 100 50 100; 200 50 100; 300 50 100 ], 30, 10, 5, 25 )
I get this written in console
tmp = p_dot.png
error: imread: unable to find file p_dot.png
error: called from
imageIO at line 71 column 7
imread at line 106 column 30
enparticle at line 24 column 23
When I try to imread() file this way, Octave thinks, that there is no file named like this. But it is actually. In the same folder as script file.
The most curious thing is that, when I change
tmp = particleNames(i,:)
to
tmp = particleNames(:,:)
and Octave assigns all names to tmp as array, it magically find all the files with passed names.
But it's not the way I want it to work, because all files will be replaced, or merged, or sth along image processing then.
Why I'm trying to do it that way is corelated with fact, that I want to put every frame (of image and alpha) separately into a cell array later.
I totally don't have any clue, about what I do wrong there and can't google it anywhere also :(
The code:
filenames = [ "p_dot.png"; "p_star.png"; "p_dot.png" ]
does not do what you think it does. This will create a 2 dimensional
array of characters. See
octave> size (filenames)
ans =
3 10
Of interest note is that it lists 10 columns. Take a look at your
filenames and you will notice that their file names are of different
lengths, two have length 9 and one has length 10. But this is just
like a numeric matrix, the only difference is you have ascii
characters. And like a numeric matrix, all rows must have the length.
What is happening, is that the shortest rows get padded with
whitespace. You can confirm this by checking the ascii decimal code
of your filenames:
octave> int8 (filenames)
ans =
112 95 100 111 116 46 112 110 103 32
112 95 115 116 97 114 46 112 110 103
112 95 100 111 116 46 112 110 103 32
Note how the first and third row end in '32'. In ASCII, that's the
space character (see the wikipedia article about
ASCII which has the tables)
So the problem is not that imread does not find a file named
'p_dot.png', the problem is that it does not find a file named
'p_dot.png '.
You should not be using character arrays for this. Instead, use a
cell array. In a cell array of char arrays. Like so:
filenames = {"p_dot.png", "p_start.png", "p_dot.png"}
for i = 1:numel (filenames)
fname = filenames{i};
[pIMG, pMAP, pALPHA] = imread (fname);
## do some stuff
endfor
I have a pdf which contains some character like dashes and double quotes on page number 8 having width 0.
It has Times-Roman font.
I have tried to find the width using AFM files for Times-Roman font but had no luck.
How can I find width of such characters?
Thanks.
Times Roman is one of the Standard 14 Fonts. These fonts oftentimes are exceptions in the PDF specification concerning required data, e.g.
Widths array (Required except for the standard 14 fonts;...
(Table 111 – Entries in a Type 1 font dictionary - ISO 32000-1)
The section on these fonts explains where to get the information instead:
The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, HelveticaOblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique
These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.
NOTE The character sets and encodings for these fonts are listed in Annex D. The font metrics files for the standard 14 fonts are available from the ASN Web site (see the Bibliography). For more information on font metrics, see Adobe Technical Note #5004, Adobe Font Metrics File Format Specification.
(section 9.6.2.2 - Standard Type 1 Fonts (Standard 14 Fonts) - ISO 32000-1)
Having clicked throught the Adobe sites it looks like font metrics currently are available at ftp://ftp.adobe.com/pub/adobe/type/
The OP clarified his problems in a comment:
I have tried Adobe's Font metric file to get width value char \x93,\x94,\x96,\x97,\x98. However these values are not present in AFM file. How can I find widths of these values?
First of all you have to look up the meaning of those values.
You mentioned that the problem occurs on page number 8 and the font being Times-Roman. On page 8 there are Times-Roman fonts F28 and and F46, and there is a Fimes-Bold font F43. The other fonts are Courier and CMSY10 (TeX Computer Modern Symbol?). F28, F43, and F46 have the same Encoding entry:
163 0 obj
<</Differences
[0/.notdef 1/dotaccent/fi/fl/fraction/hungarumlaut/Lslash/lslash/ogonek/ring 10/.notdef 11/breve/minus 13/.notdef 14/Zcaron/zcaron/caron/dotlessi/dotlessj/ff/ffi/ffl 22/.notdef 30/grave/quotesingle/space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quoteright/parenleft/parenright/asterisk/plus/comma/hyphen/period/slash/zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon/less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright/asciicircum/underscore/quoteleft/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde 127/.notdef 130/quotesinglbase/florin/quotedblbase/ellipsis/dagger/daggerdbl/circumflex/perthousand/Scaron/guilsinglleft/OE 141/.notdef 147/quotedblleft/quotedblright/bullet/endash/emdash/tilde/trademark/scaron/guilsinglright/oe 157/.notdef 159/Ydieresis 160/.notdef 161/exclamdown/cent/sterling/currency/yen/brokenbar/section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot/hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior/acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine/guillemotright/onequarter/onehalf/threequarters/questiondown/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute/thorn/ydieresis]
/Type/Encoding>>
endobj
You are looking for \x93,\x94,\x96,\x97,\x98, i.e. (decimal) 147, 148, 150, 151, 152. According to the Encoding above, especially this Differences section:
147/quotedblleft/quotedblright/bullet/endash/emdash/tilde
that means quotedblleft, quotedblright, endash, emdash, and tilde. Searching for those names in the font metrics file one gets:
C 170 ; WX 444 ; N quotedblleft ; B 43 433 414 676 ;
C 186 ; WX 444 ; N quotedblright ; B 30 433 401 676 ;
C 177 ; WX 500 ; N endash ; B 0 201 500 250 ;
C 208 ; WX 1000 ; N emdash ; B 0 201 1000 250 ;
C 196 ; WX 333 ; N tilde ; B 1 532 331 638 ;
So here are the metrics, especially the widths, of your characters.
Hi I have a directory of files and each file has multiple languages text strings over multiple lines. Using grepwin I would like to extract all the English text strings and save into another text file. Typically in the each file the english text is inside a Switch/Case condition like this:
Default //English
bitmap 8 20 "bmp5/warning.bmp"
Ltext 5 1 11 "USB Device Overload"
LText 85 20 13 "USB"
Ltext 50 33 13 "Device Overload!"
Break
Case _French
bitmap 8 20 "bmp5/warning.bmp"
LTEXT 5 1 11 "Surcharge clé USB!"
LTEXT 45 30 13 "Surcharge clé USB"
Break
Since all the English text is always between 'Default' and 'Break' I want to use those two keywords as the delimeter. Finally all the text between the two keywords needs to be saved out to another text file.
Can anyone help at all.
Thanks guys
Through grep.
$ grep -oPz '(?s)(?<=\n|^)Default\b[^\n]*\n\K.*?(?=\nBreak\b)' file
bitmap 8 20 "bmp5/warning.bmp"
Ltext 5 1 11 "USB Device Overload"
LText 85 20 13 "USB"
Ltext 50 33 13 "Device Overload!"
To save the result to another file, you need to use output redirection operator.
grep -oPz '(?s)(?<=\n|^)Default\b[^\n]*\n\K.*?(?=\nBreak\b)' infile > outfile
From man grep
-P, --perl-regexp PATTERN is a Perl regular expression
-z, --null-data a data line ends in 0 byte, not newline