Parsing linearized pdf xref table - parsing

I have pdf having start as:
%PDF-1.7
%‚„œ”
69 0 obj
<</Linearized 1/L 3937432/O 71/E 2811072/N 9/T 3935937/H [ 996 498]>>
endobj
xref
69 35
0000000016 00000 n
0000001494 00000 n
0000001593 00000 n
0000002065 00000 n
........................
and at the end I have:
0003929147 00000 n
0003929283 00000 n
0003929352 00000 n
0003929458 00000 n
0003935743 00000 n
trailer
<</Size 69/ID[<00E23EA222C14F40B1305A98D798C27F><F53AB532FC064AB39459DBD6BAF21DD6>]>>
startxref
11
Now if I tried to fetch startxref at 11 then I get to „œ” string...which seems wrong, how do I go to actual xrefstart ("xref"),
Can Any body help?

Your xref table byte offset is wrong. It is not 11.
startxref
11
If you fixed this, you can access the xref reading properly

Related

how to use seq2seq to decode concatenated string

Am trying to decode a concatenated String like below ...
SQCB7A750BATWE SQ CB 7 A 750 B A T WE
PT05A1219PY023 PT 05 A 12 19 P Y 023
PT55A1019PX02 PT 55 A 10 19 P X 02
PT33SE2215SW023 PT 33 SE 22 15 S W 023
PT05A2216PW023(LC) PT 05 A 22 16 P W 023 (LC)
am looking for a smarter way rather than hard-coded rules as the input will have variations(number of characters and digits), I came across SEQ2SEQ model and I want to know if it's possible to use it in such problem
I already followed some tutorials to get a taste of it, but the results weren't even close
it also seems there are 2 approaches character level and word level as per this tutorial
Character level:
Input sentence: SQCACA333BA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
-
Input sentence: SQCAAC152DA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
am still trying to implement the word level, but I'd like to know if the problem can be solved using this approach (seq2seq)

The DHT in JPEG does not contain the actual huffman code, how come?

The DHT contains 16 bytes that just contains count of how many values were encoded with huffman code of each length from 1 bit all the way to 16 bits. After this, it contains the actual values that were encoded, all these value are 8 bits in size.
Q: Why is huffman code not stored, how does decoder derive the codes?
Q: If there are say 4 values that have huffman code of 3 bits long, we shall write them as 4 bytes. Does it matter what order they are in or they have to be in ascending or descending order? I do know that the values must be in order such that the values with 1 bit huffman code are then followed by values with 2 bit huffman code e.t.c.
Q: I have used jpegsnoop to look at huffman table of different files, some made in MS paint and some were downloaded. I find that they all have the SAME table.
Here are the DHT entries I got from JPEG snoop:
Destination ID = 1
Class = 1 (AC Table)
Codes of length 01 bits (000 total):
Codes of length 02 bits (002 total): 00 01
Codes of length 03 bits (001 total): 02
Codes of length 04 bits (002 total): 03 11
Codes of length 05 bits (004 total): 04 05 21 31
Codes of length 06 bits (004 total): 06 12 41 51
Codes of length 07 bits (003 total): 07 61 71
Codes of length 08 bits (004 total): 13 22 32 81
Codes of length 09 bits (007 total): 08 14 42 91 A1 B1 C1
Codes of length 10 bits (005 total): 09 23 33 52 F0
Codes of length 11 bits (004 total): 15 62 72 D1
Codes of length 12 bits (004 total): 0A 16 24 34
Codes of length 13 bits (000 total):
Codes of length 14 bits (001 total): E1
Codes of length 15 bits (002 total): 25 F1
Codes of length 16 bits (119 total): 17 18 19 1A 26 27 28 29 2A 35 36 37 38 39 3A 43
44 45 46 47 48 49 4A 53 54 55 56 57 58 59 5A 63
64 65 66 67 68 69 6A 73 74 75 76 77 78 79 7A 82
83 84 85 86 87 88 89 8A 92 93 94 95 96 97 98 99
9A A2 A3 A4 A5 A6 A7 A8 A9 AA B2 B3 B4 B5 B6 B7
B8 B9 BA C2 C3 C4 C5 C6 C7 C8 C9 CA D2 D3 D4 D5
D6 D7 D8 D9 DA E2 E3 E4 E5 E6 E7 E8 E9 EA F2 F3
F4 F5 F6 F7 F8 F9 FA
Total number of codes: 162
And
Destination ID = 1
Class = 0 (DC / Lossless Table)
Codes of length 01 bits (000 total):
Codes of length 02 bits (003 total): 00 01 02
Codes of length 03 bits (001 total): 03
Codes of length 04 bits (001 total): 04
Codes of length 05 bits (001 total): 05
Codes of length 06 bits (001 total): 06
Codes of length 07 bits (001 total): 07
Codes of length 08 bits (001 total): 08
Codes of length 09 bits (001 total): 09
Codes of length 10 bits (001 total): 0A
Codes of length 11 bits (001 total): 0B
Codes of length 12 bits (000 total):
Codes of length 13 bits (000 total):
Codes of length 14 bits (000 total):
Codes of length 15 bits (000 total):
Codes of length 16 bits (000 total):
Total number of codes: 012
The AC table compresses RRRRSSSS that contain zero-run length and AC coefficient magnitude while the DC table compresses SSSS. Thus, I think that the AC table must contain total of 255 entries (exlcuded 0) while the DC table must be 15 entries (excluded 0). However, neither of the tables contain this many total number of codes. WHY?
Q: Why is huffman code not stored, how does decoder derive the codes?
The reason the Huffman tables is defined as they are rather than with the actual codes is that it is much smaller and simpler to encode that way. PNG uses a similar but different method.
Keep in mind that to store the Huffman codes in the JPEG stream you would need to include both the length and the code itself. The result would be much larger than encoding a count of lengths.
Q: If there are say 4 values that have huffman code of 3 bits long, we shall write them as 4 bytes. Does it matter what order they are in or they have to be in ascending or descending order?
If the Huffman code has 3 bits, it is written as three bits to the JPEG stream. The codes are generated in ascending order.
Q: I have used jpegsnoop to look at huffman table of different files, some made in MS paint and some were downloaded. I find that they all have the SAME table.
The encoder is being lazy and using a fixed Huffman table. There is a sample Huffman table in the JPEG standard that they often use. To generate optimal Huffman codes, the encoder must make two passes over the data. With a preset table, the encoder only needs to make one pass.
F.1.2.1.2 and F.1.2.2.1 of the JPEG Specification explain why the Huffman tables are not fully populated. For baseline encoding DC difference values are limited to 11 bits (table F.1) and AC values are limited to 10 bits (table F.2).
Since DC Huffman symbols only need SSSS values from 0 to 11 their Huffman trees need only 12 codes as you've reported.
AC Huffman symbols have a prefix zero run count from 0 to 15. With 11 bit sizes that works out to 16 * 11 = 176 symbols. However, they don't include the symbols 0x10, 0x20, ... 0xE0 because they are redundant. They encode a run of 1, 2, ... 14 zeros followed by a 0 value. If an encoder has, say, 7 zero values followed by a 3 bit value it can encode that as 0x73. There would be no point encoding it with two symbols 0x60;0x03.
Ignoring those 14 useless symbols we end up with 162 codes as you have reported.
By the way, the 0xF0 (ZRL) value is needed because there isn't a symbol that can express a run of 16 zeros followed by a value thus is cannot be merged.
I don't know why the JPEG spec limits the DC and AC values to a certain number of bits. I speculate that the extra precision would have no effect or is typically thrown away by quantization. Or maybe it has to do with the mathematics of the Inverse Discrete Cosine Transform. Keep in mind that these Huffman encoded values are (quantized) coefficients for the IDCT and are only indirectly related to the continuous tone RGB output.
The Huffman encoding is almost fully determined by the relative frequencies of all 256 symbols (except tiebreaker rules). This means you can choose many, many formats to encode those relative frequencies; the most simple one would be to simply store all these frequencies. The receiver can then rebuild the encoding from that order.
Background: the two least frequent characters of a Huffman encoding share the same (long) prefix, and differ only in the last bit. This combination is then assigned a joint frequency (sum of both combinations), which is used recursively to determine the prefix. Finally, you end up with two sets, one holding X characters and the other holding 256-X characters. The first set has prefix 0 and the second set has prefix 1.
Yes, that's arbitrary, you could swap those 0 and 1, and have a similar table and the same compression ratio - a 0 is just as long as a one. That's why you have detailed rules (e.g. most common set gets 0, tiebreaker is first byte in set)
Back to encoding. You want to store these relative frequencies efficiently, as we're using compression here. Now, as I pointed out, when we have codes suffix-0 and suffix-1, they're both equally long (namely the suffix length plus one). So we know from the fact that there are 119 unique 16 bits codes that there are 60 unique prefixes with length 15. Calculating backwards, we also know that there are two unique symbols with length 15, total 62, so there must be 31 prefixes of length 14. We can backtrack again to prefixes of length 1.
Again, it's necessary to point out that we don't know here the exact values of those prefixes, and the matching symbols. This depends on the tiebreaker rules, as pointed out, but those rules are fixed for JPEG.
JPEG does have a bit of a special case: Huffman codes for very rare symbols should be longer than 16 bits. That's inconvenient, so in building the table you don't choose the two sets with the least combined frequency if either of them already has a long suffix - combining those two subsets would just make the suffix even longer. You see this with all the 16-bit codes in the example: most should have had longer codes in a pure Huffman encoding.
I think the worst case is if the most frequent character appears 50%, the next 25%, etc. You'll get codes 0, 10, 110, 1110 etcetera. That's unary counting, which is indeed optimal for that case, but the longest code is now 256 bits. You'd need a document with 2^256 bytes to have a frequency of 2^-256, though.

COBOL table structure and output

I'm trying to userstand how this works:
WORKING-STORAGE SECTION.
01 PAY-TABLE.
05 PAY-VALUES OCCURS 25 TIMES PIC 9(3)V99.
01 WORKING-VALUES.
05 SUB PIC 9(2) VALUE ZERO.
PROCEDURE DIVISION.
INITIALIZE-ROUTINE.
INITIALIZE PAY-TABLE
MAINLINE-ROUTINE.
PERFORM LOAD-TABLE
VARYING SUB FROM 1
BY 1
UNTIL SUB>10.
DISPLAY-ROUTINE.
DISPLAY PAY-VALUE (SUB)
LOAD-TABLE.
MOVE SUB TO PAY-VALUE (SUB).
I have this code as a review code in the book. There are several questions according to this code, with answers, however, I don't really understand why should it be exactly this answer.
The DISPLAY statement will display the value _____ on the monitor.
Answer: unknown
When the value of SUB in the PERFORM statement is equal to 9, the value in PAY-VALUE (SUB), before the PERFORM executes, will be equal to:
Answer:10
When the LOAD-TABLE routine has executed for the last time, the value in PAY-VALUE (25) will be:
Answer: 26
I've tried to read tutorials about tables but still don't understand how this example works.
Note that much of this depends on the environment -- some COBOL compilers behave differently and there are also variances between platforms. (I've had a lot of fun with this porting code from HP3000 COBOL to Linux NetCOBOL.) But generally speaking...
1) The book may say the answer is "unknown" because compilers may initialize things differently. INITIALIZE PAY-TABLE may set each instance of PAY-VALUE to zero or it may set PAY-TABLE (the 125 byte character field that is comprised of the 25 PAY-VALUEs) to spaces. What it should be, however, (as in the former case,) is zero.
2) This is incorrect. Again, there is the possibility of variance (due to the reason mentioned in #1) but it should be zero. Because nothing has happened to PAY-VALUE yet (before the PERFORM for SUB=9), it would still be the value to which it had been initialized.
3) As in #2, with the same caveats, the answer should be zero. Because the LOAD-TABLE paragraph is only performed until SUB becomes greater than 10, PAY-VALUE (25) will never change from its initialized value.
Note also that the program is poorly written -- without a STOP RUN. at the end of the MAINLINE-ROUTINE, the program will continue on and execute the DISPLAY-ROUTINE and LOAD-TABLE paragraphs one last time.
There are also, a number of typos, most notably the definition of PAY-VALUES (with an S) and the usage of PAY-VALUE (no S).
Here is a sample run (I added displays before and after the MOVE statement in LOAD-TABLE and after the PERFORM LOAD-TABLE statement has completed. Note the extra displays.)
Before move -- Sub: 01 PV: 00000
After move -- PV: 00100
Before move -- Sub: 02 PV: 00000
After move -- PV: 00200
Before move -- Sub: 03 PV: 00000
After move -- PV: 00300
Before move -- Sub: 04 PV: 00000
After move -- PV: 00400
Before move -- Sub: 05 PV: 00000
After move -- PV: 00500
Before move -- Sub: 06 PV: 00000
After move -- PV: 00600
Before move -- Sub: 07 PV: 00000
After move -- PV: 00700
Before move -- Sub: 08 PV: 00000
After move -- PV: 00800
Before move -- Sub: 09 PV: 00000
After move -- PV: 00900
Before move -- Sub: 10 PV: 00000
After move -- PV: 01000
Completed Perform of LOAD-TABLE.
PAY-VALUE(25): 00000
00000
Before move -- Sub: 11 PV: 00000
After move -- PV: 01100
Bear in mind that the PAY-VALUE values show as hundreds (e.g., 00100 instead of 1) because the field is defined as 9(03)V99 meaning there is an implied decimal place and two digits to the right of it.
Hope this helps!

Print Bitmap to ESC/POS printer

I am trying to print to a ESC/POS compatible printer and am struggling to get my head around GS v 0. I have just connected from a Mac and sending commands is hex via CoolTerm.
The docs say ...
GS v 0 m xL xH yL yH d1....dk
-----------------------------------------------------
[Name] Print raster bit image
[Format] ASCII GS v 0 m xL xH yL yH d1....dk
Hex 1D 76 30 m xL xH yL yH d1....dk
Decimal 29 118 48 m xL xH yL yH d1....dk
[Range] 0≤xL≤48, xH=0; 0≤yL≤255, yH=0; 0≤d≤255
k=(xL+xH×256)×(yL+yH×256)(k≠0)
[Description] Selects Raster bit-image mode. The value of m selects the mode, as follows:
+------+------------+----------------------------+---------------------------+
| m | MODE | Vertical Dot Density | Horizontal Dot density |
+------+------------+----------------------------+---------------------------+
|0, 48 | Normal | 200 DPI | 200 DPI |
+------+------------+----------------------------+---------------------------+
|1, 49 |Double-width| 200 DPI | 100 DPI |
+------+-------------+---------------------------+---------------------------+
|2, 50 |Double-height| 100 DPI | 200 DPI |
+------+-------------+---------------------------+---------------------------+
|3, 51 | Quadruple | 100 DPI | 100 DPI |
+------+-------------+---------------------------+---------------------------+
• xL, xH, select the number of data bits ( xL+ xH × 256) in the horizontal direction for the bit image.
• yL, yH, select the number of data bits ( yL+ yH × 256) in the vertical direction for the bit image.
• This command has no effect in all print modes (character size, emphasized, double-strike, upside-down, underline, white/black reverse printing, etc.) for raster bit image.
• The part of bit image that exceeds the printable area will not be printed.
• d indicates the bit-image data. Set time a bit to 1 prints a dot and setting it to 0 does not print a dot.
So from this I deduce I need to send the following in HEX
1D 76 30 30 20 00 00 01
Does the image data now follow this, and do I have to send a message saying the image has ended?
I remember there is <ESC>K (not GS) command to print 8 lines of pixels. See ESC commands for details. After K must be sent 2 bytes - number of data bytes and the real data. But it is not guaranteed whether every printer will support this. What is the brand and model of your?
The docs say:
Hex 1D 76 30 m xL xH yL yH d1....dk
with m being 0x48..0x51, so you can't send
1D 76 30 30 20 00 00 01
but rather, for instance
1D 76 30 48 20 00 00 01
I don't think it makes much sense that xH and yL are 0.

Unexpected behavior of io:fread in Erlang

This is an Erlang question.
I have run into some unexpected behavior by io:fread.
I was wondering if someone could check whether there is something wrong with the way I use io:fread or whether there is a bug in io:fread.
I have a text file which contains a "triangle of numbers"as follows:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
There is a single space between each pair of numbers and each line ends with a carriage-return new-line pair.
I use the following Erlang program to read this file into a list.
-module(euler67).
-author('Cayle Spandon').
-export([solve/0]).
solve() ->
{ok, File} = file:open("triangle.txt", [read]),
Data = read_file(File),
ok = file:close(File),
Data.
read_file(File) ->
read_file(File, []).
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N]} ->
read_file(File, [N | Data]);
eof ->
lists:reverse(Data)
end.
The output of this program is:
(erlide#cayle-spandons-computer.local)30> euler67:solve().
[59,73,41,52,40,9,26,53,6,3410,51,87,86,8161,95,66,57,25,
6890,81,80,38,92,67,7330,28,51,76,81|...]
Note how the last number of the fourth line (34) and the first number of the fifth line (10) have been merged into a single number 3410.
When I dump the text file using "od" there is nothing special about those lines; they end with cr-nl just like any other line:
> od -t a triangle.txt
0000000 5 9 cr nl 7 3 sp 4 1 cr nl 5 2 sp 4 0
0000020 sp 0 9 cr nl 2 6 sp 5 3 sp 0 6 sp 3 4
0000040 cr nl 1 0 sp 5 1 sp 8 7 sp 8 6 sp 8 1
0000060 cr nl 6 1 sp 9 5 sp 6 6 sp 5 7 sp 2 5
0000100 sp 6 8 cr nl 9 0 sp 8 1 sp 8 0 sp 3 8
0000120 sp 9 2 sp 6 7 sp 7 3 cr nl 3 0 sp 2 8
0000140 sp 5 1 sp 7 6 sp 8 1 sp 1 8 sp 7 5 sp
0000160 4 4 cr nl 8 4 sp 1 4 sp 9 5 sp 8 7 sp
One interesting observation is that some of the numbers for which the problem occurs happen to be on 16-byte boundary in the text file (but not all, for example 6890).
I'm going to go with it being a bug in Erlang, too, and a weird one. Changing the format string to "~2s" gives equally weird results:
["59","73","4","15","2","40","0","92","6","53","0","6","34",
"10","5","1","87","8","6","81","61","9","5","66","5","7",
"25","6",
[...]|...]
So it appears that it's counting a newline character as a regular character for the purposes of counting, but not when it comes to producing the output. Loopy as all hell.
A week of Erlang programming, and I'm already delving into the source. That might be a new record for me...
EDIT
A bit more investigation has confirmed for me that this is a bug. Calling one of the internal methods that's used in fread:
> io_lib_fread:fread([], "12 13\n14 15 16\n17 18 19 20\n", "~d").
{done,{ok,"\f"}," 1314 15 16\n17 18 19 20\n"}
Basically, if there's multiple values to be read, then a newline, the first newline gets eaten in the "still to be read" part of the string. Other testing suggests that if you prepend a space it's OK, and if you lead the string with a newline it asks for more.
I'm going to get to the bottom of this, gosh-darn-it... (grin) There's not that much code to go through, and not much of it deals specifically with newlines, so it shouldn't take too long to narrow it down and fix it.
EDIT^2
HA HA! Got the little blighter.
Here's the patch to the stdlib that you want (remember to recompile and drop the new beam file over the top of the old one):
--- ../erlang/erlang-12.b.3-dfsg/lib/stdlib/src/io_lib_fread.erl
+++ ./io_lib_fread.erl
## -35,9 +35,9 ##
fread_collect(MoreChars, [], Rest, RestFormat, N, Inputs).
fread_collect([$\r|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\r|More]);
fread_collect([$\n|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\n|More]);
fread_collect([C|More], Stack, Rest, RestFormat, N, Inputs) ->
fread_collect(More, [C|Stack], Rest, RestFormat, N, Inputs);
fread_collect([], Stack, Rest, RestFormat, N, Inputs) ->
## -55,8 +55,8 ##
eof ->
fread(RestFormat,eof,N,Inputs,eof);
_ ->
- %% Don't forget to count the newline.
- {more,{More,RestFormat,N+1,Inputs}}
+ %% Don't forget to strip and count the newline.
+ {more,{tl(More),RestFormat,N+1,Inputs}}
end;
Other -> %An error has occurred
{done,Other,More}
Now to submit my patch to erlang-patches, and reap the resulting fame and glory...
Besides the fact that it seems to be a bug in one of the erlang libs I think you could (very) easily circumvent the problem.
Given the fact your file is line-oriented I think best practice is that you process it line-by-line as well.
Consider the following construction. It works nicely on an unpatched erlang and because it uses lazy evaluation it can handle files of arbitrary length without having to read all of it into memory first. The module contains an example of a function to apply to each line - turning a line of text-representations of integers into a list of integers.
-module(liner).
-author("Harro Verkouter").
-export([liner/2, integerize/0, lazyfile/1]).
% Applies a function to all lines of the file
% before reducing (foldl).
liner(File, Fun) ->
lists:foldl(fun(X, Acc) -> Acc++Fun(X) end, [], lazyfile(File)).
% Reads the lines of a file in a lazy fashion
lazyfile(File) ->
{ok, Fd} = file:open(File, [read]),
lazylines(Fd).
% Actually, this one does the lazy read ;)
lazylines(Fd) ->
case io:get_line(Fd, "") of
eof -> file:close(Fd), [];
{error, Reason} ->
file:close(Fd), exit(Reason);
L ->
[L|lazylines(Fd)]
end.
% Take a line of space separated integers (string) and transform
% them into a list of integers
integerize() ->
fun(X) ->
lists:map(fun(Y) -> list_to_integer(Y) end,
string:tokens(X, " \n")) end.
Example usage:
Eshell V5.6.5 (abort with ^G)
1> c(liner).
{ok,liner}
2> liner:liner("triangle.txt", liner:integerize()).
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
And as a bonus, you can easily fold over the lines of any (lineoriented) file w/o running out of memory :)
6> lists:foldl( fun(X, Acc) ->
6> io:format("~.2w: ~s", [Acc,X]), Acc+1
6> end,
6> 1,
6> liner:lazyfile("triangle.txt")).
1: 59
2: 73 41
3: 52 40 09
4: 26 53 06 34
5: 10 51 87 86 81
6: 61 95 66 57 25 68
7: 90 81 80 38 92 67 73
8: 30 28 51 76 81 18 75 44
Cheers,
h.
I noticed that there are multiple instances where two numbers are merged, and it appears to be at the line boundaries on every line starting at the fourth line and beyond.
I found that if you add a whitespace character to the beginning of every line starting at the fifth, that is:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
The numbers get parsed properly:
39> euler67:solve().
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
It also works if you add the whitespace to the beginning of the first four lines, as well.
It's more of a workaround than an actual solution, but it works. I'd like to figure out how to set up the format string for io:fread such that we wouldn't have to do this.
UPDATE
Here's a workaround that won't force you to change the file. This assumes that all digits are two characters (< 100):
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N] } ->
if
N > 100 ->
First = N div 100,
Second = N - (First * 100),
read_file(File, [First , Second | Data]);
true ->
read_file(File, [N | Data])
end;
eof ->
lists:reverse(Data)
end.
Basically, the code catches any of the numbers which are the concatenation of two across a newline and splits them into two.
Again, it's a kludge that implies a possible bug in io:fread, but that should do it.
UPDATE AGAIN The above will only work for two-digit inputs, but since the example packs all digits (even those < 10) into a two-digit format, that will work for this example.

Resources