Examples of Deflate Compression - huffman-code

I am interested in learning about the deflate compression algorithm, particularly how is it represented in a data-stream, and feel that I would greatly benefit from some extra examples (eg. the compression of a short string of text, or the decompression of a compressed chunk).
I am continuing to study some resources I have found: ref1, ref2, ref3 but these do not have many examples of how the actual compression looks as a data-stream.
If I could get a few examples of how some strings would look before and after being compressed, and an explanation of the relationship between them that would be fantastic.
Also if there are other resources that I could be looking at please add those.

You can compress example data with gzip or zlib and use infgen to disassemble and examine the resulting compressed data. infgen also has an option to see the detail in the dynamic headers.

+1 for infgen, but here's a slightly more detailed answer.
You can take a look at the before- and after- using gzip and any hex editor. For example, xxd is included on most linux distros. I'd included both raw hex output (not that interesting without understanding) and infgen's output.
hello hello hello hello (triggers static huffman coding, like most short strings).
~ $ echo -n "hello hello hello hello" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 cb48 cdc9 c957 ...........H...W
00000010: c840 2701 e351 3d8d 1700 0000 .#'..Q=.....
~ $ echo -n "hello hello hello hello" | gzip | ./infgen/a.out -i
! infgen 2.4 output
!
gzip
!
last
fixed
literal 'hello h
match 16 6
end
!
crc
length
\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1 (triggers uncompressed mode)
~ $ echo -ne "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 010f 00f0 ffff ................
00000010: fefd fcfb faf9 f8f7 f6f5 f4f3 f2f1 c6d3 ................
00000020: 157e 0f00 0000 .~....
~ $ echo -ne "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8\xf7\xf6\xf5\xf4\xf3\xf2\xf1" | gzip | ./infgen/a.out -i
! infgen 2.4 output
!
gzip
!
last
stored
data 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241
end
!
crc
length
abaabbbabaababbaababaaaabaaabbbbbaa (triggers dynamic huffman coding)
~ $ echo -n "abaabbbabaababbaababaaaabaaabbbbbaa" | gzip | xxd
00000000: 1f8b 0800 0000 0000 0003 1dc6 4901 0000 ............I...
00000010: 1040 c0ac a37f 883d 3c20 2a97 9d37 5e1d .#.....=< *..7^.
00000020: 0c6e 2934 9423 0000 00 .n)4.#...
~ $ echo -n "abaabbbabaababbaababaaaabaaabbbbbaa" | gzip | ./infgen/a.out -i -d
! infgen 2.4 output
!
gzip
!
last
dynamic
count 260 7 18
code 1 4
code 2 1
code 4 4
code 16 4
code 17 4
code 18 2
zeros 97
lens 1 2
zeros 138
zeros 19
lens 4
repeat 3
lens 2
zeros 3
lens 2 2 2
! litlen 97 1
! litlen 98 2
! litlen 256 4
! litlen 257 4
! litlen 258 4
! litlen 259 4
! dist 0 2
! dist 4 2
! dist 5 2
! dist 6 2
literal 'abaabbba
match 4 7
match 3 9
match 5 6
literal 'aaa
match 5 5
literal 'b
match 4 1
literal 'aa
end
!
crc
length
I found infgen was still not enough detail to fully understand the format. I look through decompressing all three examples here bit-by-bit, by hand, in detail on my blog
For concepts, in addition to RFC 1951 (DEFLATE) which is pretty good, I would recommend Feldspar's conceptual overview of Huffman codes and LZ77 in DEFLATE

Related

How to grab filter the numbers of results [migrated]

This question was migrated from Stack Overflow because it can be answered on Super User.
Migrated 19 days ago.
I use grep -E '^[ 0-9]{6}$' to grab strings of 5 digits (numbers or space) in files
It returns:
71 051
17 293
017299
862610
But is it possible to extract only the 2 first occurrences?
If possible like this in this example "71051-17293"?
Two options to grep only two lines max:
$ grep -Em2 '^[ 0-9]{6}$'
71 051
17 293
$ grep -E '^[ 0-9]{6}$' | head -n2
71 051
17 293

Convert hex dump to binary file use xxd

I have hex file from memory dump created from reading memory area of 32768 bytes in size.
... ... ....
80001174: aaaa aaaa aaaa aaaa aaaa aaaa 27bd ff80 | ............'...
80001184: afa1 0004 afa2 0008 afa3 000c afa4 0010 | ................
80001194: afa5 0014 afa6 0018 afa7 001c afa8 0020 | ...............
800011a4: afa9 0024 afaa 0028 afab 002c afac 0030 | ...$...(...,...0
800011b4: afad 0034 afae 0038 afaf 003c afb0 0040 | ...4...8...<...#
800011c4: afb1 0044 afb2 0048 afb3 004c afb4 0050 | ...D...H...L...P
800011d4: afb5 0054 afb6 0058 afb7 005c afb8 0060 | ...T...X...\...`
800011e4: afb9 0064 afbc 0070 afbe 0078 afbf 007c | ...d...p...x...|
800011f4: 3c1a 8045 275a 1024 0340 f809 0000 0000 | <..E'Z.$.#......
80001204: 8fa1 0004 8fa2 0008 8fa3 000c 8fa4 0010 | ................
I tried convert hex dump to binary file use command
xxd -r -p test.log file.bin
but created binary file size is 40968 bytes, that exceed the expected size 32768 bytes. What can be wrong in conversion?
It seems that your hex file is not a plain hex file as it has the"offsets" listed in the file as well.
xxd will do the correct conversion if you have a plaing hex file without any offset in it.
If you can generate again your hex file please do it in the plain format, otherwise, you can open this Hex file using notepad++ and delete the entire column with the address.

Informix - Locked DB due to lock created by cancelled session?

SI attempted to run a script to generate a table in my Informix database, but the script was missing a newline at EOF, so I think Informix had problems to read it and hence the script got blocked doing nothing. I had to kill the script and add the new line to the file so now the script works fine, except it does not create the table due to a lockecreated when I killed the script abruptly.
I am new to this, so sorry for the dumb question. IBM page does not have a clear and simple explanation of how to clean this now.
So, my question is: How do I unlock the locks so I can continue working in my script?
admin_proyecto#li1106-217 # onstat -k
IBM Informix Dynamic Server Version 12.10.FC9DE -- On-Line (CKPT REQ) -- Up 9 ds
Blocked:CKPT
Locks
address wtlist owner lklist type tbz
44199028 0 44ca6830 0 HDR+S
44199138 0 44cac0a0 0 HDR+S
441991c0 0 44cac0a0 4419b6f0 HDR+IX
44199358 0 44ca44d0 0 S
441993e0 0 44ca44d0 44199358 HDR+S
4419ac50 0 44cac0a0 441991c0 HDR+X
4419aef8 0 44ca44d0 441993e0 HDR+IX
4419b2b0 0 44ca79e0 0 S
4419b3c0 0 44ca82b8 0 S
4419b6f0 0 44cac0a0 44199138 HDR+X
4419b998 0 44ca8b90 0 S
4419bdd8 0 44ca44d0 4419aef8 HDR+X
12 active, 20000 total, 16384 hash buckets, 0 lock table overflows
On my "toy" systems i usually point LTAPEDEV to a directory:
LTAPEDEV /usr/informix/dumps/motor_003/backups
Then, when Informix blocks due to having all of it's logical logs full, i manually do an ontape -a to backup to files the used logical logs and free them to be reused.
For example, here I have an Informix instance blocked due to no more logical logs available:
$ onstat -l
IBM Informix Dynamic Server Version 12.10.FC8DE -- On-Line (CKPT REQ) -- Up 00:18:58 -- 213588 Kbytes
Blocked:CKPT
Physical Logging
Buffer bufused bufsize numpages numwrits pages/io
P-1 0 64 1043 21 49.67
phybegin physize phypos phyused %used
2:53 51147 28085 240 0.47
Logical Logging
Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/io
L-1 13 64 191473 12472 6933 15.4 1.8
Subsystem numrecs Log Space used
OLDRSAM 191470 15247376
HA 3 132
Buffer Waiting
Buffer ioproc flags
L-1 0 0x21 0
address number flags uniqid begin size used %used
44d75f88 1 U------ 47 3:15053 5000 5 0.10
44b6df68 2 U---C-L 48 3:20053 5000 4986 99.72
44c28f38 3 U------ 41 3:25053 5000 5000 100.00
44c28fa0 4 U------ 42 3:53 5000 2843 56.86
44d59850 5 U------ 43 3:5053 5000 5 0.10
44d598b8 6 U------ 44 3:10053 5000 5 0.10
44d59920 7 U------ 45 3:30053 5000 5 0.10
44d59988 8 U------ 46 3:35053 5000 5 0.10
8 active, 8 total
On the online log I have:
$ onstat -m
04/23/18 18:20:42 Logical Log Files are Full -- Backup is Needed
So I manually issue the command:
$ ontape -a
Performing automatic backup of logical logs.
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000041
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000042
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000043
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000044
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000045
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000046
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000047
File created: /usr/informix/dumps/motor_003/backups/informix003.ifx.marqueslocal_3_Log0000000048
Do you want to back up the current logical log? (y/n) n
Program over.
If I check again the status of the logical logs:
$ onstat -l
IBM Informix Dynamic Server Version 12.10.FC8DE -- On-Line -- Up 00:23:42 -- 213588 Kbytes
Physical Logging
Buffer bufused bufsize numpages numwrits pages/io
P-2 33 64 1090 24 45.42
phybegin physize phypos phyused %used
2:53 51147 28091 36 0.07
Logical Logging
Buffer bufused bufsize numrecs numpages numwrits recs/pages pages/io
L-1 0 64 291335 15878 7023 18.3 2.3
Subsystem numrecs Log Space used
OLDRSAM 291331 22046456
HA 4 176
address number flags uniqid begin size used %used
44d75f88 1 U-B---- 47 3:15053 5000 5 0.10
44b6df68 2 U-B---- 48 3:20053 5000 5000 100.00
44c28f38 3 U---C-L 49 3:25053 5000 3392 67.84
44c28fa0 4 U-B---- 42 3:53 5000 2843 56.86
44d59850 5 U-B---- 43 3:5053 5000 5 0.10
44d598b8 6 U-B---- 44 3:10053 5000 5 0.10
44d59920 7 U-B---- 45 3:30053 5000 5 0.10
44d59988 8 U-B---- 46 3:35053 5000 5 0.10
8 active, 8 total
The logical logs are now marked as "Backed Up" and can be reused and the Informix instance is no longer blocked on Blocked:CKPT .

How to merge two files without matching rows

I want to combine two files that are very different without any row matching:
File 1 (1000+ rows):
M03558 203 5 23464 CTTGTA
M03559 205 3 1096 CTTGTQ
M03560 209 12 1956 CTTGTW
M035561 304 5 2347 CTTGTK
...
File 2 (a table of 3 rows):
A 12 34 78 0.3
B 13 35 79 0.3
C 14 36 80 0.5
Desired outcome:
M03558 203 5 23464 CTTGTA A 12 34 78 0.3
M03559 205 3 1096 CTTGTQ B 13 35 79 0.3
M03560 209 12 1956 CTTGTW C 14 36 80 0.5
M03561 304 5 2347 CTTGTK
...
Is there any way to achieve that in bash, perl, python or R, please?
In linux you can use the paste command:
paste -d " " file1 file2 > outfile
If, instead of a space seperating the two merged records, you wanted a tab character then:
paste -d "\t" file1 file2 > outfile

to many files when starting new jhipster project

i follow the tutorial from matt on:
http://jhipster.github.io/video-tutorial/
when i do cloc . i see i have much and much more files i would expect:
$ cloc .
66717 text files.
20401 unique files.
24466 files ignored.
http://cloc.sourceforge.net v 1.60 T=128.46 s (115.7 files/s, 15523.0 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
Javascript 13322 222956 357190 1266221
HTML 676 6984 1047 44885
CSS 76 1883 932 22029
Java 262 3548 1854 15641
XML 53 3383 1395 11307
LESS 79 1388 1546 7269
C/C++ Header 18 1032 300 5109
YAML 190 221 346 3466
CoffeeScript 47 783 699 2467
make 58 417 523 1271
Bourne Shell 31 234 202 1097
Maven 1 12 34 824
Perl 2 87 170 584
DTD 1 179 177 514
SASS 5 42 25 273
C++ 4 43 26 260
IDL 6 38 0 167
Bourne Again Shell 3 28 36 140
D 6 0 0 118
Scala 1 16 7 118
JavaServer Faces 3 3 0 109
Smarty 6 17 30 91
DOS Batch 1 24 2 64
Python 1 7 7 36
XSLT 1 5 0 32
C# 2 3 1 27
ASP.Net 2 5 0 23
C 1 7 4 23
OCaml 1 5 15 6
Lisp 1 0 0 6
PowerShell 1 2 2 4
Lua 1 0 0 2
--------------------------------------------------------------------------------
SUM: 14862 243352 366570 1384183
--------------------------------------------------------------------------------
why is that?
in total it is 610 mb large!
it seems there are a lot of node modules:
$ du -h -d1
584M ./node_modules
24K ./gulp
26M ./src
64K ./.mvn
610M .
is this correct?
and what do i need to add to source control?
thanks
This is normal. Most of those files are NPM dependencies, as you mentioned.
The generated .gitignore should already be configured properly and will ignore node_modules.

Resources