How to detect recurrent patterns in a textfile - parsing

To simplify a function with many terms, a program would be useful that searches for patterns in a file and arranges them in a ranking list. I can imagine that this is an elaborate process, but I'm sure there are people who have built something like this.
An example of a text:
sin(t1)*cos(t1)*t1+t1-sin(t1)*sin(t1-pi)
This should give me such an output like this (min. 2 letters):
6x: "t1"
4x: "(t1"
3x: "n(t1"
3x: "sin"
3x: "sin("
2x: "sin(t1)"
etc.
Does this problem have a name (which I don't know)? Is there a known algorithm that could solve the problem for me?

I have written a small program with QT, which fulfills the task. The approach is to try everything. To solve my problem, it will probably take a few days, because the text files are very large.
If I have the following text as input ("text.txt"):
sin(t1)*cos(t1)*t1+t1-sin(t1)*sin(t1-pi)
I have with the parameters: length 2-5, minimum occurrence: 3
following result:
t1 6
(t 4
(t1 4
si 3
in 3
n( 3
1) 3
)* 3
sin 3
in( 3
n(t 3
t1) 3
1)* 3
sin( 3
in(t 3
n(t1 3
(t1) 3
t1)* 3
sin(t 3
in(t1 3
(t1)* 3
Code:
#include <QCoreApplication>
#include <qdebug.h>
#include <qstring.h>
#include <qfile.h>
#include <qtextstream.h>
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QString * wholefile = new QString;
uint64_t minchar = 2;
uint64_t maxchar = 5;
uint64_t min_occur = 3;
QFile file("text.txt");
if(!file.open(QIODevice::ReadOnly)) {
qDebug()<<"error reading file";
}
QTextStream in(&file);
while(!in.atEnd()) {
QString line = in.readLine();
wholefile->append(line);
}
file.close();
QStringList * allpatterns = new QStringList;
for(uint64_t i=minchar; i<=maxchar;i++){
for(uint64_t pos=0; pos<wholefile->length()-i;pos++){
QString pattern = wholefile->mid(pos,i);
if(allpatterns->contains(pattern)==0){
allpatterns->append(pattern);
}
}
}
uint64_t * strcnt = new uint64_t[allpatterns->length()];
uint64_t maximum_cnt = 0;
QStringList * interestingpatterns = new QStringList;
uint64_t nr_of_patterns = 0;
for(uint64_t i=0; i<allpatterns->length();i++){
QString str = allpatterns->at(i);
strcnt[nr_of_patterns] = wholefile->count(str);
if(strcnt[nr_of_patterns]>=min_occur){
if(strcnt[nr_of_patterns]>maximum_cnt){
maximum_cnt = strcnt[nr_of_patterns];
}
interestingpatterns->append(str);
nr_of_patterns++;
}
}
/* display result*/
QFile file2("out.txt");
if (!file2.open(QIODevice::WriteOnly | QIODevice::Text))
qDebug()<<"error writing file";
QTextStream out(&file2);
uint64_t current_max = maximum_cnt;
while(current_max>=min_occur){
for(uint64_t i=0; i<interestingpatterns->length();i++){
if(strcnt[i]==current_max){
QString str = interestingpatterns->at(i);
qDebug()<<str<<strcnt[i];
out <<str<<" "<< QString::number(strcnt[i])<<"\n";
}
}
current_max--;
}
file2.close();
return a.exec();
}

Related

is a Fortran subroutine with a dummy argument specified size array thread safe

The following code compiles in gfortran, with a warning about large_array being larger than the limit for a stack variable, stating that the array will be moved to static memory and is therefore not threadsafe:
subroutine stack_size_warning
implicit none
real :: large_array(65536)
print *, large_array
end subroutine stack_size_warning
This subroutine however compiles with no errors or warnings, and I can call it with n values larger than 65536 without issue, at least in simple cases.
subroutine no_warning(n)
implicit none
integer :: n
real :: automatic_array(n)
print *, automatic_array
end subroutine no_warning
Is this second array threadsafe? Where is the memory allocated for automatic_array in this second subroutine? Is the memory allocated and deallocated on every call making it slower than if it was on the stack or if a preallocated array was passed in as a dummy argument?
I wrote the following program to test 3 scenarios, a subroutine with a small array on the stack, another with a large array over the stack limit and thus stored in static memory, and a third where a dummy argument specifies the size of an array defined inside the routine.
Here is that program:
program main
implicit none
call small
call large
call automatic(65536)
end program main
subroutine small
implicit none
real :: small_array(10)
small_array=1.
print *, small_array
end subroutine small
subroutine large
implicit none
real :: large_array(65536)
large_array=1.
print *, large_array
end subroutine large
subroutine automatic(n)
implicit none
integer :: n
real :: automatic_array(n)
automatic_array=1.
print *, automatic_array
end subroutine automatic
Using steve's recommendation I compiled with a tree dump as follows:
gfortran array_dim_test.f90 -o array_dim_test -fdump-tree-original
The full dump is at the end, but to summarize what I see, the automatic subroutine has a try/finally block. In the try block, a call to malloc allocates the memory, and in the finally block, the memory is freed. So I guess this memory is allocated and deallocated on the heap with every call to the subroutine. This intuitively makes sense as how else would the program know what to do with this array that lives only in the subroutine, and whose size is defined in a call to the subroutine, but it is interesting to see the explicit calls in the tree dump. This would appear to be thread-safe then, but perhaps also not the most efficient thing to do if this routine is called many times with the same array size parameter, allocating and deallocating memory with every call.
Here is the tree dump:
__attribute__((fn spec (". w ")))
void automatic (integer(kind=4) & restrict n)
{
void * restrict D.3964;
integer(kind=8) ubound.0;
integer(kind=8) size.1;
real(kind=4)[0:D.3961] * restrict automatic_array;
integer(kind=8) D.3961;
bitsizetype D.3962;
sizetype D.3963;
try
{
ubound.0 = (integer(kind=8)) *n;
size.1 = NON_LVALUE_EXPR <ubound.0>;
size.1 = MAX_EXPR <size.1, 0>;
D.3961 = size.1 + -1;
D.3962 = (bitsizetype) (sizetype) NON_LVALUE_EXPR <size.1> * 32;
D.3963 = (sizetype) NON_LVALUE_EXPR <size.1> * 4;
D.3964 = (void * restrict) __builtin_malloc (MAX_EXPR <(unsigned long) (size.1 * 4), 1>);
automatic_array = (real(kind=4)[0:D.3961] * restrict) D.3964;
{
integer(kind=8) D.3940;
D.3940 = ubound.0;
{
integer(kind=8) S.2;
S.2 = 1;
while (1)
{
if (S.2 > D.3940) goto L.1;
(*automatic_array)[S.2 + -1] = 1.0e+0;
S.2 = S.2 + 1;
}
L.1:;
}
}
{
struct __st_parameter_dt dt_parm.3;
dt_parm.3.common.filename = &"array_dim_test.f90"[1]{lb: 1 sz: 1};
dt_parm.3.common.line = 27;
dt_parm.3.common.flags = 128;
dt_parm.3.common.unit = 6;
_gfortran_st_write (&dt_parm.3);
{
integer(kind=8) D.3944;
struct array01_real(kind=4) parm.4;
D.3944 = ubound.0;
parm.4.span = 4;
parm.4.dtype = {.elem_len=4, .rank=1, .type=3};
parm.4.dim[0].lbound = 1;
parm.4.dim[0].ubound = D.3944;
parm.4.dim[0].stride = 1;
parm.4.data = (void *) &(*automatic_array)[0];
parm.4.offset = -1;
_gfortran_transfer_array_write (&dt_parm.3, &parm.4, 4, 0);
}
_gfortran_st_write_done (&dt_parm.3);
}
}
finally
{
__builtin_free ((void *) automatic_array);
}
}
__attribute__((fn spec (". ")))
void large ()
{
static real(kind=4) large_array[65536];
{
integer(kind=8) S.5;
S.5 = 1;
while (1)
{
if (S.5 > 65536) goto L.2;
large_array[S.5 + -1] = 1.0e+0;
S.5 = S.5 + 1;
}
L.2:;
}
{
struct __st_parameter_dt dt_parm.6;
dt_parm.6.common.filename = &"array_dim_test.f90"[1]{lb: 1 sz: 1};
dt_parm.6.common.line = 19;
dt_parm.6.common.flags = 128;
dt_parm.6.common.unit = 6;
_gfortran_st_write (&dt_parm.6);
{
struct array01_real(kind=4) parm.7;
parm.7.span = 4;
parm.7.dtype = {.elem_len=4, .rank=1, .type=3};
parm.7.dim[0].lbound = 1;
parm.7.dim[0].ubound = 65536;
parm.7.dim[0].stride = 1;
parm.7.data = (void *) &large_array[0];
parm.7.offset = -1;
_gfortran_transfer_array_write (&dt_parm.6, &parm.7, 4, 0);
}
_gfortran_st_write_done (&dt_parm.6);
}
}
__attribute__((fn spec (". ")))
void small ()
{
real(kind=4) small_array[10];
{
integer(kind=8) S.8;
S.8 = 1;
while (1)
{
if (S.8 > 10) goto L.3;
small_array[S.8 + -1] = 1.0e+0;
S.8 = S.8 + 1;
}
L.3:;
}
{
struct __st_parameter_dt dt_parm.9;
dt_parm.9.common.filename = &"array_dim_test.f90"[1]{lb: 1 sz: 1};
dt_parm.9.common.line = 12;
dt_parm.9.common.flags = 128;
dt_parm.9.common.unit = 6;
_gfortran_st_write (&dt_parm.9);
{
struct array01_real(kind=4) parm.10;
parm.10.span = 4;
parm.10.dtype = {.elem_len=4, .rank=1, .type=3};
parm.10.dim[0].lbound = 1;
parm.10.dim[0].ubound = 10;
parm.10.dim[0].stride = 1;
parm.10.data = (void *) &small_array[0];
parm.10.offset = -1;
_gfortran_transfer_array_write (&dt_parm.9, &parm.10, 4, 0);
}
_gfortran_st_write_done (&dt_parm.9);
}
}
__attribute__((fn spec (". ")))
void MAIN__ ()
{
small ();
large ();
{
static integer(kind=4) C.3993 = 65536;
automatic (&C.3993);
}
}
__attribute__((externally_visible))
integer(kind=4) main (integer(kind=4) argc, character(kind=1) * * argv)
{
static integer(kind=4) options.11[7] = {2116, 4095, 0, 1, 1, 0, 31};
_gfortran_set_args (argc, argv);
_gfortran_set_options (7, &options.11[0]);
MAIN__ ();
return 0;
}

not correct num histgram

Im trying to make a toString method that prints out a histogram that shows how often each character of the alphabet is used in a string. The most frequent character has to be 60 #s long, with the rest of the characters then scaled to match.
My issue is with making the equation that scales the rest of the letters to the correct length for the histogram. My current equation is (myArray[i]/max) * 60, but im getting really weird results.
If I put in "hello world" to be analyzed, L would be the most common occuring letter, seen 3 times. So L should have 60 #s for the histogram, h should have 20, o should have 40 etc. Instead im getting results like d : 10
e : 10
h : 10
l : 360
o : 20
r : 10
w : 10
Sorry for how sloppy this is right now, im just trying to figure out whats going on
public class LetterCounter
private static int[] alphabetArray;
private static String input;
/**
* Constructor for objects of class LetterCounter
*/
public LetterCounter()
{
alphabetArray = new int[26];
}
public void countLetters(String input) {
this.input = input;
this.input.toLowerCase();
//String s= input;
//s.toLowerCase();
for ( int i = 0; i < input.length(); i++ ) {
char ch= input.charAt(i);
if (ch >= 97 && ch <= 122){
alphabetArray[ch-'a']++;
}
}
}
public void getTotalCount() {
for (int i = 0; i < alphabetArray.length; i++) {
if(alphabetArray[i]>=0){
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i]);
}
}
}
public void reset() {
for (int i =0; i<alphabetArray.length; i++) {
if(alphabetArray[i]>=0){
alphabetArray[i]=0;
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i]);
}
}
}
public String toString() {
String s = "";
int max = alphabetArray[0];
int markCounter = 0;
for(int i =0; i<alphabetArray.length; i++) {
//finds the largest number of occurences for any letter in the string
if(alphabetArray[i] > max) {
max = alphabetArray[i];
}
}
for(int i =0; i<alphabetArray.length; i++) {
//trying to scale the rest of the characters down here
if(alphabetArray[i] > 0) {
markCounter = (alphabetArray[i] / max) * 60;
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i] + markCounter);
}
}
for (int i = 0; i < alphabetArray.length; i++) {
//prints the whole alphabet, total number of occurences for all chars
if(alphabetArray[i]>=0){
char ch = (char) (i+97);
System.out.println(ch +" : "+alphabetArray[i]);
}
}
return s;
}
}
There are many many problems with your code, but lets go one by one.
First of all, your print statement is simply misleading. Change it to
System.out.println(ch +" : "+alphabetArray[i] + " " + markCounter);
and you will see
d : 1 0
e : 1 0
h : 1 0
l : 3 60
o : 2 0
r : 1 0
w : 1 0
As you can see: the counters are correct (1,1,1,3,2,1,1). But the your scaling doesn't work:
1 / 3 --> 0 ... and 0 * 3 ... is still 0
3 / 3 --> 1 and 1 * 3 ... is 60
but of course, when you dont print a space between 1 and 0 and 3 and 60.
Thus to get correct scaling, just change to:
markCounter = alphabetArray[i] * 60 / max;
Other things worth mentioning:
You are overriding toString(). Then you should put #Override in fron t of that method
toLowerCase() returns a new string in lower case; just calling it without pushing the result back into your string ... just throws away the "lower casing".
toString() shouldnt print to the console. The whole idea is that you put all the information into the string that you return. In other words: in the end you do some System.out.println(someLetterCounter.toString()
Your code is extremely low-level. You don't iterate arrays using for (int), you can do (int letter : alphabetArray) instead
You might want to read about Map. You see, if you would be using a Map<Character, Integer> where the map key would represent the different characters, and the map value represents a counter for each character ... well, you could throw out most of your code; and come up with a solution that would require a few lines of code only!
( and seriously: because of all these issues, debugging your code was really much harder than it needed to be )
countLetters seems has some issues. You can not convert String to lowercase by just calling
this.input.toLowerCase();
Because String is immutable in java. You have to assign it like:
this.input = input.toLowerCase();
Another problem is you are using input variable from parameter instead of this.input which has lower case string. You can do this way to make work countLetters method:
public void countLetters(String input) {
this.input = input.toLowerCase();
for ( int i = 0; i < this.input.length(); i++ ) {
char ch= this.input.charAt(i);
if (ch >= 97 && ch <= 122) {
alphabetArray[ch-'a']++;
}
}
}

Integer overflow in Fibonacci number

I was solving this codechef problem on Fibonacci numbers. It says number is of 1000 digits then why it is not causing integer overflow in tester's solution when it is scanning the array and storing it in unsigned long long int. I can't understand how solution is working. Below is the problem and tester's solution.
The Head Chef has been playing with Fibonacci numbers for long . He has learnt several tricks related to Fibonacci numbers . Now he wants to test his chefs in the skills .
A fibonacci number is defined by the recurrence :
f(n) = f(n-1) + f(n-2) for n > 2
and f(1) = 0
and f(2) = 1 .
Given a number A , determine if it is a fibonacci number.
Input
The first line of the input contains an integer T denoting the number of test cases. The description of T test cases follows.
The only line of each test case contains a single integer A denoting the number to be checked .
Output
For each test case, output a single line containing "YES" if the given number is a fibonacci number , otherwise output a single line containing "NO" .
Constraints
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.
Example
Input:
3
3
4
5
Output:
YES
NO
YES
**Tester's solution:**
#include <iostream>
#include <cstdio>
#include <algorithm>
#include <set>
#include <cstring>
using namespace std;
int const mx = 6666;
set <unsigned long long> f;
unsigned long long fib[mx + 10];
char s[mx + 1];
int main(){
// freopen("input.txt", "r", stdin);
// freopen("output.txt", "w", stdout);
fib[0] = 0;
fib[1] = 1;
f.insert(1);
f.insert(0);
int i;
for (i = 2; i <= mx; i++){
fib[i] = fib[i - 1] + fib[i - 2];
f.insert(fib[i]);
}
int tc;
cin>>tc;
while (tc--){
unsigned long long n = 0, ten = 10;
cin>>s;
int len = strlen(s);
for (i = 0; i < len; i++){
char q = s[i];
unsigned long long a = q - '0';
n = n * ten + a;
}
if (f.find(n) == f.end()) printf("NO\n");
else printf("YES\n");
}
return 0;
}
From cplusplus you will see that,
ULLONG_MAX Maximum value for an object of type unsigned long long int is 18446744073709551615 (264-1) or greater.
The actual value depends on the particular system and library implementation, but shall reflect the limits of these types in
the target platform.
Above information is just to let you know its a BIG number. Moreover the cause of not getting overflow is not the limit i mentioned.
Most probably, the input file of judge does not contain any input that can cause an overflow.
And its still possible to set such input even after fulfilling the conditions,
1 ≤ T ≤ 1000
1 ≤ number of digits in A ≤ 1000
The sum of number of digits in A in all test cases <= 10000.

Memory bandwidth measurement with memset,memcpy

I am trying to understand the performance of memory operations with memcpy/memset. I measure the time needed for a loop containing memset,memcpy. See the attached code (it is in C++11, but in plain C the picture is the same). It is understandable that memset is faster than memcpy. But this is more-or-less the only thing which I understand... The biggest question is:
Why there is a such a strong dependence on the number of loop iterations?
The application is single threaded! And the CPU is: AMD FX(tm)-4100 Quad-Core Processor.
And here are some numbers:
memset: iters=1 0.0625 GB in 0.1269 s : 0.4927 GB per second
memcpy: iters=1 0.0625 GB in 0.1287 s : 0.4857 GB per second
memset: iters=4 0.25 GB in 0.151 s : 1.656 GB per second
memcpy: iters=4 0.25 GB in 0.1678 s : 1.49 GB per second
memset: iters=16 1 GB in 0.2406 s : 4.156 GB per second
memcpy: iters=16 1 GB in 0.3184 s : 3.14 GB per second
memset: iters=128 8 GB in 1.074 s : 7.447 GB per second
memcpy: iters=128 8 GB in 1.737 s : 4.606 GB per second
The code:
/*
-- Compilation and run:
g++ -O3 -std=c++11 -o mem-speed mem-speed.cc && ./mem-speed
-- Output example:
*/
#include <cstdio>
#include <chrono>
#include <memory>
#include <string.h>
using namespace std;
const uint64_t _KB=1024, _MB=_KB*_KB, _GB=_KB*_KB*_KB;
std::pair<double,char> measure_memory_speed(uint64_t buf_size,int n_iters)
{
// without returning something from the buffers, the compiler will optimize memset() and memcpy() calls
char retval=0;
unique_ptr<char[]> buf1(new char[buf_size]), buf2(new char[buf_size]);
auto time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memset(buf1.get(),123,buf_size);
retval += buf1[0];
}
auto t1 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memcpy(buf2.get(),buf1.get(),buf_size);
retval += buf2[0];
}
auto t2 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
printf("memset: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t1.count()/1e9, n_iters*buf_size/double(_GB) / (t1.count()/1e9) );
printf("memcpy: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t2.count()/1e9, n_iters*buf_size/double(_GB) / (t2.count()/1e9) );
printf("\n");
double avr = n_iters*buf_size/_GB * (1e9/t1.count()+1e9/t2.count()) / 2;
retval += buf1[0]+buf2[0];
return std::pair<double,char>(avr,retval);
}
int main(int argc,const char **argv)
{
uint64_t n=64;
if( argc==2 )
n = atoi(argv[1]);
for( int i=0; i<=10; i++ )
measure_memory_speed(n*_MB,1<<i);
return 0;
}
Surely this is just down to the instruction caches loading - so the code runs faster after the 1st iteration, and the data cache speeding access to the memcpy/memcmp for further iterations. The cache memory is inside the processor so it doesn't have to fetch or put the data to the slower external memory so often - so runs faster.

Huffman's encoding and decoding

I have to build a compressor based on the Huffman algorithm.
So far, I managed to create the tree with the frequencies of each character and generate a representation with a smaller number of bits for each character.
Is something like this, for the phrase "good this sugarplum":
'o' 000, '' 001, 't' 0100, 'r' 0101, 'p' 0110, 'm' 0111, 'l' 1000, 'i' 1001, 'h' 1010, 'd' 1011, 'a'1100, 'u' 1101, 'g' 1110, 's' 1111
The problem I'm having now is finding a way to save the tree in the archive, so I can rebuild it and then decompress the file.
Any suggestions?
I did some research but found it difficult to understand, so if you can explain in detail, I would appreciate it.
The code I used to read the frequencies from file is:
int main (int argc, char *argv[])
{
int i;
TipoSentinela *sentinela;
TipoLista *no = NULL;
Arv *arvore, *arvore2, *arvore3;
int *repete = (int *) calloc (256, sizeof(int));
if(argc == 2)
{
in = load_base(argv[1]);
le_dados_arquivo (repete); //read the frequencies from the file
sentinela = cria_lista (); //create a marker for the tree node list
for (i = 0; i < 256; i++)
{
if(repete[i] > 0 && i != 0)
{
arvore = arv_cria (Cria_info (i, repete[i])); //create a tree node with the character i and the frequence of it in the file
no = inicia_lista (arvore, no, sentinela); //create the list of tree nodes
}
}
Ordena (sentinela); //sort the tree nodes list by the frequencies
for(Seta_primeiro(sentinela); Tamanho_lista(sentinela) != 1; Move_marcador(sentinela))
{
Seta_primeiro(sentinela); //put the marker in the first element of the list
no = Retorna_marcador(sentinela);
arvore2 = Retorna_arvore (no); //return the tree represented by the list marker
Move_marcador(sentinela); //put the marker to the next element
arvore3 = Retorna_arvore (Retorna_marcador (sentinela)); //return the tree represented by the list marker
arvore = Cria_pai (arvore2, arvore3); //create a tree node that will contain the both arvore2 and arvore3
Insere_arvoreFinal (sentinela, arvore); //insert the node at the end of the list
Remove_arvore (sentinela); //remove the node arvore2 from the list
Remove_arvore (sentinela); //remove the node arvore3 from the lsit
Ordena (sentinela); //sort the list again
}
out = load_out(argv[1]); //open the output file
Codificacao (arvore); //generate the code from each node of the tree
rewind(in);
char c;
while(!feof(in))
{
c = fgetc(in);
if(c != EOF)
arvore2 = Procura_info (arvore, c); //search the character c in the tree
if(arvore2 != NULL)
imprimebit(Retorna_codigo(arvore2), out); //write the code in the file
}
fclose(in);
fclose(out);
free(repete);
arvore = arv_libera (arvore);
Libera_Lista(sentinela);
}
return 0;
}
//bit_counter and cur_byte are global variables
void write_bit (unsigned char bit, FILE *f)
{
static k = 0;
if(k != 0)
{
if(++bit_counter == 8)
{
fwrite(&cur_byte,1,1,f);
bit_counter = 0;
cur_byte = 0;
}
}
k = 1;
cur_byte <<= 1;
cur_byte |= ('0' != bit);
}
//aux is the code of a character in the tree
void imprimebit(char *aux, FILE *f)
{
int i, j;
if(aux == NULL)
return;
for(i = 0; i < strlen(aux); i++)
{
write_bit(aux[i], f); //write the bits of the code in the file
}
}
With this, I can write the code of all characters in the output file, but I can't see a way to store the tree too.
You don't need to send the tree. Just send the lengths. Then establish a consistent algorithm to convert the lengths to codes on both ends. The consistency is called a "canonical" Huffman code. You sort the codes by length, and within each length, sort by the symbol. Then assign codes starting at 0. So you would end up with (_ means space):
_ 000
o 001
a 0100
d 0101
g 0110
h 0111
i 1000
l 1001
m 1010
p 1011
r 1100
s 1101
t 1110
u 1111
I did found a way to store the code of each character.
For example:
I write the tree, starting by the root and going down to the left, then right.
So, if my tree was something like
0
/ \
0 1
/ \ / \
'a' 'b' 'c' 'd'
The header of my file would be someting like this:
001[8 bits from 'a']1[8 bits from b]01[8 bits from c]1[8 bits from d]
With this, I would be able to rebuild my tree.
My problem now is in read bit-by-bit of the header of file to know in wich direction I have to create a new node.

Resources