Unable to do parameter sharing in Torch between [sub]networks - lua

I am trying to share the parameters between the encoder/decoder sub-networks of an architecture with another encoder/decoder in a different architecture. This is necessary for my problem since at the test time it requires a lot of computation (and time) to do a forward pass on the original architecture and then extract the decoder results. However, what I noticed was although I have explicitly asked for parameter sharing when doing clone(), the parameters are not shared and each architecture has its own parameters while training.
I am showing the difference between the results of the two architecture via some print() statements by forward-propagating some random vectors into the decoder and encoders of both architectures (you can do also compare their weights).
So I wonder, can anyone help me find out what I'm doing wrong when sharing the parameters?
Below I post a simplified version of my code:
require 'nn'
require 'nngraph'
require 'cutorch'
require 'cunn'
require 'optim'
input = nn.Identity()()
encoder = nn.Sequential():add(nn.Linear(100, 20)):add(nn.ReLU(true)):add(nn.Linear(20, 10))
decoder = nn.Sequential():add(nn.Linear(10, 20)):add(nn.ReLU(true)):add(nn.Linear(20, 100))
code = encoder(input)
reconstruction = decoder(code)
outsideCode = nn.Identity()()
decoderCloned= decoder:clone('weight', 'bias', 'gradWeight', 'gradBias')
outsideReconstruction = decoderCloned(nn.JoinTable(1)({code, outsideCode}))
dumbNet = nn.Sequential():add(nn.Linear(100, 10))
codeRecon = dumbNet(outsideReconstruction)
input2 = nn.Identity()()
encoderTestTime = encoder:clone('weight', 'bias', 'gradWeight', 'gradBias')
decoderTestTime = decoder:clone('weight', 'bias', 'gradWeight', 'gradBias')
codeTest = encoderTestTime(input2)
reconTest = decoderTestTime(codeTest)
gMod = nn.gModule({input, outsideCode}, {reconstruction, codeRecon})
gModTest = nn.gModule({input2}, {reconTest})
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
-- Okay, the module has been created. Now it's time to do some other stuff
params, gParams = gMod:getParameters()
numParams = params:nElement()
memReqForParams = numParams * 5 * 4 / 1024 / 1024 -- Convert to MBs
-- If enough memory on GPU, move stuff to the GPU
if memReqForParams <= 1000 then
gMod = gMod:cuda()
gModTest = gModTest:cuda()
criterion1 = criterion1:cuda()
criterion2 = criterion2:cuda()
params, gParams = gMod:getParameters()
end
-- Data
Data = torch.rand(200, 100):cuda()
Data[Data:gt(0.5)] = 1
Data[Data:lt(0.5)] = 0
fakeCodes = torch.rand(400, 10):cuda()
config = {learningRate = 0.001}
state = {}
-- Start training
print ("\nEncoders before training: \n\tgMod's Encoder: " .. gMod:get(2):forward(torch.ones(1, 100):cuda()):sum() .. "\n\tgModTest's Encoder: " .. gModTest:get(2):forward(torch.ones(1, 100):cuda()):sum())
print ("\nDecoders before training: \n\tgMod's Decoder: " .. gMod:get(3):forward(torch.ones(1, 10):cuda()):sum() .. "\n\tgModTest's Decoder: " .. gModTest:get(3):forward(torch.ones(1, 10):cuda()):sum())
gMod:training()
for i=1, Data:size(1) do
local opfunc = function(x)
if x ~= params then
params:copy(x)
end
gMod:zeroGradParameters()
recon, outsideRecon = unpack(gMod:forward({Data[{{i}}], fakeCodes[{{i}}]}))
err = criterion1:forward(recon, Data[{{i}}])
df_dw = criterion1:backward(recon, Data[{{i}}])
errFake = criterion2:forward(outsideRecon, fakeCodes[{{i*2-1, i * 2}}])
df_dwFake = criterion2:backward(outsideRecon, fakeCodes[{{i*2-1, i * 2}}])
errorGrads = {df_dw, df_dwFake}
gMod:backward({Data[{{i}}], fakeCodes[{{i*2-1, i * 2}}]}, errorGrads)
return err, gParams
end
x, reconError = optim.adam(opfunc, params, config, state)
end
print ("\n\nEncoders after training: \n\tgMod's Encoder: " .. gMod:get(2):forward(torch.ones(1, 100):cuda()):sum() .. "\n\tgModTest's Encoder: " .. gModTest:get(2):forward(torch.ones(1, 100):cuda()):sum())
print ("\nDecoders after training: \n\tgMod's Decoder: " .. gMod:get(3):forward(torch.ones(1, 10):cuda()):sum() .. "\n\tgModTest's Decoder: " .. gModTest:get(3):forward(torch.ones(1, 10):cuda()):sum())

I got the solution to the problem with the help of fmassa on a GitHub issue I had opened for this problem here. One can use nn.Container to resolve the issue of parameter sharing as follow:
container = nn.Container()
container:add(gMod)
container:add(gModTest)
params, gradParams = container:getParameters()

Related

Lua - Dynamically generated look up table & values

I’m trying to create a table that is dynamically populated with the on/off status of set of switches I have. The following is where I I’ve got stuck as it always returns nothing/nil? ..
-- retrieved http.request status of 4 binary switches
local P61v = 1
local P62v = 0
local P63v = 1
local P64v = 0
— following table should allow us to look up the status of all associated light by their plug names P61, P62, P63 etc.
local LookupTable = {
P61 = P61v,
P62 = P62v,
P63 = P63v,
P64 = P64v
}
local x = LookupTable[P62]
print(x)
In LookupTable[P62] the expression P62 is evaluated to nil, resulting in LookupTable[nil] which resolves to nil.
What you are looking for is either
LookupTable.P62
-- or --
LookupTable['P62']
which are equivalent ways of expressing the same thing.
I created a script that defines both values, and this is what I ended up with..
local PowerResponse = "P61=0,P62=1,P63=0,P64=0"
local P61, _, P61v, P62, _, P62v, P63, _, P63v, P64, _, P64v = PowerResponse:match("(%w*)(=)(%d),(%w*)(=)(%d),(%w*)(=)(%d),(%w*)(=)(%d)")
local PowerStatusTable = {
P61 = P61v,
P62 = P62v,
P63 = P63v,
P64 = P64v
}
--local x = PowerStatusTable[P62]
for k, v in pairs(PowerStatusTable) do
luup.variable_set("urn:upnp-net:serviceId:IPPower1", k, v,deviceID)
end

How to use prepare_analogy_questions and check_analogy_accuracy functions in text2vec package?

Following code:
library(text2vec)
text8_file = "text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip ("text8.zip", files = "text8")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
RcppParallel::setThreadOptions(numThreads = 4)
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10, learning_rate = .25)
word_vectors_main = glove_model$fit_transform(tcm, n_iter = 20)
word_vectors_context = glove_model$components
word_vectors = word_vectors_main + t(word_vectors_context)
causes error:
qlst <- prepare_analogy_questions("questions-words.txt", rownames(word_vectors))
> Error in (function (fmt, ...) :
invalid format '%d'; use format %s for character objects
File questions-words.txt from word2vec sources https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt
This was a small bug in information message formatting (after introduction of futille.logger). Just fixed it and pushed to github.
You can install updated version of the package with devtools::install_github("dselivanov/text2vec"

What to do when Seq2Seq network repeats words over and over in output?

So, I've been working on a project for a while, we have very little data, I know it would become much better if we were able to put together a much much larger dataset. That aside, my issue at the moment is when I have a sentence input, my outputs look like this right now:
contactid contactid contactid contactid
A single word is focused on and repeated over and over again. What can I do to overcome this hurdle?
Things I've tried:
Double checked I was appending start/stop tokens and make sure the tokens were properly placed in the top of their vocab files, I am sharing vocab.
I found something saying it could be due to poor word embeddings. To that end I checked with tensorboard and sure enough PCA showed a very dense cluster of points. Seeing that I grabbed Facebook's public pre trained word vectors and loaded them in as the embedding. Trained again and this time tensorboard PCA showed a much better picture.
Switched my training scheduler from basic to SampledScheduling to occasionally replace a training output with the ground truth.
Switched my decoder to use the beam search decoder I figured this may give more robust responses if the word choices were close together in the intermediary feature space.
For certain my perplexity is steadily decreasing.
Here is my dataset preperation code:
class ModelInputs(object):
"""Factory to construct various input hooks and functions depending on mode """
def __init__(
self, vocab_files, batch_size,
share_vocab=True, src_eos_id=1, tgt_eos_id=2
):
self.batch_size = batch_size
self.vocab_files = vocab_files
self.share_vocab = share_vocab
self.src_eos_id = src_eos_id
self.tgt_eos_id = tgt_eos_id
def get_inputs(self, file_path, num_infer=None, mode=tf.estimator.ModeKeys.TRAIN):
self.mode = mode
if self.mode == tf.estimator.ModeKeys.TRAIN:
return self._training_input_hook(file_path)
if self.mode == tf.estimator.ModeKeys.EVAL:
return self._validation_input_hook(file_path)
if self.mode == tf.estimator.ModeKeys.PREDICT:
if num_infer is None:
raise ValueError('If performing inference must supply number of predictions to be made.')
return self._infer_input_hook(file_path, num_infer)
def _prepare_data(self, dataset, out=False):
prep_set = dataset.map(lambda string: tf.string_split([string]).values)
prep_set = prep_set.map(lambda words: (words, tf.size(words)))
if out == True:
return prep_set.map(lambda words, size: (self.vocab_tables[1].lookup(words), size))
return prep_set.map(lambda words, size: (self.vocab_tables[0].lookup(words), size))
def _batch_data(self, dataset, src_eos_id, tgt_eos_id):
batched_set = dataset.padded_batch(
self.batch_size,
padded_shapes=((tf.TensorShape([None]), tf.TensorShape([])), (tf.TensorShape([None]), tf.TensorShape([]))),
padding_values=((src_eos_id, 0), (tgt_eos_id, 0))
)
return batched_set
def _batch_infer_data(self, dataset, src_eos_id):
batched_set = dataset.padded_batch(
self.batch_size,
padded_shapes=(tf.TensorShape([None]), tf.TensorShape([])),
padding_values=(src_eos_id, 0)
)
return batched_set
def _create_vocab_tables(self, vocab_files, share_vocab=False):
if vocab_files[1] is None and share_vocab == False:
raise ValueError('If share_vocab is set to false must provide target vocab. (src_vocab_file, \
target_vocab_file)')
src_vocab_table = lookup_ops.index_table_from_file(
vocab_files[0],
default_value=UNK_ID
)
if share_vocab:
tgt_vocab_table = src_vocab_table
else:
tgt_vocab_table = lookup_ops.index_table_from_file(
vocab_files[1],
default_value=UNK_ID
)
return src_vocab_table, tgt_vocab_table
def _prepare_iterator_hook(self, hook, scope_name, iterator, file_path, name_placeholder):
if self.mode == tf.estimator.ModeKeys.TRAIN or self.mode == tf.estimator.ModeKeys.EVAL:
feed_dict = {
name_placeholder[0]: file_path[0],
name_placeholder[1]: file_path[1]
}
else:
feed_dict = {name_placeholder: file_path}
with tf.name_scope(scope_name):
hook.iterator_initializer_func = \
lambda sess: sess.run(
iterator.initializer,
feed_dict=feed_dict,
)
def _set_up_train_or_eval(self, scope_name, file_path):
hook = IteratorInitializerHook()
def input_fn():
with tf.name_scope(scope_name):
with tf.name_scope('sentence_markers'):
src_eos_id = tf.constant(self.src_eos_id, dtype=tf.int64)
tgt_eos_id = tf.constant(self.tgt_eos_id, dtype=tf.int64)
self.vocab_tables = self._create_vocab_tables(self.vocab_files, self.share_vocab)
in_file = tf.placeholder(tf.string, shape=())
in_dataset = self._prepare_data(tf.contrib.data.TextLineDataset(in_file).repeat(None))
out_file = tf.placeholder(tf.string, shape=())
out_dataset = self._prepare_data(tf.contrib.data.TextLineDataset(out_file).repeat(None))
dataset = tf.contrib.data.Dataset.zip((in_dataset, out_dataset))
dataset = self._batch_data(dataset, src_eos_id, tgt_eos_id)
iterator = dataset.make_initializable_iterator()
next_example, next_label = iterator.get_next()
self._prepare_iterator_hook(hook, scope_name, iterator, file_path, (in_file, out_file))
return next_example, next_label
return (input_fn, hook)
def _training_input_hook(self, file_path):
input_fn, hook = self._set_up_train_or_eval('train_inputs', file_path)
return (input_fn, hook)
def _validation_input_hook(self, file_path):
input_fn, hook = self._set_up_train_or_eval('eval_inputs', file_path)
return (input_fn, hook)
def _infer_input_hook(self, file_path, num_infer):
hook = IteratorInitializerHook()
def input_fn():
with tf.name_scope('infer_inputs'):
with tf.name_scope('sentence_markers'):
src_eos_id = tf.constant(self.src_eos_id, dtype=tf.int64)
self.vocab_tables = self._create_vocab_tables(self.vocab_files, self.share_vocab)
infer_file = tf.placeholder(tf.string, shape=())
dataset = tf.contrib.data.TextLineDataset(infer_file)
dataset = self._prepare_data(dataset)
dataset = self._batch_infer_data(dataset, src_eos_id)
iterator = dataset.make_initializable_iterator()
next_example, seq_len = iterator.get_next()
self._prepare_iterator_hook(hook, 'infer_inputs', iterator, file_path, infer_file)
return ((next_example, seq_len), None)
return (input_fn, hook)
And here is my model:
class Seq2Seq():
def __init__(
self, batch_size, inputs,
outputs, inp_vocab_size, tgt_vocab_size,
embed_dim, mode, time_major=False,
enc_embedding=None, dec_embedding=None, average_across_batch=True,
average_across_timesteps=True, vocab_path=None, embedding_path='./data_files/wiki.simple.vec'
):
embed_np = self._get_embedding(embedding_path)
if not enc_embedding:
self.enc_embedding = tf.contrib.layers.embed_sequence(
inputs,
inp_vocab_size,
embed_dim,
trainable=True,
scope='embed',
initializer=tf.constant_initializer(value=embed_np, dtype=tf.float32)
)
else:
self.enc_embedding = enc_embedding
if mode == tf.estimator.ModeKeys.TRAIN or mode == tf.estimator.ModeKeys.EVAL:
if not dec_embedding:
embed_outputs = tf.contrib.layers.embed_sequence(
outputs,
tgt_vocab_size,
embed_dim,
trainable=True,
scope='embed',
reuse=True
)
with tf.variable_scope('embed', reuse=True):
dec_embedding = tf.get_variable('embeddings')
self.embed_outputs = embed_outputs
self.dec_embedding = dec_embedding
else:
self.dec_embedding = dec_embedding
else:
with tf.variable_scope('embed', reuse=True):
self.dec_embedding = tf.get_variable('embeddings')
if mode == tf.estimator.ModeKeys.PREDICT and vocab_path is None:
raise ValueError('If mode is predict, must supply vocab_path')
self.vocab_path = vocab_path
self.inp_vocab_size = inp_vocab_size
self.tgt_vocab_size = tgt_vocab_size
self.average_across_batch = average_across_batch
self.average_across_timesteps = average_across_timesteps
self.time_major = time_major
self.batch_size = batch_size
self.mode = mode
def _get_embedding(self, embedding_path):
model = KeyedVectors.load_word2vec_format(embedding_path)
vocab = model.vocab
vocab_len = len(vocab)
return np.array([model.word_vec(k) for k in vocab.keys()])
def _get_lstm(self, num_units):
return tf.nn.rnn_cell.BasicLSTMCell(num_units)
def encode(self, num_units, num_layers, seq_len, cell_fw=None, cell_bw=None):
if cell_fw and cell_bw:
fw_cell = cell_fw
bw_cell = cell_bw
else:
fw_cell = self._get_lstm(num_units)
bw_cell = self._get_lstm(num_units)
encoder_outputs, bi_encoder_state = tf.nn.bidirectional_dynamic_rnn(
fw_cell,
bw_cell,
self.enc_embedding,
sequence_length=seq_len,
time_major=self.time_major,
dtype=tf.float32
)
c_state = tf.concat([bi_encoder_state[0].c, bi_encoder_state[1].c], axis=1)
h_state = tf.concat([bi_encoder_state[0].h, bi_encoder_state[1].h], axis=1)
encoder_state = tf.contrib.rnn.LSTMStateTuple(c=c_state, h=h_state)
return tf.concat(encoder_outputs, -1), encoder_state
def _train_decoder(self, decoder_cell, out_seq_len, encoder_state, helper):
if not helper:
helper = tf.contrib.seq2seq.ScheduledEmbeddingTrainingHelper(
self.embed_outputs,
out_seq_len,
self.dec_embedding,
0.3,
)
# helper = tf.contrib.seq2seq.TrainingHelper(
# self.dec_embedding,
# out_seq_len,
# )
projection_layer = layers_core.Dense(self.tgt_vocab_size, use_bias=False)
decoder = tf.contrib.seq2seq.BasicDecoder(
decoder_cell,
helper,
encoder_state,
output_layer=projection_layer
)
return decoder
def _predict_decoder(self, cell, encoder_state, beam_width, length_penalty_weight):
tiled_encoder_state = tf.contrib.seq2seq.tile_batch(
encoder_state, multiplier=beam_width
)
with tf.name_scope('sentence_markers'):
sos_id = tf.constant(1, dtype=tf.int32)
eos_id = tf.constant(2, dtype=tf.int32)
start_tokens = tf.fill([self.batch_size], sos_id)
end_token = eos_id
projection_layer = layers_core.Dense(self.tgt_vocab_size, use_bias=False)
emb = tf.squeeze(self.dec_embedding)
decoder = tf.contrib.seq2seq.BeamSearchDecoder(
cell=cell,
embedding=self.dec_embedding,
start_tokens=start_tokens,
end_token=end_token,
initial_state=tiled_encoder_state,
beam_width=beam_width,
output_layer=projection_layer,
length_penalty_weight=length_penalty_weight
)
return decoder
def decode(
self, num_units, out_seq_len,
encoder_state, cell=None, helper=None,
beam_width=None, length_penalty_weight=None
):
with tf.name_scope('Decode'):
if cell:
decoder_cell = cell
else:
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(2*num_units)
if self.mode != estimator.ModeKeys.PREDICT:
decoder = self._train_decoder(decoder_cell, out_seq_len, encoder_state, helper)
else:
decoder = self._predict_decoder(decoder_cell, encoder_state, beam_width, length_penalty_weight)
outputs = tf.contrib.seq2seq.dynamic_decode(
decoder,
maximum_iterations=20,
swap_memory=True,
)
outputs = outputs[0]
if self.mode != estimator.ModeKeys.PREDICT:
return outputs.rnn_output, outputs.sample_id
else:
return outputs.beam_search_decoder_output, outputs.predicted_ids
def prepare_predict(self, sample_id):
rev_table = lookup_ops.index_to_string_table_from_file(
self.vocab_path, default_value=UNK)
predictions = rev_table.lookup(tf.to_int64(sample_id))
return tf.estimator.EstimatorSpec(
predictions=predictions,
mode=tf.estimator.ModeKeys.PREDICT
)
def prepare_train_eval(
self, t_out,
out_seq_len, labels, lr,
train_op=None, loss=None
):
if not loss:
weights = tf.sequence_mask(
out_seq_len,
dtype=t_out.dtype
)
loss = tf.contrib.seq2seq.sequence_loss(
t_out,
labels,
weights,
average_across_batch=self.average_across_batch,
)
if not train_op:
train_op = tf.contrib.layers.optimize_loss(
loss,
tf.train.get_global_step(),
optimizer='SGD',
learning_rate=lr,
summaries=['loss', 'learning_rate']
)
return tf.estimator.EstimatorSpec(
mode=self.mode,
loss=loss,
train_op=train_op,
)
This type of repetition is called a "text degeneration".
There is a great paper from 2019 which analyse this phenomenon: The Curious Case of Neural Text Degeneration by Ari Holtzman et al. from the Allen Institute for Artificial Intelligence.
The repetition may come from the type of text search (text sampling) on the decoder site. Many people implement this just by the most probable next world proposed by the model (argmax on the softmax on the last layer) or by so called beam search. In fact the beam search is the industry standard for today.
This is the example of Beam search from the article:
Continuation (BeamSearch, b=10):
"The unicorns were able to communicate with each other, they said unicorns. a statement that the unicorns. Professor of the Department of Los Angeles, the most important place the world to be recognition of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the world to be a of the…
As you can see there is a great amount of repetition.
According to the paper this curious case may be explained by the fact that each repeated sequence of words have higher probability than the sequence without the next repetition:
The article propose some workarounds with words sampling by the decoder. It definitely requires more study, but this is the best explanation we have today.
The other is that your model need still more training. In many cases I faced a similar behaviour when I had big training set and model still couldn't generalise well over whole diversity of the data. To test this hypothesis - try to train on smaller dataset and see if it generalise (produce meaningful results).
But even if your model generalise well enough, that doesn't mean you won't ever face the repetition pattern. Unless you change the sampling patter of the decoder, it is a common scenario.
If you train on a small data then try to decrease the number of parameters, f. e. number of neurons in each layer.
For me, when the network outputs one word all the time, significant decrease of learning rate helps.

Unusual behavior of image saving and loading in torch7

I noticed an unusual behavior with torch7. I know a little about torch7. So I don't know how this behavior can be explained or corrected.
So, I am using CIFAR-10 dataset. Simply I fetched data for an image from CIFAR-10 and then saved it in my directory. When I loaded that saved image, it was different.
Here is my code -
require 'image'
i1 = testData.data[2] --fetching data from CIFAR-10
image.save("1.png", i) --saving the data as image
i2 = image.load("1.png") --loading the saved image
if(i1 == i2) then --checking if image1(i1) and image2(i2) are different
print("same")
end
Is this behavior expected? I thought png was supposed to be lossless.
If so how this can be corrected?
Code for loading CIFAR-10 dataset-
-- load dataset
trainData = {
data = torch.Tensor(50000, 3072),
labels = torch.Tensor(50000),
size = function() return trsize end
}
for i = 0,4 do
local subset = torch.load('cifar-10-batches-t7/data_batch_' .. (i+1) .. '.t7', 'ascii')
trainData.data[{ {i*10000+1, (i+1)*10000} }] = subset.data:t()
trainData.labels[{ {i*10000+1, (i+1)*10000} }] = subset.labels
end
trainData.labels = trainData.labels + 1
local subset = torch.load('cifar-10-batches-t7/test_batch.t7', 'ascii')
testData = {
data = subset.data:t():double(),
labels = subset.labels[1]:double(),
size = function() return tesize end
}
testData.labels = testData.labels + 1
testData.data = testData.data:reshape(10000,3,32,32)
== operator compares pointers to two tensors, not contents:
a = torch.Tensor(3, 5):fill(1)
b = torch.Tensor(3, 5):fill(1)
print(a == b)
> false
print(a:eq(b):all())
> true

Decompressing LZW in Lua [duplicate]

Here is the Pseudocode for Lempel-Ziv-Welch Compression.
pattern = get input character
while ( not end-of-file ) {
K = get input character
if ( <<pattern, K>> is NOT in
the string table ){
output the code for pattern
add <<pattern, K>> to the string table
pattern = K
}
else { pattern = <<pattern, K>> }
}
output the code for pattern
output EOF_CODE
I am trying to code this in Lua, but it is not really working. Here is the code I modeled after an LZW function in Python, but I am getting an "attempt to call a string value" error on line 8.
function compress(uncompressed)
local dict_size = 256
local dictionary = {}
w = ""
result = {}
for c in uncompressed do
-- while c is in the function compress
local wc = w + c
if dictionary[wc] == true then
w = wc
else
dictionary[w] = ""
-- Add wc to the dictionary.
dictionary[wc] = dict_size
dict_size = dict_size + 1
w = c
end
-- Output the code for w.
if w then
dictionary[w] = ""
end
end
return dictionary
end
compressed = compress('TOBEORNOTTOBEORTOBEORNOT')
print (compressed)
I would really like some help either getting my code to run, or helping me code the LZW compression in Lua. Thank you so much!
Assuming uncompressed is a string, you'll need to use something like this to iterate over it:
for i = 1, #uncompressed do
local c = string.sub(uncompressed, i, i)
-- etc
end
There's another issue on line 10; .. is used for string concatenation in Lua, so this line should be local wc = w .. c.
You may also want to read this with regard to the performance of string concatenation. Long story short, it's often more efficient to keep each element in a table and return it with table.concat().
You should also take a look here to download the source for a high-performance LZW compression algorithm in Lua...

Resources