TFF: How define tff.simulation.ClientData.from_clients_and_fn Function? - tensorflow-federated

In the federated learning context, One such classmethod that should work would be tff.simulation.ClientData.from_clients_and_fn. Here, if I pass a list of client_ids and a function which returns the appropriate dataset when given a client id, you will have your hands on a fully functional ClientData.
I think here, an approach for defining the function I may use is to construct a Python dict which maps client IDs to tf.data.Dataset objects--you could then define a function which takes a client id, looks up the dataset in the dict, and returns the dataset.
So I define function as below but I think it is wrong, what do you think?
list = ["0","1","2"]
tab = {"0":ds, "1":ds, "2":ds}
def create_tf_dataset_for_client_fn(id):
return ds
source = tff.simulation.ClientData.from_clients_and_fn(list, create_tf_dataset_for_client_fn)
I suppose here that the 4 clients have the same dataset :'ds'

Creating a dict of (client_id, dataset) key-value pairs is a reasonable way to set up a tff.simulation.ClientData. Indeed, the code in the question will result in all clients have the same dataset since ds is return for all values of parameter id. One thing to watch out in pre-constructing a dict of datasets is that it may require loading the entire contents of the data into memory (may fail for large datasets).
Alternatively, constructing the dataset on-demand could reduce memory usage. One example might be to have a dict of (client_id, file path) key-value pairs. Something like:
dataset_paths = {
'client_0': '/tmp/A.txt',
'client_1': '/tmp/B.txt',
'client_2': '/tmp/C.txt',
}
def create_tf_dataset_for_client_fn(id):
path = dataset_paths.get(id)
if path is None:
raise ValueError(f'No dataset for client {id}')
return tf.data.Dataset.TextLineDataset(path)
source = tff.simulation.ClientData.from_clients_and_fn(
dataset_paths.keys(), create_tf_dataset_for_client_fn)
This is similar to the approach used in tff.simulation.FilePerUserClientData. It may be useful to look at the code of that class as an example.

Related

TensorFlowFederated: Passing tensor to tff.federated_computation

I have trialled TFF tutorial (MNIST) on my single machine and now I am trying to perform a multi-machine process using MNIST data.
Clearly, I cannot use create_tf_dataset_for_client so I have used GRPC to learn how to pass data from one machine to another.
My scenario is that Server will dispatch the initial model (with zeroes) to all the participating clients where the model will run on local data. Each client will dispatch the new weights to the server that will perform federated_mean.
I was thinking of using tff.learning.build_federated_averaging_process where I could hopefully customise the next function (2nd argument) but I failed... I am not even sure if we use this approach to send the model and get the weights back from remote clients.
Then I thought I could use tff.federated_mean under #tff.federated_computation decorator. However, since weights are arrays and I have a list of them (as I have a number of clients), I am unable to understand how do I create a tff.FederatedType that points to that a list of lists. Any help from someone who has modelled federation on distributed dataset will be handy to understand.
Regards,
Dev.
TFF computations are designed to be platform/runtime agnostic; a single computation can be executed by several different backends.
TFF's type system can be helpful here in reasoning about how data is expected to flow in you computation. See the custom federated algorithms part 1 tutorial for an intro to how TFF thinks about types.
The result of build_federated_averaging_process expects an argument of datasets which are placed at clients; for a dataset of element type T, in TFF's usual notation this would be denoted {T*}#C. This signature particular is agnostic with respect to how the datasets arrive at the clients, or indeed how the clients themselves are represented.
Materializing the data and representing the clients is really the job of the runtime. TFF provides a few so-called native options here.
For example, in the local Python runtime clients are represented by threads on your local machine. Datasets are simply eager tf.data.Dataset objects, and the threads pull data from the datasets during training.
In the remote Python runtime, clients are represented by (threads on) remote workers, so that a single remote worker could be running more than one client. In this case, as you note, data must be materialized on the remote worker in order to train.
There are several options for accomplishing this.
One, TFF will actually handle serialization and deserialization of eager datasets across this RPC connection for you, so you could use the identical pattern of specifying data as in the local runtime, and it should "just work". This pattern actually got significantly better in March of 2021, via the use of tf.raw_ops.DatasetToGraphV2.
Perhaps better mapping to the concepts of federated computation, however, is the use of some library functions to simply instantiate the datasets on the workers.
Suppose you have an iterative process ip, which accepts a state and data argument, where data is of type {T*}#C. Suppose further we have a TFF computation get_dataset_for_client_id, which accepts a string and returns a dataset of appropriate type (IE, its TFF type signature is tf.str -> T*).
Then we can compose these two computations into another:
#tff.federated_computation(STATE_TYPE, tff.FederatedType(tf.string, tff.CLIENTS))
def new_next(state, client_ids):
datasets_on_clients = tff.federated_map(get_dataset_for_client_id, client_ids)
return ip.next(state, datasets_on_clients)
new_next now requires the controller to only specify the ids of clients on which to train, and delegates responsibility for pointing to a data store to whoever is representing the clients.
This pattern I think is likely what you want; TFF provides some helper s like the dataset_computation attribute on tff.simulation.ClientData and tff.simulation.compose_dataset_computation_with_iterative_process, which will more or less perform the wiring we did above for you.
let's do this step by step. Please let us know if the explanation below answers your question.
Let's start with an example of TF (non-federated, just local) code that takes a dataset and does something with it, say add numbers:
#tff.tf_computation(tff.SequenceType(tf.int32))
def process_data(ds):
return ds.reduce(np.int32(0), lambda x, y: x + y)
This code takes a dataset of integer numbers at input, and returns a single integer with the sum at output.
You can confirm this by lookin at the type signature, like this:
str(process_data.type_signature)
You should see this:
(int32* -> int32)
So, process_data takes a set of integers, and returns an integer.
Now, using TFF's federated operators we can create a federated computation that does this on multiple clients, like this:
#tff.federated_computation(tff.FederatedType(tff.SequenceType(tf.int32), tff.CLIENTS))
def process_data_on_clients(federated_ds):
return tff.federated_map(process_data, federated_ds)
If you look at the type signature of this new computation (just like above), you will see this:
({int32*}#CLIENTS -> {int32}#CLIENTS)
It means process_data_on_clients takes a federated set of integers (one set per client), and returns a federated integer (one integer with the sum on each client).
What happens in the above is that, the TF logic in process_data will be executed once on each client. This is how the federated_map operator works.
Now, process_data_on_clients is a little bit like the the iterative process you are working with. It wants you to provide a federated dataset as an argument.
Let's see how we can make one by following the same pattern as above.
Here's some TF code that creates a single dataset with integers, say you supply an integer n and want to create a dataset with numbers from 1 up to n, i..e, {1, 2, ..., n}:
#tff.tf_computation(tf.int32)
def make_data(n):
return tf.data.Dataset.range(tf.cast(n, tf.int64)).map(lambda x: tf.cast(x + 1, tf.int32))
This is obviously a silly example, you could do something more along the lines of what you need (e.g., read data from a file specified by a name, etc.).
And here's what its type signature looks like:
(int32 -> int32*)
You can see the similarity to process_data.
And, just like with processing data, here's now we can make data on all clients by using the federated_map operator:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_data_on_clients(federated_n):
return tff.federated_map(make_data, federated_n)
This is the type signature:
({int32}#CLIENTS -> {int32*}#CLIENTS)
Great, so make_data_on_clients takes a federated integer (that tells us how many data items to produce on each client), and returns a federated dataset, just like what process_data_on_clients wants.
You can check that the two work together as intended:
federated_n = [2, 3, 4]
federated_ds = make_data_on_clients(federated_n)
result = process_data_on_clients(federated_ds)
result
You should get the sums 1+2, 1+2+3, and 1+2+3+4 on the 3 clients involved in this computation (note there were 3 numbers in the federated integer above, so there are 3 clients here):
[<tf.Tensor: shape=(), dtype=int32, numpy=3>,
<tf.Tensor: shape=(), dtype=int32, numpy=6>,
<tf.Tensor: shape=(), dtype=int32, numpy=10>]
Note that all TF code you have seen so far, including both dataset creation and dataset reduce, were being executed on the clients (using federated_map).
Now, you can put the two together:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_and_process_data_on_clients(federated_n):
federated_ds = make_data_on_clients(federated_n)
return process_data_on_clients(federated_ds)
And now, you can invoke the make and process data combo in one shot:
make_and_process_data_on_clients(federated_n)
Again, all TF code here is executing on clients, just like in the above.
So where does this leave you?
Going back to Keith's explanation, the iterative process you got from TFF wants a federated dataset at input, just like process_data_on_clients.
The function get_dataset_for_client_id in Keith's example is like our make_data in that it is assumed to contain TensorFlow code that you want to run on each client to physically construct a dataset on that client.
In out silly example, dataset construction logic used range, but it can be anything. For example, you could load data on each client from the same local file my_data, or using a custom TF op, or by whatever other means. Just like in our example, you can pass parameters to that function to give you more centralized control (similarly to whatever did above with the federated integer).
The code snipper new_next in Keith's example is just like our make_and_process_data_on_clients, in that it combines two federated computations: one that makes federated data on clients (supplied by you, just as discussed here), and one that processes that data (from tff.learning, the iterative process).
Does this help?
If still unclear, I would recommend to try the examples I included above on your distributed setup, since you already have one. You could inject some TF print ops to that code to confirm that the TF code you wrote is executing on the client machines in your system.
Once you get that part, it's simple tweak to replace the silly data set construction logic in make_data with one that loads a dataset on each client from whatever local data source you are using.
EDITS:
Re: how to print, any TensorFlow code that appears in the body of a #tff.tf_computation is executed in eager mode, and you can use standard TensorFlow mechanisms such as tf.print to print from within TensorFlow.
tensorflow.org/api_docs/python/tf/print
On how to configure a multi-machine system with multiple worker nodes, see the Kubernetes tutorial. Note that the machine that drives the process connects to worker nodes, not the other way round.
https://www.tensorflow.org/federated/tutorials/high_performance_simulation_with_kubernetes

How does one dynamically add new parameters to optimizers in Pytorch?

I was going through this post in the pytorch forum, and I also wanted to do this. The original post removes and adds layers but I think my situation is not that different. I also want to add layers or more filters or word embeddings. My main motivation is that the AI agent does not know the whole vocabulary/dictionary in advance because its large. I prefer strongly (for the moment) to not do character by character RNNs.
So what will happen for me is when the agent starts a forward pass it might find new words it has never seen and will need to add them to the embedding table (or perhaps add new filters before it starts the forward pass).
So what I want to make sure is:
embeddings are added correctly (at the right time, when a new computation graph is made) so that they are updatable by the optimizer
no issues with stored info of past parameters e.g. if its using some sort of momentum
How does one do this? Any sample code that works?
Just to add an answer to the title of your question: "How does one dynamically add new parameters to optimizers in Pytorch?"
You can append params at any time to the optimizer:
import torch
import torch.optim as optim
model = torch.nn.Linear(2, 2)
# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)
extra_params = torch.randn(2, 2)
optimizer.param_groups.append({'params': extra_params })
#then you can print your `extra_params`
print("extra params", extra_params)
print("optimizer params", optimizer.param_groups)
That is a tricky question, as I would argue that the answer is "depends", in particular on how you want to deal with the optimizer.
Let's start with your specific problem - an embedding. In particular, you are asking on how to add embeddings to allow for a larger vocabulary dynamically. My first advice is, that if you have a good sense of an upper boundary of your vocabulary size, make the embedding large enough to cope with it from the beginning, as this is more efficient, and as you will need the memory eventually anyway. But this is not what you asked. So - to dynamically change your embedding, you'll need to overwrite your old one with a new one, and inform your optimizer of the change. You can simply do that whenever you run into an exception with your old embedding, in a try ... except block. This should roughly follow this idea:
# from within whichever module owns the embedding
# remember the already trained weights
old_embedding_weights = self.embedding.weight.data
# create a new embedding of the new size
self.embedding = nn.Embedding(new_vocab_size, embedding_dim)
# initialize the values for the new embedding. this does random, but you might want to use something like GloVe
new_weights = torch.randn(new_vocab_size, embedding_dim)
# as your old values may have been updated, you want to retrieve these updates values
new_weights[:old_vocab_size] = old_embedding_weights
self.embedding.weights.data.copy_(new_weights)
However, you should not do this for every single new word you receive, as this copying takes time (and a whole lot of memory, as the embedding exists twice for a short time - if you're nearly out memory, just make your embedding large enough from the start). So instead increase the size dynamically by a couple of hundred slots at a time.
Additionally, this first step already raises some questions:
How does my respective nn.Module know about the new embedding parameter?
The __setattr__ method of nn.Module takes care of that (see here)
Second, why don't I simply change my parameter? That's already pointing towards some of the problems of changing the optimizer: pytorch internally keeps references by object ID. This means that if you change your object, all these references will point towards a potentially incompatible object, as its properties have changed. So we should simply create a new parameter instead.
What about other nn.Parameters or nn.Modules that are not embeddings? These you treat the same. You basically just instantiate them, and attach them to their parent module. The __setattr__ method will take care of the rest. So you can do so completely dyncamically ...
Except, of course, the optimizer. The optimizer is the only other thing that "knows" about your parameters except for your main model-module. So you need to let the optimizer know of any change.
And this is tricky, if you want to be sophisticated about it, and very easy if you don't care about keeping the optimizer state. However, even if you want to be sophisticated about it, there is a very good reason why you probably should not do this anyways. More about that below.
Anyways, if you don't care, a simple
# simply overwrite your old optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)
will do. If you care, however, you want to transfer your old state, you can do so the same way that you can store, and later load parameters and optimizer states from disk: using the .state_dict() and .load_state_dict() methods. This, however, does only work with a twist:
# extract the state dict from your old optimizer
old_state_dict = optimizer.state_dict()
# create a new optimizer
optimizer = optim.SGD(model.parameters())
new_state_dict = optimizer.state_dict()
# the old state dict will have references to the old parameters, in state_dict['param_groups'][xyz]['params'] and in state_dict['state']
# you now need to find the parameter mismatches between the old and new statedicts
# if your optimizer has multiple param groups, you need to loop over them, too (I use xyz as a placeholder here. mostly, you'll only have 1 anyways, so just replace xyz with 0
new_pars = [p for p in new_state_dict['param_groups'][xyz]['params'] if not p in old_state_dict['param_groups'][xyz]['params']]
old_pars = [p for p in old_state_dict['param_groups'][xyz]['params'] if not p in new_state_dict['param_groups'][xyz]['params']]
# then you remove all the outdated ones from the state dict
for pid in old_pars:
old_state_dict['state'].pop(pid)
# and add a new state for each new parameter to the state:
for pid in new_pars:
old_state_dict['param_groups'][xyz]['params'].append(pid)
old_state_dict['state'][pid] = { ... } # your new state def here, depending on your optimizer
However, here's the reason why you should probably never update your optimizer like this, but should instead re-initialize from scratch, and just accept the loss of state information: When you change your computation graph, you change forward and backward computation for all parameters along your computation path (if you do not have a branching architecture, this path will be your entire graph). This more specifically means, that the input to your functions (=layer/nn.Module) will be different if you change some function (=layer/nn.Module) applied earlier, and the gradients will change if you change some function (=layer/nn.Module) applied later. That in turn invalidates the entire state of your optimizer. So if you keep your optimizer's state around, it will be a state computed for a different computation graph, and will probably end up in catastrophic behavior on part of your optimizer, if you try to apply it to a new computation graph. (I've been there ...)
So - to sum it up: I'd really recommend to try to keep it simple, and to only change a parameter as conservatively as possible, and not to touch the optimizer.
If you want to customize initial params:
from itertools import chain
l1 = nn.Linear(3,3)
l2 = nn.Linear(2,3)
optimizer = optim.SGD(chain(l1.parameters(), l2.parameters()), lr=0.01, momentum=0.9)
The key is that the first param of constructor receives iterator.

How to create training and test DataSetIterators in deeplearning4j?

I am building a recurrent neural network with deeplearning4j and I need to create the training and test data sets.
All the examples provided in the documentation and the example code, use a CSVSequenceRecordReader to read CSV files.
Then a DataSetIterator is created with the SequenceRecordReaderDataSetIterator constructor and fed into the MultiLayerNetwork.fit() or the MultiLayerNetwork.evaluate() method (depending if it's a training or test data set iterator).
However, in my case, the data set I have is not stored in a CSV file. I access it online through a third-party library, pre-process it to obtain a List<Data> and a List<Labels> objects.
How can I:
1) create the DataSetIterator from my two lists?
2) split the DataSetIterator in a training set and a test set?
Edit:
I think my question is too broad. Let me try to narrow it down.
I have started to read this article which uses a very simple approach to create a data set:
It creates two INDArrays and builds a DataSet from them using the DataSet(INDArray first, INDArray second) constructor.
Training the data works using network.fit(dataSet);, but I can't evaluate it while training, as the method evaluate requires an data set iterator, not a data set.
Moreover, from what I understand, using this approach also means that there is only one huge data set, no mini batches.
I also guess that I could create mini batches from this big data set by using the batchBy(int num) method. But this method returns a list of data sets, and not an data set iterator... iterateWithMiniBatches() does return a data set iterator but when I looked at the source file, it returns null and is deprecated. Then I tried to see if there is an implementation of the DataSetIterator I could use, but there are a lot of them. I tried the BaseDataSetIterator but it does not take a DataSet as constructor parameter but a DataSetFetcher... Yet another layer.
Is there somewhere an example that shows how to create a data set without using the default record readers? Or should I just create my how implementation of a record reader?
1)
MultiLayerNetwork.evaluate() accepts ListDataSetIterator as a parameter
If you have a List<Data> object you can first map it into a double[] featureVector and a double[] labelVector and then create a ListDataSetIterator like this
INDArray x = Nd4j.create(featureVector, new int[]{featureVector.length/numberOfFeatures, numberOfFeatures}, 'c');
INDArray y = Nd4j.create(labelVector, new int[]{labelVector.length/numberOfLabels, numberOfLabels}, 'c');
final DataSet allData = new DataSet(x,y);
final List<DataSet> list = allData.asList();
ListDataSetIterator iterator = new ListDataSetIterator(list);
For 2) you should just create two seperate iterators, one for training, one for testing.
You can then evaluate your net with net.evaluate(testIterator);

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

Worse sin: side effects or passing massive objects?

I have a function inside a loop inside a function. The inner function acquires and stores a large vector of data in memory (as a global variable... I'm using "R" which is like "S-Plus"). The loop loops through a long list of data to be acquired. The outer function starts the process and passes in the list of datasets to be acquired.
for (dataset in list_of_datasets) {
for (datachunk in dataset) {
<process datachunk>
<store result? as vector? where?>
}
}
I programmed the inner function to store each dataset before moving to the next, so all the work of the outer function occurs as side effects on global variables... a big no-no. Is this better or worse than collecting and returning a giant, memory-hogging vector of vectors? Is there a superior third approach?
Would the answer change if I were storing the data vectors in a database rather than in memory? Ideally, I'd like to be able to terminate the function (or have it fail due to network timeouts) without losing all the information processed prior to termination.
use variables in the outer function instead of global variables. This gets you the best of both approaches: you're not mutating global state, and you're not copying a big wad of data. If you have to exit early, just return the partial results.
(See the "Scope" section in the R manual: http://cran.r-project.org/doc/manuals/R-intro.html#Scope)
Remember your Knuth. "Premature optimization is the root of all programming evil."
Try the side effect free version. See if it meets your performance goals. If it does, great, you don't have a problem in the first place; if it doesn't, then use the side effects, and make a note for the next programmer that your hand was forced.
It's not going to make much difference to memory use, so you might as well make the code clean.
Since R has copy-on-modify for variables, modifying the global object will have the same memory implications as passing something up in return values.
If you store the outputs in a database (or even in a file) you won't have the memory use issues, and the data will be incrementally available as it is created, rather than just at the end. Whether it's faster with the database depends primarily on how much memory you are using: is the reduction is garbage collection going to pay for the cost of writing to disk.
There are both time and memory profilers in R, so you can see empirically what the impacts are.
FYI, here's a full sample toy solution that avoids side effects:
outerfunc <- function(names) {
templist <- list()
for (aname in names) {
templist[[aname]] <- innerfunc(aname)
}
templist
}
innerfunc <- function(aname) {
retval <- NULL
if ("one" %in% aname) retval <- c(1)
if ("two" %in% aname) retval <- c(1,2)
if ("three" %in% aname) retval <- c(1,2,3)
retval
}
names <- c("one","two","three")
name_vals <- outerfunc(names)
for (name in names) assign(name, name_vals[[name]])
I'm not sure I understand the question, but I have a couple of solutions.
Inside the function, create a list of the vectors and return that.
Inside the function, create an environment and store all the vectors inside of that. Just make sure that you return the environment in case of errors.
in R:
help(environment)
# You might do something like this:
outer <- function(datasets) {
# create the return environment
ret.env <- new.env()
for(set in dataset) {
tmp <- inner(set)
# check for errors however you like here. You might have inner return a list, and
# have the list contain an error component
assign(set, tmp, envir=ret.env)
}
return(ret.env)
}
#The inner function might be defined like this
inner <- function(dataset) {
# I don't know what you are doing here, but lets pretend you are reading a data file
# that is named by dataset
filedata <- read.table(dataset, header=T)
return(filedata)
}
leif
Third approach: inner function returns a reference to the large array, which the next statement inside the loop then dereferences and stores wherever it's needed (ideally with a single pointer store and not by having to memcopy the entire array).
This gets rid of both the side effect and the passing of large datastructures.
It's tough to say definitively without knowing the language/compiler used. However, if you can simply pass a pointer/reference to the object that you're creating, then the size of the object itself has nothing to do with the speed of the function calls. Manipulating this data down the road could be a different story.

Resources