As part of my Go tutorial, I'm writing simple program counting words across multiple files. I have a few go routines for processing files and creating map[string]int telling how many occurrence of particular word have been found. The map is then being sent to reducing routine, which aggregates values to a single map. Sounds pretty straightforward and looks like a perfect (map-reduce) task for Go!
I have around 10k document with 1.6 million unique words. What I found is my memory usage is growing fast and constantly while running the code and I'm running out of memory at about half way of processing (12GB box, 7GB free). So yes, it uses gigabytes for this small data set!
Trying to figure out where the problem lies, I found the reducer collecting and aggregating data is to blame. Here comes the code:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
total[w] += c
}
}
output <- len(total)
}
If I remove the map from the sample above the memory stays within reasonable limits (a few hundred megabytes). What I found though, is taking copy of a string also solves the problem, i.e. the following sample doesn't eat up my memory:
func reduceWords (input chan map[string]int, output chan int) {
total := make(map[string]int)
for wordMap := range input {
for w, c := range wordMap {
copyW := make([]byte, len(w)) // <-- will put a copy here!
copy(copyW, w)
total[string(copyW)] += c
}
}
output <- len(total)
}
Is it possible it's a wordMap instance not being destructed after every iteration when I use the value directly? (As a C++ programmer I have limited intuition when comes to GC.) Is it desirable behaviour? Am I doing something wrong? Should I be disappointed with Go or rather with myself?
Thanks!
What does your code look like that turns files into strings? I would look for a problem there. If you are converting large blocks (whole files maybe?) to strings, and then slicing those into words, then you are pinning the entire block if you save any one word. Try keeping the blocks as []byte, slicing those into words, and then converting words to the string type individually.
Related
Is it possible to have a function that will return x lines of file from the end? The function will take parameter defining how far from end we want to read from(in lines measure) and how much lines we want to be returned from that position:
get_lines_file_end(IoDevice, LineNumberPositionFromEnd, LineCount) ->
Example:
We have file with 30 lines 0-29
get_lines_file_end(IoDevice, -10, 10) // will return lines 20-29
get_lines_file_end(IoDevice, -20, 10) // will return lines 10-19
The problem in this is that I can seek only with file:position by certain number of bytes ..
Purpose:
View large log file(hundreds of MB) in page manner starting from last "page".
Erlang is used for rest api which is used by javascript web.
The usage of such function is to view whole log files page by page, where page is represented by x lines of text. No processing of log files, or getting certain information of it is needed.
Thanks
Two points to be made:
To make this efficient you must create metadata about your text file contents to amortize the work involved. This way you can directly skip to the bits you need by seeking using file:position/2 after you have created this metadata.
If this is your use case then you should be partitioning the work differently. The huge text files should either be broken down into smaller text files, or (more likely) you shouldn't be using text files at all. Depending on what your goal is (which you haven't mentioned; I strongly suspect this is to be an X-Y problem) you probably don't want text at all but rather want to know something represented by the text. It may be a good idea to keep the raw text around somewhere just in case, but for actual processing of the data is is almost certainly a better idea to create symbolic data that (much more briefly) represents whatever you find interesting about the data, and store that in a database where seeking, scanning, indexing and doing whatever other things you might want are natural operations.
To build metadata about the files, you will need to do something analogous to:
1> {ok, Data} = file:read_file("TheLongDarkTeaTimeOfTheSoul.txt").
{ok,<<"Douglas Adams. The Long Dark Tea-Time of the Soul\r\n\r\n"...>>}
2> LineEnds = binary:matches(Data, <<"\r\n">>).
[{49,2},
{51,2},
{53,2},
{...}|...]
And then save LineEnds somewhere separately as meta about the file itself. Using this seeking within the file data is elementary (as in, use file:position/2 with the data at linebreak X, or at length(LineEnds) - X or whatever).
But this is still silly.
If you want to hop around within log files, and especially if you want to be able to locate patterns within them, count certain aspects of them, etc. then you would almost certainly do better reading them into a database like Postgres line by line, counting the line numbers as you go. At that point, pagination becomes a trivial issue.
Log files, however, are usually full of the sort of data that is best represented by symbols, not actual text, and it is probably an even better idea to tokenize the log file. Consider the case of access log files. A repeating number of visitors access from a finite number of access points (IPs, or devices, or whatever) an arbitrary number of times. Each aspect of this can be separately indexed and compared rather trivially within a database. The tokenization itself is rather trivial as well. Not only is this solution much faster when it comes to later analysis of the data, but it lends itself naturally to answering otherwise very difficult to answer questions about the contents of the data in a very straightfoward and familiar manner. ...And you don't even have to lose any of the raw data, or intermediate stages of processing (which may all be independently useful in different ways).
Also... note that all of the above work can be made parallel very easily in Erlang. Whatever your computing resource situation is, writing a solution that best leverages your hardware is certainly within grasp (assuming you have enough total data that this is even an issue).
Just like many "How to do X with data Y?" questions, the real answer is always going to revolve around "What is your goal regarding the data and why?"
You can use the file:read_line/1 function to read lines, discarding those that doesn't match your range:
get_lines(File, From) when From > 0 ->
get_lines(File, file:read_line(File), From, 1).
get_lines(_File, eof, _From, _Current) ->
[];
get_lines(File, {ok, _Line}, From, Current) when Current < From ->
get_lines(File, file:read_line(File), From, Current + 1);
get_lines(File, {ok, Line}, From, Current) ->
[Line|get_lines(File, file:read_line(File), From, Current + 1)];
get_lines(_IoDevice, Error, _From, _Current) ->
Error.
Take for example the following code:
for i := (myStringList.Count - 1) DownTo 0 do begin
dataList := SplitString(myStringList[i], #9);
x := StrToFloat(dataList[0]);
y := StrToFloat(dataList[1]);
z := StrToFloat(dataList[2]);
//Do something with these variables
myOutputRecordArray[i] := {SomeFunctionOf}(x,y,z)
//Free Used List Item
myStringList.Delete(i);
end;
//Free Memory
myStringList.Free;
How would you parallelise this using, for example, the OmniThreadLibrary? Is it possible? Or does it need to be restructured?
I'm calling myStringList.Delete(i); at each iteration as the StringList is large and freeing items after use at each iteration is important to minimise memory usage.
Simple answer: You wouldn't.
More involved answer: The last thing you want to do in a parallelized operation is modify shared state, such as this delete call. Since it's not guaranteed that each individual task will finish "in order"--and in fact it's highly likely that they won't at least once, with that probability approaching 100% very quickly the more tasks you add to the total workload--trying to do something like that is playing with fire.
You can either destroy the items as you go and do it serialized, or do it in parallel, finish faster, and destroy the whole list. But I don't think there's any way to have it both ways.
You can cheat. Setting the string value to an empty string will free most of the memory and will be thread safe. At the end of the processing you can then clear the list.
Parallel.ForEach(0, myStringList.Count - 1).Execute(
procedure (const index: integer)
var
dataList: TStringDynArray;
x, y, z: Single;
begin
dataList := SplitString(myStringList[index], #9);
x := StrToFloat(dataList[0]);
y := StrToFloat(dataList[1]);
z := StrToFloat(dataList[2]);
//Do something with these variables
myOutputRecordArray[index] := {SomeFunctionOf}(x,y,z);
//Free Used List Item
myStringList[index] := '';
end);
myStringList.Clear;
This code is safe because we are never writing to a shared object from multiple threads. You need to make sure that all of the variables you use that would normally be local are declared in the threaded block.
I'm not going to attempt to show how to do what you originally asked because it is a bad idea that will not lead to improved performance. Not even assuming that you deal with the many and various data races in your proposed parallel implementation.
The bottleneck here is the disk I/O. Reading the entire file into memory, and then processing the contents is the design choice that is leading to your memory problems. The correct way to solve this problem is to use a pipeline.
Step 1 of the pipeline takes as input the file on disk. The code here reads chunks of the file and then breaks those chunks into lines. These lines are the output of this step. The entire file is never in memory at one time. You'll have to tune the size of the chunks that you read.
Step 2 takes as input the strings the step 1 produced. Step 2 consumes those strings and produces vectors. Those vectors are added to your vector list.
Step 2 will be faster than step 1 because I/0 is so expensive. Therefore there's nothing to be gained by trying to optimise either of the steps with parallel algorithms. Even on a uniprocessor machine this pipelined implementation could be faster than non-pipelined.
I am a Delphi programmer.
In a program I have to generate bidimensional arrays with different "branch" lengths.
They are very big and the operation takes a few seconds (annoying).
For example:
var a: array of array of Word;
i: Integer;
begin
SetLength(a, 5000000);
for i := 0 to 4999999 do
SetLength(a[i], Diff_Values);
end;
I am aware of the command SetLength(a, dim1, dim2) but is not applicable. Not even setting a min value (> 0) for dim2 and continuing from there because min of dim2 is 0 (some "branches" can be empty).
So, is there a way to make it fast? Not just by 5..10% but really FAST...
Thank you.
When dealing with a large amount of data, there's a lot of work that has to be done, and this places a theoretical minimum on the amount of time it can be done in.
For each of 5 million iterations, you need to:
Determine the size of the "branch" somehow
Allocate a new array of the appropriate size from the memory manager
Zero out all the memory used by the new array (SetLength does this for you automatically)
Step 1 is completely under your control and can possibly be optimized. 2 and 3, though, are about as fast as they're gonna get if you're using a modern version of Delphi. (If you're on an old version, you might benefit from installing FastMM and FastCode, which can speed up these operations.)
The other thing you might do, if appropriate, is lazy initialization. Instead of trying to allocate all 5 million arrays at once, just do the SetLength(a, 5000000); at first. Then when you need to get at a "branch", first check if its length = 0. If so, it hasn't been initialized, so initialize it to the proper length. This doesn't save time overall, in fact it will take slightly longer in total, but it does spread out the initialization time so the user doesn't notice.
If your initialization is already as fast as it will get, and your situation is such that lazy initialization can't be used here, then you're basically out of luck. That's the price of dealing with large amounts of data.
I just tested your exact code, with a constant for Diff_Values, timed it using GetTickCount() for rudimentary timing. If Diff_Values is 186 it takes 1466 milliseconds, if Diff_Values is 187 it fails with Out of Memory. You know, Out of Memory means Out of Address Space, not really Out of Memory.
In my opinion you're allocating so much data you run out of RAM and Windows starts paging, that's why it's slow. On my system I've got enough RAM for the process to allocate as much as it wants; And it does, until it fails.
Possible solutions
The obvious one: Don't allocate that much!
Figure out a way to allocate all data into one contiguous block of memory: helps with address space fragmentation. Similar to how a bi dimensional array with fixed size on the "branches" is allocated, but if your "branches" have different sizes, you'll need to figure a different mathematical formula, based on your data.
Look into other data structures, possibly ones that cache on disk (to brake the 2Gb address space limit).
In addition to Mason's points, here are some more ideas to consider:
If the branch lengths never change after they are allocated, and you have an upper bound on the total number of items that will be stored in the array across all branches, then you might be able to save some time by allocating one huge chunk of memory and divvying up the "branches" within that chunk yourself. Your array would become a 1 dimensional array of pointers, and each entry in that array points to the start of the data for that branch. You keep track of the "end" of the used space in your big block with a single pointer variable, and when you need to reserve space for a new "branch" you take the current "end" pointer value as the start of the new branch and increment the "end" pointer by the amount of space that branch requires. Don't forget to round up to dword boundaries to avoid misalignment penalties.
This technique will require more use of pointers, but it offers the potential of eliminating all the heap allocation overhead, or at least replacing the general purpose heap allocation with a purpose-built very simple, very fast suballocator that matches your specific use pattern. It should be faster to execute, but it will require more time to write and test.
This technique will also avoid heap fragmentation and reduces the releasing of all the memory to a single deallocation (instead of millions of separate allocations in your present model).
Another tip to consider: If the first thing you always do with the each newly allocated array "branch" is assign data into every slot, then you can eliminate step 3 in Mason's example - you don't need to zero out the memory if all you're going to do is immediately assign real data into it. This will cut your memory write operations by half.
Assuming you can fit the entire data structure into a contiguous block of memory, you can do the allocation in one shot and then take over the indexing.
Note: Even if you can't fit the data into a single contiguous block of memory, you can still use this technique by allocating multiple large blocks and then piecing them together.
First off form a helper array, colIndex, which is to contain the index of the first column of each row. Set the length of colIndex to RowCount+1. You build this by setting colIndex[0] := 0 and then colIndex[i+1] := colIndex[i] + ColCount[i]. Do this in a for loop which runs up to and including RowCount. So, in the final entry, colIndex[RowCount], you store the total number of elements.
Now set the length of a to be colIndex[RowCount]. This may take a little while, but it will be quicker than what you were doing before.
Now you need to write a couple of indexers. Put them in a class or a record.
The getter looks like this:
function GetItem(row, col: Integer): Word;
begin
Result := a[colIndex[row]+col];
end;
The setter is obvious. You can inline these access methods for increased performance. Expose them as an indexed property for convenience to the object's clients.
You'll want to add some code to check for validity of row and col. You need to use colIndex for the latter. You can make this checking optional with {$IFOPT R+} if you want to mimic range checking for native indexing.
Of course, this is a total non-starter if you want to change any of your column counts after the initial instantiation!
I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.
The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.
If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.
Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).
I have a function inside a loop inside a function. The inner function acquires and stores a large vector of data in memory (as a global variable... I'm using "R" which is like "S-Plus"). The loop loops through a long list of data to be acquired. The outer function starts the process and passes in the list of datasets to be acquired.
for (dataset in list_of_datasets) {
for (datachunk in dataset) {
<process datachunk>
<store result? as vector? where?>
}
}
I programmed the inner function to store each dataset before moving to the next, so all the work of the outer function occurs as side effects on global variables... a big no-no. Is this better or worse than collecting and returning a giant, memory-hogging vector of vectors? Is there a superior third approach?
Would the answer change if I were storing the data vectors in a database rather than in memory? Ideally, I'd like to be able to terminate the function (or have it fail due to network timeouts) without losing all the information processed prior to termination.
use variables in the outer function instead of global variables. This gets you the best of both approaches: you're not mutating global state, and you're not copying a big wad of data. If you have to exit early, just return the partial results.
(See the "Scope" section in the R manual: http://cran.r-project.org/doc/manuals/R-intro.html#Scope)
Remember your Knuth. "Premature optimization is the root of all programming evil."
Try the side effect free version. See if it meets your performance goals. If it does, great, you don't have a problem in the first place; if it doesn't, then use the side effects, and make a note for the next programmer that your hand was forced.
It's not going to make much difference to memory use, so you might as well make the code clean.
Since R has copy-on-modify for variables, modifying the global object will have the same memory implications as passing something up in return values.
If you store the outputs in a database (or even in a file) you won't have the memory use issues, and the data will be incrementally available as it is created, rather than just at the end. Whether it's faster with the database depends primarily on how much memory you are using: is the reduction is garbage collection going to pay for the cost of writing to disk.
There are both time and memory profilers in R, so you can see empirically what the impacts are.
FYI, here's a full sample toy solution that avoids side effects:
outerfunc <- function(names) {
templist <- list()
for (aname in names) {
templist[[aname]] <- innerfunc(aname)
}
templist
}
innerfunc <- function(aname) {
retval <- NULL
if ("one" %in% aname) retval <- c(1)
if ("two" %in% aname) retval <- c(1,2)
if ("three" %in% aname) retval <- c(1,2,3)
retval
}
names <- c("one","two","three")
name_vals <- outerfunc(names)
for (name in names) assign(name, name_vals[[name]])
I'm not sure I understand the question, but I have a couple of solutions.
Inside the function, create a list of the vectors and return that.
Inside the function, create an environment and store all the vectors inside of that. Just make sure that you return the environment in case of errors.
in R:
help(environment)
# You might do something like this:
outer <- function(datasets) {
# create the return environment
ret.env <- new.env()
for(set in dataset) {
tmp <- inner(set)
# check for errors however you like here. You might have inner return a list, and
# have the list contain an error component
assign(set, tmp, envir=ret.env)
}
return(ret.env)
}
#The inner function might be defined like this
inner <- function(dataset) {
# I don't know what you are doing here, but lets pretend you are reading a data file
# that is named by dataset
filedata <- read.table(dataset, header=T)
return(filedata)
}
leif
Third approach: inner function returns a reference to the large array, which the next statement inside the loop then dereferences and stores wherever it's needed (ideally with a single pointer store and not by having to memcopy the entire array).
This gets rid of both the side effect and the passing of large datastructures.
It's tough to say definitively without knowing the language/compiler used. However, if you can simply pass a pointer/reference to the object that you're creating, then the size of the object itself has nothing to do with the speed of the function calls. Manipulating this data down the road could be a different story.