Efficiently Reorganize or Reference Large Data in MATLAB - memory

I am currently bringing large (tens of GB) data files into Matlab using memmapfile. The file I'm reading in is structured with several fields describing the data that follows it. Here's an example of how my format might look:
m.format = { 'uint8' [1 1024] 'metadata'; ...
'uint8' [1 500000] 'mydata' };
m.repeat = 10000;
So, I end up with a structure m where one sample of the data is addressed like this:
single_element = m.data(745).mydata(26);
I want to think of this data as a matrix of, from the example, 10,000 x 500,000. Indexing individual items in this way is not difficult though somewhat cumbersome. My real problem arises when I want to access e.g. the 4th column of every row. MATLAB will not allow the following:
single_column = m.data(:).mydata(4);
I could write a loop to slowly piece this whole thing into an actual matrix (I don't care about the metadata by the way), but for data this large it's hard to overemphasize how prohibitively slow that will be... not to mention the fact that it will double the memory required. Any ideas?

Simply map it to a matrix:
m.format = { 'uint8' [1024 500000] 'x' };
m.Data(1).x will be you data matrix.

Related

Does xarray.Dataset.to_array() load the array into memory and how efficiently sample mini batches from an xarray?

I am currently trying to load a big multi-dimensional array (>5 GB) into a python script. Since I use the array as training data for a machine learning model, it is important to efficiently load the data in mini batches but avoid loading the whole data set in memory once.
My idea was to use the xarray library.
I load the data set with X=xarray.open_dataset("Test_file.nc"). To the best of my knowledge, this command does not load the data set in memory - so far, so good. However, I want to convert X to an array with the command X=X.to_array().
My first question is: Does X=X.to_array() load it into memory or not?
If that is done, I wonder how to best load minibatches in memory. The shape of the array is (variable,datetime,x1_position,x2_position). I want to load minibatches per datetime, which would lead to:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[:,ind]
The other approach would be to transpose the array before with X.transpose("datetime","variable","x1_position","x2_position") and then sample via:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[ind,:]
My second question is:
Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
My first question is: Does X=X.to_array() load it into memory or not?
xarray makes use of dask to chunk (load) parts of the data into memory. You can compare X through
X = xarray.open_dataset("Test_file.nc")
# or
X = xarray.open_dataset("Test_file.nc",
chunks={'datetime':1, 'x1_position':x1_count, 'x2_position':x2_count})
and see (print(X)) the differences between loaded datasets, or specify the chunks accordingly.
The latter way means chunking (load) only one datetime slice data into memory. I don't think you need X=X.to_array() but you can also compare the results after to_array(). My experience is that to_array() does not change the actual chunking (loading) but just the view of the data.
My second question is: Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
I think one goal of xarray is to let users forget the details of the underlying implementation (based on numpy). Transposing may only modify the view rather than the underlying structure of the data. There certainly are some efficiency differences between the two indexing ways, depending on which one is accessing data along contiguous memory. But such difference would not be overhead. Feel free to use both.

How do I speedup adding two big vectors of tuples?

Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)

Fortran entries of array change seemingly at random

I have been working with a FORTRAN program. I have noticed seemingly random changes in a 1D matrix I'm working with. It is a matrix of 4000 integers. Values are added to the matrix one by one, starting with index 1 and iterating by 1 for each added value. The matrix does not get fully "filled", currently only 100 values are placed into the matrix. So one would expect that the first 100 entries of the matrix will be non-zero (all added values are non-zero) and the remaining 3900 entries will be 0. However, several of the entries of the matrix end up being large negative numbers, but I'm certain that no portion of my code touches these entries.
What could be causing this issue? I'm sorry but I can't post the code for you all to work with.
The code has several other large matrices, taking up a total of ~100 MB of space. Could this potentially be a memory issue?
Thanks!
You have to initialize your array, otherwise it will almost always contain garbage. This would do it:
array = 0.0e0 ! real array
or
array = 0.0e0 ! double precision
or
array = 0 ! integer
A "matrix" is two-dimensional; your array is one-dimensional.
Things do not change unless you ask them to change.
FORTRAN does not initialize variables other than (as I recall) in a labeled COMMON. As such, they are guaranteed to start out with garbage values. Try initializing your data with a DATA statement. If you have to initialize a labeled COMMON, you will have to supply a BLOCK DATA subprogram.

looping on flattened fortran matrices

I have a bit of code that looks like this:
DO I=0,500
arg1((I*54+1):(I*54+54)) = premultz*sinphi(I+1)
ENDDO
In short, I have an array premultz of dimension 54. I have an array sinphi of dimension 501. I want to take the first value of of sinphi times all the entries of premultz and store it in the first 54 entries of arg1, then the second value of of sinphi times all the entries of premultz and store it in the second54 entries of arg1, and so on.
These are flattened matrices. I have flattened them in the interest of speed, as one of the primary goals of this project is very fast code.
My question is this: is there a more efficient way of coding this sort of calculation in Fortran90? I know that Fortran has a lot of nifty array operations that can be done that I'm not fully aware of.
Thanks in advance.
This expression, if I've got things right, ought to create arg1 in one statement
arg1 = reshape(spread(premultz,dim=2,ncopies=501)*&
&spread(sinphi,dim=1,ncopies=54),[1,54*501])
I've hardwired the dimensions here, that may or may not suit your purposes. The inner expression generates the outer product of premultz and sinphi, which is then reshaped into a vector. You may find you need to reshape the transpose of the outer product, I haven't checked things very carefully.
However, based on my experience with this sort of clever use of Fortran's array intrinsics I doubt that this, or most other clever uses of Fortran's array intrinsics, will outperform the straightforward loop implementation you already have. For many of these operations the compiler is going to generate copies of arrays, and copying data is relatively expensive. Of course, this is an assertion you may want to test.
I'll leave it to you to decide if the one-liner is more comprehensible than the loops. Sometimes the expressivity of the array syntax comes at an acceptable cost in performance, sometimes it doesn't.

How to display the results of multiple comparisons

If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.

Resources