Clean way to loop over a masked list in Julia - foreach

In Julia, I have a list of neighbors of a location stored in all_neighbors[loc]. This allows me to quickly loop over these neighbors conveniently with the syntax for neighbor in all_neighbors[loc]. This leads to readable code such as the following:
active_neighbors = 0
for neighbor in all_neighbors[loc]
if cube[neighbor] == ACTIVE
active_neighbors += 1
end
end
Astute readers will see that this is nothing more than a reduction. Because I'm just counting active neighbors, I figured I could do this in a one-liner using the count function. However,
# This does not work
active_neighbors = count(x->x==ACTIVE, cube[all_neighbors[loc]])
does not work because the all_neighbors mask doesn't get interpreted correctly as simply a mask over the cube array. Does anyone know the cleanest way to write this reduction? An alternative solution I came up with is:
active_neighbors = count(x->x==ACTIVE, [cube[all_neighbors[loc][k]] for k = 1:length(all_neighbors[loc])])
but I really don't like this because it's even less readable than what I started with. Thanks for any advice!

This should work:
count(x -> cube[x] == ACTIVE, all_neighbors[loc])

Related

Performing an "online" linear interpolation

I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
end
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
kk=kk-1;
end
end
end
end
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
figure;
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
set(leg1,'Location','Best');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
end
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.
The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
end
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
end
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.

`knitr_out, `file_out` and `vis_drake_graph` usage in R:drake

I'm trying to understand how to use knitr_out, file_out and vis_drake_graph properly in drake.
I have two questions.
Q1: Usage of knitr_out and file_out to create markdown reports
While a code like this works correctly for one of my smaller projects:
make_hyp_data_aggregated_report <- function() {
render(
input = knitr_in("rmd/hyptest-is-data-being-aggregated.Rmd"),
output_file = file_out("~/projectname/reports/01-hyp-test.html"),
quiet = TRUE
)
}
plan <- drake_plan(
...
...
hyp_data_aggregated_report = make_hyp_data_aggregated_report()
...
...
)
Exactly similar code in my large project (with ~10+ reports) doesn't work exactly right. i.e., while the reports get built, the knitr_in objects don't get displayed as the blue squares in the graph using drake::vis_drake_graph() in my large project.
Both projects use the drake::loadd(....) within the markdown to get the objects from cache.
Is there some code in vis_drake_graph that removes these squares once the graph gets busy?
Q2: file_out objects in vis_drake_graph
Is there a way to display the file_out objects themselves as circles/squares in vis_drake_graph?
Q3: packages showing up in vis_drake_graph
Is there a way to avoid vis_drake_graph from printing the packages explicitly? (Basically anything with the ::)
Q1
Every literal file path needs its own knitr_in() or file_out(). If you have one function with one knitr_in(), even if you use the function multiple times, that still only counts as one file path. I recommend writing these keywords at the plan level, e.g.
plan <- drake_plan(
r1 = render(knitr_in("report1.Rmd"), output_file = file_out("report1.html")),
r2 = render(knitr_in("report2.Rmd"), output_file = file_out("report2.html")),
r3 = render(knitr_in("report3.Rmd"), output_file = file_out("report3.html"))
)
Q2
They should appear unless you set show_output_files = FALSE in vis_drake_graph().
Q3
No, but if it's any consolation, I do regret the decision to track namespaced functions and objects at all in drake. drake's approach is fundamentally suboptimal for tracking packages, and I plan to get rid of it if there ever comes time for a round of breaking changes. Otherwise, there is no way to get rid of it except vis_drake_graph(targets_only = TRUE), which also gets rid of all the imports in the graph.

How to share the same index among multiple dask arrays

I'm trying to build a dask-based ipython application, that holds a meta-class which consists of some sub-dask-arrays (which are all shaped (n_samples, dim_1, dim_2 ...)) and should be able to sector the sub-dask-arrays by its getitem operator.
In the getitem method, I call the da.Array.compute method (the code is still in it's very early state), so I would be able to iterate batches of the sub-arrays.
def MetaClass(object):
...
def __getitem__(self, inds):
new_m = MetaClass()
inds = inds.compute()
for name,var in vars(self).items():
if isinstance(var,da.Array):
try:
setattr(new_m, name, var[inds])
except Exception as e:
print(e)
else:
setattr(new_m, name, var)
return new_m
# Here I construct the meta-class to work with some directory.
m = MetaClass('/my/data/...')
# m.type is one of the sub-dask-arrays
m2 = m[m.type==2]
It works as expected, and I get the sliced arrays, but as a result I get a huge memory consumption, and I assume that in the background the mechanism of dask is copying the index for each sub-dask-array.
My question is, how do I achieve the same results, without using so much memory?
(I tried not to "compute" the "inds" in getitem, but then I get nan shaped arrays, which can not be iterated, which is a must for the application)
I have been thinking about three possible solutions that I'd be happy to be advised which of them is the "right" one for me. (or to get another solution which I haven't thought of):
To use a Dask DataFrame, which I'm not sure how to fit multidimensional-dask-arrays in (would really appreciate some help or even a link that explains how to deal with multidimensional arrays in dd).
To forget about the entire MetaClass, and to use one dask-array with a nasty dtype (something like [("type",int,(1,)),("images",np.uint8,(1000,1000))]), again, I'm not familiar with this and would really appreciate some help with that (tried to google it.. it's a bit complicated..)
To share the index as a global inside the calling function (getitem) with property and its get-function-mechanism (https://docs.python.org/2/library/functions.html#property). But the big downside here is that I lose the types of the arrays (big down for representation and everything that needs anything but the data itself).
Thanks in advance!!!
One can use the sub-arrays.map_blocks with a shared function that holds the indices in its memory.
Here is an example:
def bool_mask(arr, block_info=None):
from_ind,to_ind = block_info[0]["array-location"][0]
return arr[inds[from_ind:to_ind]]
def getitem(var):
original_chunks = var.chunks[0]
tmp_inds = np.cumsum([0]+list(original_chunks))
from_inds = tmp_inds[:-1]
to_inds = tmp_inds[1:]
new_chunks_0 = np.array(list(map(lambda f,t:inds[f:t].sum(),from_inds,to_inds)))
new_chunks = tuple([tuple(new_chunks_0.tolist())] + list(var.chunks[1:]))
return var.map_blocks(bool_mask,dtype=var.dtype,chunks=new_chunks)

Initialising hist function in Julia for use in a loop

When I use a loop, to access the variables outside of the loop they need to be initialised before you enter the loop. For example:
Y = Array{Int}()
for i = 1:end
Y = i
end
Since I have initialised Y before entering the loop, I can access it later by typing
Y
If I had not initialised it before entering the loop, typing Y would not have returned anything.
I want to extend this functionality to the output of the 'hist' function. I don't know how to set up the empty hist output before the loop. The only work around I have found is below.
yHistData = [hist(DataSet[1],Bins)]
for j = 2:NumberOfLayers
yHistData = [yHistData;hist(DataSet[j],Bins)]
end
Now when I access this later on by simply typing
yHistData
I get the correct values returned to me.
How can I initialise this hist data before entering the loop without defining it using the first value of the list I'm iterating over?
This can be done with a loop like follows:
yHistData = []
for j = 1:NumberOfLayers
push!(yHistData, hist(DataSet[j], Bins))
end
push! modifies the array by adding the specified element to the end. This increases code speed because we do not need to create copies of the array all the time. This code is nice and simple, and runs faster than yours. The return type, however, is now Array{Any, 1}, which can be improved.
Here I have typed the array so that the performance when using this array in the future is better. Without typing the array, the performance is sometimes better and sometimes worse than your code, depending on NumberOfLayers.
yHistData = Tuple{FloatRange{Float64},Array{Int64,1}}[]
for j = 1:NumberOfLayers
push!(yHistData, hist(DataSet[j], Bins))
end
Assuming length(DataSet) == NumberOfLayers, we can use anonymous functions to simplify the code even further:
yHistData = map(data -> hist(data, Bins), DataSet)
This solution is short, easy to read, and very fast on Julia 0.5. However, this version is not yet released. On 0.4, the currently released version, the performance of this version will be slower.

Project Euler #3 Ruby Solution - What is wrong with my code?

This is my code:
def is_prime(i)
j = 2
while j < i do
if i % j == 0
return false
end
j += 1
end
true
end
i = (600851475143 / 2)
while i >= 0 do
if (600851475143 % i == 0) && (is_prime(i) == true)
largest_prime = i
break
end
i -= 1
end
puts largest_prime
Why is it not returning anything? Is it too large of a calculation going through all the numbers? Is there a simple way of doing it without utilizing the Ruby prime library(defeats the purpose)?
All the solutions I found online were too advanced for me, does anyone have a solution that a beginner would be able to understand?
"premature optimization is (the root of all) evil". :)
Here you go right away for the (1) biggest, (2) prime, factor. How about finding all the factors, prime or not, and then taking the last (biggest) of them that is prime. When we solve that, we can start optimizing it.
A factor a of a number n is such that there exists some b (we assume a <= b to avoid duplication) that a * b = n. But that means that for a <= b it will also be a*a <= a*b == n.
So, for each b = n/2, n/2-1, ... the potential corresponding factor is known automatically as a = n / b, there's no need to test a for divisibility at all ... and perhaps you can figure out which of as don't have to be tested for primality as well.
Lastly, if p is the smallest prime factor of n, then the prime factors of n are p and all the prime factors of n / p. Right?
Now you can complete the task.
update: you can find more discussion and a pseudocode of sorts here. Also, search for "600851475143" here on Stack Overflow.
I'll address not so much the answer, but how YOU can pursue the answer.
The most elegant troubleshooting approach is to use a debugger to get insight as to what is actually happening: How do I debug Ruby scripts?
That said, I rarely use a debugger -- I just stick in puts here and there to see what's going on.
Start with adding puts "testing #{i}" as the first line inside the loop. While the screen I/O will be a million times slower than a silent calculation, it will at least give you confidence that it's doing what you think it's doing, and perhaps some insight into how long the whole problem will take. Or it may reveal an error, such as the counter not changing, incrementing in the wrong direction, overshooting the break conditional, etc. Basic sanity check stuff.
If that doesn't set off a lightbulb, go deeper and puts inside the if statement. No revelations yet? Next puts inside is_prime(), then inside is_prime()'s loop. You get the idea.
Also, there's no reason in the world to start with 600851475143 during development! 17, 51, 100 and 1024 will work just as well. (And don't forget edge cases like 0, 1, 2, -1 and such, just for fun.) These will all complete before your finger is off the enter key -- or demonstrate that your algorithm truly never returns and send you back to the drawing board.
Use these two approaches and I'm sure you'll find your answers in a minute or two. Good luck!
Do you know you can solve this with one line of code in Ruby?
Prime.prime_division(600851475143).flatten.max
=> 6857

Resources