dask equivalent of df.loc[df.index.intesection(mylabels)] - dask

When I run df.loc[mylabels] in dask I get a warning with the link to
Warning Starting in 0.21.0, using .loc or [] with a list with one or more missing labels, is deprecated, in favor of .reindex *
This page also says:
Alternatively, if you want to select only valid keys, the following is idiomatic and efficient; it is guaranteed to preserve the dtype of the selection.
In [106]: labels = [1, 2, 3]
In [107]: s.loc[s.index.intersection(labels)]
Out[107]:
1 2
2 3
dtype: int64
Dask indexes do not have an intersection method.
So hat is the recommended way to achieve the above effect in dask?
The problem with df.loc[mylabels] is that mylabels contains items not in df.index.

For now it looks like you should continue calling df.loc[labels].
It looks like things have changed upstream and probably dask.dataframe needs to follow a bit. I recommend submitting a bug report to https://github.com/dask/dask/issues/new

Related

reduce_max function in tensorflow

Screenshot
>>> boxes = tf.random_normal([ 5])
>>> with s.as_default():
... s.run(boxes)
... s.run(keras.backend.argmax(boxes,axis=0))
... s.run(tf.reduce_max(boxes,axis=0))
...
array([ 0.37312034, -0.97431135, 0.44504794, 0.35789603, 1.2461706 ],
dtype=float32)
3
0.856236
.
Why am I getting 0.8564. I expect the value to be 1.2461. since 1.2461 is big.right?
I am getting correct answer if i use tf.constant.
But I am not getting correct answer while using radom_normal
Each time a new boxes is regenerated when you run s.run() with radom_normal. So your three results are different. If you want to get consistent results, you should only run s.run() once.
result = s.run([boxes,keras.backend.argmax(boxes,axis=0),tf.reduce_sum(boxes,axis=0)])
print(result[0])
print(result[1])
print(result[2])
#print
[ 0.69957364 1.3192859 -0.6662426 -0.5895929 0.22300807]
1
0.9860319
In addition, the code should be given in text format rather than picture format.
TensorFlow is different from numpy because TF only uses symbolic operations. That means when you instantiate the random_normal, you don't get numeric values, but a symbolic normal distribution, so each time you evaluate it, you get different numbers.
Each time you operate with this distribution, with any other operation, you are getting different numbers, and that explains the results you see.

How do I sum the product of two values, across multiple objects in Rails?

Imagine I have a portfolio p that has 2 stocks port_stocks. What I want to do is run a calculation on each port_stock, and then sum up all the results.
[60] pry(main)> p.port_stocks
=> [#<PortStock:0x00007fd520e064e0
id: 17,
portfolio_id: 1,
stock_id: 385,
volume: 2000,
purchase_price: 5.9,
total_spend: 11800.0>,
#<PortStock:0x00007fd52045be68
id: 18,
portfolio_id: 1,
stock_id: 348,
volume: 1000,
purchase_price: 9.0,
total_spend: 9000.0>]
[61] pry(main)>
So, in essence, using the code above I would like to do this:
ps = p.port_stocks.first #(`id=17`)
first = ps.volume * ps.purchase_price # 2000 * 5.9 = 11,800
ps = p.port_stocks.second #(`id=18`)
second = ps.volume * ps.purchase_price # 1000 * 9.0 = 9,000
first + second = 19,800
I want to simply get 19,800. Ideally I would like to do this in a very Ruby way.
If I were simply summing up all the values in 1 total_spend, I know I could simply do: p.port_stocks.map(&:total_spend).sum and that would be that.
But not sure how to do something similar when I am first doing a math operation on each object, then adding up all the products from all the objects. This should obviously work for 2 objects or 500.
The best way of doing this using Rails is to pass a block to sum, such as the following:
p.port_stocks.sum do |port_stock|
port_stock.volume * port_stock.purchase_price
end
That uses the method dedicated to totalling figures, and tends to be very fast and efficient - particularly when compared to manipulating the data ahead of calling a straight sum without a block.
A quick benchmark here typically shows it performing ~20% faster than the obvious alternatives.
I've not been able to test, but give that a try and it should resolve this for you.
Let me know how you get on!
Just a quick update as you also mention the best Ruby way, sum was introduced in 2.4, though on older versions of Ruby you can use reduce (also aliased to inject):
p.port_stocks.reduce(0) do |sum, port_stock|
sum + (port_stock.volume * port_stock.purchase_price)
end
This isn't as efficient as sum, but thought I'd give you the options :)
You are right to use Array#map to iterate through all stocks, but instead to sum all total_spend values, you could calculate it for each stock. After, you sum all results and your done:
p.port_stocks.map{|ps| ps.volume * ps.purchase_price}.sum
Or you could use Enumerable#reduce like SRack did. This would return the result with one step/iteration.

How should I get the values from a Folsom "slide" histogram?

I'm trying out Folsom for metrics generation in Erlang.
I've created a histogram (slide), how can I get the values out of it? I'm using
(test#SebMaynardSL2)1> folsom:start().
(test#SebMaynardSL2)2> MyMetric = "mymetric",
(test#SebMaynardSL2)3> folsom_metrics:new_histogram(MyMetric, slide).
And tried putting some values in it:
(test#SebMaynardSL2)4> [ folsom_metrics:notify({MyMetric, V}) || V <- lists:seq(1, 10) ].
But getting the values out (with folsom_metrics:get_metric_value/1) seems to be return the results in rather a strange order:
(test#SebMaynardSL2)5> folsom_metrics:get_metric_value(MyMetric).
[4,5,8,9,3,10,2,7,6,1]
If I wait a while (60 seconds, the default slide window time) and do it again, I don't necessarily end up with a metric value in the same order.
How should I get the values out of Folsom to use for (say) graph generation? I did consider putting {now(), V} instead of just V in my notify, and then sorting the returned result set by the first tuple value, but it seems odd that the results are coming back (or rather, are getting written) in a weird order, and Folsom is keeping track of the time of events anyway (to make it "slide").
This is using Folsom 0.7.4, with Erlang R16B
Thanks!
Rather oddly, after a fresh clone and checking out tag 0.7.4 again, running the commands from the question gives the results in the correct order:
(test#SebMaynardSL2)5> folsom_metrics:get_metric_value(MyMetric).
[1,2,3,4,5,6,7,8,9,10]
Perhaps this wasn't an issue after all. No idea why it was generating such oddities the other day.

matlab indexing into nameless matrix [duplicate]

For example, if I want to read the middle value from magic(5), I can do so like this:
M = magic(5);
value = M(3,3);
to get value == 13. I'd like to be able to do something like one of these:
value = magic(5)(3,3);
value = (magic(5))(3,3);
to dispense with the intermediate variable. However, MATLAB complains about Unbalanced or unexpected parenthesis or bracket on the first parenthesis before the 3.
Is it possible to read values from an array/matrix without first assigning it to a variable?
It actually is possible to do what you want, but you have to use the functional form of the indexing operator. When you perform an indexing operation using (), you are actually making a call to the subsref function. So, even though you can't do this:
value = magic(5)(3, 3);
You can do this:
value = subsref(magic(5), struct('type', '()', 'subs', {{3, 3}}));
Ugly, but possible. ;)
In general, you just have to change the indexing step to a function call so you don't have two sets of parentheses immediately following one another. Another way to do this would be to define your own anonymous function to do the subscripted indexing. For example:
subindex = #(A, r, c) A(r, c); % An anonymous function for 2-D indexing
value = subindex(magic(5), 3, 3); % Use the function to index the matrix
However, when all is said and done the temporary local variable solution is much more readable, and definitely what I would suggest.
There was just good blog post on Loren on the Art of Matlab a couple days ago with a couple gems that might help. In particular, using helper functions like:
paren = #(x, varargin) x(varargin{:});
curly = #(x, varargin) x{varargin{:}};
where paren() can be used like
paren(magic(5), 3, 3);
would return
ans = 16
I would also surmise that this will be faster than gnovice's answer, but I haven't checked (Use the profiler!!!). That being said, you also have to include these function definitions somewhere. I personally have made them independent functions in my path, because they are super useful.
These functions and others are now available in the Functional Programming Constructs add-on which is available through the MATLAB Add-On Explorer or on the File Exchange.
How do you feel about using undocumented features:
>> builtin('_paren', magic(5), 3, 3) %# M(3,3)
ans =
13
or for cell arrays:
>> builtin('_brace', num2cell(magic(5)), 3, 3) %# C{3,3}
ans =
13
Just like magic :)
UPDATE:
Bad news, the above hack doesn't work anymore in R2015b! That's fine, it was undocumented functionality and we cannot rely on it as a supported feature :)
For those wondering where to find this type of thing, look in the folder fullfile(matlabroot,'bin','registry'). There's a bunch of XML files there that list all kinds of goodies. Be warned that calling some of these functions directly can easily crash your MATLAB session.
At least in MATLAB 2013a you can use getfield like:
a=rand(5);
getfield(a,{1,2}) % etc
to get the element at (1,2)
unfortunately syntax like magic(5)(3,3) is not supported by matlab. you need to use temporary intermediate variables. you can free up the memory after use, e.g.
tmp = magic(3);
myVar = tmp(3,3);
clear tmp
Note that if you compare running times with the standard way (asign the result and then access entries), they are exactly the same.
subs=#(M,i,j) M(i,j);
>> for nit=1:10;tic;subs(magic(100),1:10,1:10);tlap(nit)=toc;end;mean(tlap)
ans =
0.0103
>> for nit=1:10,tic;M=magic(100); M(1:10,1:10);tlap(nit)=toc;end;mean(tlap)
ans =
0.0101
To my opinion, the bottom line is : MATLAB does not have pointers, you have to live with it.
It could be more simple if you make a new function:
function [ element ] = getElem( matrix, index1, index2 )
element = matrix(index1, index2);
end
and then use it:
value = getElem(magic(5), 3, 3);
Your initial notation is the most concise way to do this:
M = magic(5); %create
value = M(3,3); % extract useful data
clear M; %free memory
If you are doing this in a loop you can just reassign M every time and ignore the clear statement as well.
To complement Amro's answer, you can use feval instead of builtin. There is no difference, really, unless you try to overload the operator function:
BUILTIN(...) is the same as FEVAL(...) except that it will call the
original built-in version of the function even if an overloaded one
exists (for this to work, you must never overload
BUILTIN).
>> feval('_paren', magic(5), 3, 3) % M(3,3)
ans =
13
>> feval('_brace', num2cell(magic(5)), 3, 3) % C{3,3}
ans =
13
What's interesting is that feval seems to be just a tiny bit quicker than builtin (by ~3.5%), at least in Matlab 2013b, which is weird given that feval needs to check if the function is overloaded, unlike builtin:
>> tic; for i=1:1e6, feval('_paren', magic(5), 3, 3); end; toc;
Elapsed time is 49.904117 seconds.
>> tic; for i=1:1e6, builtin('_paren', magic(5), 3, 3); end; toc;
Elapsed time is 51.485339 seconds.

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources