Google sheets, summing over a formula? - google-sheets

Apologies if this is a little unclear, or this question has been asked already. It's a little difficult to explain, but I've bolded my question - it's essentially about shortening formulas.
I'm running a payment plan waterfall model. My code works, but, it's er, you know...
IF($K2=Q$1,Assumptions!$E$56*($N2+$O2),IF(AND($K2<Q$1,$M2>Q$1),Assumptions!$E$56*$P2/$L2,0)) + IF(edate($K2,1)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,1)<Q$1,edate($M2,1)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,2)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,2)<Q$1,edate($M2,2)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,3)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,3)<Q$1,edate($M2,3)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,4)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,4)<Q$1,edate($M2,4)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,5)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,5)<Q$1,edate($M2,5)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,6)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,6)<Q$1,edate($M2,6)>Q$1),Assumptions!$F$56*$P2/$L2,0))
...pretty long.
Essentially what's going on is: the assumption is, when we launch a product, we sell say 80% in the first month, and 2.5% every subsequent month until we each 100%.
I'd like the 80% and the 2.5% to be variables (listed as Assumptions!$E$56 and Assumptions!$E$56) here.
Obviously a little long. But noticed after the first IF clause, the subsequent ones are actually identical, the only difference being the number inside edate(__,2), edate(__,3)...
So my question is - can this code be tidied up into some sort of for loop? Python would make it pretty simple to increment over the variable edate(__,i) and sum over i = 1:6.

Sure, there is. Usually looping is emulated by Sequence(N), which makes an array of numbers from [1,N] vertically, which is somewhat like your Python range. Then you can do stuff to it as an ArrayFormula.
In your case, you end up with two terms: your initial term using $E, and all the looped stuff using $F. I see 6 terms, so I will use Sequence(6):
=IF(
$K2=Q$1,
Assumptions!$E$56*($N2+$O2),
IF(
AND(
$K2<Q$1,
$M2>Q$1
),
Assumptions!$E$56*$P2/$L2,
0
)
) + ArrayFormula(SUM(
IF(
edate($K2,SEQUENCE(6))=Q$1,
Assumptions!$F$56*($N2+$O2),
IF(
(edate($K2,SEQUENCE(6))<Q$1)*
(edate($M2,SEQUENCE(6))>Q$1),
Assumptions!$F$56*$P2/$L2,
0
)
)
))
And if you want, you can give your Assumption values names using named ranges.

Related

Recurrence Relation tree method

I am currently having issues with figuring our some recurrence stuff and since I have midterms about it coming up soon I could really use some help and maybe an explanation on how it works.
So I basically have pseudocode for solving the Tower of Hanoi
TOWER_OF_HANOI ( n, FirstRod, SecondRod, ThirdRod)
if n == 1
move disk from FirstRod to ThirdRod
else
TOWER_OF_HANOI(n-1, FirstRod, ThirdRod, SecondRod)
move disk from FirstRod to ThirdRod
TOWER_OF_HANOI(n-1, SecondRod, FirstRod, ThirdRod)
And provided I understand how to write the relation (which, honestly I'm not sure I do...) it should be T(n) = 2T(n-1)+Ɵ(n), right? I sort of understand how to make a tree with fractional subproblems, but even then I don't fully understand the process that would give you the end solution of Ɵ(n) or Ɵ(n log n) or whatnot.
Thanks for any help, it would be greatly appreciated.
Assume the time complexity is T(n), it is supposed to be: T(n) = T(n-1) + T(n-1) + 1 = 2T(n-1) + 1. Why "+1" but not "+n"? Since "move disk from FirstRod to ThirdRod" costs you only one move.
For T(n) = 2T(n-1) + 1, its recursion tree will exactly look like this:
https://www.quora.com/What-is-the-complexity-of-T-n-2T-n-1-+-C (You might find it helpful, the image is neat.) C is a constant; it means the cost per operation. In the case of Tower of Hanoi, C = 1.
Calculate the sum of the cost each level, you will easily find out in this case, the total cost will be 2^n-1, which is exponential(expensive). Therefore, the answer of this recursion equation is Ɵ(2^n).

Project Euler #3 Ruby Solution - What is wrong with my code?

This is my code:
def is_prime(i)
j = 2
while j < i do
if i % j == 0
return false
end
j += 1
end
true
end
i = (600851475143 / 2)
while i >= 0 do
if (600851475143 % i == 0) && (is_prime(i) == true)
largest_prime = i
break
end
i -= 1
end
puts largest_prime
Why is it not returning anything? Is it too large of a calculation going through all the numbers? Is there a simple way of doing it without utilizing the Ruby prime library(defeats the purpose)?
All the solutions I found online were too advanced for me, does anyone have a solution that a beginner would be able to understand?
"premature optimization is (the root of all) evil". :)
Here you go right away for the (1) biggest, (2) prime, factor. How about finding all the factors, prime or not, and then taking the last (biggest) of them that is prime. When we solve that, we can start optimizing it.
A factor a of a number n is such that there exists some b (we assume a <= b to avoid duplication) that a * b = n. But that means that for a <= b it will also be a*a <= a*b == n.
So, for each b = n/2, n/2-1, ... the potential corresponding factor is known automatically as a = n / b, there's no need to test a for divisibility at all ... and perhaps you can figure out which of as don't have to be tested for primality as well.
Lastly, if p is the smallest prime factor of n, then the prime factors of n are p and all the prime factors of n / p. Right?
Now you can complete the task.
update: you can find more discussion and a pseudocode of sorts here. Also, search for "600851475143" here on Stack Overflow.
I'll address not so much the answer, but how YOU can pursue the answer.
The most elegant troubleshooting approach is to use a debugger to get insight as to what is actually happening: How do I debug Ruby scripts?
That said, I rarely use a debugger -- I just stick in puts here and there to see what's going on.
Start with adding puts "testing #{i}" as the first line inside the loop. While the screen I/O will be a million times slower than a silent calculation, it will at least give you confidence that it's doing what you think it's doing, and perhaps some insight into how long the whole problem will take. Or it may reveal an error, such as the counter not changing, incrementing in the wrong direction, overshooting the break conditional, etc. Basic sanity check stuff.
If that doesn't set off a lightbulb, go deeper and puts inside the if statement. No revelations yet? Next puts inside is_prime(), then inside is_prime()'s loop. You get the idea.
Also, there's no reason in the world to start with 600851475143 during development! 17, 51, 100 and 1024 will work just as well. (And don't forget edge cases like 0, 1, 2, -1 and such, just for fun.) These will all complete before your finger is off the enter key -- or demonstrate that your algorithm truly never returns and send you back to the drawing board.
Use these two approaches and I'm sure you'll find your answers in a minute or two. Good luck!
Do you know you can solve this with one line of code in Ruby?
Prime.prime_division(600851475143).flatten.max
=> 6857

Moving Average across Variables in Stata

I have a panel data set for which I would like to calculate moving averages across years.
Each year is a variable for which there is an observation for each state, and I would like to create a new variable for the average of every three year period.
For example:
P1947=rmean(v1943 v1944 v1945), P1947=rmean(v1944 v1945 v1946)
I figured I should use a foreach loop with the egen command, but I'm not sure about how I should refer to the different variables within the loop.
I'd appreciate any guidance!
This data structure is quite unfit for purpose. Assuming an identifier id you need to reshape, e.g.
reshape long v, i(id) j(year)
tsset id year
Then a moving average is easy. Use tssmooth or just generate, e.g.
gen mave = (L.v + v + F.v)/3
or (better)
gen mave = 0.25 * L.v + 0.5 * v + 0.25 * F.v
More on why your data structure is quite unfit: Not only would calculation of a moving average need a loop (not necessarily involving egen), but you would be creating several new extra variables. Using those in any subsequent analysis would be somewhere between awkward and impossible.
EDIT I'll give a sample loop, while not moving from my stance that it is poor technique. I don't see a reason behind your naming convention whereby P1947 is a mean for 1943-1945; I assume that's just a typo. Let's suppose that we have data for 1913-2012. For means of 3 years, we lose one year at each end.
forval j = 1914/2011 {
local i = `j' - 1
local k = `j' + 1
gen P`j' = (v`i' + v`j' + v`k') / 3
}
That could be written more concisely, at the expense of a flurry of macros within macros. Using unequal weights is easy, as above. The only reason to use egen is that it doesn't give up if there are missings, which the above will do.
FURTHER EDIT
As a matter of completeness, note that it is easy to handle missings without resorting to egen.
The numerator
(v`i' + v`j' + v`k')
generalises to
(cond(missing(v`i'), 0, v`i') + cond(missing(v`j'), 0, v`j') + cond(missing(v`k'), 0, v`k')
and the denominator
3
generalises to
!missing(v`i') + !missing(v`j') + !missing(v`k')
If all values are missing, this reduces to 0/0, or missing. Otherwise, if any value is missing, we add 0 to the numerator and 0 to the denominator, which is the same as ignoring it. Naturally the code is tolerable as above for averages of 3 years, but either for that case or for averaging over more years, we would replace the lines above by a loop, which is what egen does.
There is a user written program that can do that very easily for you. It is called mvsumm and can be found through findit mvsumm
xtset id time
mvsumm observations, stat(mean) win(t) gen(new_variable) end

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Resources