How to predict at which day a value will be reached? - google-sheets

I have a table like the following:
Day | Number
1 | 2
2 | 2.5
3 | 3.5
4 | 5
5 | 7
6 | 8
7 | 10
8 | 11
9 | 13
10 | 15
11 | 12
Most of the time the numbers increase alongside the days, but sometimes they decrease instead. I would like to know at which day the value of number will be a certain number. For example, I would like to know at which day the value of Number will reach 100.
Is this possible? I've tried using the FORECAST function, but when I lower the value of number and increase the number of days, the predicted day lowers, where if I understand correctly it should increase instead.

Step1. Use forecast
You need to lock the first calls in a FORECAST formula:
=FORECAST(A17,B$2:B16,A$2:A16)
Use $ sign to lock the second row.
Step2. Find the position of a day in a forecast
Please try:
=INDEX(sort(A2:B,2,0), MATCH(C1,sort(B2:B,1,0),-1), 1)
where C1 = 100 (your number)
The formula will sort the range twice and will work even if the number does not grow as the days increase.

With your target Number in B13, please try in A13:
=FORECAST(B13,A$2:A12,B$2:B12)
You don't show the formula you tried so what was wrong with it is just a guess but maybe your xs and ys were 'the wrong way round'.

I would use function slope to calculate the day. Slope is the amount number will increase for each day. When C2 is edited, this script will calculate the day for the number you you enter. Enter the target number in C2. In F2 put =slope(B2:B12,A2:A12). The custom menu has a reset.
function onOpen() {
SpreadsheetApp.getActiveSpreadsheet().addMenu(
'Reset', [
{ name: 'Reset', functionName: 'reset' }
])}
function onEdit(e) {
var sheet=e.source.getSheetName()
var cell=e.range.getSheet().getActiveCell().getA1Notation()
var ss=SpreadsheetApp.getActiveSpreadsheet()
var s=ss.getSheetByName("Sheet1")
var find=s.getRange("C2").getValue()
var slope=s.getRange("F2").getValue()
Answer=0
if(sheet=="Sheet1" && cell=="C2"){
for (var i=0;i<100;i++){
var Answer=Answer+1
var test=Math.round(slope*Answer)
if(test>=find){
s.getRange("D2").setValue(Answer);
break;
}}}}
function reset(){
var ss=SpreadsheetApp.getActiveSpreadsheet()
var s=ss.getSheetByName("Sheet1")
s.getRange("C2").setValue(0)
s.getRange("D2").setValue(0);
}
Here is a test spreadsheet you can copy and try. https://docs.google.com/spreadsheets/d/1inmd2Lc1aOfVL7POKye6p2OQHsloJryISgdqNMowW90/edit?usp=sharing

Related

influxdb: calculating duration of boolean events?

I have data in an influxdb database from a door sensor. This is a boolean sensor (either the door is open (value is false) or it is closed (value is true)), and the table looks like:
name: door
--------------
time value
1506026143659488953 true
1506026183699139512 false
1506026751433484237 true
1506026761473122666 false
1506043848850764808 true
1506043887602743375 false
I would like to calculate how long the door was open in a given period of time. The ELAPSED function gets me close, but I'm not sure how to either (a) restrict it to only those intervals for which the intitial value is false, or (b) identify "open" intervals from the output of something like select elapsed(value, 1s) from door.
I was hoping I could do something like:
select elapsed(value, 1s), first(value) from door
But that doesn't get me anything useful:
name: door
--------------
time elapsed first
0 true
1506026183699139512 40
1506026751433484237 567
1506026761473122666 10
1506043848850764808 17087
1506043887602743375 38
I was hoping for something more along the lines of:
name: door
--------------
time elapsed first
1506026183699139512 40 true
1506026751433484237 567 false
1506026761473122666 10 true
1506043848850764808 17087 false
1506043887602743375 38 true
Short of extracting the data myself and processing it in e.g. python, is there any way to do this via an influxdb query?
I came across this problem as well, I wanted to sum the durations of times for which a flag is on, which is pretty common in signal processing in time series libraries, but influxdb just doesn't seem to support that very well. I tried INTEGRATE with a flag of value 1 but it just didn't seem to give me correct values. In the end, I resorted to just calculating intervals in my data source, publishing those as a separate field in influxdb and summing them up. It works much better that way.
This is the closest I have found so far:
https://community.influxdata.com/t/storing-duration-in-influxdb/4669
The idea is to store the boolean event as 0or 1 and to store each state changes with two entries with one unit of time difference. It would look something like this:
name: door
--------------
time value
1506026143659488953 1
1506026183699139511 1
1506026183699139512 0
1506026751433484236 0
1506026751433484237 1
1506026761473122665 1
1506026761473122666 0
1506043848850764807 0
1506043848850764808 1
1506043887602743374 1
1506043887602743375 0
It should then be possible to use a query like this:
SELECT integral(value) FROM "door" WHERE time > x and time < y
I'm new to influx so let me know if this is a bad way of doing things today. I also haven't tested the example I've written here.
I had this same problem. After running into this wall with InfluxDB and finding no clean solutions here or elsewhere, I ended up switching to TimescaleDB (PostgreSQL-based) and solving it with a SQL window function, using lag() to calculate the delta to the previous time value.
For the OP's dataset, a possible solution looks like this:
SELECT
"time",
("time" - lag("time") OVER (ORDER BY "time"))/1000000000 AS elapsed,
value AS first
FROM door
ORDER BY 1
OFFSET 1; -- omit the initial zero value
Input:
CREATE TEMPORARY TABLE "door" (time bigint, value boolean);
INSERT INTO "door" VALUES
(1506026143659488953, true),
(1506026183699139512, false),
(1506026751433484237, true),
(1506026761473122666, false),
(1506043848850764808, true),
(1506043887602743375, false);
Output:
time | elapsed | first
---------------------+---------+-------
1506026183699139512 | 40 | f
1506026751433484237 | 567 | t
1506026761473122666 | 10 | f
1506043848850764808 | 17087 | t
1506043887602743375 | 38 | f
(5 rows)

Creating numbered variable names using the foreach command

I have a list of variables for which I want to create a list of numbered variables. The intent is to use these with the reshape command to create a stacked data set. How do I keep them in order? For instance, with this code
local ct = 1
foreach x in q61 q77 q99 q121 q143 q165 q187 q209 q231 q253 q275 q297 q306 q315 q324 q333 q342 q351 q360 q369 q378 q387 q396 q405 q414 q423 {
gen runs`ct' = `x'
local ct = `ct' + 1
}
when I use the reshape command it generates an order as
runs1 runs10 runs11 ... runs2 runs22 ...
rather than the desired
runs01 runs02 runs03 ... runs26
Preserving the order is necessary in this analysis. I'm trying to add a leading zero to all ct values less than 10 when assigning variable names.
Generating a series of identifiers with leading zeros is a documented and solved problem: see e.g. here.
local j = 1
foreach v in q61 q77 q99 q121 q143 q165 q187 q209 q231 q253 q275 q297 q306 q315 q324 q333 q342 q351 q360 q369 q378 q387 q396 q405 q414 q423 {
local J : di %02.0f `j'
rename `v' runs`J'
local ++j
}
Note that I used rename rather than generate. If you are going to reshape the variables afterwards, the labour of copying the contents is unnecessary. Indeed the default float type for numeric variables used by generate could in some circumstances result in loss of precision.
I note that there may also be a solution with rename groups.
All that said, it's hard to follow your complaint about what reshape does (or does not) do. If you have a series of variables like runs* the most obvious reshape is a reshape long and for example
clear
set obs 1
gen id = _n
foreach v in q61 q77 q99 q121 q143 {
gen `v' = 42
}
reshape long q, i(id) j(which)
list
+-----------------+
| id which q |
|-----------------|
1. | 1 61 42 |
2. | 1 77 42 |
3. | 1 99 42 |
4. | 1 121 42 |
5. | 1 143 42 |
+-----------------+
works fine for me; the column order information is preserved and no use of rename was needed at all. If I want to map the suffixes to 1 up, I can just use egen, group().
So, that's hard to discuss without a reproducible example. See
https://stackoverflow.com/help/mcve for how to post good code examples.

Fill missing data by interpolation in Google Spreadsheet

I have Google Spreadsheet with following data
A B D
1 Date Weight Computation
2 2015/12/09 =B2*2
3 2015/12/10 65 =B3*2
4 2015/12/11 =B4*2
5 2015/12/12 =B5*2
6 2015/12/14 62 =B6*2
7 2015/12/15 =B7*2
8 2015/12/16 61 =B8*2
9 2015/12/17 =B9*2
I want to graph the weight w.r.t. date, and/or use it with other columns that compute other quantities off the weight. However you will notice that there are some missing entries. What I want is another column which has data which is based on the Weight column with missing values interpolated and filled in. E.g.:
A B C D
1 Date Weight WeightI Computation
2 2015/12/09 65 =C2*2 # use first known value
3 2015/12/10 65 65 =C3*2
4 2015/12/11 64 =C4*2 # =(62-65)/3*(1)+65
5 2015/12/12 63 =C5*2 # =(62-65)/3*(2)+65
6 2015/12/14 62 62 =C6*2
7 2015/12/15 61.5 =C7*2 # =(61-62)/2*(1)+62
8 2015/12/16 61 61 =C8*2
9 2015/12/17 61 =C9*2 # use the last known value
In column C are values filled in using linear interpolation when I have to find missing data between two known points.
I believe this is a really simple and common use case, so I am sure its a trivial thing to do, but I am unable to find a solution using built in functions. I don't have much experience with spreadsheets either. I have spent hours experimenting with =INDEX, =MATCH, =VLOOKUP, =LINEST, =TREND etc., but I am not able to come up with something from the examples. The only solution that I could use was to create a custom function using Google Apps Script. Though my solution works, it seems to execute really very slowly. My spreadsheet is also huge.
Any pointers, solutions?
You might want to use forecast for which it may be more convenient first to separate out the dates you have readings from those you don't (and rearrange later). So with just three readings say:
A B
1 10/12/2015 65
2 14/12/2015 62
3 16/12/2015 61
and the dates for which values are required on the left below:
6 09/12/2015 65.6
7 11/12/2015 64.3
8 12/12/2015 63.6
9 15/12/2015 61.5
10 17/12/2015 60.2
The formula giving rise to 65.6 in B6 (and copied down from there to suit) is:
=forecast(A6,$B$1:$B$3,$A$1:$A$3)
This is not calculated in quite the way you show but may be considered slightly more accurate, in particular by extrapolating the missing end values, rather than just repeating their nearest available value.
Having calculated the values you would probably want to reassemble the data in date order. So I suggest copy B6:B10 and Edit, Paste special, Paste values only over the top and then sort to suit.
The chart below compares the results above (blue) with those in your OP (green) and marks the given data points:
Found an solution that satisfies most of my requirements using:
Used =FILTER() to first remove blank lines where data is not available (thanks for a tip from "pnuts").
And =MATCH() to lookup two consecutive rows from the filtered table. In my case I was able to use this function because column A is sorted and has no repetitions.
And then using line formula to interpolate values.
So the output becomes:
A B C D E
1 Date Weight FDdate FWeight IWeight
2 2015/05/09 2015/05/10 65.00 #N/A
3 2015/05/10 65.00 2015/05/13 62.00 65.00
4 2015/05/11 2015/05/15 61.00 64.00
5 2015/05/12 63.00
6 2015/05/13 62.00 62.00
7 2015/05/14 61.50
8 2015/05/15 61.00 61.00
9 2015/05/16 61.00
10 2015/05/17 61.00
Where cells C2 and D2 have the following range formula (minor note: the following formulas could of course be combined if columns A and B are adjacent):
C2 =FILTER($A$2:$A$10, NOT(ISBLANK($B$2:$B$10)))
D2 =FILTER($B$2:$B$10, NOT(ISBLANK($B$2:$B$10)))
Cells E2 through E10 contain the following line interpolation formula: [y = y1 + (y2 - y1) / (x2 - x1) * (x - x1)]:
E2 =(INDEX($D:$D, MATCH($A2, $C:$C, 1), 1))
+(INDEX($D:$D, MATCH($A2, $C:$C, 1) + 1, 1)
- INDEX($D:$D, MATCH($A2, $C:$C, 1), 1))
/(INDEX($C:$C, MATCH($A2, $C:$C, 1) + 1, 1)
- INDEX($C:$C, MATCH($A2, $C:$C, 1), 1))
*(INDEX($C:$C, MATCH($A2, $C:$C, 1), 1) - $A2) * -1
What this solution does not work for is when the first cell B2 does not have a value, where the formula result in #N/A. All this would have been much more efficient if we had something like =INTERPOLATE_LINE( A2, $A$2:$A$10, $B$2:$B$10 ) in google spreadsheet, but unfortunately this does not exist. Please correct me if I have missed it in my reading of the supported functions in google spreadsheet.
I found a solution which satisfies the requirements completely. I used a separate sheet so I could break up the calculation into pieces.
Create a new sheet. Enter the following formulas into Cells A2-F2, and then copy them down the page.
Cell A2: Copy your weight data into the first column. (In this example, the sheet name is Daily Record and the weights are recorded in column D.)
'Daily Record'!D2
Cell B2: Find the most recent recorded weight.
=INDEX(FILTER(A$2:A2,A$2:A2 <> ""),COUNT(FILTER(A$2:A2,A$2:A2 <> "")),1)
Cell C2: Count the number of days since the most recent weigh-in.
=IF(A2<>"",0,IF(ROW(C2)<3,0,C1+1))
Cell D2: Find the next recorded weight (from the current date or later.)
=IFERROR(INDEX(FILTER(A2:A,A2:A <> ""),1,1),"")
Cell E2: Count the number of days until the next weigh-in.
=IF(A2<>"",0,IF(E3="","",E3+1))
Cell F2: Calculate the interpolated weight.
=IF(A2 <> "", A2, IF(D2 = "", "", B2 + (D2-B2)*C2/(C2+E2)))

Generating means of a variable using dummy variables & foreach in Stata

My dataset includes TWO main variables X and Y.
Variable X represents distinct codes (e.g. 001X01, 001X02, etc) for multiple computer items with different brands.
Variable Y represents the tax charged for each code of variable X (e.g. 15 = 15% for 001X01) at a store.
I've created categories for these computer items using dummy variables (e.g. HD dummy variable for Hard-Drives, takes value of 1 when variable X represents a HD, etc). I have a list of over 40 variables (two of them representing X and Y, and the rest is a bunch of dummy variables for the different categories I've created for computer items).
I would like to display the averages of all these categories using a loop in Stata, but I'm not sure how to do this.
For example the code:
mean Y if HD == 1
Mean estimation Number of obs = 5
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
Tax | 7.1 2.537716 1.154172 15.24583
gives me the mean Tax for the category representing Hard Drives. How can I use a loop in Stata to automatically display all the mean Taxes charged for each category? I would do it by hand without a problem, but I want to repeat this process for multiple years, so I would like to use a loop for each year in order to come up with this output.
My goal is to create a separate Excel file with each of the computer categories I've created (38 total) and the average tax for each category by year.
Why bother with the loop and creating the indicator variables? If I understand correctly, your initial dataset allows the use of a simple collapse:
clear all
set more off
input ///
code tax str10 categ
1 0.15 "hd"
2 0.25 "pend"
3 0.23 "mouse"
4 0.29 "pend"
5 0.16 "pend"
6 0.50 "hd"
7 0.54 "monitor"
8 0.22 "monitor"
9 0.21 "mouse"
10 0.76 "mouse"
end
list
collapse (mean) tax, by(categ)
list
To take to Excel you can try export excel or put excel.
Run help collapse and help export for details.
Edit
Because you insist, below is an example that gives the same result using loops.
I assume the same data input as before. Some testing using this example database
with expand 1000000, shows that speed is virtually the same. But almost surely,
you (including your future you) and your readers will prefer collapse.
It is much clearer, cleaner and concise. It is even prettier.
levelsof categ, local(parts)
gen mtax = .
quietly {
foreach part of local parts {
summarize tax if categ == "`part'", meanonly
replace mtax = r(mean) if categ == "`part'"
}
}
bysort categ: keep if _n == 1
keep categ mtax
Stata has features that make it quite different from other languages. Once you
start getting a hold of it, you will find that many things done with loops elsewhere,
can be made loop-less in Stata. In many cases, the latter style will be preferred.
See corresponding help files using help <command> and if you are not familiarized with saved results (e.g. r(mean)), type help return.
A supplement to Roberto's excellent answer: After collapse, you will need a loop to export the results to excel.
levelsof categ, local(levels)
foreach x of local levels {
export excel `x', replace
}
I prefer to use numerical codes for variables such as your category variable. I then assign them value labels. Here's a version of Roberto's code which does this and which, for closer correspondence to your problem, adds a "year" variable
input code tax categ year
1 0.15 1 1999
2 0.25 2 2000
3 0.23 3 2013
4 0.29 1 2010
5 0.16 2 2000
6 0.50 1 2011
7 0.54 4 2000
8 0.22 4 2003
9 0.21 3 2004
10 0.76 3 2005
end
#delim ;
label define catl
1 hd
2 pend
3 mouse
4 monitor
;
#delim cr
label values categ catl
collapse (mean) tax, by(categ year)
levelsof categ, local(levels)
foreach x of local levels {
export excel `:label (categ) `x'', replace
}
The #delim ; command makes it possible to easily list each code on a separate line. The"label" function in the export statement is an extended macro function to insert a value label into the file name.

Obtaining the quantity and proportion in SPSS 21

I have the data in a sav file
CODE | QUANTITY
------|----------
A | 1
B | 4
C | 1
F | 3
B | 3
D | 12
D | 5
I need to obtain the quantity of codes which have a quantity <= 3 and to obtain the proportion in a percentage with respect to the total number and present a result like this
<= 3 | PERCENTAGE
------|----------
4 | 57 %
All of this using SPSS syntax.
I would first convert the quantity value to a 0-1 variable, and then aggregate by code to the mean. This produces a nice second dataset to make a table. Example below.
data list free / Code (A1) Quantity (F2.0).
begin data
A 1
B 4
C 1
F 3
B 3
D 12
D 5
end data.
*convert to 0-1.
compute QuantityB3 = (Quantity LE 3).
*Aggregate.
DATASET DECLARE AggQuant.
AGGREGATE
/OUTFILE='AggQuant'
/BREAK=Code
/QuantityB3 = MEAN(QuantityB3).
I dont know how you migrate your question here, I dont have reputation here to add screen shoots that's help you allot. Anyhow the procedure of your desire output is given below.
Goto Transform->Count Values within cases a dialogue box open, write the name of new variable say "New" in Target Variable: go to define values a new dialogue box is open then check the radio button Range, LOWEST through value: put in below box 3 and then press add and press continue and press ok. A new variable is created with the name of "New". Now go to Analyze -> Descriptive Statistics-> Frequencies, new dialogue box will be open send "New" variable into Variable(s): press Statistics in new dialogue box check Percentile(s): write 100 in box and press Add and then continue and ok. You get the desire results.

Resources