How remove gross errors from time series? - time-series

I have a long time series of 5min water level data from wells. The series contain measurement errors that are easily viewed in time-series plots.
water level time series plot
head(data)
# A tibble: 229,120 x 4
date temp P_comp_m alt_m
<dttm> <dbl> <dbl> <dbl>
1 2016-06-10 11:50:00 21.8 1.09 1008.
2 2016-06-10 11:55:00 21.2 1.07 1008.
3 2016-06-10 12:00:00 21.1 1.06 1008.
4 2016-06-10 12:05:00 21.1 1.05 1008.
5 2016-06-10 12:10:00 21.9 1.05 1008.
6 2016-06-10 12:15:00 21.8 1.04 1008.
7 2016-06-10 12:20:00 21.7 1.03 1008.
8 2016-06-10 12:25:00 21.6 1.03 1008.
9 2016-06-10 12:30:00 21.5 1.02 1008.
10 2016-06-10 12:35:00 21.5 1.01 1008.
# ... with 229,110 more rows
Due to the volume of data I wish to automate the data cleaning process. Currently, I am removing spurious data manually with R tidyverse tools.
data[between(data$date,
as_datetime("2016-11-27 17:00:00"),
as_datetime("2016-11-29 01:50:00")),] <- data %>%
filter(between(date, as_datetime("2016-11-27 17:00:00"),
as_datetime("2016-11-29 01:50:00"))) %>%
mutate(temp = NA, # temperature column
P_comp_m = NA, # pressure
alt_m = NA) # altitude
Can anyone provide suggestions to automate the task?

You can automate the task if you can articulate/express the criteria for "spurious" data. Or you can automate a part of it: e.g. pick the manually choose/pick the times you considered the data is spurious, put them into a vector/list and set a process to automatically remove these datapoints from the data (based on manually created list).

Related

Does XGBoost Regressor handles missing timesteps?

I've a dataframe with daily items selling: the goal is forecasting on future selling for a good warehouse supply. I'm using XGBoost as Regressor.
date
qta
prezzo
year
day
dayofyear
month
week
dayofweek
festivo
2014-01-02 00:00:00
6484.8
1
2014
2
2
1
1
3
1
2014-01-03 00:00:00
5300
1
2014
3
3
1
1
4
1
2014-01-04 00:00:00
2614.9
1.1
2014
4
4
1
1
5
1
2014-01-07 00:00:00
114.3
1.1
2014
7
7
1
2
1
0
2014-01-09 00:00:00
11490
1
2014
9
9
1
2
3
0
The date is also the index of my dataframe. Qta is the label (the dependent variable) and all the others are the features.
As you can see it's a daily sampling but some days are missing (i.e. 5,6,8).
Could it be a problem during fitting and prediction of future days?
Am i supposed to fill the missing days with qta = 0?

Arrange downloaded data into more useful way in google sheets

We currently have a fixed report data that we can only manipulate after download and to simplify, it looks like this:
raw report data extracted to google sheets
a b c
1 Start Date Time Adhering to Schedule (Hours) Time Not Adhering to Schedule (Hours)
2 Employee: A Supervisor: X
3 5/4/2022 7.65 1.35
4 5/5/2022 8.12 0.88
5 5/6/2022 6.95 2.05
6 5/9/2022 8.7 0.3
7 5/10/2022 7.45 1.55
8 5/11/2022 8.63 0.37
9 5/12/2022 8.08 0.92
10 5/13/2022 6.13 0.13
11 Totals: 61.71 7.55
12 Employee: B Supervisor: X
13 5/1/2022 3.8 0.27
14 5/2/2022 6.72 2.28
15 5/3/2022 6.1 2.9
16 5/4/2022 8.43 0.57
17 5/5/2022 5.85 0.53
18 5/10/2022 6.13 2.87
19 5/11/2022 0 1.5
20 5/12/2022 2 1.5
21 5/13/2022 1.75 1.75
22 Totals: 40.78 14.17
I would like some help in constructing a new sheet via formulas so that it rearranges the raw data as follows:
desired output
a b c d e
1 EMPLOYEE SUPERVISOR Start Date Time Adhering to Schedule (Hours) Time Not Adhering to Schedule (Hours)
2 A X 04/05/22 7.65 1.35
3 A X 05/05/22 8.12 0.88
4 A X 06/05/22 6.95 2.05
5 A X 09/05/22 8.70 0.30
6 A X 10/05/22 7.45 1.55
7 A X 11/05/22 8.63 0.37
8 A X 12/05/22 8.08 0.92
9 A X 13/05/22 6.13 0.13
10 B X 01/05/22 3.80 0.27
11 B X 02/05/22 6.72 2.28
12 B X 03/05/22 6.10 2.90
13 B X 04/05/22 8.43 0.57
14 B X 05/05/22 5.85 0.53
15 B X 10/05/22 6.13 2.87
16 B X 11/05/22 0.00 1.50
17 B X 12/05/22 2.00 1.50
18 B X 13/05/22 1.75 1.75
It probably needs some combination of QUERY() ARRAYFORMULA(), TRANSPOSE() and/or INDEX() or something.. but i can't quite figure it out. I need some help with to get started in the right track. the dates and data between employees are dynamic so the formula in the desired result needs to adjust to that as well.
thanks!
edit: adding a sample trix for reference :) https://docs.google.com/spreadsheets/d/1m_FCGcnXvnEiMZ8X4K1eEsMljORWV4V1Yq_81vFnx4Y/edit?usp=sharing
Gobal solution
in E1
={ArrayFormula(if(A1:A="Totals:",,{
substitute(lookup(row(A1:A),row(A1:A)/if(ISNUMBER(A1:A),0,1),A1:A),"Employee: ",""),
substitute(lookup(row(A1:A),row(A1:A)/if(ISNUMBER(A1:A),0,1),C1:C),"Supervisor: ","")
})),Arrayformula(if(ISNUMBER(A1:A),{A1:A,B1:B,C1:C},))}
In 3 steps (3 arrayformulas),
try in H1
=arrayformula(if(left(A1:A,6)="Totals",,if(left(A1:A,8)="Employee",{B1:B,D1:D,E1:E,E1:E,E1:E},{E1:E,E1:E,A1:A,B1:B,C1:C})))
then, back in F1 to complete all rows with employee and supervisor
=ArrayFormula({lookup(row(H:H),row(H:H)/if(H:H<>"",1,0),H:H),lookup(row(I:I),row(I:I)/if(I:I<>"",1,0),I:I)})
finally, if you want to reduce the presentation, in M1
=query(F:L,"select F,G,J,K,L where J is not null",0)

Averaging a Data Series in a Google Sheet to a single entry per period regardless of the number of samples in the larger period?

I have a small data set of ~200 samples taken over twenty years with two columns of data that sometimes have multiple entries for the period (i.e. age or date). When I go to plot it, even though the data is over 20 years the graph heavily reflects the number of samples in the period and not the period itself. For example during age 23 there may be 2 or 3 samples, 1 for age 24, 20 for age 25, and 10 for age 35.. the number of samples entirely on needs for additional data at the time.. so simply there is no consistency to the sample rate.
How do I get an Max or an Average / Max for a period (age) and ensure there is only one entry per period in the sheet (about one entry per year) without having to create a separate sheet full of separate queries and charting off of that?
What I have tried in Google Sheets (where my data is) is on the x-series chart choosing "aggregate" (which is on the age period) which helps flatten the graph a bit, but doesn't reduce the series.
A read only link to the the spreadsheet is HERE for reference.
Data Looking something like this:
3/27/2013 36.4247 2.5 29.3
4/10/2013 36.4630 1.8 42.8
4/15/2013 36.4767 2.2 33.9
5/2/2013 36.5233 2.2 33.9
5/21/2013 36.5753 1.91 39.9
5/29/2013 36.5973 1.94 39.2
7/29/2013 36.7644 1.98 38.3
10/25/2013 37.0055 1.7 45.6
2/28/2014 37.3507 1.85 50 41.3
6/1/2014 37.6055 1.98 38 38.1
12/1/2014 38.1068 37
6/1/2015 38.6055 2.18 34 33.9
12/11/2015 39.1342 3.03 23 23.1
12/14/2015 39.1425 3.18 22 21.9
12/15/2015 39.1452 3.44 20 20.0
12/17/2015 39.1507 3.61 19 18.9
12/21/2015 39.1616 3.62 19 18.8
12/23/2015 39.1671 3.32 21 20.8
12/25/2015 39.1726 3.08 23 22.7
12/28/2015 39.1808 3.12 22 22.4
12/29/2015 39.1836 2.97 24 23.7
12/30/2015 39.1863 3.57 19 19.1
12/31/2015 39.1890 3.37 20 20.5
1/1/2016 39.1918 3.37 20 20.5
1/3/2016 39.1973 2.65 27 27.0
1/4/2016 39.2000 2.76 26 25.8
try:
=QUERY(SORTN(SORT({YEAR($A$6:$A), B6:B}, 1, 0, 2, 0), 9^9, 2, 1, 1),
"where Col1 <> 1899")
demo spreadsheet
and build a chart from there

Google sheets importHTML removes zero and treats commas as decimal

I'm trying to import a table where the commas are the 1000 separator,
example: 32,100 is 32100 but it is treating it as 32.1 instead.
This is a similar table (first one / top left):
https://en.wikipedia.org/wiki/Demographics_of_the_world
imgur for screenshots:
https://imgur.com/a/hJR9tox
I want it to say:
Year million
1500 458
1600 580
1700 682
1750 791
1800 978
1850 1262
1900 1650
1950 2521
1999 5978
2008 6707
2011 7000
2015 7350
2018 7600
2020 7750
But it comes out as:
Year million
1500 458
1600 580
1700 682
1750 791
1800 978
1850 1,262
1900 1,65
1950 2,521
1999 5,978
2008 6,707
2011 7
2015 7,35
2018 7,6
2020 7,75
This is the function I'm using:
=IMPORTHTML("https://en.wikipedia.org/wiki/Demographics_of_the_world"; "table"; 1)
I have also tried using this function:
=IMPORTXML("https://en.wikipedia.org/wiki/Demographics_of_the_world"; "//*[#id='mw-content-text']/div/table[1]/tbody")
But that shows as this witch is extremely hard to understand since it looks like this and still removes the zeros:
World Population[1][2] Yearmillion 1500458 1600580 1700682 1750791 1800978 18501,262 19001,65 19502,521 19995,978 20086,707 20117 20157,35 20187,6 20207,75
Other things i have tried is:
forsing it to always print out three decimals, that wont work since it adds more numbers to the end of all numbers.
The main & easiest possible solution that you have is to change your Spreadsheet's locale setting to one that uses the , as mile separator.
As an alternative, if changing this setting is really not a possibility, you could create a script that uses URLFetchApp to retrieve the page's contents and parses the values, taking into considerations the usage of , as mile separator.

Heroku, Oink and R14 errors. Can one line of code need 70MB of memory?

I've been somewhat concerned at the number of R14 errors I'm getting on Heroku recently.
I don't know if this has anything to do with using Unicorn. Or having recently installed New Relic or Logentries. I really can't work it out.
I have "installed" Oink and have just received the following analysis but have no read idea how to fully understand what it's trying to tell me.
---- MEMORY THRESHOLD ----
THRESHOLD: 0 MB
-- SUMMARY --
Worst Requests:
1. Nov 13 02:53:51, 70836 KB, messages#getmessagecount
2. Nov 13 02:03:04, 65836 KB, messages#getmessagecount
3. Nov 13 02:21:46, 60236 KB, messages#getmessagecount
4. Nov 13 01:32:47, 6328 KB, messages#deletemessage
5. Nov 13 01:33:43, 6328 KB, locations#sendprofiles
6. Nov 13 01:32:56, 6328 KB, messages#deletemessage
7. Nov 13 01:32:58, 6328 KB, messages#deletemessage
8. Nov 13 01:32:49, 6328 KB, messages#deletemessage
9. Nov 13 01:47:46, 5300 KB, messages#getmessagecount
10. Nov 13 03:09:56, 5300 KB, messages#getmessagecount
Worst Actions:
9, messages#deletemessage
7, messages#getmessagecount
1, locations#sendprofiles
1, photos#photodatarequest
1, messages#getmessages
Aggregated Totals:
Action Max Mean Min Total Number of requests
messages#getmessagecount 70836 29814 464 208700 7
messages#deletemessage 6328 3016 180 27144 9
locations#sendprofiles 6328 6328 6328 6328 1
photos#photodatarequest 460 460 460 460 1
messages#getmessages 300 300 300 300 1
I'm concerned, as a layman, that my message#getmessagecount is eating a LOT of memory. Is that what the above means?
If so ... the routine is simply:
def getmessagecount
#messagecount = Message.where(recipient: current_user, messageSysMessCode: 0, messageAdminMessage: false).count
end
And I have no idea how this could be "leaking" memory.
The graph of memory usage on Heroku over the last day looks like:
I'm using Ruby 2.1.4 and Rails 4.1.7 if that's any help. I'm using two Web dynos and one Worker.
Oh ... and my delete message routine is:
def deletemessage
#message = Message.where(recipient_id: current_user.id, id: params[:messageID]).first
if (#message)
#message.delete
#code = "OK"
else
#code = "Couldn't delete message"
end
end
This is killing my performance (if that's the right thing to say) every 3 hours or so. I have no idea why this is ramping up every 10 minutes (which I hopefully infer from reading the graph). 10 minutes might be significant as I have an iPhone app which is polling the getmessagecount routine every 10 mins with a single test app. I can only wonder what will happen if 10 copies of the app (or 1,000's) start hitting the server?
Any help would be very deeply appreciated.

Resources