I try to measure periodicity strength of a specific time on the time series data when a period (e.g., 1day, 7day) is given.
For example,
| AM 10:00 | 10:30 | 11:00 |
DAY 1 | A | A | B |
DAY 2 | A | B | B |
DAY 3 | A | B | B |
DAY 4 | A | A | B |
DAY 5 | A | A | B |
If a period is 1 day, AM 10:00 and 11:00 is the highest strength of periodicity in this data because there are consistent value in both times.
Are there any popular method or research to do this?
There are many existed research for finding periodic pattern in the time series, but I can't find research measuring periodicity strength of a specific time when a period is given.
Please sharing your knowledge. Thanks.
What you are looking for is something called cyclic association rules. I've linked to the paper that was originally written by researches at Bell Labs.
Related
I am trying to create a small "app" using tasker on my android phone that am supposed to track my workhours and over/under-time. I have managed to get tasker to send timestamps on the start/end of each workday and are writing them to a google sheet so it gets recorded like:
<Not implemented> <Not implemented>
| A | B | C | D | E | F |
| 2020-01-29 | 07:24 | 16:33 | 00:09 | | -02:51 |
| 2020-01-30 | 07:00 | 12:00 | -03:00 | | |
Where the "D" column is the difference between ordinary workhours (8) and actually registred hours.
The "F" column should summarize the "D" column and show the sum of all values.
The data in the three first columns are beeing sent correctly but I cant figure out how to set up formulas so that the values for the "D" column is added and and same thing with the cell in the "F" column. I have been trying to change to different formats and tried creating my own formats to but do not understand how to get it to work.
I'm getting a different result than you in D1. I wonder if you're also accounting for a lunch hour (so subtract 9 instead of 8), but these formulas worked for me:
in Column D: =(C1-B1)-(8/24)
in Cell F1: =sum(D1:D2)
Column D and Cell F1 are formatted as Time > Duration.
Here's the result:
I know I how to do this using a custom function/script but I am wondering if it can be done with a built-in formula.
I have a list of tasks with a start date and end date. I want to calculate the actual # of working days (NETWORKDAYS) spent on all the tasks.
Task days may overlap so I can't just total the # of days spent on each task
There may be gaps between tasks so I can't just find the difference between the first start and last end.
For example, let's use these:
| Task Name | Start Date | End Date | NETWORKDAYS |
|:---------:|------------|------------|:-----------:|
| A | 2019-09-02 | 2019-09-04 | 3 |
| B | 2019-09-03 | 2019-09-09 | 5 |
| C | 2019-09-12 | 2019-09-13 | 2 |
| D | 2019-09-16 | 2019-09-17 | 2 |
| E | 2019-09-19 | 2019-09-23 | 3 |
Here it is visually:
Now:
If you total the NETWORKDAYS you'll get 15
If you calculate NETWORKDAYS between 2019-09-02 and 2019-09-23 you get 16
But the actual duration is 13:
A and B overlap a bit
There is a gap between B and C
There is a gap between D and E
If I was to write a custom function I would basically take all the dates, sort them, find overlaps and remove them, and account for gaps.
But I am wondering if there is a way to calculate the actual duration using built-in formulas?
sure, why not:
=ARRAYFORMULA(COUNTA(IFERROR(QUERY(UNIQUE(TRANSPOSE(SPLIT(CONCATENATE("×"&
SPLIT(REPT(INDIRECT("B1:B"&COUNTA(B1:B))&"×",
NETWORKDAYS(INDIRECT("B1:B"&COUNTA(B1:B)), INDIRECT("C1:C"&COUNTA(B1:B)))), "×")+
TRANSPOSE(ROW(INDIRECT("A1:A"&MAX(NETWORKDAYS(B1:B, C1:C))))-1)), "×"))),
"where Col1>4000", 0))))
I'm struggling with detecting anomalies in time series sensor data. My data looks like this:
| Timestamp | Temperature |
| 2018-04-01 10:00:00 | 19.00 |
| 2018-04-01 11:00:00 | 21.00 |
| 2018-04-01 12:00:00 | 22.00 |
I'm also able to provide a label, but this label isn't very accurate:
| Timestamp | Temperature | IsBroken |
| 2018-04-01 10:00:00 | 19.00 | 0 |
| 2018-04-01 11:00:00 | 21.00 | 0 |
| 2018-04-01 12:00:00 | 01.00 | 1 |
I can also provide other sensors in the region, like humidity sensors, etc. Or the average temperature in the region.
I found so many resources about algorithms but I don't know how to solve this technically. Can somebody help me or at least push me in the right direction?
The goal is to detect if a sensor is broken or not in future sensordata based on results of the past.
Outlier and anomaly detection is a broad topic. If you are looking for something easy to understand yet powerful try an isolation forest (link). This algorithm should be able to find days where the sensors reported some unusual combination of values.
We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.
I have a time series dataset with quarterly observations, which I want to collapse to an annual series. For that, I need to transform my date variable first.
It looks like
. list date in 1/5
+--------+
| date |
|--------|
1. | 1991q1 |
2. | 1991q2 |
3. | 1991q3 |
4. | 1991q4 |
5. | 1992q1 |
+--------+
Hence, to collapse, I want date (or date2) to be 1991, 1991, 1991, 1991, 1992 etc.
Once I have that, I could use collapse or tscollapse to turn my dataset into annual data.
// create some example data
. clear all
. set obs 5
obs was 0, now 5
. gen date = 123 + _n
. format date %tq
// create the yearly date
. gen date2 = yofd(dofq(date))
// admire the result
. list
+----------------+
| date date2 |
|----------------|
1. | 1991q1 1991 |
2. | 1991q2 1991 |
3. | 1991q3 1991 |
4. | 1991q4 1991 |
5. | 1992q1 1992 |
+----------------+
Another way is just to remember that years and quarters are just integers. A little consultation of the documentation and a little fiddling around yield
. gen Y = 1960 + floor(Q/4)
as a conversion rule to get years from Stata quarterly dates. Formatting year as a yearly date is then permissible but superfluous.