I'm struggling with detecting anomalies in time series sensor data. My data looks like this:
| Timestamp | Temperature |
| 2018-04-01 10:00:00 | 19.00 |
| 2018-04-01 11:00:00 | 21.00 |
| 2018-04-01 12:00:00 | 22.00 |
I'm also able to provide a label, but this label isn't very accurate:
| Timestamp | Temperature | IsBroken |
| 2018-04-01 10:00:00 | 19.00 | 0 |
| 2018-04-01 11:00:00 | 21.00 | 0 |
| 2018-04-01 12:00:00 | 01.00 | 1 |
I can also provide other sensors in the region, like humidity sensors, etc. Or the average temperature in the region.
I found so many resources about algorithms but I don't know how to solve this technically. Can somebody help me or at least push me in the right direction?
The goal is to detect if a sensor is broken or not in future sensordata based on results of the past.
Outlier and anomaly detection is a broad topic. If you are looking for something easy to understand yet powerful try an isolation forest (link). This algorithm should be able to find days where the sensors reported some unusual combination of values.
Related
I'm trying to count the number of items that fit at least one criteria. But my actual formula count 2 instead of 1 when an item fits 2 criteria at the same time.
Considering the following example :
Article | Rate 1 | Rate 2 | Rate 3 | Language
1 | 12% | 54% | 6% | English
2 | 65% | 55% | 34% | English
3 | 59% | 12% | 78% | French
4 | 78% | 8% | 47% | English
5 | 12% | 11% | 35% | English
How do you count the number of article in English with at least one success rate over 50%.
Right now my formula counts 4 instead of 3, because the article 2 counts for 2. (I'm on google sheets)
Thank you for your help.
Best,
Assuming that data is in columns A:E, you could use:
=COUNT(filter(A2:A6,E2:E6="English",(D2:D6>=50%)+(C2:C6>=0.5)+(B2:B6>=0.5)))
=SUMPRODUCT(--(E2:E6="english"), SIGN((B2:B6>0.5)+(C2:C6>0.5)+(D2:D6>0.5)))
We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.
I try to measure periodicity strength of a specific time on the time series data when a period (e.g., 1day, 7day) is given.
For example,
| AM 10:00 | 10:30 | 11:00 |
DAY 1 | A | A | B |
DAY 2 | A | B | B |
DAY 3 | A | B | B |
DAY 4 | A | A | B |
DAY 5 | A | A | B |
If a period is 1 day, AM 10:00 and 11:00 is the highest strength of periodicity in this data because there are consistent value in both times.
Are there any popular method or research to do this?
There are many existed research for finding periodic pattern in the time series, but I can't find research measuring periodicity strength of a specific time when a period is given.
Please sharing your knowledge. Thanks.
What you are looking for is something called cyclic association rules. I've linked to the paper that was originally written by researches at Bell Labs.
I'm rolling the following:
Rails 3.2.9
Highcharts
State Machine
I've got an irregular set of data that represents the change of state of hundreds of linux boxes. Each box checks into a central ping server every two minutes.
Every time a device heartbeats, the ping server checks if the device's current state is offline and if so, changes the state to online and sets the heartbeat table's online col to true and inserts the time this happened.
On the ping server, we have a cron that runs a rake task every 5 minutes. This finds all devices with a heartbeat less than the time now minus 5 minutes.
If it discovers a device is offline, it sets the device state to offline and marks to heartbeat table with the time of the last heartbeat and a 0.
We've been doing this for a while and it seems like an efficient way to store the uptime data without creating a row for 500 devices every 5 minutes.
The table looks a little like this:
+---------------------+--------+--------+
| created_at | dev_id | online |
+---------------------+--------+--------+
| 2012-10-08 16:29:16 | 2345 | 0 |
| 2012-11-21 16:40:22 | 2345 | 1 |
| 2012-11-03 19:15:00 | 2345 | 0 |
| 2012-11-08 09:15:01 | 2345 | 1 |
| 2012-11-08 09:18:03 | 2345 | 0 |
| 2012-11-09 17:57:22 | 2345 | 1 |
| 2012-12-09 13:57:23 | 2345 | 0 |
| 2012-12-09 14:57:25 | 2345 | 1 |
| 2012-12-09 15:00:30 | 2345 | 0 |
| 2012-12-09 15:57:31 | 2345 | 1 |
| 2012-12-09 16:07:35 | 2345 | 0 |
| 2012-12-09 16:37:38 | 2345 | 1 |
| 2012-12-09 17:57:40 | 2345 | 0 |
+---------------------+--------+--------+
Following Ryan Bate's fantastic Railscast on Highcharts, I can create a line graph of this data with irregular intervals.
The chart and data series
Following this example:
http://www.highcharts.com/demo/spline-irregular-time
And using a data series something like this:
= #devices.heartbeats.map { |o| o.online == true ? 1 : 0 }
It was plotting the line graph pretty nicely.
Where I'm stuck
The graph finishes at the last time it checked in and I need the graph to show a point at Now. In Ryan's example, he maps a zero to a date if there's no value. I can't translate this part.
I'm trying to achieve a graph like the stack bar chart but can't get the data sorted.
http://www.highcharts.com/demo/bar-stacked
How can I format my query so I get the data until Now as well as each individual point so I can create such a graph?
I resolved the following question but
I have another issue. I would like to
analyse "Likert Scale" questionaire
which is measured 1 to 5 ( agree,
strongly agree etc ). I tried many
ways but I didn't combine all results.
Have you got any idea to analyze
likert scale?
Does anybody help us to define following type of question in SPSS variable view?
( looks like array question, user answers non unique which they can enter text )
QUESTION 1:
Allows a table of text inputs
+----------------------------------------------+
| Speed Design Accuracy |
+----------+---------+----------+--------------+
| Google | | | |
+----------+---------+----------+--------------+
| Yahoo | | | |
+----------+---------+----------+--------------+
| Bing | | | |
+----------+---------+----------+--------------+
I had the same problem. Fortunately, it is easy to solve ;)
If you have your data in the table - you have to "restructure" it (Menu - Data - Restructure). This option allows you to create multidimansional variables. You can find some tutorial on youtube for data restructuring.
In your case, you have to make it manually. You just repeat your identifying variable accordng to the amount of likert scale questions. Let's assume you have 3 questions to "Speed", 3 questions to "Design", and 3 questions to "Accuracy". Your table should look like this:
+----------------------------------------------+
| Speed1 Speed2 Speed3 | Duration1 | Duration2 | ...
+----------+---------+----------+--------------+
| Google | | | | | |
+----------+---------+----------+--------------+
| Yahoo | | | |
+----------+---------+----------+--------------+
| Bing | | | |
+----------+---------+----------+--------------+
You can restructure the data later to perform statistical analysis.
In the case of repeated measurement (e.g. you asked your Likert scale question in the same company 3 times over time), your table might look like this:
+----------------------------------------------+
| Speed1 Speed2 Speed3 | Duration1 | Duration2 | ...
+----------+---------+----------+--------------+
| Google | | | | | |
+----------+---------+----------+--------------+
| Google | | | |
+----------+---------+----------+--------------+
| Google | | | |
+----------+---------+----------+--------------+
| Yahoo | | | |
+----------+---------+----------+--------------+
| Yahoo | | | |
+----------+---------+----------+--------------+
| Yahoo | | | |
+----------+---------+----------+--------------+
| Bing | | | |
+----------+---------+----------+--------------+
...
I hope, it helped!
Best,
Eugene
I am not sure I know what you are asking, but I believe you are looking for some guidance as to what the "dataset" might need to look like. If you run the following syntax, you should get a better idea of how I would structure it.
DATA LIST LIST (",") / browser (A30) type (A30) score.
BEGIN DATA
Google, Speed, 123
Yahoo, Speed, 34
Bing, Design, 23
Google, Accuracy, 231
Yahoo, Design, 12
END DATA.
Likert scale data should be analyzed using non-parametric methods. Two ways to handle this.
1). Rank the cases and then perform ANOVA on the ranked values
2). Perform Kruskal Wallis on the Likert scale data
Regards