Time series binary classfication [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Problem:
I have a dataset about hedge fund. It contains monthly hedge fund returns and some financial metrics. I calculated metrics for every month from 2010 to 2019 December. (2889 monthly data) I want to binary classification and predict hedge funds' class basis on these metrics for next month. I want make prediction for T+1 from T time. And i want use random forest and other classifiers(Decision Tree,KNN,SVM,logistic regression). I know this dataset is time series problem, how do i convert this to machine learning problem.
I am open to your suggestions and advisories as to what method or approach should be followed in modeling, feature engineering and editing this data set.
Additional Questions:
1)How do I make a data split when using this data for training and test ? 0,80-0,20?. Is there any other method of validation you can recommend?
2)some funds are added to the data later, so not all funds have data of equal length, for example, the "AEB" fund established in 2015 has no data before 2015. There are a few such funds, do they cause problems, or is it better to delete them and remove them from the dataset? I have a total of 27 different fund data.
3)In addition, I have changed the tickers/names of the hedge funds to numeric ID, is it possible to do dummy encoding, would it be better for performance?
Sample Dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 1 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 1 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | 0 |
My Idea and First Try:
I modeled one example. I used a method like this, I shifted(-1) back the target variable. So each line was shown the class in which the fund was located in the following month.I did it because of this, I want to predict the next month before that month starts. Predict to T+1 from T.But this model gave a very poor result.(%43)
view of this model dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 1 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 0 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | ? |

There are many approaches out there that you can find. Time series are challenging and its okay to have poor results at the beginning. I advise you do the following:
Add some lags as additional columns in your dataset. You want to predict t+1 and you have t, so try to also compute t-1,t-2, t-3, etc.
In order to know the best number of t-x that you can have, try to do ACF and PACF plots and see the first lags that appear in the in the shaded region
Lags might boost your accuracy
try to normalize/standardize your data when modeling
Try to see if your time series is a random walk, if it is, there are many recent papers that try to tackle the problem of random walk prediction
If your dataset is big enough, try to use some neural networks like LSTM, RNN, GANs, etc that might be better than the shallow models you have mentioned
I really advise you to see the tutorials of Jason Brownlee on Time Series here Jason is super intelligent and you can always add comments to his tutorials. He is also responsive!!

Related

Neo4j Cypher: How to optimize a NOT EXISTS Query when cardinality is high

The below query takes over 1 second & consumer about 7 MB when cardinality b/w users to posts is about 8000 (one user views about 8000 posts). It is difficult to scale this due to high & linearly growing latencies & memory consumption. Is there a possibility to model this differently and/or optimise the query?
Query
PROFILE MATCH (u:User)-[:CREATED]->(p:Post) WHERE NOT (:User{ID: 2})-[:VIEWED]->(p) RETURN p.ID
Plan
| Plan | Statement | Version | Planner | Runtime | Time | DbHits | Rows | Memory (Bytes) |
+-----------------------------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 4.1" | "COST" | "INTERPRETED" | 1033 | 3721750 | 10 | 6696240 |
+-----------------------------------------------------------------------------------------------------------+
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| Operator | Details | Estimated Rows | Rows | DB Hits | Cache H/M | Memory (Bytes) | Ordered by |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +ProduceResults#neo4j | `p.ID` | 2158 | 10 | 0 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Projection#neo4j | p.ID AS `p.ID` | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Filter#neo4j | u:User | 2158 | 10 | 10 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +Expand(All)#neo4j | (p)<-[anon_15:CREATED]-(u) | 2158 | 10 | 20 | 0/0 | | |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +AntiSemiApply#neo4j | | 2158 | 10 | 0 | 0/0 | | |
| |\ +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +Expand(Into)#neo4j | (anon_47)-[anon_61:VIEWED]->(p) | 233 | 0 | 3695819 | 0/0 | 6696240 | anon_47.ID ASC |
| | | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| | +NodeUniqueIndexSeek#neo4j | UNIQUE anon_47:User(ID) WHERE ID = $autoint_0 | 8630 | 8630 | 17260 | 0/0 | | anon_47.ID ASC |
| | +-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
| +NodeByLabelScan#neo4j | p:Post | 8630 | 8630 | 8631 | 0/0 | | |
+------------------------------+-----------------------------------------------+----------------+------+---------+-----------+----------------+----------------+
Yes, this can be improved.
First, let's understand what this is doing.
First, it starts with a NodeByLabelScan. That makes sense, there's no avoiding that.
But then, for every node of the label (the following executes PER ROW!), it matches to user 2, and expands all :VIEWED relationships from user 2 to see if any of them is the post for that particular row.
Can you see why this is inefficient? There are 8630 post nodes according to the PROFILE plan, so user 2 is looked up by index 8630 times, and their :VIEWED relationships are expanded 8630 times. Why 8630 times? Because this is happening per :Post node.
Instead, try this:
MATCH (:User{ID: 2})-[:VIEWED]->(viewedPost)
WITH collect(viewedPost) as viewedPosts
MATCH (:User)-[:CREATED]->(p:Post)
WHERE NOT p IN viewedPosts
RETURN p.ID
This changes things up a bit.
First it matches to user 2's viewed posts (the lookup and expansion is performed only once), then those viewed posts are collected.
Then it will do a label scan, and filter such that the post isn't in the collection of viewed posts.

How to interpret and use Emokit data?

I am using EmoKit (https://github.com/openyou/emokit) to retrieve data. The sample data looks like as follows:
+========================================================+
| Sensor | Value | Quality | Quality L1 | Quality L2 |
+--------+----------+----------+------------+------------+
| F3 | -768 | 5672 | None | Excellent |
| FC5 | 603 | 7296 | None | Excellent |
| AF3 | 311 | 7696 | None | Excellent |
| F7 | -21 | 296 | Nothing | Nothing |
| T7 | 433 | 104 | Nothing | Nothing |
| P7 | 581 | 7592 | None | Excellent |
| O1 | 812 | 7760 | None | Excellent |
| O2 | 137 | 6032 | None | Excellent |
| P8 | 211 | 5912 | None | Excellent |
| T8 | -51 | 6624 | None | Excellent |
| F8 | 402 | 7768 | None | Excellent |
| AF4 | -52 | 7024 | None | Excellent |
| FC6 | 249 | 6064 | None | Excellent |
| F4 | 509 | 5352 | None | Excellent |
| X | -2 | N/A | N/A | N/A |
| Y | 0 | N/A | N/A | N/A |
| Z | ? | N/A | N/A | N/A |
| Batt | 82 | N/A | N/A | N/A |
+--------+----------+----------+------------+------------+
|Packets Received: 3101 | Packets Processed: 3100 |
| Sampling Rate: 129 | Crypto Rate: 129 |
+========================================================+
Are these values in micro-volts? If so, how can these be more than 200 microvolts? The EEG data is in the range of 0-200 microvolts. Or does this require some kind of processing? If so what?
As described in the frequently asked questions of emokit, :
What unit is the data I'm getting back in? How do I get volts out of it?
One least-significant-bit of the fourteen-bit value you get back is 0.51 microvolts. See the specification for more details.
Looking for the details in the specification (via archive.org), we find the following for the "Emotiv EPOC Neuroheadset":
Resolution | 14 bits 1 LSB = 0.51μV (16 bit ADC,
| 2 bits instrumental noise floor discarded)
Dynamic range (input referred) | 8400μV (pp)
As a validation we can check that for a 14 bits linear ADC, the 8400 microvolts (peak-to-peak) would be divided in steps of 8400 / 16384 or approximately 0.5127 microvolts.
For the Epoc+, the comparison chart indicates a 14-bit and a 16-bit version (with a +/- 4.17mV dynamic range or 8340 microvolts peak-to-peak). The 16-bit version would then have raw data steps of 8340 / 65536 or approximately 0.127 microvolts. If that is what you are using, then the largest value of 812 you listed would correspond to 812 * 0.127 = 103 microvolts.

Dynamically lookup and get average using arrayformula

I'm trying to get the average of all rows containing data in my SourceSheet, which need to be matched with the Fish ID A1:F1 in Sheet1 and A2:A5 in SourceSheet. I want to do this by using ARRAYFORMULA() since Sheet1!A2:A5 is dynamic and may contain other values from time to time.
So far, I've only managed the lookup-and-average-part by:
=AVERAGE(ARRAYFORMULA(HLOOKUP($A2,SourceSheet!A:F,ROW(Items!A$2:F),FALSE)))
How do i achieve the result (see below) with out copying this formula down all rows? Thanks in advance!
Source Data (SourceSheet)
+------+--------+-----+--------+---------+---------+
| tuna | mullet | cod | salmon | herring | catfish |
+------+--------+-----+--------+---------+---------+
| 4 | 3 | 5 | 5 | 5 | 3 |
| 5 | 3 | 3 | 1 | 3 | 2 |
| 5 | 4 | 4 | 4 | 4 | 4 |
| 1 | 2 | 1 | 2 | 3 | 1 |
| 3 | 2 | 2 | 2 | 3 | 2 |
| 4 | 2 | 4 | 2 | 3 | 3 |
| 4 | 2 | 2 | 1 | 2 | 1 |
| 4 | 3 | 4 | 3 | 5 | 4 |
| 3 | 4 | 4 | 2 | 5 | 1 |
| 4 | 3 | 4 | 1 | 2 | 2 |
| 2 | 1 | 3 | 1 | 1 | 1 |
| 2 | 4 | 3 | 2 | 2 | 2 |
| 5 | 3 | 5 | 4 | 5 | 2 |
| 4 | 2 | 4 | 2 | 3 | 2 |
| 2 | 4 | 4 | 3 | 4 | 2 |
| 5 | 4 | 5 | 5 | 3 | 2 |
| 3 | 1 | 3 | 3 | 4 | 2 |
+------+--------+-----+--------+---------+---------+
What I'm trying to achieve: (Sheet1)
+---------+---------+
| | Average |
+---------+---------+
| mullet | 2.76 |
| salmon | 2.75 |
| herring | 2.73 |
| catfish | 2.64 |
+---------+---------+
Formula
=ArrayFormula(
MMULT(
hlookup(A2:A5,SourceSheet!A:F,
TRANSPOSE(FILTER(ROW(SourceSheet!A2:A),LEN(SourceSheet!A2:A))),FALSE),
n(ISNUMBER((FILTER(ROW(SourceSheet!A2:A),LEN(SourceSheet!A2:A)))))
)
/
count(SourceSheet!A2:A)
)
Explanation
AVERAGE is an aggregate function. It could take several values separated by commas, one array or several arrays of values as arguments and returns a single value. This kind of functions if are used on an array formulas will return a single value, so when it's require to return an array of values it's required to use other methods.
In this case, MMULT is used to return the sum of values of each row.
See also
Using CHOOSE and CONCATENATE with ARRAYFORMULA in Google Sheets

Read non delimited asciif file Apache Pig Latin

I'm trying to read a text file in Apache Pig Latin that has non-delimited ascii comprising each row. That is, each column in that row begins and ends at a specific position in the row.
Sample definition:
+--------+----------------+--------------+
| Column | Start Position | End Position |
+--------+----------------+--------------+
| A | 1 | 6 |
+--------+----------------+--------------+
| B | 8 | 11 |
+--------+----------------+--------------+
| C | 13 | 15 |
+--------+----------------+--------------+
Sample Data:
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| s | a | m | p | l | e | | d | a | t | a | | | h | i |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| d | u | d | e | | | | hi | | | | | b | r | o |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
Expected Output:
sample, data, hi
dude, hi, bro
How do I read this in Pig? PigStorage doesn't seem flexible enough to allow positional delimiting, only string delimiting (comma, tab, etc..).
Looks like Apache provides a loader for this specific use case:
LOAD 'data.txt' USING org.apache.pig.piggybank.storage.FixedWidthLoader('1-6, 8-11, 13-15', 'SKIP_HEADER') AS (a, b, c);
https://pig.apache.org/docs/r0.16.0/api/

InfluxDB select different time between two row having same a field value

I have a table like this on InfluxDB:
+---------------+-----------------+--------+--------------------------+
| time | sequence_number | action | session_id |
+---------------+-----------------+--------+--------------------------+
| 1433322591220 | 270001 | delete | 556d85bfe26c3b3864617605 |
| 1433322553324 | 250001 | delete | 556d88e4e26c3b3b83c99d32 |
| 1433241828472 | 230001 | create | 556d88e4e26c3b3b83c99d32 |
| 1433241023633 | 80001 | create | 556d85bfe26c3b3864617605 |
| 1433239305306 | 70001 | create | 556d7f09e26c3b34e872b2ba |
+---------------+-----------------+--------+--------------------------+
Now I want to find the time range from a session be created to deleted, that means get time where action=delete minus time where action=create if they have same session_id

Resources