Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Problem:
I have a dataset about hedge fund. It contains monthly hedge fund returns and some financial metrics. I calculated metrics for every month from 2010 to 2019 December. (2889 monthly data) I want to binary classification and predict hedge funds' class basis on these metrics for next month. I want make prediction for T+1 from T time. And i want use random forest and other classifiers(Decision Tree,KNN,SVM,logistic regression). I know this dataset is time series problem, how do i convert this to machine learning problem.
I am open to your suggestions and advisories as to what method or approach should be followed in modeling, feature engineering and editing this data set.
Additional Questions:
1)How do I make a data split when using this data for training and test ? 0,80-0,20?. Is there any other method of validation you can recommend?
2)some funds are added to the data later, so not all funds have data of equal length, for example, the "AEB" fund established in 2015 has no data before 2015. There are a few such funds, do they cause problems, or is it better to delete them and remove them from the dataset? I have a total of 27 different fund data.
3)In addition, I have changed the tickers/names of the hedge funds to numeric ID, is it possible to do dummy encoding, would it be better for performance?
Sample Dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 1 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 1 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | 0 |
My Idea and First Try:
I modeled one example. I used a method like this, I shifted(-1) back the target variable. So each line was shown the class in which the fund was located in the following month.I did it because of this, I want to predict the next month before that month starts. Predict to T+1 from T.But this model gave a very poor result.(%43)
view of this model dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 1 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 0 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | ? |
There are many approaches out there that you can find. Time series are challenging and its okay to have poor results at the beginning. I advise you do the following:
Add some lags as additional columns in your dataset. You want to predict t+1 and you have t, so try to also compute t-1,t-2, t-3, etc.
In order to know the best number of t-x that you can have, try to do ACF and PACF plots and see the first lags that appear in the in the shaded region
Lags might boost your accuracy
try to normalize/standardize your data when modeling
Try to see if your time series is a random walk, if it is, there are many recent papers that try to tackle the problem of random walk prediction
If your dataset is big enough, try to use some neural networks like LSTM, RNN, GANs, etc that might be better than the shallow models you have mentioned
I really advise you to see the tutorials of Jason Brownlee on Time Series here Jason is super intelligent and you can always add comments to his tutorials. He is also responsive!!
I'm trying to read a text file in Apache Pig Latin that has non-delimited ascii comprising each row. That is, each column in that row begins and ends at a specific position in the row.
Sample definition:
+--------+----------------+--------------+
| Column | Start Position | End Position |
+--------+----------------+--------------+
| A | 1 | 6 |
+--------+----------------+--------------+
| B | 8 | 11 |
+--------+----------------+--------------+
| C | 13 | 15 |
+--------+----------------+--------------+
Sample Data:
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| s | a | m | p | l | e | | d | a | t | a | | | h | i |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| d | u | d | e | | | | hi | | | | | b | r | o |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
Expected Output:
sample, data, hi
dude, hi, bro
How do I read this in Pig? PigStorage doesn't seem flexible enough to allow positional delimiting, only string delimiting (comma, tab, etc..).
Looks like Apache provides a loader for this specific use case:
LOAD 'data.txt' USING org.apache.pig.piggybank.storage.FixedWidthLoader('1-6, 8-11, 13-15', 'SKIP_HEADER') AS (a, b, c);
https://pig.apache.org/docs/r0.16.0/api/
I have recently updated neo4j from 2.1.7 to 2.2.5. I found out that query
Match (c:C) where id(c) = 111 with c Match (p:I{id: c.id}) return count(p)
worked fine in 2.1.7, but it performs very poor in 2.2.5 (100 times longer). I have all the indexes that are needed.
I modified this query to
Match (c:C) where id(c) = 111 with c.id as c_id Match (p:I{id: c_id}) return count(p)
and after this it works fine in 2.2.5
This two queries have different profile. But I'm not very expirienced with profiling.
UPDATED
One more strange thing is that if i use explain instead of profile - it works fast.
neo4j-sh (?)$ PROFILE Match (c:C) where id(c) = 10563822 with c Match (i:I{id: c.id}) return count(i);
==> +----------+
==> | count(i) |
==> +----------+
==> | 4551 |
==> +----------+
==> 1 row
==> 18257 ms
==>
==> Compiler CYPHER 2.2
==>
==> Planner COST
==>
==> EagerAggregation
==> |
==> +Filter(0)
==> |
==> +CartesianProduct
==> |
==> +Filter(1)
==> | |
==> | +NodeByIdSeek
==> |
==> +NodeByLabelScan
==>
==> +------------------+---------------+---------+---------+-------------+-------------------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +------------------+---------------+---------+---------+-------------+-------------------------+
==> | EagerAggregation | 26 | 1 | 0 | count(i) | |
==> | Filter(0) | 652 | 4551 | 2522988 | c, i | i.id == c.id |
==> | CartesianProduct | 6521 | 1261494 | 0 | c, i | |
==> | Filter(1) | 0 | 1 | 1 | c | c:C |
==> | NodeByIdSeek | 1 | 1 | 1 | c | |
==> | NodeByLabelScan | 1261494 | 1261494 | 1261495 | i | :I |
==> +------------------+---------------+---------+---------+-------------+-------------------------+
==>
==> Total database accesses: 3784485
sh (?)$ PROFILE Match (c:C) where id(c) = 10563822 with c.id as c_id Match (i:I{id: c_id}) return count(i);
==> +----------+
==> | count(i) |
==> +----------+
==> | 4551 |
==> +----------+
==> 1 row
==> 64 ms
==>
==> Compiler CYPHER 2.2
==>
==> Planner COST
==>
==> EagerAggregation
==> |
==> +Apply
==> |
==> +Projection
==> | |
==> | +Filter
==> | |
==> | +NodeByIdSeek
==> |
==> +NodeIndexSeek
==>
==> +------------------+---------------+------+--------+-------------+---------------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +------------------+---------------+------+--------+-------------+---------------------+
==> | EagerAggregation | 1 | 1 | 0 | count(i) | |
==> | Apply | 1 | 4551 | 0 | c, c_id, i | |
==> | Projection | 0 | 1 | 1 | c, c_id | c.id |
==> | Filter | 0 | 1 | 1 | c | c:C |
==> | NodeByIdSeek | 1 | 1 | 1 | c | |
==> | NodeIndexSeek | 1 | 4551 | 4552 | i | :I(id) |
==> +------------------+---------------+------+--------+-------------+---------------------+
==>
==> Total database accesses: 4555
I don't have enough knowledge of neo4j internals to know why your query is slower (the CartesianProduct step seems a red flag) in more recent versions, but here is a logically equivalent query that seems like it should be much faster:
START c = node(111)
MATCH (p:I { id: c.id })
RETURN count(p)
Here is the profile:
+------------------+------+--------+----------------------------------------------------------+-----------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+----------------------------------------------------------+-----------------------+
| ColumnFilter | 1 | 0 | count(p) | keep columns count(p) |
| EagerAggregation | 1 | 0 | INTERNAL_AGGREGATE51b25e53-027d-439b-9046-c1a2a6b0fe70 | |
| Filter | 0 | 0 | c, p | p.id == c.id |
| NodeById | 0 | 0 | c, p | Literal(List(111)) |
| NodeByLabel | 0 | 1 | p | :I |
+------------------+------+--------+----------------------------------------------------------+-----------------------+
NOTE: This should be considered a temporary workaround, as START has been deprecated, and I do not know how long this kind of usage will continue to be supported.
are these two Chypher statements identical:
//first
match (a)-[r]->(b),b-[r2]->c
//second
match (a)-[r]->(b)
match b-[r2]->c
The 2 Cypher statements are NOT identical. We can show this by using the PROFILE command, which shows you how the Cypher engine would perform a query.
In the following examples, the queries all end with RETURN a, c, since you cannot have a bare MATCH clause.
As you can see, the first query has a NOT(r == r2) filter that the second query does not. This is because Cypher makes sure that the result of a single MATCH clause does not contain duplicate relationships.
First query
profile match (a)-[r]->(b),b-[r2]->c return a,c;
==> +-----------------------------------------------+
==> | a | c |
==> +-----------------------------------------------+
==> | Node[1]{name:"World"} | Node[0]{name:"World"} |
==> +-----------------------------------------------+
==> 1 row
==> 2 ms
==>
==> Compiler CYPHER 2.3
==>
==> Planner COST
==>
==> Runtime INTERPRETED
==>
==> Projection
==> |
==> +Filter
==> |
==> +Expand(All)(0)
==> |
==> +Expand(All)(1)
==> |
==> +AllNodesScan
==>
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Projection | 1 | 1 | 0 | a, b, c, r, r2 | a; c |
==> | Filter | 1 | 1 | 0 | a, b, c, r, r2 | NOT(r == r2) |
==> | Expand(All)(0) | 1 | 2 | 4 | a, b, c, r, r2 | (b)-[r2:]->(c) |
==> | Expand(All)(1) | 2 | 2 | 8 | a, b, r | (b)<-[r:]-(a) |
==> | AllNodesScan | 6 | 6 | 7 | b | |
==> +----------------+---------------+------+--------+----------------+----------------+
==>
Second query
profile match (a)-[r]->(b) match b-[r2]->c return a,c;
==> +-----------------------------------------------+
==> | a | c |
==> +-----------------------------------------------+
==> | Node[1]{name:"World"} | Node[1]{name:"World"} |
==> | Node[1]{name:"World"} | Node[0]{name:"World"} |
==> +-----------------------------------------------+
==> 2 rows
==> 2 ms
==>
==> Compiler CYPHER 2.3
==>
==> Planner COST
==>
==> Runtime INTERPRETED
==>
==> Projection
==> |
==> +Expand(All)(0)
==> |
==> +Expand(All)(1)
==> |
==> +AllNodesScan
==>
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
==> +----------------+---------------+------+--------+----------------+----------------+
==> | Projection | 1 | 2 | 0 | a, b, c, r, r2 | a; c |
==> | Expand(All)(0) | 1 | 2 | 4 | a, b, c, r, r2 | (b)-[r2:]->(c) |
==> | Expand(All)(1) | 2 | 2 | 8 | a, b, r | (b)<-[r:]-(a) |
==> | AllNodesScan | 6 | 6 | 7 | b | |
==> +----------------+---------------+------+--------+----------------+----------------+
data set:
neo4j-sh (?)$ START n = node(*) MATCH n-[r]-m RETURN n,r,m;
==> +---------------------------------------------+
==> | n | r | m |
==> +---------------------------------------------+
==> | Node[1]{} | (2)-[1:KNOWS]->(1)| Node[2]{} |
==> | Node[1]{} | (3)-[2:KNOWS]->(1) | Node[3]{} |
==> | Node[2]{} | (2)-[1:KNOWS]->(1) | Node[1]{} |
==> | Node[2]{} | (3)-[0:KNOWS]->(2) | Node[3]{} |
==> | Node[3]{} | (3)-[0:KNOWS]->(2) | Node[2]{} |
==> | Node[3]{} | (3)-[2:KNOWS]->(1) | Node[1]{} |
==> +---------------------------------------------+
==> 6 rows
==>
==> 0 ms
cypher query:
neo4j-sh (0)$ start x=node(1,2,3),y=node(1,2,3) match x-[r]-y return id(x),id(y) order by id(x) desc;
==> +---------------+
==> | id(x) | id(y) |
==> +---------------+
==> | 1 | 2 |
==> | 1 | 3 |
==> | 2 | 1 |
==> | 3 | 1 |
==> +---------------+
==> 4 rows
in fact,2 and 3 are linked,why no returns;
how to get returns?
thanks
url:http://console.neo4j.org/?id=qwdh4p