Read non delimited asciif file Apache Pig Latin - parsing

I'm trying to read a text file in Apache Pig Latin that has non-delimited ascii comprising each row. That is, each column in that row begins and ends at a specific position in the row.
Sample definition:
+--------+----------------+--------------+
| Column | Start Position | End Position |
+--------+----------------+--------------+
| A | 1 | 6 |
+--------+----------------+--------------+
| B | 8 | 11 |
+--------+----------------+--------------+
| C | 13 | 15 |
+--------+----------------+--------------+
Sample Data:
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| s | a | m | p | l | e | | d | a | t | a | | | h | i |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
| d | u | d | e | | | | hi | | | | | b | r | o |
+---+---+---+---+---+---+---+----+---+----+----+----+----+----+----+
Expected Output:
sample, data, hi
dude, hi, bro
How do I read this in Pig? PigStorage doesn't seem flexible enough to allow positional delimiting, only string delimiting (comma, tab, etc..).

Looks like Apache provides a loader for this specific use case:
LOAD 'data.txt' USING org.apache.pig.piggybank.storage.FixedWidthLoader('1-6, 8-11, 13-15', 'SKIP_HEADER') AS (a, b, c);
https://pig.apache.org/docs/r0.16.0/api/

Related

How to add slim to rails statistics (stats) for code statistics?

I tried to search and experimented, but couldn't figure out, how to add slim to rails stats views statistics. It is counting only .erb templates, but I want .slim to be added as these are views too.
% bin/rails stats
+----------------------+--------+--------+---------+---------+-----+-------+
| Name | Lines | LOC | Classes | Methods | M/C | LOC/M |
+----------------------+--------+--------+---------+---------+-----+-------+
| Controllers | 3245 | 1634 | 57 | 218 | 3 | 5 |
| Helpers | 186 | 149 | 0 | 18 | 0 | 6 |
| Jobs | 34 | 20 | 2 | 2 | 1 | 8 |
| Models | 879 | 541 | 25 | 77 | 3 | 5 |
| Mailers | 85 | 53 | 3 | 6 | 2 | 6 |
| Channels | 46 | 28 | 3 | 4 | 1 | 5 |
| Views | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------+--------+--------+---------+---------+-----+-------+
I could add an extra rules for something like "Slim views", but this would count the .erb templates in views too.

Time series binary classfication [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Problem:
I have a dataset about hedge fund. It contains monthly hedge fund returns and some financial metrics. I calculated metrics for every month from 2010 to 2019 December. (2889 monthly data) I want to binary classification and predict hedge funds' class basis on these metrics for next month. I want make prediction for T+1 from T time. And i want use random forest and other classifiers(Decision Tree,KNN,SVM,logistic regression). I know this dataset is time series problem, how do i convert this to machine learning problem.
I am open to your suggestions and advisories as to what method or approach should be followed in modeling, feature engineering and editing this data set.
Additional Questions:
1)How do I make a data split when using this data for training and test ? 0,80-0,20?. Is there any other method of validation you can recommend?
2)some funds are added to the data later, so not all funds have data of equal length, for example, the "AEB" fund established in 2015 has no data before 2015. There are a few such funds, do they cause problems, or is it better to delete them and remove them from the dataset? I have a total of 27 different fund data.
3)In addition, I have changed the tickers/names of the hedge funds to numeric ID, is it possible to do dummy encoding, would it be better for performance?
Sample Dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 1 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 1 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | 0 |
My Idea and First Try:
I modeled one example. I used a method like this, I shifted(-1) back the target variable. So each line was shown the class in which the fund was located in the following month.I did it because of this, I want to predict the next month before that month starts. Predict to T+1 from T.But this model gave a very poor result.(%43)
view of this model dataset:
Date | Fund Name / Ticker | sharpe | sortino | beta | alpha | target |
------------|--------------------|--------|---------|-------|-------|--------|--
31.03.2010 | ABC | -0,08 | 0,025 | 0,6 | 0,13 | 1 |
31.03.2010 | DEF | 0,41 | 1,2 | 1,09 | 0,045 | 0 |
31.03.2010 | SDF | 0,03 | 0,13 | 0,99 | -0,07 | 1 |
31.03.2010 | CBD | 0,71 | -0,05 | 1,21 | 0,2 | 1 |
30.04.2010 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.04.2010 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.04.2010 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 0 |
30.04.2010 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
: | : | : | : | : | : | : |
: | : | : | : | : | : | : |
30.12.2019 | ABC | 0,05 | -0,07 | 0,41 | 0,04 | 0 |
30.12.2019 | DEF | 0,96 | 0,2 | 1,09 | 1,5 | 0 |
30.12.2019 | SDF | -0,06 | 0,23 | 0,13 | 0,23 | 1 |
30.12.2019 | CBD | 0,75 | -0,01 | 0,97 | -0,06 | 1 |
30.12.2019 | FGF | 1,45 | 0,98 | -0,03 | 0,55 | 0 |
30.12.2019 | AEB | 0,25 | 1,22 | 0,17 | -0,44 | ? |
There are many approaches out there that you can find. Time series are challenging and its okay to have poor results at the beginning. I advise you do the following:
Add some lags as additional columns in your dataset. You want to predict t+1 and you have t, so try to also compute t-1,t-2, t-3, etc.
In order to know the best number of t-x that you can have, try to do ACF and PACF plots and see the first lags that appear in the in the shaded region
Lags might boost your accuracy
try to normalize/standardize your data when modeling
Try to see if your time series is a random walk, if it is, there are many recent papers that try to tackle the problem of random walk prediction
If your dataset is big enough, try to use some neural networks like LSTM, RNN, GANs, etc that might be better than the shallow models you have mentioned
I really advise you to see the tutorials of Jason Brownlee on Time Series here Jason is super intelligent and you can always add comments to his tutorials. He is also responsive!!

joinging three tables in psql and keeping results according to group membership

I am using psql and joined three tables A, B and C from table A.
For example resulting table is as follows:
+----+------+------+------+
| pk | a_id | b_id | c_id |
+----+------+------+------+
| 1 | 5 | 12 | 16 |
| 2 | 5 | 7 | 8 |
| 3 | 5 | 6 | 21 |
| 4 | 8 | 12 | 16 |
| 5 | 8 | 3 | 9 |
| 6 | 9 | 11 | 32 |
| 7 | 9 | 8 | 2 |
+----+------+------+------+
I am trying to create c_id relations over a_id. In a_id there are three groups [5,8,9]. For example c_id=16 has a relation to a_id=[5,8], so c_id=[8,21,9,32] must be protected via a_id=[5,8]. And resulting table should look like as follows:
+----+------+------+------+
| pk | a_id | b_id | c_id |
+----+------+------+------+
| 1 | 5 | 12 | 16 |
| 2 | 5 | 7 | 8 |
| 3 | 5 | 6 | 21 |
| 4 | 8 | 12 | 16 |
| 5 | 8 | 3 | 9 |
+----+------+------+------+
How can I write such a condition in join statement?
After the join, you can write this query. I created your result table directly, and then I wrote a SQL query.
SELECT * from res
WHERE a_id in (SELECT distinct a_id
FROM res
WHERE c_id=16)

Retrieve last n rows based on one numeric column in google sheet

My data looks like this:
+---------------+-----+-----+------+-----+-----+
| Serial Number | LSL | LCL | DATA | UCL | USL |
+---------------+-----+-----+------+-----+-----+
| 1 | 1 | 3 | 2.3 | 7 | 9 |
| 2 | 1 | 3 | 3.1 | 7 | 9 |
| 3 | 1 | 3 | 2.7 | 7 | 9 |
| 4 | 1 | 3 | 4.9 | 7 | 9 |
| 5 | 1 | 3 | 5 | 7 | 9 |
| 6 | 1 | 3 | 3 | 7 | 9 |
| 7 | 1 | 3 | 10 | 7 | 9 |
| 8 | 1 | 3 | 7.8 | 7 | 9 |
| 9 | 1 | 3 | | 7 | 9 |
| 10 | 1 | 3 | 6.8 | 7 | 9 |
| 11 | 1 | 3 | 10 | 7 | 9 |
| 12 | 1 | 3 | 3.9 | 7 | 9 |
| 13 | 1 | 3 | 11.3 | 7 | 9 |
| 14 | 1 | 3 | | 7 | 9 |
| 15 | 1 | 3 | | 7 | 9 |
| 16 | 1 | 3 | | 7 | 9 |
| 17 | 1 | 3 | | 7 | 9 |
| 18 | 1 | 3 | | 7 | 9 |
| 19 | 1 | 3 | | 7 | 9 |
| 20 | 1 | 3 | | 7 | 9 |
+---------------+-----+-----+------+-----+-----+
I want to query last 7 rows data where the DATA column is not empty. Trying to achieve something like this:
+----+---+---+------+---+---+
| 7 | 1 | 3 | 10 | 7 | 9 |
| 8 | 1 | 3 | 7.8 | 7 | 9 |
| 9 | 1 | 3 | | 7 | 9 |
| 10 | 1 | 3 | 6.8 | 7 | 9 |
| 11 | 1 | 3 | 10 | 7 | 9 |
| 12 | 1 | 3 | 3.9 | 7 | 9 |
| 13 | 1 | 3 | 11.3 | 7 | 9 |
+----+---+---+------+---+---+
But currently, I am only able to get the last 7 rows data which looks like this:
+---------------+-----+-----+------+-----+-----+
| Serial Number | LSL | LCL | DATA | UCL | USL |
+---------------+-----+-----+------+-----+-----+
| 14 | 1 | 3 | | 7 | 9 |
| 15 | 1 | 3 | | 7 | 9 |
| 16 | 1 | 3 | | 7 | 9 |
| 17 | 1 | 3 | | 7 | 9 |
| 18 | 1 | 3 | | 7 | 9 |
| 19 | 1 | 3 | | 7 | 9 |
| 20 | 1 | 3 | | 7 | 9 |
+---------------+-----+-----+------+-----+-----+
The formula I used is:
=SORT(QUERY(Sheet1!A7:F,"order by A desc limit 7"),1,1)
This formula does not incorporate the condition that the last row of DATA column must not be empty. Is there a way to achieve what I am looking for?
Assuming your serial numbers are consecutive and sorted as such.
=QUERY(A:F,"Select * where A >= "&ARRAYFORMULA(INDEX(SORT(A2:F,1,false),MATCH(true,ISNUMBER(INDEX(SORT(A2:F,1,false),,4)),0),1))-6&" limit 7")
Breakdown:
=QUERY(A:F,"Select * where A >= "
//index used to find the first serial number with a number in the data column
&ARRAYFORMULA(INDEX(
//reverse order
SORT(A2:F,1,false),
//find first number in data column of reversed data
MATCH(true,ISNUMBER(
//get fourth column (data column) to check for numbers
INDEX(SORT(A2:F,1,false),,4)
//minus 6 so you can get the 6 rows above and the row found
),0),1))-6
//get the first 7 rows from the serial number that matches.
&" limit 7")
EDIT
After our conversation:
If your first column is a date and your dates are consecutive with no duplicates, you can use this:
=QUERY(A:F,"Select * where A >= date '"&TEXT(INDEX(SORT(A2:F,1,false),MATCH(true,ISNUMBER(INDEX(SORT(A2:F,1,false),,4)),0),1)-6,"yyyy-mm-dd")&"' limit 7")
Breakdown:
=QUERY(A:F,"Select * where A >= date
//date tells query that it's looking for a date value
'"&TEXT(INDEX(SORT(A2:F,1,false),MATCH(true,ISNUMBER(INDEX(SORT(A2:F,1,false),,4)),0),1)-6,
"yyyy-mm-dd")&"' limit 7"))
//text formats the date in the way that query requires: yyyy-mm-dd
Based on description rather than example:
=query(query(A:F,"where D is not NULL order by A desc limit 7"),"order by Col1")
("last 7 rows data where the DATA column is not empty.")

Dynamically lookup and get average using arrayformula

I'm trying to get the average of all rows containing data in my SourceSheet, which need to be matched with the Fish ID A1:F1 in Sheet1 and A2:A5 in SourceSheet. I want to do this by using ARRAYFORMULA() since Sheet1!A2:A5 is dynamic and may contain other values from time to time.
So far, I've only managed the lookup-and-average-part by:
=AVERAGE(ARRAYFORMULA(HLOOKUP($A2,SourceSheet!A:F,ROW(Items!A$2:F),FALSE)))
How do i achieve the result (see below) with out copying this formula down all rows? Thanks in advance!
Source Data (SourceSheet)
+------+--------+-----+--------+---------+---------+
| tuna | mullet | cod | salmon | herring | catfish |
+------+--------+-----+--------+---------+---------+
| 4 | 3 | 5 | 5 | 5 | 3 |
| 5 | 3 | 3 | 1 | 3 | 2 |
| 5 | 4 | 4 | 4 | 4 | 4 |
| 1 | 2 | 1 | 2 | 3 | 1 |
| 3 | 2 | 2 | 2 | 3 | 2 |
| 4 | 2 | 4 | 2 | 3 | 3 |
| 4 | 2 | 2 | 1 | 2 | 1 |
| 4 | 3 | 4 | 3 | 5 | 4 |
| 3 | 4 | 4 | 2 | 5 | 1 |
| 4 | 3 | 4 | 1 | 2 | 2 |
| 2 | 1 | 3 | 1 | 1 | 1 |
| 2 | 4 | 3 | 2 | 2 | 2 |
| 5 | 3 | 5 | 4 | 5 | 2 |
| 4 | 2 | 4 | 2 | 3 | 2 |
| 2 | 4 | 4 | 3 | 4 | 2 |
| 5 | 4 | 5 | 5 | 3 | 2 |
| 3 | 1 | 3 | 3 | 4 | 2 |
+------+--------+-----+--------+---------+---------+
What I'm trying to achieve: (Sheet1)
+---------+---------+
| | Average |
+---------+---------+
| mullet | 2.76 |
| salmon | 2.75 |
| herring | 2.73 |
| catfish | 2.64 |
+---------+---------+
Formula
=ArrayFormula(
MMULT(
hlookup(A2:A5,SourceSheet!A:F,
TRANSPOSE(FILTER(ROW(SourceSheet!A2:A),LEN(SourceSheet!A2:A))),FALSE),
n(ISNUMBER((FILTER(ROW(SourceSheet!A2:A),LEN(SourceSheet!A2:A)))))
)
/
count(SourceSheet!A2:A)
)
Explanation
AVERAGE is an aggregate function. It could take several values separated by commas, one array or several arrays of values as arguments and returns a single value. This kind of functions if are used on an array formulas will return a single value, so when it's require to return an array of values it's required to use other methods.
In this case, MMULT is used to return the sum of values of each row.
See also
Using CHOOSE and CONCATENATE with ARRAYFORMULA in Google Sheets

Resources