R - how to exclude pennystocks from environment before calculating adjusted stock returns - environment

Within my current research I'm trying to find out, how big the impact of ad-hoc sentiment on daily stock returns is.
Calculations functioned quite well and results also are plausible.
The calculations until now with quantmod package and yahoo financial data look like below:
getSymbols(c("^CDAXX",Symbols) , env = myenviron, src = "yahoo",
from = as.Date("2007-01-02"), to = as.Date("2016-12-30")
Returns <- eapply(myenviron, function(s) ROC(Ad(s), type="discrete"))
ReturnsDF <- as.data.table(do.call(merge.xts, Returns))
# adjust column names
colnames(ReturnsDF) <- gsub(".Adjusted","",colnames(ReturnsDF))
ReturnsDF <- as.data.table(ReturnsDF)
However, to make it more robust towards noisy influence of pennystock data I wonder, how its possible to exclude stocks that once in the time period go below a certain value x, let's say 1€.
I guess, the best thing would be to exclude them before calculating the returns and merge the xts object results or even better, before downloading them with the getSymbols command.
Has anybody an idea how this could work best? Thanks in advance.

Try this:
build a price frame of the Adj. closing prices of your symbols
(I use the PF function of the quantmod add-on package qmao which has lots of other useful functions for this type of analysis. (install.packages("qmao", repos="http://R-Forge.R-project.org”))
check by column if any price is below your minimum trigger price
select only columns which have no closings below the trigger price
To stay more flexible I would suggest to take a sub period - let’s say no price below 5 during the last 21 trading days.The toy example below may illustrate my point.
I use AAPL, FB and MSFT as the symbol universe.
> symbols <- c('AAPL','MSFT','FB')
> getSymbols(symbols, from='2018-02-01')
[1] "AAPL" "MSFT" "FB"
> prices <- PF(symbols, silent = TRUE)
> prices
AAPL MSFT FB
2018-02-01 167.0987 93.81929 193.09
2018-02-02 159.8483 91.35088 190.28
2018-02-05 155.8546 87.58855 181.26
2018-02-06 162.3680 90.90299 185.31
2018-02-07 158.8922 89.19102 180.18
2018-02-08 154.5200 84.61253 171.58
2018-02-09 156.4100 87.76771 176.11
2018-02-12 162.7100 88.71327 176.41
2018-02-13 164.3400 89.41000 173.15
2018-02-14 167.3700 90.81000 179.52
2018-02-15 172.9900 92.66000 179.96
2018-02-16 172.4300 92.00000 177.36
2018-02-20 171.8500 92.72000 176.01
2018-02-21 171.0700 91.49000 177.91
2018-02-22 172.5000 91.73000 178.99
2018-02-23 175.5000 94.06000 183.29
2018-02-26 178.9700 95.42000 184.93
2018-02-27 178.3900 94.20000 181.46
2018-02-28 178.1200 93.77000 178.32
2018-03-01 175.0000 92.85000 175.94
2018-03-02 176.2100 93.05000 176.62
Let’s assume you would like any instrument which traded below 175.40 during the last 6 trading days to be excluded from your analysis :-) .
As you can see that shall exclude AAPL and FB.
apply and the base function any applied(!) to a 6-day subset of prices will give us exactly what we want. Showing the last 3 days of prices excluding the instruments which did not meet our condition:
> tail(prices[,apply(tail(prices),2, function(x) any(x < 175.4)) == FALSE],3)
FB
2018-02-28 178.32
2018-03-01 175.94
2018-03-02 176.62

Related

spaCy: optimizing tokenization

I'm currently trying to tokenize a text file where each line is the body text of a tweet:
"According to data reported to FINRA, short volume percent for $SALT clocked in at 39.19% on 12-29-17 http://www.volumebot.com/?s=SALT"
"#Good2go #krueb The chart I posted definitely supports ng going lower. Gobstopper' 2.12, might even be conservative."
"#Crypt0Fortune Its not dumping as bad as it used to...."
"$XVG.X LOL. Someone just triggered a cascade of stop-loss orders and scooped up morons' coins. Oldest trick in the stock trader's book."
The file is 59,397 lines long (a day's worth of data) and I'm using spaCy for pre-processing/tokenization. It's currently taking me around 8.5 minutes and I was wondering if there were any way of optimising the following code to be quicker as 8.5 minutes seems awfully long for this process:
def token_loop(path):
store = []
files = [f for f in listdir(path) if isfile(join(path, f))]
start_time = time.monotonic()
for filename in files:
with open("./data/"+filename) as f:
for line in f:
tokens = nlp(line.lower())
tokens = [token.lemma_ for token in tokens if not token.orth_.isspace() and token.is_alpha and not token.is_stop and len(token.orth_) != 1]
store.append(tokens)
end_time = time.monotonic()
print("Time taken to tokenize:",timedelta(seconds=end_time - start_time))
return store
Although it says files, it's currently only looping over 1 file.
Just to note, I only need this to tokenize the content; I don't need any extra tagging etc.
It sounds like you haven't optimised the pipeline yet. You'll get a significant speed up from disabling the pipeline components you don't need, like so:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
This should get you down to about the two-minute mark, or better, on its own.
If you need a further speed up, you can look at multi-threading using nlp.pipe. Docs for multi-threading are here:
https://spacy.io/usage/processing-pipelines#section-multithreading
You can use nlp.pipe(all_lines) instead of nlp(line) for a faster processing
see Spacy's documentation - https://spacy.io/usage/processing-pipelines

What is the correct way to set StopLoss and TakeProfit in OrderSend() in MetaTrader4 EA?

I'm trying to figure out if there is a correct way to set the Stop Loss (SL) and Take Profit (TP) levels, when sending an order in an Expert Advisor, in MQL4 (Metatrader4). The functional template is:
OrderSend( symbol, cmd, volume, price, slippage, stoploss, takeprofit, comment, magic, expiration, arrow_color);
So naturally I've tried to do the following:
double dSL = Point*MM_SL;
double dTP = Point*MM_TP;
if (buy) { cmd = OP_BUY; price = Ask; SL = ND(Bid - dSL); TP = ND(Ask + dTP); }
if (sell) { cmd = OP_SELL; price = Bid; SL = ND(Ask + dSL); TP = ND(Bid - dTP); }
ticket = OrderSend(SYM, cmd, LOTS, price, SLIP, SL, TP, comment, magic, 0, Blue);
However, there are as many variations as there are scripts and EA's. So far I have come across these.
In the MQL4 Reference in the MetaEditor, the documentation say to use:
OrderSend(Symbol(),OP_BUY,Lots,Ask,3,
NormalizeDouble(Bid - StopLoss*Point,Digits),
NormalizeDouble(Ask + TakeProfit*Point,Digits),
"My order #2",3,D'2005.10.10 12:30',Red);
While in the "same" documentation online, they use:
double stoploss = NormalizeDouble(Bid - minstoplevel*Point,Digits);
double takeprofit = NormalizeDouble(Bid + minstoplevel*Point,Digits);
int ticket=OrderSend(Symbol(),OP_BUY,1,price,3,stoploss,takeprofit,"My order",16384,0,clrGreen);
And so it goes on with various flavors, here, here and here...
Assuming we are interested in a OP_BUY and have the signs correct, we have the options for basing our SL and TP values on:
bid, bid
bid, ask
ask, ask
ask, bid
So what is the correct way to set the SL and TP for a BUY?
(What are the advantages or disadvantages of using the various variations?)
EDIT: 2018-06-12
Apart a few details, the answer is actually quite simple, although not obvious. Perhaps because MT4 only show Bid prices on the chart (by default) and not both Ask and Bid.
So because: Ask > Bid and Ask - Bid = Slippage, it doesn't matter which we choose as long as we know about the slippage. Then depending on what price you are following on the chart, you may wish to decide on using one over the other, adding or subtracting the Slippage accordingly.
So when you use the measure tool to get the Pip difference of currently shown prices, vs your "exact" SL/TP settings, you need to keep this in mind.
So to avoid having to put the Slippage in my code above, I used the following for OP_BUY: TP = ND(Bid + dTP); (and the opposite for OP_SELL.)
If you buy, you OP_BUY at Ask and close (SL, TP) at Bid.
If you sell, OP_SELL operation is made at Bid price, and closes at Ask.
Both SL and TP should stay at least within STOP_LEVEL * Point() distance from the current price to close ( Bid for buy, Ask for sell).
It is possible that STOP_LEVEL is zero - in such cases ( while MT4 accepts the order ) the Broker may reject it, based on its own algorithms ( Terms and Conditions may call it a "floating Stoplevel" rule or some similar Marketing-wise "re-dressed" term ).
It is adviced to send an OrderSend() request with zero values of SL and TP and modify it after you see that the order was sent successfully. Sometimes it is not required, sometimes that is even mandatory.
There is no difference between the two links you gave us: you may compute SL and TP and then pass them into the function or compute them based on OrderOpenPrice() +/- distance * Point().
So what is the correct way to set the SL and TP for a BUY ?
There is no such thing as "The Correct Way", there are rules to meet
Level 0: Syntax is to meet the call-signature ( the easiest one )
Level 1: all at Market XTO-s have to meet the right level of the current Price +/- slippage, make sure to repeat a RefreshRates()-test as close to the PriceDOMAIN-levels settings, otherwise they get rejected from the Broker side ( blocking one's trading engine at a non-deterministic add-on RTT-latency ) + GetLastError() == 129 | ERR_INVALID_PRICE
Level 2: yet another rules get set from Broker-side, in theire respective Service / Product definition in [ Trading Terms and Conditions ]. If one's OrderSend()-request fails to meet any one of these, again, the XTO will get rejected, having same adverse blocking effects, as noted in Level 1.
Some Brokers do not allow some XTO situations due to their T&C, so re-read such conditions with a due care. Any single of theirs rule, if violated, will lead to your XTO-instruction to get legally rejected, with all adverse effects, as noted above. Check all rules, as you will not like to see any of the following error-states + any of others, restricted by your Broker's T&C :
ERR_LONG_POSITIONS_ONLY_ALLOWED Buy orders only allowed
ERR_TRADE_TOO_MANY_ORDERS The amount of open and pending orders has reached the limit set by the broker
ERR_TRADE_HEDGE_PROHIBITED An attempt to open an order opposite to the existing one when hedging is disabled
ERR_TRADE_PROHIBITED_BY_FIFO An attempt to close an order contravening the FIFO rule
ERR_INVALID_STOPS Invalid stops
ERR_INVALID_TRADE_VOLUME Invalid trade volume
...
..
.
#ASSUME NOTHING ; Is the best & safest design-side (self)-directive

Scikit-learn: How to extract features from the text?

Assume I have an array of Strings:
['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']
I'd like to extract from this description features like:
item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...
Should I prepare the pre-defined known features first? Like
brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']
I am not sure that I need to use CountVectorizer and TfidfVectorizer here, it's more appropriate to have DictVictorizer, but how can I make dicts with keys extracting values from the entire string?
is it possible with scikit-learn's Feature Extraction? Or should I make my own .fit(), and .transform() methods?
UPDATE:
#sergzach, please review if I understood you right:
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
for brand in brands:
if brand in d:
# ok brand is found
for model in models:
if model in d:
# ok model is found
So creating N-loops per each feature? This might be working, but not sure if it is right and flexible.
Yes, something like the next.
Excuse me, probably you should correct the code below.
import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
# and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
line_cats = {}
for col, features in features.iteritems():
for i, feature in enumerate(features):
found = False
if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.
found = True
break # current category is determined by a first occurence
# cycle has been end but feature had not been found. Make column value as default not existing feature
if not found:
line_cats[col] = 0
not_found_columns.append((col, line))
cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.
Now you have column names with lines (not_found_columns) which was not found. View them, probably you forgot some features.
We can also write strings (instead of numbers) as categories and then use DV. In result the approaches are equivalent.
Scikit Learn's vectorizers will convert an array of strings to an inverted index matrix (2d array, with a column for each found term/word). Each row (1st dimension) in the original array maps to a row in the output matrix. Each cell will hold a count or a weight, depending on which kind of vectorizer you use and its parameters.
I am not sure this is what you need, based on your code. Could you tell where you intend to use this features you are looking for? Do you intend to train a classifier? To what purpose?

Baffled: SPSS (2265) Unrecognized or invalid variable format

A year ago, we analyzed with SPSS 22 some data with 100+ variables on 5 lines. We used the GUI and laboriously entered variable names and output formats. This year, we are using SPSS 23 after a mandatory upgrade. We have similar data, and want to use a syntax file instead. We copied the GET DATA output from last year, made a few changes, and ran. No deal. We get the notorious and almost completely unhelpful error message in the title. (It continues "The format is invalid. For numeric formats, the width or decimals value may be invalid." Not line number, Not indication of the problem).
We are not using big numbers. We are not using macros, as in this SO question. We tried replacing F1.0 with N1. There are no ','s in the file (hence, no F3,1-like typos). I have searched the web. Does anyone know what else the problem might be?
The failing GET DATA statement, with filename and middle elided.
GET DATA /TYPE=TXT
/FILE="E: ... .txt"
/ENCODING='UTF8'
/DELCASE=VARIABLES 123
/DELIMITERS="\t"
/ARRANGEMENT=DELIMITED
/FIRSTCASE=1
/IMPORTCASE=ALL
/VARIABLES=
ID A4
Group A2
Quality A2
V4 A5
oarea F4.1
oallarea F4.1
olthmean F5.3
olthmax F5.3
...
x N1
o N1
S N1
Z N1
w N1.
F5.5 was not valid. Fixing that and program ran.

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources