zipline backtesting using non-US (European) intraday data - hft

I'm trying to get zipline working with non-US, intraday data, that I've loaded into a pandas DataFrame:
BARC HSBA LLOY STAN
Date
2014-07-01 08:30:00 321.250 894.55 112.105 1777.25
2014-07-01 08:32:00 321.150 894.70 112.095 1777.00
2014-07-01 08:34:00 321.075 894.80 112.140 1776.50
2014-07-01 08:36:00 321.725 894.80 112.255 1777.00
2014-07-01 08:38:00 321.675 894.70 112.290 1777.00
I've followed moving-averages tutorial here, replacing "AAPL" with my own symbol code, and the historical calls with "1m" data instead of "1d".
Then I do the final call using algo_obj.run(DataFrameSource(mydf)), where mydf is the dataframe above.
However there are all sorts of problems arising related to TradingEnvironment. According to the source code:
# This module maintains a global variable, environment, which is
# subsequently referenced directly by zipline financial
# components. To set the environment, you can set the property on
# the module directly:
# from zipline.finance import trading
# trading.environment = TradingEnvironment()
#
# or if you want to switch the environment for a limited context
# you can use a TradingEnvironment in a with clause:
# lse = TradingEnvironment(bm_index="^FTSE", exchange_tz="Europe/London")
# with lse:
# the code here will have lse as the global trading.environment
# algo.run(start, end)
However, using the context doesn't seem to fully work. I still get errors, for example stating that my timestamps are before the market open (and indeed, looking at trading.environment.open_and_close the times are for the US market.
My question is, has anybody managed to use zipline with non-US, intra-day data? Could you point me to a resource and ideally example code on how to do this?
n.b. I've seen there are some tests on github that seem related to the trading calendars (tradincalendar_lse.py, tradingcalendar_tse.py , etc) - but this appears to only handle data at the daily level. I would need to fix:
open/close times
reference data for the benchmark
and probably more ...

I've got this working after fiddling around with the tutorial notebook. Code sample below. It's using the DF mid, as described in the original question. A few points bear mentioning:
Trading Calendar I create one manually and assign to trading.environment, by using non_working_days in tradingcalendar_lse.py. Alternatively you could create one that fits your data exactly (however could be a problem for out-of-sample data). There are two fields that you need to define: trading_days and open_and_closes.
sim_params There is a problem with the default start/end values because they aren't timezone aware. So you must create a sim_params object and pass start/end parameters with a timezone.
Also, run() must be called with the argument overwrite_sim_params=False as calculate_first_open/close raise timestamp errors.
I should mention that it's also possible to pass pandas Panel data, with fields open,high,low,close,price and volume in the minor_axis. But in this case, the former fields are mandatory - otherwise errors are raised.
Note that this code only produces a daily summary of the performance. I'm sure there must be a way to get the result at a minute resolution (I thought this was set by emission_rate, but apparently it's not). If anybody knows please comment and I'll update the code.
Also, not sure what the api call is to call 'analyze' (i.e. when using %%zipline magic in IPython, as in the tutorial, the analyze() method gets automatically called. How do I do this manually?)
import pytz
from datetime import datetime
from zipline.algorithm import TradingAlgorithm
from zipline.utils import tradingcalendar
from zipline.utils import tradingcalendar_lse
from zipline.finance.trading import TradingEnvironment
from zipline.api import order_target, record, symbol, history, add_history
from zipline.finance import trading
def initialize(context):
# Register 2 histories that track daily prices,
# one with a 100 window and one with a 300 day window
add_history(10, '1m', 'price')
add_history(30, '1m', 'price')
context.i = 0
def handle_data(context, data):
# Skip first 30 mins to get full windows
context.i += 1
if context.i < 30:
return
# Compute averages
# history() has to be called with the same params
# from above and returns a pandas dataframe.
short_mavg = history(10, '1m', 'price').mean()
long_mavg = history(30, '1m', 'price').mean()
sym = symbol('BARC')
# Trading logic
if short_mavg[sym] > long_mavg[sym]:
# order_target orders as many shares as needed to
# achieve the desired number of shares.
order_target(sym, 100)
elif short_mavg[sym] < long_mavg[sym]:
order_target(sym, 0)
# Save values for later inspection
record(BARC=data[sym].price,
short_mavg=short_mavg[sym],
long_mavg=long_mavg[sym])
def analyze(context,perf) :
perf["pnl"].plot(title="Strategy P&L")
# Create algorithm object passing in initialize and
# handle_data functions
# This is needed to handle the correct calendar. Assume that market data has the right index for tradeable days.
# Passing in env_trading_calendar=tradingcalendar_lse doesn't appear to work, as it doesn't implement open_and_closes
from zipline.utils import tradingcalendar_lse
trading.environment = TradingEnvironment(bm_symbol='^FTSE', exchange_tz='Europe/London')
#trading.environment.trading_days = mid.index.normalize().unique()
trading.environment.trading_days = pd.date_range(start=mid.index.normalize()[0],
end=mid.index.normalize()[-1],
freq=pd.tseries.offsets.CDay(holidays=tradingcalendar_lse.non_trading_days))
trading.environment.open_and_closes = pd.DataFrame(index=trading.environment.trading_days,columns=["market_open","market_close"])
trading.environment.open_and_closes.market_open = (trading.environment.open_and_closes.index + pd.to_timedelta(60*7,unit="T")).to_pydatetime()
trading.environment.open_and_closes.market_close = (trading.environment.open_and_closes.index + pd.to_timedelta(60*15+30,unit="T")).to_pydatetime()
from zipline.utils.factory import create_simulation_parameters
sim_params = create_simulation_parameters(
start = pd.to_datetime("2014-07-01 08:30:00").tz_localize("Europe/London").tz_convert("UTC"), #Bug in code doesn't set tz if these are not specified (finance/trading.py:SimulationParameters.calculate_first_open[close])
end = pd.to_datetime("2014-07-24 16:30:00").tz_localize("Europe/London").tz_convert("UTC"),
data_frequency = "minute",
emission_rate = "minute",
sids = ["BARC"])
algo_obj = TradingAlgorithm(initialize=initialize,
handle_data=handle_data,
sim_params=sim_params)
# Run algorithm
perf_manual = algo_obj.run(mid,overwrite_sim_params=False) # overwrite == True calls calculate_first_open[close] (see above)

#Luciano
You can add analyze(None, perf_manual)at the end of your code for automatically running the analyze process.

Related

Biopython : how to extract only relevant atom and save a pdb file (not locally)?

Using Biopython. I have a list of atoms. rep_atoms = [CA, CB, CD3] (Carbon atoms).
I want to save only these from any given PDB file. I don't want to save it locally; I want it to save in the memory (Lots of iteration).
I have arrived at the code below, but it saves the file locally and is very slow.
So, my goal is from each atom in PDB, if it is present in rep_atoms. Make a new_pdb store only that information so that when I call it later in my code, it should be a PDB file without getting saved in my computer in a local folder.
How do I append each atom? Printing all atoms is very fast. I want to append it, but it wouldn't be a PDB structure file. What should I do?
from Bio.PDB import .... PDBIO, Select ....
class rep_atom_Select(Select):
def accept_atom(self, atom):
if atom.get_name() in rep_atoms:
return 1
else:
return 0
def rep_atoms_pdb(input_pdb):
io = PDBIO()
io.set_structure(input_pdb)
for model in input_pdb:
for chain in model:
for residue in chain:
for atom in residue:
if atom.get_name() in rep_atoms:
print(atom)
# dnr_only = io.save("dnr_only.pdb", rep_atom_Select())
Save after the loop, once, instead of thousands of times inside the loop.
def rep_atoms_pdb(input_pdb):
my_atoms = list()
for model in input_pdb:
for chain in model:
for residue in chain:
for atom in residue:
if atom.get_name() in rep_atoms: # or if rep_atom_Select().accept_atom(atom):
my_atoms.append(atom) # or something like this
# The function returns the list of extracted atoms
return my_atoms
Your definition of rep_atom_Select() does not seem to be directly compatible with this design, nor am I sure receiving the atoms as a list is actually what you want, but this should at least give you a nudge in the right direction.
Brief reading of the Bio.PDB.PDBIO documentation suggests that you might simply want to return the actual PDBIO object. I think something like this:
class rep_atom_Select(Select):
def accept_atom(self, atom):
if atom.get_name() in rep_atoms:
return 1
else:
return 0
def rep_atoms_pdb(input_pdb):
io = rep_atom_Select()
io.set_structure(input_pdb)
return io
This is based on a very cursory reading of the documentation, but at least demonstrates how you would use your overridden class to select only some of the atoms in the input_pdb structure.

Override Python Eve Endpoint because of backend column changes?

How do I override an endpoint if I want to run some data transformations before it hit's the database. For e.g., let's say we had a people table, with a column name fname and we renamed it to first_name. But our users are making queries with fname. Is there a way of overriding the endpoint with a custom route for people so that I can transform the column name from fname to first_name or more complexly, run some python code before either calling SQLALchemy by myself or perhaps returning it back to the Eve framework to continue with calling the database?
E.g. using the QuickStart, I tried something like this but it didn't work:
from eve import Eve
from flask import jsonify
app = Eve()
#app.route('/people/<name>')
def custom_people_func(name):
return jsonify(name='Override', people_name=name)
if __name__ == '__main__':
app.run()
settings.py
people = {
# 'title' tag used in item links. Defaults to the resource title minus
# the final, plural 's' (works fine in most cases but not for 'people')
'item_title': 'person',
# by default the standard item entry point is defined as
# '/people/<ObjectId>'. We leave it untouched, and we also enable an
# additional read-only entry point. This way consumers can also perform
# GET requests at '/people/<lastname>'.
'additional_lookup': {
'url': 'regex("[\w]+")',
'field': 'lastname'
},
# We choose to override global cache-control directives for this resource.
'cache_control': 'max-age=10,must-revalidate',
'cache_expires': 10,
# most global settings can be overridden at resource level
'resource_methods': ['GET', 'POST'],
'schema': schema
}
DOMAIN = {'people': people}
When I do curl -i http://127.0.0.1:5000/people/obama, it won't call the method I defined but the default Eve routing.
Asking generally, how do we manage these kind of database changes, if possible, using Eve?
Have you looked into Event Hooks, specifically into database event hooks? They allow you to hook a callout function to your db events (insert, replace, delete, fetch.) Within your function you can, for example, change the incoming payload before it hits the database.
>>> def before_insert(resource_name, items):
... print('About to store items to "%s" ' % resource)
... # modify incoming items here
>>> app = Eve()
>>> app.on_insert += before_insert
>>> app.run()

How broadcast variables are used in dask parallelization

I have some code applying a map function on a dask bag. I need a lookup dictionary to apply that function and it doesn't work with client.scatter.
I don't know if I am doing the right things, because the workers starts, but they don't do anything. I have tried different configuration looking to different examples, but I can't get it to work. Any support will be appreciated.
I know from Spark, you define a broadcast variable and you access the content by variable.value inside the function you want to apply. I don't see the same with dask.
# Function to map
def transform_contacts_add_to_historic_sin(data,historic_dict):
raw_buffer = ''
line = json.loads(data)
if line['timestamp] > historic_dict['timestamp]:
raw_buffer = raw_buffer + line['vid']
return raw_buffer
# main program
# historic_dict is a dictionary previously filled, which is the lookup variable for map function
# file_records will be a list of json.dump getting from a S3 file
from distributed import Client
client = Client()
historic_dict_scattered = client.scatter(historic_dict, broadcast=True)
file_records = []
raw_data = s3_procedure.read_raw_file(... S3 file.......)
data = TextIOWrapper(raw_data)
for line in data:
file_records.append(line)
bag_chunk = db.from_sequence(file_records, npartitions=16)
bag_transform = bag_chunk.map(lambda x: transform_contacts_add_to_historic(x), args=[historic_dict_scattered])
bag_transform.compute()
If your dictionary is small you can just include it directly
def func(partition, d):
return ...
my_dict = {...}
b = b.map(func, d=my_dict)
If it's large then you might want to wrap it up in Dask delayed first
my_dict = dask.delayed(my_dict)
b = b.map(func, d=my_dict)
If it's very large then yes, you might want to scatter it first (though I would avoid this if things work out with either of the approaches above).
[my_dict] = client.scatter([my_dict])
b = b.map(func, d=my_dict)

Posting date issue due to time zones and the use of SystemDateGet()

Summary
We are experiencing a problem where the systemDateGet() function is returning the
AOS wher the local date is required when determining the date for posting purposes.
Details
We have branches around the world, both sides of the international date line, with
users connecting to local terminal servers with the appropriate time zone settings
for the user's branch.
Each branch is associated with a different company, which has been configured with
the correct time zone.
We run a single AOS with the time zone setting set for the head office. There is
not an option to introduce additional AOS's.
The SystemDateGet() function is returning the AOS date as the user is not changing
their Axapta session date.
A number of system fields in the database related to posting dates are DATE based and
not UTCDATETIME based.
We are running AX2009 SP1 RU7.
Kernel version 5.0.1500.4570
Application version 5.0.1500.4570
Localization version: Brazil, China, Japan, Thailand, India
I am aware that the SystemDateGet() function was designed to return the AOS date unless
the user changes their session date in which case that date is returned.
Each user has the appropriate time zone setting in there user profile.
Problem
One example of the problem is when a user attempts to post a journal involving financial
transactions, where the ledger period is checked to see if it is open. For example,
the user is in England attempting to post a journal at 3:00pm on the 30st of November, local
time, the standard Axapta code uses the systemDateGet() function to determine the date to use
in the validation (determining if the period is open). In our case, the AOS is based in
Brisbane Australia and the systemDateGet() function is returning the 1st of December
(local time 1:00am on the 1st of December).
Another example of the problem is where an invoice is posted in the United States on a Friday
and the day of the week where the AOS is situated is a Saturday. We need the system to
record the local date.
Question
Besides rewriting all system code involving systemDateGet(), over 2000 entities, is there
any other options that can be used to get around the problem of getting the correct local
date?
Solution limitations.
In the example given above of the ledger period being closed, it is not possible from a
business practices standpoint to open the next period before end of month processing
has been completed.
There is currently no option for the purchase of additional AOS's.
Create a function in the Global class:
static date systemDateGetLocal()
{
return DateTimeUtil::date(DateTimeUtil::applyTimeZoneOffset(DateTimeUtil::utcNow(), DateTimeUtil::getUserPreferredTimeZone()));
}
Then in Info.watchDog() do a:
systemDateSet(systemDateGetLocal());
This may only be a partial solution, the watchDog is executed on the client side only.
Here is a quick update. Let me know if there are any issues or situations that need to be addressed.
USING LOCAL TIME
Introduction:
The following is a work in progress update, as the solution has yet to be fully tested in all situations.
Problem:
If the user has a different time zone to the AOS, the timeNow() and systemDateGet() functions are returning the AOS details when the local date time is required.
Clarifiers:
When the local date time is changed within Axapta, the systemDateGet() function will return the local date, however the timeNow() function still returns the AOS time.
Coding changes:
A number of coding changes have been made to handle a number of different situations. Comments will be added where an explanation may be required.
The brackets were changed to { and } to allow this to be posted.
CLASS:GLOBAL
One requirement I was given was to allow the system to handle multiple sites within a company that may have different time zones. Presently this functionality is not required.
static server void setSessionDateTime(
inventSiteId inventSiteId = '',
utcDateTime reference = dateTimeUtil::utcNow())
{
str sql;
sqlStatementExecutePermission perm;
connection conn = new UserConnection();
timeZone timeZone;
int ret;
;
if (inventSiteId)
{
timeZone = inventSite::find(inventSiteId).Timezone;
}
else
{
timeZone = dateTimeUtil::getCompanyTimeZone();
}
//This is to get around the kernel validation of changing timezones
//when the user has more than one session open.
sql = strfmt("Update userInfo set preferredTimeZone = %1 where userInfo.id = '%2'", enum2int(timeZone), curUserId());
perm = new SQLStatementExecutePermission(sql);
perm.assert();
ret = conn.createStatement().executeUpdate(sql);
dateTimeUtil::setUserPreferredTimeZone(timeZone);
dateTimeUtil::setSystemDateTime(reference);
CodeAccessPermission::revertAssert();
}
static int localTime()
{
utcDateTime tmp;
;
setSessionDateTime();
tmp = dateTimeUtil::applyTimeZoneOffset( dateTimeUtil::utcNow(), dateTimeUtil::getCompanyTimeZone());
return dateTimeUtil::time(tmp);
}
The following method was implemented as a cross check to ensure that systemDateGet() returns the expected value.
static date localDate()
{
utcDateTime tmp;
;
setSessionDateTime();
tmp = dateTimeUtil::applyTimeZoneOffset( dateTimeUtil::utcNow(), dateTimeUtil::getCompanyTimeZone());
return dateTimeUtil::date(tmp);
}
CLASS:APPLICATION
Modify the method setDefaultCompany. Add the line setSessionDateTime(); directly after the super call. This is to allow the time to change when the user changes company (another requirement I was given).
CLASS:INFO
So that the system uses the correct date/time from the start of the session.
void startupPost()
{
;
setSessionDateTime();
}
Modify the method canViewAlertInbox() adding the line setSessionDateTime(); as the first line. This is to handle if the user has multiple forms open for different companies.
Localization dependent changes:
Depending on your service pack and localizations, you will need to change a function of objects to use the localTime() function, replacing timeNow(). IMPORTANT NOTE: Do not change the class BatchRun to use the new localTime function as this will stop it working correctly.
In our system there were around 260 instances that could be changed. If you do not use all modules and localizations the actual number of lines you need to change will be less.
Final note:
There are a number of today() calls in the code. I have not yet gone through each line to ensure it is coded correctly, i.e. using today() instead of systemDateGet().
Known issues:
I have come across a situation where the timezone change function did not work completely as expected. This was when one session was doing a database synchronisation and another session was opened in a different company. As normal users will never be able to do this, I have not spent much time on its solution at this stage. It is something I do intend to resolve.

How can you join two or more dictionaries created by Bio.SeqIO.index?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

Resources