CVXPY constraint formulation - cvxpy

I'm creating a long short market neutral portfolio using following function.
import cvxpy as cp
def create_gross_exposure_constraint(w):
return cp.norm1(w) <= 2
def create_market_neutral_constraint(w):
return cp.sum(w) == 0
I'm trying to add net asset value constraints as explained below:
I want the net exposure < 1 Million USD (this number can be changed). What does this mean is If i do allocation using above weight: abs(Position_of_Asset * Price_of_Asset) < 1 Million USD.
I tried creating it using below. But it is not working.
def create_value_exposure_constraint(w, idx):
x = cp.sum(cp.abs(w))
return w[idx] / x <= 0.1

Related

Where can I perform EDA and run machine learning model on a large dataset?

I want to perform EDA, feature selection, and run a machine learning model on a large dataset.
I run the code (python) in google colab with Colab Pro.
I'm combining 3 tables (all in csv files) and doing some feature engineering so I end up with 11,521,958 rows × 50 columns (I started with 20,000,000 entries but downsized it because I got the error in a much earlier stage).
From that point on, when I run any code, I get the error:
"Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro." - Even though I am using Colab Pro.
In addition, I'm using the following code to reduce memory size:
from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype
def reduce_mem_usage(df, use_float16=False):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
#iterating over every column and finding the type of the column
for col in df.columns:
if is_datetime(df[col]) or is_categorical_dtype(df[col]):
continue
col_type = df[col].dtype
#If the columns is not object
if col_type != object:
#Get the minimum and maximum value
c_min = df[col].min()
c_max = df[col].max()
#If the type is int
if str(col_type)[:3] == 'int':
#If the min max values lies with thin the range of int8 type then assign the type as int8
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
#If the min max values lies with thin the range of int16 type then assign the type as int16
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
#If the min max values lies with thin the range of int32 type then assign the type as int32
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
#If the min max values lies with thin the range of int64 type then assign the type as int64
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
#If the min max values lies with thin the range of float16 type then assign the type as float16
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
#If the min max values lies with thin the range of float32 type then assign the type as float32
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
`
Nothing helps. I wish to remain with this dataset and not removing entries.
What other options do I have for analyzing a large dataframe on my local machine? Cloud computing services is not my strong side, so I'll be happy to get some help.
Thank you all!

MCMC for multi coefficients with normal gaussian distribution

I have a linear model as follows: Acceleration= (C1V)+ C2(X- D) where D= alpha-(betaV)+(gammaM).
note that the values for V,X,M are given in dataset.
My goal is to run 350 times MCMC for each following coefficients: C1,C2,alpha,beta,gamma.
1- I have mean and standard deviation for C1,C2,alpha,beta,gamma.
2- All coefficients (C1,C2,alpha,beta,gamma) are normal distribution.
I have tried two methods to find MCMC for each of coefficients, one is by pymc3(which I'm not sure if I have done correctly), another is defining likelihood function based on the method mentioned in following link by Jonny Homfmeister (in my case, I changed the distribution from binomial to normal gaussian):
https://towardsdatascience.com/bayesian-inference-and-markov-chain-monte-carlo-sampling-in-python-bada1beabca7
The problem is that, after running MCMC for C1,C2,alpha,beta and gamma, and using the mean of posterior (output of MCMC) in my main model, I see that the absolute error has increased! it means the coefficients have not optimized by MCMC and my method does not work properly!
I do appreciate it if someone help me with correct algorithm for MCMC for Normal distribution.
####First method: pymc3##########
import pymc3 as pm
import scipy.stats as st
import arviz as az
for row in range(350):
X_c1 = st.norm(loc=-0.06, scale=0.47).rvs(size=100)
with pm.Model() as model:
prior = pm.Normal('c1', mu=-0.06, sd=0.47) ###### prior #####weights
obs = pm.Normal('obs', mu=prior, sd=0.47, observed=X_c1) #######likelihood
step = pm.Metropolis()
trace_c1 = pm.sample(draws=30, chains=2, step=step, return_inferencedata=True)
###calculate the mean of output (posterior distribution)#################3
mean_c1= az.summary(trace_c1, var_names=["c1"], round_to=2).iloc[0][['mean']]
mean_c1 = mean_c1.to_numpy()
Acceleration= (mean_C1*V)+ C2*(X- D) ######apply model
######second method in the given link##########
## Define the Likelihood P(x|p) - normal distribution
def likelihood(p):
return scipy.stats.norm.cdf(C1, loc=-0.06, scale=0.47)
def prior(p):
return scipy.stats.norm.pdf(p)
def acceptance_ratio(p, p_new):
# Return R, using the functions we created before
return min(1, ((likelihood(p_new) / likelihood(p)) * (prior(p_new) / prior(p))))
p = np.random.normal(C1, 0.47) # Initialzie a value of p
#### Define model parameters
n_samples = 790 ################# I HAVE NO IDEA HOW TO CHOOSE THIS VALUE????###########
burn_in = 99
lag = 2
##### Create the MCMC loop
for i in range(n_samples):
p_new = np.random.random_sample() ###Propose a new value of p randomly from a normal distribution
R = acceptance_ratio(p, p_new) #### Compute acceptance probability
u= np.random.uniform(0, 1) ####Draw random sample to compare R to
if u < R: ##### If R is greater than u, accept the new value of p
p = p_new
if i > burn_in and i%lag == 0: #### Record values after burn in - how often determined by lag
results_1.append(p)

Can I calculate the confidence bound of a Prophet model that would contain a certain value?

can I use the y-hat variance, the bounds, and the point estimate from the forecast data frame to calculate the confidence level that would contain a given value?
I've seen that I can change my interval level prior to fitting but programmatically that feels like a LOT of expensive trial and error.
Is there a way to estimate the confidence bound using only the information from the model parameters and the forecast data frame?
Something like:
for level in [.05, .1, .15, ... , .95]:
if value_in_question in (yhat - Z_{level}*yhat_variance/N, yhat + Z_{level}*yhat_variance/N):
print 'im in the bound level {level}'
# This is sudo code not meant to run in console
EDIT: working prophet example
# csv from fbprohets working examples https://github.com/facebook/prophet/blob/master/examples/example_wp_log_peyton_manning.csv
import pandas as pd
from fbprophet import Prophet
import os
df = pd.read_csv('example_wp_log_peyton_manning.csv')
m = Prophet()
m.fit(df)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
# the smallest confidence level s.t. the confidence interval of the 30th prediction contains 9
## My current approach
def __probability_calculation(estimate, forecast, j = 30):
sd_residuals = (forecast.yhat_lower[j] - forecast.yhat[j])/(-1.28)
for alpha in np.arange(.5, .95, .01):
z_val = st.norm.ppf(alpha)
if (forecast.yhat[j]-z_val*sd_residuals < estimate < forecast.yhat[j]+z_val*sd_residuals):
return alpha
prob = __probability_calculation(9, forecast)
fbprophet uses the numpy.percentile method to estimate the percentiles as you can see here in the source code:
https://github.com/facebook/prophet/blob/0616bfb5daa6888e9665bba1f95d9d67e91fed66/python/prophet/forecaster.py#L1448
How to inverse calculate percentiles for values is already answered here:
Map each list value to its corresponding percentile
Combining everything based on your code example:
import pandas as pd
import numpy as np
import scipy.stats as st
from fbprophet import Prophet
url = 'https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_peyton_manning.csv'
df = pd.read_csv(url)
# put the amount of uncertainty samples in a variable so we can use it later.
uncertainty_samples = 1000 # 1000 is the default
m = Prophet(uncertainty_samples=uncertainty_samples)
m.fit(df)
future = m.make_future_dataframe(periods=30)
# You need to replicate some of the preparation steps which are part of the predict() call internals
tmpdf = m.setup_dataframe(future)
tmpdf['trend'] = m.predict_trend(tmpdf)
sim_values = m.sample_posterior_predictive(tmpdf)
The sim_values object contains for every datapoint 1000 simulations on which the confidence interval is based.
Now you can call the scipy.stats.percentileofscore method with any target value
target_value = 8
st.percentileofscore(sim_values['yhat'], target_value, 'weak') / uncertainty_samples
# returns 44.26
To prove this works backwards and forwards you can get the output of the np.percentile method and put it in the scipy.stats.percentileofscore method.
This works for an accuracy of 4 decimals:
ACCURACY = 4
for test_percentile in np.arange(0, 100, 0.5):
target_value = np.percentile(sim_values['yhat'], test_percentile)
if not np.round(st.percentileofscore(sim_values['yhat'], target_value, 'weak') / uncertainty_samples, ACCURACY) == np.round(test_percentile, ACCURACY):
print(test_percentile)
raise ValueError('This doesnt work')

How this method or formula for calculating ROC AUC works?

I was trying to calculate the AUC using MySQL for the data in table like below:
y p
1 0.872637
0 0.130633
0 0.098054
...
...
1 0.060190
0 0.110938
I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).
SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
FROM (
SELECT y, row_number() OVER (ORDER BY p) r
FROM probs
) t
Using pandas this can be done as follows:
temp = df.sort_values(by="p")
temp['r'] = np.arange(1, len(df)+1, 1)
temp['yr'] = temp['y']*temp['r']
print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )
I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?
I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.
Short answer: it is Wilcoxon-Mann-Whitney statistic, see
https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
The page has proof as well.
The bottom part of your formula is identical to the formula in the wiki. The top part is trickier. f in wiki corresponds to p in your data and t_0 and t_1 are indexes in the data frame. Note that we first sort by p, which makes our life easier.
Note that the double sum may be decomposed as
Sum_{t_1 such that y(t_1)=1} #{t_0 such that p(t_0) < p(t_1) and y(t_0)=0}
Here # stands for the total number of such indexes.
For each row index t_1 (such that y(t_1) =1), how many t_0 are such that p(t_0) < p(t_1) and y(t_0)=0? We know that there are exactly t_1 values of p that are less or equal than t_1 because values are sorted. We conclude that
#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = t_1 - #{t_0: t_0 <= t_1 and y(t_0)=1}
Now imagine scrolling down the sorted dataframe. For the first time we meet y=1, #{t_0: t_0 <= t_1 and y(t_0)=1}=1, for the second time we meet y=1, the same quantity is 2, for the third time we meet y=1, the quantity is 3, and so on. Therefore, when we sum the equality over all indexes t_1 when y=1, we get
Sum_{t_1: y(t_1)=1}#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = Sum_{t_1: y(t_1)=1} t_1 - (1 + 2 + 3 + ... + n),
where n is the total number of ones in y column. Now we need to do one more simplification. Note that
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1)
If y(t_1) is not one, it is zero. Therefore,
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1) = Sum_{t} t y(t)
Plugging this to our formula and using that
1 + 2+ 3 + ... + n = n(n+1)/2
finished the proof of the formula you found.
P.S. I think that posting this question on math or stats overflow would make more sense.

ML coursera submission (week 2) Feature Normalization

I have written the following code for section "feature normalization"
Here X is the Feature matrix (m*n) such that
m = number of examples
n = number of features
Code
mu = mean(X);
sigma = std(X);
m = size(X,1);
% Subtracting the mean from each row
for i = 1:m
X_norm(i,:) = X(i,:)-mu;
end;
% Dividing the STD from each row
for i = 1:m
X_norm(i,:) = X(i,:)./sigma;
end;
But on submitting it to the server built for Andrew Ng's class, It's not giving me any confirmation if it's wrong or correct.
==
== Part Name | Score | Feedback
== --------- | ----- | --------
== Warm-up Exercise | 10 / 10 | Nice work!
== Computing Cost (for One Variable) | 40 / 40 | Nice work!
== Gradient Descent (for One Variable) | 50 / 50 | Nice work!
== Feature Normalization | 0 / 0 |
== Computing Cost (for Multiple Variables) | 0 / 0 |
== Gradient Descent (for Multiple Variables) | 0 / 0 |
== Normal Equations | 0 / 0 |
== --------------------------------
== | 100 / 100 |
Is this a bug in the web frontend presentation layer or my code?
When the submit() does not give you any points, this means your answer is not correct.
This usually means, that either you have not implemented it yet, or there is a mistake in your implementation.
From what I can see, your indices are not correct. However, in order not to violate the code of conduct of this course, you should ask your question in the Coursera forum (without posting your code).
There are also tutorials with each programming exercise. Those are usually very helpful and guide you through the entire exercise.
You need to iterate for EACH feature
m = size(X,1);
What you are actually getting with m is the number of ROWS (example), but you want to get the number of COLUMNS (Features)
solution:
m = size(X,2);
Try this it worked for and notice that you're making mistake with dividing each row of X without subtracting to mean.
Combine both and do with less code like this -
% Subtracting the mean and Dividing the STD from each row:
for i = 1:m
X_norm(i,:) = (X(i,:) - mu) ./ sigma;
end;
At the end of the class, the final correct answer was given as featureNormalize.m:
function [X_norm, mu, sigma] = featureNormalize(X)
%description: Normalizes the features in X
% FEATURENORMALIZE(X) returns a normalized version of X where
% the mean value of each feature is 0 and the standard deviation
% is 1. This is often a good preprocessing step to do when
% working with learning algorithms.
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
% Instructions: First, for each feature dimension, compute the mean
% of the feature and subtract it from the dataset,
% storing the mean value in mu. Next, compute the
% standard deviation of each feature and divide
% each feature by it's standard deviation, storing
% the standard deviation in sigma.
%
% Note that X is a matrix where each column is a
% feature and each row is an example. You need
% to perform the normalization separately for
% each feature.
mu = mean(X);
sigma = std(X);
X_norm = (X - mu)./sigma;
end
If you're taking this class and feel the urge to copy and paste, you're in a grey-area on academic honesty. You're supposed to figure it out from first principles, not google it and regurgitate the answer.

Resources