I am following the Pyro introductory tutorial in forecasting, and trying to access the learned parameters after training the model, I get different results using different access methods for some of them (while getting identical results for others).
Here is the stripped-down reproducible code from the tutorial:
import torch
import pyro
import pyro.distributions as dist
from pyro.contrib.examples.bart import load_bart_od
from pyro.contrib.forecast import ForecastingModel, Forecaster
pyro.enable_validation(True)
pyro.clear_param_store()
pyro.__version__
# '1.3.1'
torch.__version__
# '1.5.0+cu101'
# import & prepare the data
dataset = load_bart_od()
T, O, D = dataset["counts"].shape
data = dataset["counts"][:T // (24 * 7) * 24 * 7].reshape(T // (24 * 7), -1).sum(-1).log()
data = data.unsqueeze(-1)
T0 = 0 # begining
T2 = data.size(-2) # end
T1 = T2 - 52 # train/test split
# define the model class
class Model1(ForecastingModel):
def model(self, zero_data, covariates):
data_dim = zero_data.size(-1)
feature_dim = covariates.size(-1)
bias = pyro.sample("bias", dist.Normal(0, 10).expand([data_dim]).to_event(1))
weight = pyro.sample("weight", dist.Normal(0, 0.1).expand([feature_dim]).to_event(1))
prediction = bias + (weight * covariates).sum(-1, keepdim=True)
assert prediction.shape[-2:] == zero_data.shape
noise_scale = pyro.sample("noise_scale", dist.LogNormal(-5, 5).expand([1]).to_event(1))
noise_dist = dist.Normal(0, noise_scale)
self.predict(noise_dist, prediction)
# fit the model
pyro.set_rng_seed(1)
pyro.clear_param_store()
time = torch.arange(float(T2)) / 365
covariates = torch.stack([time], dim=-1)
forecaster = Forecaster(Model1(), data[:T1], covariates[:T1], learning_rate=0.1)
So far so good; now, I want to inspect the learned latent parameters stored in Paramstore. Seems there are more than one ways to do this; using the get_all_param_names() method:
for name in pyro.get_param_store().get_all_param_names():
print(name, pyro.param(name).data.numpy())
I get
AutoNormal.locs.bias [14.585433]
AutoNormal.scales.bias [0.00631594]
AutoNormal.locs.weight [0.11947815]
AutoNormal.scales.weight [0.00922901]
AutoNormal.locs.noise_scale [-2.0719821]
AutoNormal.scales.noise_scale [0.03469057]
But using the named_parameters() method:
pyro.get_param_store().named_parameters()
gives the same values for the location (locs) parameters, but different values for all scales ones:
dict_items([
('AutoNormal.locs.bias', Parameter containing: tensor([14.5854], requires_grad=True)),
('AutoNormal.scales.bias', Parameter containing: tensor([-5.0647], requires_grad=True)),
('AutoNormal.locs.weight', Parameter containing: tensor([0.1195], requires_grad=True)),
('AutoNormal.scales.weight', Parameter containing: tensor([-4.6854], requires_grad=True)),
('AutoNormal.locs.noise_scale', Parameter containing: tensor([-2.0720], requires_grad=True)),
('AutoNormal.scales.noise_scale', Parameter containing: tensor([-3.3613], requires_grad=True))
])
How is this possible? According to the documentation, Paramstore is a simple key-value store; and there are only these six keys in it:
pyro.get_param_store().get_all_param_names() # .keys() method gives identical result
# result
dict_keys([
'AutoNormal.locs.bias',
'AutoNormal.scales.bias',
'AutoNormal.locs.weight',
'AutoNormal.scales.weight',
'AutoNormal.locs.noise_scale',
'AutoNormal.scales.noise_scale'])
so, there is no way that one method access one set of items and the other a different one.
Am I missing something here?
pyro.param() returns transformed parameters in this case to the positive reals for scales.
Here is the situation, as revealed in the Github thread I opened in parallel with this question...
Paramstore is no more just a simple key-value store - it also performs constraint transformations; quoting a Pyro developer from the above link:
here's some historical background. The ParamStore was originally just a key-value store. Then we added support for constrained parameters; this introduced a new layer of separation between user-facing constrained values and internal unconstrained values. We created a new dict-like user-facing interface that exposed only constrained values, but to keep backwards compatibility with old code we kept the old interface around. The two interfaces are distinguished in the source files [...] but as you observe it looks like we forgot to mark the old interface as DEPRECATED.
I guess in clarifying docs we should:
clarify that the ParamStore is no longer a simple key-value store
but also performs constraint transforms;
mark all "old" style interface methods as DEPRECATED;
remove "old" style interface usage from examples and tutorials.
As a consequence, it turns out that, while pyro.param() returns the results in the constrained (user-facing) space, the older method named_parameters() returns the unconstrained (i.e. for internal use only) values, hence the apparent discrepancy.
It's not difficult to verify indeed that the scales values returned by the two methods above are related by a logarithmic transformation:
import numpy as np
items = list(pyro.get_param_store().named_parameters()) # unconstrained space
i = 0
for name in pyro.get_param_store().keys():
if 'scales' in name:
temp = np.log(
pyro.param(name).item() # constrained space
)
print(temp, items[i][1][0].item() , np.allclose(temp, items[i][1][0].item()))
i+=1
# result:
-5.027793402915326 -5.0277934074401855 True
-4.600319371162187 -4.6003193855285645 True
-3.3920585732532835 -3.3920586109161377 True
Why does this discrepancy affect only scales parameters? That's because scales (i.e. essentially variances) are by definition constrained to be positive; that doesn't hold for locs (i.e. means), which are not constrained, hence the two representations coincide for them.
As a result of the question above, a new bullet has now been added in the Paramstore documentation, giving a relevant hint:
in general parameters are associated with both constrained and unconstrained values. for example, under the hood a parameter that is constrained to be positive is represented as an unconstrained tensor in log space.
as well as in the documentation of the named_parameters() method of the old interface:
Note that, in the event the parameter is constrained, unconstrained_value is in the unconstrained space implicitly used by the constraint.
Within my current research I'm trying to find out, how big the impact of ad-hoc sentiment on daily stock returns is.
Calculations functioned quite well and results also are plausible.
The calculations until now with quantmod package and yahoo financial data look like below:
getSymbols(c("^CDAXX",Symbols) , env = myenviron, src = "yahoo",
from = as.Date("2007-01-02"), to = as.Date("2016-12-30")
Returns <- eapply(myenviron, function(s) ROC(Ad(s), type="discrete"))
ReturnsDF <- as.data.table(do.call(merge.xts, Returns))
# adjust column names
colnames(ReturnsDF) <- gsub(".Adjusted","",colnames(ReturnsDF))
ReturnsDF <- as.data.table(ReturnsDF)
However, to make it more robust towards noisy influence of pennystock data I wonder, how its possible to exclude stocks that once in the time period go below a certain value x, let's say 1€.
I guess, the best thing would be to exclude them before calculating the returns and merge the xts object results or even better, before downloading them with the getSymbols command.
Has anybody an idea how this could work best? Thanks in advance.
Try this:
build a price frame of the Adj. closing prices of your symbols
(I use the PF function of the quantmod add-on package qmao which has lots of other useful functions for this type of analysis. (install.packages("qmao", repos="http://R-Forge.R-project.org”))
check by column if any price is below your minimum trigger price
select only columns which have no closings below the trigger price
To stay more flexible I would suggest to take a sub period - let’s say no price below 5 during the last 21 trading days.The toy example below may illustrate my point.
I use AAPL, FB and MSFT as the symbol universe.
> symbols <- c('AAPL','MSFT','FB')
> getSymbols(symbols, from='2018-02-01')
[1] "AAPL" "MSFT" "FB"
> prices <- PF(symbols, silent = TRUE)
> prices
AAPL MSFT FB
2018-02-01 167.0987 93.81929 193.09
2018-02-02 159.8483 91.35088 190.28
2018-02-05 155.8546 87.58855 181.26
2018-02-06 162.3680 90.90299 185.31
2018-02-07 158.8922 89.19102 180.18
2018-02-08 154.5200 84.61253 171.58
2018-02-09 156.4100 87.76771 176.11
2018-02-12 162.7100 88.71327 176.41
2018-02-13 164.3400 89.41000 173.15
2018-02-14 167.3700 90.81000 179.52
2018-02-15 172.9900 92.66000 179.96
2018-02-16 172.4300 92.00000 177.36
2018-02-20 171.8500 92.72000 176.01
2018-02-21 171.0700 91.49000 177.91
2018-02-22 172.5000 91.73000 178.99
2018-02-23 175.5000 94.06000 183.29
2018-02-26 178.9700 95.42000 184.93
2018-02-27 178.3900 94.20000 181.46
2018-02-28 178.1200 93.77000 178.32
2018-03-01 175.0000 92.85000 175.94
2018-03-02 176.2100 93.05000 176.62
Let’s assume you would like any instrument which traded below 175.40 during the last 6 trading days to be excluded from your analysis :-) .
As you can see that shall exclude AAPL and FB.
apply and the base function any applied(!) to a 6-day subset of prices will give us exactly what we want. Showing the last 3 days of prices excluding the instruments which did not meet our condition:
> tail(prices[,apply(tail(prices),2, function(x) any(x < 175.4)) == FALSE],3)
FB
2018-02-28 178.32
2018-03-01 175.94
2018-03-02 176.62
Assume I have an array of Strings:
['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']
I'd like to extract from this description features like:
item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...
Should I prepare the pre-defined known features first? Like
brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']
I am not sure that I need to use CountVectorizer and TfidfVectorizer here, it's more appropriate to have DictVictorizer, but how can I make dicts with keys extracting values from the entire string?
is it possible with scikit-learn's Feature Extraction? Or should I make my own .fit(), and .transform() methods?
UPDATE:
#sergzach, please review if I understood you right:
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
for brand in brands:
if brand in d:
# ok brand is found
for model in models:
if model in d:
# ok model is found
So creating N-loops per each feature? This might be working, but not sure if it is right and flexible.
Yes, something like the next.
Excuse me, probably you should correct the code below.
import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
# and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
line_cats = {}
for col, features in features.iteritems():
for i, feature in enumerate(features):
found = False
if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.
found = True
break # current category is determined by a first occurence
# cycle has been end but feature had not been found. Make column value as default not existing feature
if not found:
line_cats[col] = 0
not_found_columns.append((col, line))
cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.
Now you have column names with lines (not_found_columns) which was not found. View them, probably you forgot some features.
We can also write strings (instead of numbers) as categories and then use DV. In result the approaches are equivalent.
Scikit Learn's vectorizers will convert an array of strings to an inverted index matrix (2d array, with a column for each found term/word). Each row (1st dimension) in the original array maps to a row in the output matrix. Each cell will hold a count or a weight, depending on which kind of vectorizer you use and its parameters.
I am not sure this is what you need, based on your code. Could you tell where you intend to use this features you are looking for? Do you intend to train a classifier? To what purpose?
Suppose, I have following sample ARFF file with two attributes:
(1) sentiment: positive [1] or negative [-1]
(2) tweet: text
#relation sentiment_analysis
#attribute sentiment {1, -1}
#attribute tweet string
#data
-1,'is upset that he can\'t update his Facebook by texting it... and might cry as a result School today also. Blah!'
-1,'#Kenichan I dived many times for the ball. Managed to save 50\% The rest go out of bounds'
-1,'my whole body feels itchy and like its on fire '
-1,'#nationwideclass no, it\'s not behaving at all. i\'m mad. why am i here? because I can\'t see you all over there. '
-1,'#Kwesidei not the whole crew '
-1,'Need a hug '
1,'#Cliff_Forster Yeah, that does work better than just waiting for it In the end I just wonder if I have time to keep up a good blog.'
1,'Just woke up. Having no school is the best feeling ever '
1,'TheWDB.com - Very cool to hear old Walt interviews! ? http://blip.fm/~8bmta'
1,'Are you ready for your MoJo Makeover? Ask me for details '
1,'Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur '
1,'happy #charitytuesday #theNSPCC #SparksCharity #SpeakingUpH4H '
I want to convert the values of second attribute into equivalent TF-IDF values.
Btw, I tried following code but its output ARFF file doesn't contain first attribute for positive(1) values for respective instances.
// Set the tokenizer
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(1);
tokenizer.setDelimiters("\\W");
// Set the filter
StringToWordVector filter = new StringToWordVector();
filter.setAttributeIndicesArray(new int[]{1});
filter.setOutputWordCounts(true);
filter.setTokenizer(tokenizer);
filter.setInputFormat(inputInstances);
filter.setWordsToKeep(1000000);
filter.setDoNotOperateOnPerClassBasis(true);
filter.setLowerCaseTokens(true);
filter.setTFTransform(true);
filter.setIDFTransform(true);
// Filter the input instances into the output ones
outputInstances = Filter.useFilter(inputInstances, filter);
Sample output ARFF file:
#data
{0 -1,320 1,367 1,374 1,397 1,482 1,537 1,553 1,681 1,831 1,1002 1,1033 1,1112 1,1119 1,1291 1,1582 1,1618 1,1787 1,1810 1,1816 1,1855 1,1939 1,1941 1}
{0 -1,72 1,194 1,436 1,502 1,740 1,891 1,935 1,1075 1,1256 1,1260 1,1388 1,1415 1,1579 1,1611 1,1818 2,1849 1,1853 1}
{0 -1,374 1,491 1,854 1,873 1,1120 1,1121 1,1197 1,1337 1,1399 1,2019 1}
{0 -1,240 1,359 2,369 1,407 1,447 1,454 1,553 1,1019 1,1075 3,1119 1,1240 1,1244 1,1373 1,1379 1,1417 1,1599 1,1628 1,1787 1,1824 1,2021 1,2075 1}
{0 -1,198 1,677 1,1379 1,1818 1,2019 1}
{0 -1,320 1,1070 1,1353 1}
{0 -1,210 1,320 2,477 2,867 1,1020 1,1067 1,1075 1,1212 1,1213 1,1240 1,1373 1,1404 1,1542 1,1599 1,1628 1,1815 1,1847 1,2067 1,2075 1}
{179 1,1815 1}
{298 1,504 1,662 1,713 1,752 1,1163 1,1275 1,1488 1,1787 1,2011 1,2075 1}
{144 1,785 1,1274 1}
{19 1,256 1,390 1,808 1,1314 1,1350 1,1442 1,1464 1,1532 1,1786 1,1823 1,1864 1,1908 1,1924 1}
{84 1,186 1,320 1,459 1,564 1,636 1,673 1,810 1,811 1,966 1,997 1,1094 1,1163 1,1207 1,1592 1,1593 1,1714 1,1836 1,1853 1,1964 1,1984 1,1997 2,2058 1}
{9 1,1173 1,1768 1,1818 1}
{86 1,935 1,1112 1,1337 1,1348 1,1482 1,1549 1,1783 1,1853 1}
As you can see that first few instances are okay(as they contains -1 class along with other features), but the last remaining instances don't contain positive class attribute(1).
I mean, there should have been {0 1,...} as very first attribute in the last instances in output ARFF file, but it is missing.
You have to specify which is your class attribute explicitly in the java program as, when you apply StringToWordVector filter your input gets divided among specified n-grams. Hence class attribute location changes once StringToWordVector vectorizes the input. You can just use Reorder filer which will ultimately place class attribute at last position and Weka will pick last attribute as class attribute.
More info about Reordering in Weka can be found at http://weka.sourceforge.net/doc.stable-3-8/weka/filters/unsupervised/attribute/Reorder.html. Also example 5 at http://www.programcreek.com/java-api-examples/index.php?api=weka.filters.unsupervised.attribute.Reorder may help you in doing reordering.
Hope it helps.
Your process for obtaining TF-IDF seems correct.
According to my experiments, if you have n classes, Weka shows information labels for records for each n-1 classes and records for the nth class are implied.
In your case, you have 2 classes -1 and 1, so weka is showing labels in records with class label -1 and records with label 1 are implied.