how to use non-target attributes for timeseries prediction in weka? - time-series

i have used the weka timeseries plugin w/ algorithms like SMOReg (w/ RegSMOImproved and RegSMO) and HoltWinters. But for all of them i've observed that lag variables are created only for target attributes.
how does one have lag variables created for other (non-target) attributes such that the algo uses these too?
eg: i have 5 attributes ", a, b, c, d"
of which i have to predict for "a". ie. "a" is the "target" attribute
i've observed that lag variables are created only for "date" and "a" and that none of b, c or d are used by the algorithms
note that "overlay" does not really help me because i don't have "future" values for either of b, c or d
what i need is that lag variables be created for b, c and d and they be used for prediction by the chosen algorithm
==================== update ====================
i tried the following approach:
use the "filters->unsupervised->Copy" filter to make multiple (14) copies of the a,b,c,d variables
use "filters->unsupervised->TimesSeriesDelta" filter to shift the copies by consecutive values (eg, 1st copy by 1 day, 2nd copy by 2 days, ... 14th copy by 14 days)
use SMOReg from "classify" panel (w/ %-split of 70%) instead of from "forecast" panel (w/ .3 hold out training evaluation)
but faced the following barriers:
1. can classify (regress, actually, since the target is numeric) only 1 variable at a time
2. did not accept "date" attribute (even though the "date" values are numeric 20150601, 20150602, 20150603, and so on)
3. ran for a long time and then crashed :(
any guidance will be greatly appreciated
ps: the above example is contrived. in my real example, i have date + 8 attributes (all of them numeric), and 3 of them are target (multivariate forecasting)
==================== update ====================
https://github.com/log0ymxm/weka-timeseriesforecasting/blob/master/src/main/java/weka/classifiers/timeseries/core/TSLagMaker.java#L2974
shows that the extra attributes (non-target) are being removed because (line #3027 says):
// otherwise, this is some attribute that we are not predicting and
// wont be able to determine the value for when forecasting future
// instances. So we can't let the model use it.
==================== update ====================
https://github.com/log0ymxm/weka-timeseriesforecasting/blob/master/src/main/java/weka/classifiers/timeseries/WekaForecaster.java#L576
shows that fields-to-lag are same as fields-to-forecast

from the last 2 updates, i found that:
non-target attributes are removed
only target attributes are lagged
i thought this was algorithm specific (eg. HoltWinters, etc), but it's a feature/bug of timesseriesForecasting plugin itself
basically what i want isn't possible without code change :(

Related

How to create a language model with 2 different heads in huggingface?

I know I can create a language model with 1 head:
from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained("distilbert-base-cased").to(device)
But how can I create the same base model structure (e.g., distilbert-base-cased) with 2 heads? Say, one is AutoModelForMultipleChoice and the second is AutoModelForSequenceClassification. I need the only difference between the 2 models (1 head vs 2 heads) to be the additional head (from parameters perspective).
So now my input for the 2 heads model is something like [sequence_label, multiple_choice_labels]
In general case you will need to create a custom class derived from the DistilBertPreTrainedModel. Inside __init__() you will need to define your desired heads architectures. Then you will need to create your own forward() function and define inside it a custom loss involving both heads, and return result.
But if you are talking specifically about DistilBertForMultipleChoice and DistilBertForSequenceClassification, there is a shortcut, as the heads architecture happen to be identical (see source) and the difference is only in loss function. So you can try to train your model as multi label sequence classification problem, where the label per sequence will be [sequence_label, multiple_choice_label_0, multiple_choice_label_1, ...] . For example, in case you have an entry like {sequence, choice0, choice1, seq_label:True, correct_choice:0}
your dataset will be
[ {'text':(sequence, choice0), 'label':(1 1 0)},
{'text':(sequence, choice1), 'label':(1 0 0)} ]
This way the result of the sequence classification will be in the first position and to get the correct choice probability you will need to apply softmax function on the rest of the logits.

gretl - dummy interactions

There does not seem to be an "easy" way (such as in R or python) to create interaction terms between dummy variables in gretl ?
Do we really need to code those manually which will be difficult for many levels? Here is a minimal example of manual coding:
open credscore.gdt
SelfemplOwnRent=OwnRent*Selfempl
# model 1
ols Acc 0 OwnRent Selfempl SelfemplOwnRent
Now my manual interaction term will not work for factors with many levels and in fact does not even do the job for binary variables.
Thanks,
ML
One way of doing this is to use lists. Use the dummify-command for generating dummies for each level and the ^-operator for creating the interactions. Example:
open griliches.gdt
discrete med
list X = dummify(med)
list D = dummify(mrt)
list INT = X^D
ols lw 0 X D INT
The command discrete turns your variable into a discrete variable and allows to use dummify (this step is not necessary if your variable is already discrete). Now all interactions terms are stored in the list INT and you can easily assess them in the following ols-command.
#Markus Loecher on your second question:
You can always use the rename command to rename a series. So you would have to loop over all elements in list INT to do so. However, I would rather suggest to rename both input series, in the above example mrt and med respectively, before computing the interaction terms if you want shorter series names.

Question about SPSS modeler (There is an obstacle for make the stream run automatically)

I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/

How do you include categories with 0 responses in SPSS frequency output?

Is there a way to display response options that have 0 responses in SPSS frequency output? The default is for SPSS to omit in the frequency table output any response option that is not selected by at least a single respondent. I looked for a syntax-driven option to no avail. Thank you in advance for any assistance!
It doesn't show because there is no one single case in the data is with that attribute. So, by forcing a row of zero you'll need to realize we're asking SPSS to do something incorrect.
Having said that, you can introduce a fake case with the missing category. E.g. if you have Orange, Apple, and Pear, but no one answered they like Pear, the add one fake case that says Pear.
Now, make a new weight variable that consists of only 1. But for the Pear case, make it very very small like 0.00001. Then, go to Data > Weight Cases > Weight cases by and put that new weight variable over. Click OK to apply. Now what happens is that SPSS will treat the "1" with a weight of 1 and the fake case with a weight that is 1/10000 of a normal case. If you rerun the frequency you should see the one with zero count shows up.
If you have purchased the Custom Table module you can also do that directly as well, as far as I can tell from their technical document. That module costs 637 to 3630 depending on license type, so probably only worth a try if your institute has it.
So, I'm a noob with SPSS, I (shame on me) have a cracked version of SPSS 22 and if I understood your question correctly, this is my solution:
double click the Frequency table in Output
right click table, select Table Properties
go to General and then uncheck the Hide empty rows and columns option
Hope this helps someone!
If your SPSS version has no Custom Tables installed and you haven't collected money for that module yet then use the following (run this syntax):
*Note: please use variable names up to 8 characters long.
set mxloops 1000. /*in case your list of values is longer than 40
matrix.
get vars /vari= V1 V2 /names= names /miss= omit. /*V1 V2 here is your categorical variable(s)
comp vals= {1,2,3,4,5,99}. /*let this be the list of possible values shared by the variables
comp freq= make(ncol(vals),ncol(vars),0).
loop i= 1 to ncol(vals).
comp freq(i,:)= csum(vars=vals(i)).
end loop.
comp names= {'vals',names}.
print {t(vals),freq} /cnames= names /title 'Frequency'. /*here you are - the frequencies
print {t(vals),freq/nrow(vars)*100} /cnames= names /format f8.2 /title 'Percent'. /*and percents
end matrix.
*If variables have missing values, they are deleted listwise. To include missings, use
get vars /vari= V1 V2 /names= names /miss= -999. /*or other value
*To exclude missings individually from each variable, analyze by separate variables.

How to display the results of multiple comparisons

If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.

Resources