Is there a way to obtain an (over-/under-)approximation of the set of models of an expression (containing bitvectors) without enumerating all models? The result should be an expression itself, i.e. the characteristic function of the approximated set. Thanks!
Related
I want to implement a boosted decision tree for my analysis. But the entries in my array contain are of varying length, so the array is not convertible directly into numpy or pandas.
Is there any way to use existing ML libraries with awkward array?
Your ML library might assume that the arrays are NumPy arrays and not recognize an ak.Array. That problem, in itself, is easily solved: call np.to_numpy (or equivalently, cast it with np.asarray) to put it in a form the ML library expects. Incidentally, there's also ak.to_pandas to make a DataFrame in which variable-length nested lists are represented by a MultiIndex (with limitations: there has to be only one nested list, since a DataFrame has only one index).
The above is what I'd call a "branding" issue: the ML library just doesn't recognize the ak.Array "brand" of array, so we relabel it. But there's a more fundamental issue: does the ML algorithm in question intrinsically require rectilinear data? For instance, a feedforward neural network maps N-dimensional inputs to M-dimensional outputs; N and M can't be different for each input. This is a problem even if you're not using Awkward Array. In HEP, the old solution was to run variable-length data through a recurrent neural network (thus ignoring the boundaries between lists and imposing an irrelevant order on them) and the new solution seems to be graph neural networks (which is a more theoretically correct thing to do).
I've noticed that some ML libraries are introducing their own "jagged arrays," which are the minimum structure that Awkward Array provides: TensorFlow has RaggedTensors and PyTorch is getting NestedTensors. I don't know to what degree these data types have been integrated into the ML algorithms, though. If they have been, then Awkward Array ought to get an ak.to_tensorflow and ak.to_pytorch to complement ak.to_numpy and ak.to_pandas, as a way to preserve jaggedness when sending data to these libraries. Hopefully, they'll be able to use that jaggedness in their ML algorithms! (Otherwise, what's the point? But I haven't been following these developments closely.)
You're interested in boosted decision trees (BDTs). I can't think of how a decision tree model, boosted or not, could be adapted to different length inputs... Or maybe I can: the nodes of a decision tree choose which subtree to pass the data down to based on the value of one index in the N-dimensional input. That doesn't imply there's a maximum index value N, though a particular tree would have a set of indexes that it splits on, and there would be some maximum of that set (because the tree is finite!). Apply a tree that wants to split on index k on an input with n < k elements would have to have a contingency for how to split anyway, but there are already methods for applying decision trees to datasets with missing values. An input datum with n elements could be treated as an input for which indexes greater than n are considered missing values. To train such a BDT, you'd have to give it inputs with missing values beyond each list's maximum element.
In Awkward Array, the function for that is ak.pad_none. If you know the maximum length list in your sample (ak.num and ak.max), you can pad the whole array such that all lists have the same length with missing values at the end. If you set clip=True, then the resulting array type is "regular," it no longer considers the possibility that a list can have a length different from the chosen length. If you pass such an array to np.to_numpy (and not np.asarray), then it becomes a NumPy masked array, which a BDT algorithm that expects missing values should be able to recognize.
The only problem with this plan is that padding every list to have the same length as the maximum length list uses more memory. If the BDT algorithm were aware of jaggedness (the way that TensorFlow and soon PyTorch is/will be aware of jaggedness), then it should be able to make these trees and apply them to data without the memory-padding step. I don't know if there are any such BDT implementations out there, but if someone wants to write a "BDT with missing values that accepts jagged arrays," I'd be happy to help them get it set up with Awkward Arrays!
I was wondering if anyone knows how SPSS builds the interaction terms/calculates the significance for predictors behind the scenes in a GLM? From my understanding it dummy codes variables and treats the one that comes alphabetically last as the reference group.
The reason I'm asking is I have a GLM model which has 3 continuous predictors and two categorical predictors (dummy coded). When I build all the 2-way and 3-way interactions with syntax ie:
Age_Centred Age_CentredDx Age_Centredgender Age_CentredDxgender BMI_Centred BMI_CentredDx BMI_Centredgender BMI_CentredDxgender BPS_Centred BPS_CentredDx BPS_Centredgender BPS_CentredDxgender Dx Dxgender DxICV_Centred DxICV_Centredgender gender ICV_Centred ICV_Centred*gender.
vs manually creating all the variables by hand ie:
Age_Centred Age_Centred_Dx Age_Centred_gender Age_Centred_gender_Dx BMI_Centred BMI_Centred_Dx BMI_Centred_gender BMI_Centred_gender_Dx BPS_Centred BPS_Centred_Dx BPS_Centred_gender BPS_Centred_gender_Dx Dx gender_Dx ICV_Dx ICV_Centred_Dx_gender gender ICV_Centred ICV_gender.
I end up with a model which has the same intercept, overall significance, and R squared however the individual significance of the predictors changes. Refer to output below. To troubleshoot I've tried to flip the references groups when manually creating the variables but it still does not replicate the results. I've had another statistician try the same thing and ended up reaching the same point as what I did. Does it have to do with some of the parameters being redundant?
Building the terms via syntax:
Physically creating the variables by multiplying them together
All the details one might reasonably want about how GLM (and UNIANOVA, which is the same underlying code) parameterizes models, estimates parameters, and conducts hypothesis tests are available in the IBM SPSS Statistics Algorithms manual, available for download as a pdf at ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/26.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf. (Note that this is a large file, about 78 MB; clicking on the link starts a download.) In addition to the information in the GLM chapter, appendices F (Indicator Method) and H (Sums of Squares) are relevant, respectively, for building the design matrix and specifying linear combinations of model parameters for computing sums of squares for testing hypotheses.
In building the design matrix, categorical predictors (factors) are indeed represented by sets of indicator (0-1) variables. For a factor with k levels, k indicator variables are created, one for each observed level of the factor. The procedure does not explicitly treat the last category (sorted in ascending order, alphabetical for strings) as a reference category, though in simpler models the effect of what's done is essentially the same. If there is an intercept in the model, then the kth indicator will be redundant (linearly dependent) on the intercept and the preceding k-1 indicators. The estimation algorithm used in GLM/UNIANOVA will set the row and column in the cross-product matrix representing the redundant column in the design matrix to 0s, alias the corresponding parameter estimate to 0, and the results are similar to a reparameterization approach treating the last category as a reference category, except that you have to remember that it's there if you want to specify a linear combination of the parameters to estimate.
If you suppress the intercept, then for the first factor entered into the model the kth indicator would not be redundant (unless the factor is preceded by an unusual covariate or set of covariates). Any subsequent factors included in the model would involve redundant parameters, as would any interactions among factors, whether or not an intercept is included. Interactions among factors are created by multiplying the 0s and 1s for each level of the factors by those for each level of the other factor. So for an interaction of two two-level factors, there are four columns generated, of which typically the last three are redundant.
Covariates are entered simply by copying the values of the variables into the design matrix. Interactions involving covariates and other covariates multiply values for the columns involved within each row, and interactions involving covariates and factors multiply covariates (or products of them) by the indicator variables for the factor(s). Usually covariate-by-covariate terms do not involve redundancies, but factor-by-covariate terms do.
To get to the specifics of what's going on with your data, I can't replicate your exact results without your data, but I am able to replicate the patterns shown if I assume you've used the binary Dx variable as a covariate and the binary gender variable as a factor in each analysis. (There seem to actually be four continuous predictors in your model rather than three, but that doesn't affect anything of importance for understanding what's going on.)
There are two aspects of the situation to be considered. One is the parameterization and how the two ways of entering the variables into the model treat the variables and whether or not they produce the same estimates of parameters. The second is how the model specification results in the Type III tests shown in the ANOVA tables.
If I'm understanding things correctly based on what you've posted here, you should find if you compare parameter estimates for the two analyses that the parameter estimates for the intercepts and the non-redundant estimates for gender ([gender=0]) are the same, and have the same standard errors. For the terms involving just covariates or products of covariates, I expect that you will find the parameter estimates to differ between the two analyses and produce different t statistics. For interactions involving gender and covariates (which is all the other variables or products created outside the procedure), I expect the estimates will be the same in magnitude and opposite in sign, with the same standard errors.
None of the estimates or tests here are wrong. The models fitted involve interaction effects. An interaction means that effect of one variable varies by the levels of the other variable(s) in the interaction, and in order to estimate the same simple effects you have to parameterize the model in the same way, at least as far as the non-redundant parameters are concerned. However, to get the Type III tests for all terms to be identical, it's not always enough to have the same parameter estimates and standard errors. Type III tests involve a concept called containment that must also be considered.
For two effects in a model, effect A is contained in effect B if:
A and B contain the same covariate terms, if any.
B contains all factor effects in A, and at least one more (with the intercept being contained in all factor-only effects).
In your original model, the intercept is included in the gender effect, gender is not included in any effects, and all the covariate main effects and two-way interactions among covariates are contained within the interactions between those terms and gender, while the three-way interactions (which include gender) are not contained within any other effects.
Type III sums of squares (not invented by SPSS, but by our friends at SAS) are based on linear combinations of parameters where a given effect is adjusted for any effects that do not contain it, and made orthogonal to any effects that contain it. The practical application of these rules is complicated (see Appendix H of the algorithms).
If you recode the gender variable to swap the 0 and 1 values, specify it as a covariate along with all the other variables, and fit the same model, you should be able to match all the non-redundant parameter estimates from the original model, along with their standard errors and t statistics. However, because the containment relationships in the original model are no longer there, the Type III tests for the terms not involving gender (which were previously contained in terms involving gender) will not match up.
The bottom line is that all results are translatable and all correct for what's being done, and that in order to make much sense out of individual terms you have to carefully focus on what's being estimated in a given parameterization, as well as the containment relationships. The difficult part gets simpler when you take seriously the fact that when variable X is involved in interaction terms, there is no single estimate of the effect of X. Any estimates are conditional one where you fix the value(s) of the terms with which X interacts.
I am trying to build a logistic regression model and a lot of my features have ordered categorical variables. I think dummy variable may not be useful as it treats each category with equal weightage. So, do i need to treat to ordered categorial variable like numerical ?
Thanks in advance .
Ordered categorical values are termed as "Ordinal" attribute in data mining where one value is less than or greater than another value. You can treat these values as nominal values or continuous values (numbers).
Some of the pros and cons of treating them as numbers (continuous) are:
Pros:
This gives you a lot of flexibility in your choice of analysis and
preserves the information in the ordering. More importantly to many
analysts, it allows you to analyze the data easily.
Cons:
This approach requires the assumption that the numerical distance
between each set of subsequent categories is equal. Otherwise
depending on the domain you can make the interval large.
I am a new to Reinforcement learning, but I am trying to use RL in this task:
Given a function definition in written e.g. in C with 1 to 10s of input arguments (only numerical ones - integer, float, etc.) and the body of the function (represented as a Abstract Syntax Tree/ Abstract Decision Tree with data dependencies - how the internal variable values change) I would like to approximate the values of these input parameters so for e.g. a certain decision block is executed. For this I thought of a recurrent network with LSTM cells.
Now, to achieve this, I would traverse one path in the tree leading to the block and take note of any data changes and decision blocks in the path. These steps would influence my parameter input predictions - what values to insert into/change in the input parameters if I wish to have a certain decision block executed.
Action: Changing the value of one chosen input parameter of the function OR Changing the value of all input parameters individually (with mathematical different operation). After action execution, moving onto the next node in the tree.
Reward: How close I am to executing the given decision block (thus satisfying the condition) with given input parameter values.
Goal: Have a condition in the code satisfied and a decision block thus executed (e.g. an if condition is met).
State: Current position in the AST/ADT with data dependencies.
Assuming that I already have a way to evaluate, how far I am from executing the wanted decision block given current parameter input values, I came across two problems:
How would I deal with varying number of function input parameters in RL? If I want to change their values to be closer to the execution of the wanted decision block, the number of given actions changes with the number of parameters defined for the given function.
If I already did chose one parameter, what is the best way to do number approximation using RL? In the function body there could be numerous very complex mathematical operations happening, so should there be defined action as logarithm, exponentiation, division, multiplying, etc. or is there a better way with maybe just adding/subtracting from the current value?
If you find any mistakes in my definition of the Actions, Reward, Goal or State, please do correct me, as I am still a big learner in this field.
Thank you for your answers.
If you mean varied length of input and outputs in LSTM.You can use the encoder,decoder LSTM
What's the best way to handle missing feature attribute values with Weka's C4.5 (J48) decision tree? The problem of missing values occurs during both training and classification.
If values are missing from training instances, am I correct in assuming that I place a '?' value for the feature?
Suppose that I am able to successfully build the decision tree and then create my own tree code in C++ or Java from Weka's tree structure. During classification time, if I am trying to classify a new instance, what value do I put for features that have missing values? How would I descend the tree past a decision node for which I have an unknown value?
Would using Naive Bayes be better for handling missing values? I would just assign a very small non-zero probability for them, right?
From Pedro Domingos' ML course in University of Washington:
Here are three approaches what Pedro suggests for missing value of A:
Assign most common value of A among other examples sorted to node n
Assign most common value of A among other examples with same target value
Assign probability p_i to each possible value v_i of A; Assign fraction p_i of example to each descendant in tree.
The slides and video is now viewable at here.
An alternative approach is to leave the missing value as the '?', and not use it for the information gain calculation. No node should have an unknown value during classification because you ignored it during the information gain step. For classifying, I believe you simply consider the missing value unknown and do not delete it during classification on that specific attribute.