I have a 5 dimensional matrix in an hdf5 data file. I would like to plot this data using paraview. The solution I have in mind is describing the data via the Xdmf Format.
The 5 dimensional matrix is structured as follows:
matrix[time][type][x][y][z]
The 'time' index specifies a time step. The 'type' selects the matrices for different particle types. And x,y,z describes the spatial coordinates of a grid. The value of the matrix is a Scalar that I would like to plot.
My question is: How can I select a specific 3 dimensional matrix for a given time step and type to plot, using the xdmf format? Ideally the timestep can be represented by the <time> functionality of Xdmf.
I tried the 'hyperslab' functionality of xdmf, but that seems to not reduce the dimensionallity to, which I need to to plot the grid.
I also had a look at the 'SubSet' functionality, but I did not understand how to use it, by reading the official documentation at xdmf.
With help of the mailing list of Xdmf I found a solution that works for me.
My input matrix is 5-dim (1,2,12,6,6) in the hdf5 file "ana.h5" and I select timestep 0 and type 1.
<?xml version="1.0" ?>
<!DOCTYPE Xdmf SYSTEM "Xdmf.dtd" []>
<Xdmf xmlns:xi="http://www.w3.org/2003/XInclude" Version="2.2">
<Domain>
<Topology name="topo" TopologyType="3DCoRectMesh" Dimensions="12 6 6"></Topology>
<Geometry name="geo" Type="ORIGIN_DXDYDZ">
<!-- ORigin -->
<DataItem Format="XML" Dimensions="3">
0.0 0.0 0.0
</DataItem>
<!-- DxDyDz -->
<DataItem Format="XML" Dimensions="3">
1 1 1
</DataItem>
</Geometry>
<Grid Name="TimeStep_0" GridType="Uniform">
<Topology Reference="/Xdmf/Domain/Topology[1]"/>
<Geometry Reference="/Xdmf/Domain/Geometry[1]"/>
<Time Value="64"/>
<Attribute Type="Scalar" Center="Cell" Name="Type1">
<!-- Result will be 3 dimensions -->
<DataItem ItemType="HyperSlab" Dimensions="12 6 6 ">
<!-- The source is 5 dimensions -->
<!-- Origin=0,1,0,0,0 Stride=1,1,1,1,1 Count=1,1,12,6,6 -->
<DataItem Dimensions="3 5" Format="XML">
0 1 0 0 0
1 1 1 1 1
1 1 12 6 6
</DataItem>
<DataItem Format="HDF" NumberType="UInt" Precision="2" Dimensions="1 2 12 6 6 ">
ana.h5:/density_field
</DataItem>
</DataItem>
</Attribute>
</Grid>
</Domain>
</Xdmf>
The resulting matrix is 3 dimensional (12,6,6) and plotable with paraview.
Related
My xml file is test.xml. See below
<?xml version="1.0" ?>
<Output>
<partialstore /> <!-- The code writes the spectrum at Sun position for each species on a FITS file (optional) -->
<fullstore /> <!-- The code writes the complete (r,z,p) grid of propagated particles for each species on a FITS file (optional) -->
<feedback value="2" /> </Output>
<Diffusion type="Constant"> <!-- Spatial distribution of the diffusion coefficient; options: Constant, Exp, Qtau -->
<!-- In Constant mode: D(Rigidity) = beta^etaT * D0 * (Rigidity/4GV)^delta -->
<!-- In Exp mode: D(Rigidity,z) = beta^etaT * D0 * (Rigidity/4GV)^delta * exp(z/zt) -->
<D0_1e28 value="2.7" /> <!-- Normalization of the diffusion coefficient at reference rigidity DiffRefRig Unit: 10^28 cm^2/s -->
<DiffRefRig value = "4" /> <!-- Reference rigidity for the normalization of the diffusion coefficient -->
<!-- NOTE: the reference rigidity 4 GV is stored in the const D_ref_rig defined in include/constants.h -->
<Delta value="0.6" /> <!-- Slope of the diffusion coefficient spectrum -->
<zt value="4" /> <!-- Scale heigth of the diffusion coefficient, useful in Exp mode: D(z) \propto exp(z/zt) (optional) -->
<etaT value="1." /> <!-- Low energy correction factor of the diffusion coefficient: D \propto beta^etaT --> </Diffusion>
I want to parse this xml with python. There are two root elements output and diffusion. I want to add 0.6 with D0_1e28 value and Delta_value so that those become 3.1 and 1.2 respectively and save the test.xml as a test_mod.xml with the modification. How can I do it with python code?
If I understand you correctly, this should work, using xpath with lxml:
xml = """
<root>
[your xml above; note that it was an invalid xml file, so I added a root]
</root>
"""
import lxml.etree as et
tree = et.fromstring(xml)
to_change = ['//Diffusion/D0_1e28','//Delta'] #those are the items you are changing
for i in range(len(to_change)):
item=tree.xpath(to_change[i])[0] #xpath returns a list of 1 item, so [0] selects it
item.attrib['value']= str(float(item.attrib['value'])+0.6)
#the "value" attribute has a value of type 'string' which needs to be cast to
#'float' to increment it by 0.6, and then back to 'string` to change the value
#of the 'value' attribute
mydata=et.tostring(tree)
with open("test_mod.xml", "wb") as myfile:
myfile.write(mydata)
I am working on a dataset which has a feature that has multiple categories for a single example.
The feature looks like this:-
Feature
0 [Category1, Category2, Category2, Category4, Category5]
1 [Category11, Category20, Category133]
2 [Category2, Category9]
3 [Category1000, Category1200, Category2000]
4 [Category12]
The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn
Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.
Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.
Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.
Assuming the MultiLabelBinarizered 2000 features = X
from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)
And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.
If you want to understand how well the new smaller feature space describes the variance you could use the following command
model.explained_variance_
In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
This is the basic intuition behind Binary Encoder.
PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)
I have tried to detect something from a tutorial. When training have finished, stage files and cascade file is created. I have knowledge about the algorithm but I don't know meaning of information inside these file.
<internalNodes>
0 -1 13569 2.8149113059043884e-003</internalNodes>
<leafValues>
9.8837211728096008e-002 -8.5897433757781982e-001</leafValues></_>
and
<rects>
<_>
0 0 3 1 -1.</_>
<_>
1 0 1 1 3.</_></rects>
<tilted>0</tilted></_>
What are the meanings of these values?
Let's start with first block:
<internalNodes>
0 -1 13569 2.8149113059043884e-003</internalNodes>
<leafValues>
9.8837211728096008e-002 -8.5897433757781982e-001</leafValues></_>
It describes one of the weak classifier. In such case it's stump based, i.e. it's tree with max depth is equal to 1. 0 and -1 it's indexes of left and right child of root node. If indexes less or equal to zero it indicates that it's leaf nodes. Note that to calculate leaf index you need to negate it. Next number (13569) is index of feature in <features> section. And next number (2.8149113059043884e-003) is node threshold. In leafValues section presented weights of leafs in cascade tree.
For example, in this weak classifier we need to calculate value of 13569 feature. Next, compare this value with threshold (2.8149113059043884e-003) and if it less that threshold than you need to add the first leaf value (9.8837211728096008e-002) else you need to add the second leaf value (-8.5897433757781982e-001).
Next section describes one of the Haar feature:
<rects>
<_>
0 0 3 1 -1.</_>
<_>
1 0 1 1 3.</_></rects>
<tilted>0</tilted></_>
It obviously describes parameters of rectangle (x, y, width, height) and the weight of rectangle. It also may be tilted, that indicates by <tilted>0</tilted> flag.
I hope it will help.
I am going to do some work for transition-based dependency parsing using LIBLINEAR. But I am confused how to utilize it. As follows:
I set 3 feature templates for my training&testing processes of transition-based dependency parsing:
1. the word in the top of the stack
2. the word in the front of the queue
3. information from the current tree formed with the steps
And the feature defined in LIBLINEAR is:
FeatureNode(int index, double value)
Some examples like:
LABEL ATTR1 ATTR2 ATTR3 ATTR4 ATTR5
----- ----- ----- ----- ----- -----
1 0 0.1 0.2 0 0
2 0 0.1 0.3 -1.2 0
1 0.4 0 0 0 0
2 0 0.1 0 1.4 0.5
3 -0.1 -0.2 0.1 1.1 0.1
But I want to define my features like(one sentence 'I love you') at some stage:
feature template 1: the word is 'love'
feature template 2: the word is 'you'
feature template 3: the information is - the left son of 'love' is 'I'
Does it mean I must define features with LIBLINEAR like: -------FORMAT 1
(indexes in vocabulary: 0-I, 1-love, 2-you)
LABEL ATTR1(template1) ATTR2(template2) ATTR3(template3)
----- ----- ----- -----
SHIFT 1 2 0
(or LEFT-arc,
RIGHT-arc)
But I have go thought some statements of others, I seem to define feature in binary so I have to define a words vector like:
('I', 'love', 'you'), when 'you' appears for example, the vector will be (0, 0, 1)
So the features in LIBLINEAR may be: -------FORMAT 2
LABEL ATTR1('I') ATTR2('love') ATTR3('love')
----- ----- ----- -----
SHIFT 0 1 0 ->denoting the feature template 1
(or LEFT-arc,
RIGHT-arc)
SHIFT 0 0 1 ->denoting the feature template 2
(or LEFT-arc,
RIGHT-arc)
SHIFT 1 0 0 ->denoting the feature template 3
(or LEFT-arc,
RIGHT-arc)
Which is correct between FORMAT 1 and 2?
Is there some something I have mistaken?
Basically you have a feature vector of the form:
LABEL RESULT_OF_FEATURE_TEMPLATE_1 RESULT_OF_FEATURE_TEMPLATE_2 RESULT_OF_FEATURE_TEMPLATE_3
Liblinear or LibSVM expect you to translate it into integer representation:
1 1:1 2:1 3:1
Nowadays, depending on the language you use there are lots of packages/libraries, which would translate the string vector into libsvm format automatically, without you having to know the details.
However, if for whatever reason you want to do it yourself, the easiest thing would be maintain two mappings: one mapping for labels ('shift' -> 1, 'left-arc' -> 2, 'right-arc' -> 3, 'reduce' -> 4). And one for your feature template result ('f1=I' -> 1, 'f2=love' -> 2, 'f3=you' -> 3). Basically every time your algorithms applies a feature template you check whether the result is already in the mapping and if not you add it with a new index.
Remember that Liblinear or Libsvm expect a sorted list in ascending order.
During processing you would first apply your feature templates to the current state of your stacks and then translate the strings to the libsvm/liblinear integer representation and sort the indexes in ascending order.
I am trying to compare means of the two groups 'single mothers with one child' and 'single mothers with more than one child' before and after the reform of the EITC system in 1993.
Through the procedure T-test in SPSS, I can get the difference between groups before and after the reform. But how do I get the difference of the difference (I still want standard errors)?
I found these methods for STATA and R (http://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/), but I can't seem to figure it out in SPSS.
Hope someone will be able to help.
All the best,
Anne
This can be done with the GENLIN procedure. Here's some random data I generated to show how:
data list list /after oneChild value.
begin data.
0 1 12
0 1 12
0 1 11
0 1 13
0 1 11
1 1 10
1 1 9
1 1 8
1 1 9
1 1 7
0 0 16
0 0 16
0 0 18
0 0 15
0 0 17
1 0 6
1 0 6
1 0 5
1 0 5
1 0 4
end data.
dataset name exampleData WINDOW=front.
EXECUTE.
value labels after 0 'before' 1 'after'.
value labels oneChild 0 '>1 child' 1 '1 child'.
The mean for the groups (in order, before I truncated to integers) are 17, 6, 12, and 9 respectively. So our GENLIN procedure should generate values of -11 (the after-before difference in the >1 child group), -5 (the difference of 1 child - >1 child), and 8 (the child difference of the after-before differences).
To graph the data, just so you can see what we're expecting:
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=after value oneChild MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: after=col(source(s), name("after"), unit.category())
DATA: value=col(source(s), name("value"))
DATA: oneChild=col(source(s), name("oneChild"), unit.category())
GUIDE: axis(dim(2), label("value"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label(""))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(smooth.linear(after*value)), color.interior(oneChild))
ELEMENT: point.dodge.symmetric(position(after*value), color.interior(oneChild))
END GPL.
Now, for the GENLIN:
* Generalized Linear Models.
GENLIN value BY after oneChild (ORDER=DESCENDING)
/MODEL after oneChild after*oneChild INTERCEPT=YES
DISTRIBUTION=NORMAL LINK=IDENTITY
/CRITERIA SCALE=MLE COVB=MODEL PCONVERGE=1E-006(ABSOLUTE) SINGULAR=1E-012 ANALYSISTYPE=3(WALD)
CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL
/MISSING CLASSMISSING=EXCLUDE
/PRINT CPS DESCRIPTIVES MODELINFO FIT SUMMARY SOLUTION.
The results table shows just what we expect.
The >1 child group is 12.3 - 10.1 lower after vs. before. This 95% CI contains the "real" value of 11
The before difference between >1 children and 1 child is 5.7 - 3.5, containing the real value of 5
The difference-of-differences is 9.6 - 6.4, containing the real value of (17-6) - (12-9) = 8
Std. errors, p values, and the other hypothesis testing values are all reported as well. Hope that helps.
EDIT: this can be done with less "complicated" syntax by computing the interaction term yourself and doing simple linear regression:
compute interaction = after*onechild.
execute.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT value
/METHOD=ENTER after oneChild interaction.
Note that the resulting standard errors and confidence intervals are actually different from the previous method. I don't know enough about SPSS's GENLIN and REGRESSION procedures to tell you why that's the case. In this contrived example, the conclusion you'd draw from your data would be approximately the same. In real life, the data aren't likely to be this clean, so I don't know which method is "better".
General Linear model, i take it as a 'ANOVA' model.
So use the related module in SPSS's Analyze menu.
After T-test, you need to check the sigma equality of each group .
Regarding the first answer above:
* Note that GENLIN uses maximum likelihood estimation (MLE) whereas REGRESSION
* uses ordinary least squares (OLS). Therefore, GENLIN reports z- and Chi-square tests
* where REGRESSION reports t- and F-tests. Rather than using GENLIN, use UNIANOVA
* to get the same results as REGRESSION, but without the need to compute your own
* product term.
UNIANOVA value BY after oneChild
/PLOT=PROFILE(after*oneChild)
/PLOT=PROFILE(oneChild*after)
/PRINT PARAMETER
/EMMEANS=TABLES(after*oneChild) COMPARE(after)
/EMMEANS=TABLES(after*oneChild) COMPARE(oneChild)
/DESIGN=after oneChild after*oneChild.
HTH.