Compound types with units - hdf5

I wonder if it is possible in HDF5 to store physical units together with the components of a compound datatype?
To give an example, consider geographic coordinates. A location can be indicated in different ways. Latitude, longitude and radius may be given in degrees and kilometers but specified in radians and meters is just as good. How can I store this information using h5py?

HDF5 does not know about units. Therefore, you have to create an appropriate schema to document the units of your coordinate data. How you do this is really up to you. I can think of at least 3 approaches:
Method 1 creates 1 dataset for each coordinate
-- Units are defined with a dataset level attribute
Method 2 creates 1 compound dataset with all 3 coordinates.
-- Units are defined with 3 group level attributes AND as part of the
field (column) names
Method 3 creates 1 compound dataset with all 3 coordinates.
-- Units are defined with 3 additional fields (columns) (datatype of strings)
-- Attributes are NOT used to save Units.
Here is a code sample that demonstrates all 3 methods with a small set of data (10 values of each coordinate). Hope this gives you some ideas.
import h5py
import numpy as np
long_arr = np.random.uniform(-180.,180., 10)
lat_arr = np.random.uniform(-90.,90., 10)
rad_arr = np.random.uniform(6357.,6378., 10)
with h5py.File('SO_65977032.h5', mode='w') as h5f:
# Method 1 creates 1 dataset for each coordinate
# Units are defined with a dataset level attribute
h5f.create_group('Method_1')
h5f['Method_1'].create_dataset('Long', data=long_arr)
h5f['Method_1']['Long'].attrs['Units']='Degrees'
h5f['Method_1'].create_dataset('Lat', data=lat_arr)
h5f['Method_1']['Lat'].attrs['Units']='Degrees'
h5f['Method_1'].create_dataset('Radius', data=rad_arr)
h5f['Method_1']['Radius'].attrs['Units']='km'
# Method 2 creates 1 compound dataset with all 3 coordinates.
# Units are defined with 3 group level attributes AND as part of the field (column) names
h5f.create_group('Method_2')
llr_dt = [ ('Long(Deg)', float), ('Lat(Deg)', float), ('Radius(km)', float) ]
h5f['Method_2'].create_dataset('Coords', dtype=llr_dt, shape=(10,))
h5f['Method_2']['Coords']['Long(Deg)'] = long_arr
h5f['Method_2']['Coords'].attrs['Long Units']='Degrees'
h5f['Method_2']['Coords']['Lat(Deg)'] = lat_arr
h5f['Method_2']['Coords'].attrs['Lat Units']='Degrees'
h5f['Method_2']['Coords']['Radius(km)'] = rad_arr
h5f['Method_2']['Coords'].attrs['Radius Units']='km'
# Method 3 creates 1 compound dataset with all 3 coordinates.
# Units are defined with 3 additional fields (columns) (datatype of strings)
# Attributes are NOT used to save Units.
h5f.create_group('Method_3')
llru_dt = [ ('Long', float), ('Long_units', 'S8'),
('Lat', float), ('Lat_units', 'S8'),
('Radius', float), ('Rad_units', 'S8') ]
h5f['Method_3'].create_dataset('Coords', dtype=llru_dt, shape=(10,))
h5f['Method_3']['Coords']['Long'] = long_arr
h5f['Method_3']['Coords']['Long_units'] = [ 'Degree' for _ in range(10) ]
h5f['Method_3']['Coords']['Lat'] = lat_arr
h5f['Method_3']['Coords']['Lat_units'] = [ 'Degree' for _ in range(10) ]
h5f['Method_3']['Coords']['Radius'] = rad_arr
h5f['Method_3']['Coords']['Rad_units'] = [ 'km' for _ in range(10) ]

Related

Finding the Jacobian of a frame with respect to the joints of a given model in Pydrake

Is there any way to find the Jacobian of a frame with respect to the joints of a given model (as opposed to the whole plant), or alternatively to determine which columns of the full plant Jacobian correspond to a given model’s joints? I’ve found MultibodyPlant.CalcJacobian*, but I’m not sure if those are the right methods.
I also tried mapping the JointIndex of each joint in the model to a column of MultibodyPlant.CalcJacobian*, but the results didn't make sense -- the joint indices are sequential (all of one model followed by all of the other), but the Jacobian columns look interleaved (a column corresponding to one model followed by one corresponding to the other).
Assuming you are computing with respect to velocities, you'll want to use Joint.velocity_start() and Joint.num_velocities() to create a mask or set of indices. If you are in Python, then you can use NumPy's array slicing to select the desired columns of your Jacobian.
(If you compute w.r.t. position, then make sure you use Joint.position_start() and Joint.num_positions().)
Example notebook:
https://nbviewer.jupyter.org/github/EricCousineau-TRI/repro/blob/eb7f11d/drake_stuff/notebooks/multibody_plant_jacobian_subset.ipynb
(TODO: Point to a more official source.)
Main code to pay attention to:
def get_velocity_mask(plant, joints):
"""
Generates a mask according to supplied set of ``joints``.
The binary mask is unable to preserve ordering for joint indices, thus
`joints` required to be a ``set`` (for simplicity).
"""
assert isinstance(joints, set)
mask = np.zeros(plant.num_velocities(), dtype=np.bool)
for joint in joints:
start = joint.velocity_start()
end = start + joint.num_velocities()
mask[start:end] = True
return mask
def get_velocity_indices(plant, joints):
"""
Generates a list of indices according to supplies list of ``joints``.
The indices are generated according to the order of ``joints``, thus
``joints`` is required to be a list (for simplicity).
"""
indices = []
for joint in joints:
start = joint.velocity_start()
end = start + joint.num_velocities()
for i in range(start, end):
indices.append(i)
return indices
...
# print(Jv1_WG1) # Prints 7 dof from a 14 dof plant
[[0.000 -0.707 0.354 0.707 0.612 -0.750 0.256]
[0.000 0.707 0.354 -0.707 0.612 0.250 0.963]
[1.000 -0.000 0.866 -0.000 0.500 0.612 -0.079]
[-0.471 0.394 -0.211 -0.137 -0.043 -0.049 0.000]
[0.414 0.394 0.162 -0.137 0.014 0.008 0.000]
[0.000 -0.626 0.020 0.416 0.035 -0.064 0.000]]

H2O Gain/Lift table

My question is about H2O Gain/Lift table. I understand that the response rate is the proportion of all the events that fall into the group/bin. HOW to get that pieces of data that fall into bin 1, bin 2, etc.? I want to see how the key variables look in each group/bin in respect to the Response Rate.
It would be great to have a full description of how the measures in Gain/Lift table are calculated (formulas)
The equations for the Gains and Lift Chart can be found in this file: https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/hex/GainsLift.java
Which shows:
E = total number of events
N = number of observations
G = number of groups (10 for deciles or 20 for demi-deciles)
P = overall proportion of observations that are events (P = E/N)
ei = number of events in group i, i=1,2,...,G
ni = number of observations in group i
pi = proportion of observations in group i that are events (pi = ei/ni)
groups: are hard coded to 16; if there are fewer than 16 unique probability values, then the number of groups is reduced to the number of unique quantile thresholds.
cumulative data fraction = sum_n_i/N
lower_threshold = set by quantile bins
lift = pi/P
cumulative_lift = (Σiei/Σini)/P
response_rate = 100*pi
cumulative_response_rate = 100*Σiei/Σini
capture_rate = 100*ei/E
cumulative_capture_rate = 100*Σiei/E
gain = 100*(lift-1)
cumulative_gain = 100*(sum_lift-1)
average_response_rate = E/N
So here is a example walkthrough using the H2O-3 Python API:
import h2o
import pandas as pd
import numpy as np
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import and split the dataset
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split dataset
train, valid = cars.split_frame(ratios=[.7],seed=1234)
# Initialize and train a GBM
cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame=valid)
# Generate Gains and Lift Table
# documentation on this parameter can be found here:
# http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html?#h2o.model.H2OBinomialModel.gains_lift
gainslift = cars_gbm.gains_lift(train=False, valid=True, xval=False)
Table Overview
As expected we have 16 groups, because this is the hardcoded default behavior.
Cumulative data fractions
Threshold probability value
Response rates (proportion of observations that are events in a group)
Cumulative response rate
Event capture rate
Cumulative capture rate
Gain (difference in percentages between the overall proportion of events and the observed proportion of observations that are events in the group)
Cumulative gain
What if I Want Just the Deciles
By default the Gains and Lift Table provides you with more then just the deciles or ventiles, what this means is you have more flexibilty to pick out the percentiles in which you are interested.
Let's take the example of getting our deciles. In this example we see that we can start at row 6, skip row 7 and then take the rest of the rows to get our deciles.
Since the Gains and Lift Table returns a TwoDimTable we can use our group numbers as selection indices.
# show gains and lift table data type
print('H2O Gains Lift Table is of type: ', type(gainslift))
H2O Gains Lift Table is of type: <class 'h2o.two_dim_table.H2OTwoDimTable'>
# since this table is small and for ease of use let's covert to a pandas dataframe
pandas_gl = gainslift.as_data_frame()
pandas_gl.set_index('group')
gainslift_deciles = pandas_gl.iloc[pd.np.r_[5,7:16], :]
gainslift_deciles
What if I Want Just the Ventiles
Those are available to select out as well, so let's do that next.
gainslift_ventiles = pandas_gl.iloc[pd.np.r_[7,9,11,13,15], :]
gainslift_ventiles

Difference between the weight parameter in xgb.DMatrix and scale_pos_weight in hyper params list?

I am having a little difficulty understanding what's the difference between the weight function in xgb.DMatrix and the sum_pos_weight parameter in the param list. I am going through the following code which is using the Higgs data;
Due to the data being unbalanced, the author defines a weight parameter:
weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
sumwpos <- sum(weight * (label==1.0))
sumwneg <- sum(weight * (label==0.0))
However column 32 is already a weight variable, so the author is modifying an already defined weight variable?
Then, the modified weight variable is being set as the "weight" argument of xgb.DMatrix:
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
Additionally, in the param list the author has: "scale_pos_weight" = sumwneg / sumwpos,.
so scale_pos_weight is a function of sumneg which is a function of weight which is a function of a previously defined weight (column 32). So I am confused.
What does the author do in the following line: weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
What is the difference in setting the weight in xgb.DMatrix and again in sum_pos_weight?
When you set
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
weight should be a vector corresponding to your data rows
If for example you have the following data:
A B C
1 1 1 1
2 2 2 2
you need to set weight as a vector of 2 weights
weight <- c(1, 2)
So you will have a weight of 1 to the first event and weight of 2 to the 2nd event. You ask your self why is it good? Assume event 1 has happened 1 time and event 2 happened 2 times, you'd like co responsive weights to them specifically mentioning the amount of time that event has occurred.
Here are few more examples for using weights:
If you want recent events to have more "value"
The amount of confidence you have in a data row. you will set all weights to be between 0 to 1 and the weight will represent how much you sure of that data. for example if weight = 0.88 you gave that row 88% confidence
If you have repetitive events. instead of creating more rows, you can set them once and give them a weight as the number they've repeated
scale_pos_weight is usually used when you have "imbalanced data". for example, assuming you have a classification problem where you have 5% of the data as 1 and 95% of the data as 0, you would like to give more weight for every positive "event". So you can just set scale_pos_weight = 19 (or as the author wrote: sumneg/sumpos)
As for the "author" re defining weight. I cannot know without the full code what he did there, but I assume he's doing some sort of normalization to the weights.

Is there any way to get abstracts for a given list of pubmed ids?

I have list of pmids
i want to get abstracts for both of them in a single url hit
pmids=[17284678,9997]
abstract_dict={}
url = https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&id=**17284678,9997**&retmode=text&rettype=xml
My requirement is to get in this format
abstract_dict={"pmid1":"abstract1","pmid2":"abstract2"}
I can get in above format by trying each id and updating the dictionary, but to optimize time I want to give all ids to url and process and get only abstracts part.
Using BioPython, you can give the joined list of Pubmed IDs to Entrez.efetch and that will perform a single URL lookup:
from Bio import Entrez
Entrez.email = 'your_email#provider.com'
pmids = [17284678,9997]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
This gives as result:
{9997: 'Electron paramagnetic resonance and magnetic susceptibility studies of Chromatium flavocytochrome C552 and its diheme flavin-free subunit at temperatures below 45 degrees K are reported. The results show that in the intact protein and the subunit the two low-spin (S = 1/2) heme irons are distinguishable, giving rise to separate EPR signals. In the intact protein only, one of the heme irons exists in two different low spin environments in the pH range 5.5 to 10.5, while the other remains in a constant environment. Factors influencing the variable heme iron environment also influence flavin reactivity, indicating the existence of a mechanism for heme-flavin interaction.',
17284678: 'Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation.'}
Edit:
In the case of pmids without corresponding abstracts, watch out with the fix you suggested:
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0]
for pubmed_article in records['PubmedArticle'] if 'Abstract' in
pubmed_article['MedlineCitation']['Article'].keys()]
Suppose you have the list of Pubmed IDs pmids = [1, 2, 3], but pmid 2 doesn't have an abstract, so abstracts = ['abstract of 1', 'abstract of 3']
This will cause problems in the final step where I zip both lists together to make a dict:
>>> abstract_dict = dict(zip(pmids, abstracts))
>>> print(abstract_dict)
{1: 'abstract of 1',
2: 'abstract of 3'}
Note that abstracts are now out of sync with their corresponding Pubmed IDs, because you didn't filter out the pmids without abstracts and zip truncates to the shortest list.
Instead, do:
abstract_dict = {}
without_abstract = []
for pubmed_article in records['PubmedArticle']:
pmid = int(str(pubmed_article['MedlineCitation']['PMID']))
article = pubmed_article['MedlineCitation']['Article']
if 'Abstract' in article:
abstract = article['Abstract']['AbstractText'][0]
abstract_dict[pmid] = abstract
else:
without_abstract.append(pmid)
print(abstract_dict)
print(without_abstract)
from Bio import Entrez
import time
Entrez.email = 'your_email#provider.com'
pmids = [29090559 29058482 28991880 28984387 28862677 28804631 28801717 28770950 28768831 28707064 28701466 28685492 28623948 28551248]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0] if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys() else pubmed_article['MedlineCitation']['Article']['ArticleTitle'] for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
print abstract_dict

Compute annual mean using x-arrays

I have a python xarray dataset with time,x,y for its dimensions and value1 as its variable. I'm trying to compute annual mean of value1 for each x,y coordinate pair.
I've run into this function while reading the docs:
ds.groupby('time.year').mean()
This seems to compute a single annual mean for all x,y coordinate pairs in value1 at each given time slice
rather than the annual means of individual x,y coordinate pairs at each given time slice.
While the code snippet above produces the wrong output, I'm very interested in its oversimplified form. I would really like to figure out the "X-arrays trick" to doing annual mean for a given x,y coordinate pair rather than hacking it together myself.
Cam someone point me in the right direction? Should I temporarily turn this into a pandas object?
To avoid the default of averaging over all dimensions, you simply need to supply the dimension you want to average over explicitly:
ds.groupby('time.year').mean('time')
Note, that calling ds.groupby('time.year').mean('time') will be incorrect if you are working with monthly and not daily data. Taking the mean will place equal weight on months of different length, e.g., Feb and July, which is wrong.
Instead use below from NCAR:
def weighted_temporal_mean(ds, var):
"""
weight by days in each month
"""
# Determine the month length
month_length = ds.time.dt.days_in_month
# Calculate the weights
wgts = month_length.groupby("time.year") / month_length.groupby("time.year").sum()
# Make sure the weights in each year add up to 1
np.testing.assert_allclose(wgts.groupby("time.year").sum(xr.ALL_DIMS), 1.0)
# Subset our dataset for our variable
obs = ds[var]
# Setup our masking for nan values
cond = obs.isnull()
ones = xr.where(cond, 0.0, 1.0)
# Calculate the numerator
obs_sum = (obs * wgts).resample(time="AS").sum(dim="time")
# Calculate the denominator
ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
# Return the weighted average
return obs_sum / ones_out
average_weighted_temp = weighted_temporal_mean(ds_first_five_years, 'TEMP')

Resources