I've currently got the below set of smoothed data:
print(df_smooth.dropna())`
mean std skew kurtosis peak2peak rms crestFactor \
4 0.247555 2.100961 0.001668 3.024679 20.628402 2.115862 5.066747
5 0.237015 2.062690 -0.000792 3.029156 20.314159 2.076466 5.043114
6 0.230783 2.044657 -0.001680 3.028746 20.219575 2.057846 5.030472
7 0.235838 1.986232 -0.001031 3.025417 19.497090 2.000425 4.960363
8 0.235062 1.984086 -0.001014 3.031342 19.817176 1.998209 4.989612
9 0.238660 1.968814 -0.001608 3.023882 19.340179 1.983427 4.998115
10 0.223305 1.975597 -0.000197 3.045224 19.701747 1.988305 5.135947
11 0.219480 2.007902 -0.002460 3.060428 20.252087 2.020074 5.117502
12 0.214518 2.071287 -0.002944 3.092217 21.489908 2.082439 5.302407
13 0.244281 2.122538 -0.003717 3.094335 21.792449 2.137164 5.271366
14 0.235806 2.161333 -0.003364 3.123866 23.128965 2.174895 5.472129
15 0.233630 2.175946 -0.002682 3.152740 24.045300 2.189226 5.610038
16 0.236764 2.188906 -0.000032 3.203623 24.745386 2.202420 5.772337
17 0.262289 2.205111 0.000350 3.192511 24.708587 2.221785 5.681394
18 0.229795 2.139946 0.001239 3.183109 23.745617 2.152940 5.564731
19 0.243538 2.150018 0.001071 3.170558 23.385026 2.164355 5.427326
20 0.266458 2.097468 -0.000830 3.144338 22.084817 2.115172 5.236667
21 0.280729 2.106302 -0.000618 3.101014 21.434129 2.125517 5.147621
22 0.252042 2.078190 0.000259 3.100911 20.991519 2.093988 5.231684
23 0.252297 2.097652 0.000383 3.126250 21.790854 2.113380 5.378267
24 0.250502 2.078781 0.000042 3.129014 21.559732 2.094428 5.340024
25 0.220506 2.070573 0.001974 3.110477 21.473643 2.082461 5.364519
26 0.204412 2.049979 -0.000306 3.227532 22.975315 2.060236 5.706146
27 0.215429 2.103150 -0.001421 3.275257 23.719901 2.114265 5.660891
28 0.216689 2.137870 -0.001783 3.298750 24.040561 2.148948 5.614089
29 0.208962 2.160487 0.000547 3.349068 24.546959 2.170628 5.732873
30 0.227231 2.267705 0.000101 3.413948 25.958169 2.279131 5.745555
31 0.221097 2.258519 0.001567 3.379193 25.424651 2.269446 5.662354
32 0.204962 2.224569 0.000951 3.458483 25.984242 2.234101 5.862379
33 0.224707 2.283631 0.000046 3.516125 27.410217 2.294934 6.024091
34 0.248792 2.354713 -0.001143 3.630634 29.159253 2.368248 6.197140
35 0.229501 2.339020 -0.000673 3.743356 30.695670 2.350898 6.613011
36 0.255474 2.454993 -0.001164 3.780962 32.480614 2.468843 6.627903
37 0.257979 2.530495 0.000630 3.962767 33.656646 2.544310 6.661273
38 0.232977 2.498537 0.001111 3.931879 32.754947 2.510044 6.557506
39 0.237025 2.392735 -0.000920 3.919665 31.277647 2.405969 6.494115
40 0.243630 2.368295 -0.001569 3.812383 29.306347 2.382131 6.077379
41 0.221252 2.305374 -0.000861 4.032235 29.548822 2.317355 6.292428
42 0.215262 2.254417 -0.002057 3.977328 28.970507 2.266098 6.353168
43 0.208581 2.240020 -0.001403 4.154288 30.121039 2.251270 6.630079
44 0.170230 2.302794 -0.001867 4.307822 31.556097 2.309174 6.838202
45 0.168889 2.353960 -0.001309 4.433633 32.825109 2.360053 6.977719
46 0.163156 2.337222 -0.001097 4.238485 31.344888 2.342934 6.658564
47 0.165685 2.369817 -0.002246 4.151915 31.154929 2.375626 6.438286
48 0.190677 2.552397 -0.003645 4.311166 33.473407 2.559565 6.428513
49 0.210200 2.667889 0.004168 4.495159 35.625185 2.676223 6.500683
I want to use the sckikit learn Mutual Information Classification to test for Monotonicity in this dataset, but am having trouble with the syntax (more specifically around the X-value) and the splitting of the full dataset into test and train sets.
I only want 40% of the dataset to be used at the "test data".
Currently this is the command I have:
X_train,X_test,y_train,y_test=train_test_split(df_smooth.dropna(),
test_size=0.4,
random_state=0)
print(X_train)
This is the error I get:
ValueError: not enough values to unpack (expected 4, got 2)
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
The output I want is something like this:
Monotonicity bar chart- descending
Where the MIC array is ranked from highest to low.
Using the following command:
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info
I tried extracting the ordered numbers 1-49 from the dataframe (which is what I believe is used as the "x" syntax input into the MCI function), but they don't seem to be part of the dataframe when called with iloc[:,0] (which displays the values in the "mean" column). I don't know how this takes into account the dropped "n/a" line values.
If you're testing for something like "the degree of monotonicity between two variables," you're probably looking for Spearman's rank correlation coefficient, which is implemented in scipy.stats.spearmanr:
MRE:
from io import StringIO
import pandas as pd
from scipy import stats
data = StringIO("""mean,std,skew,kurtosis,peak2peak,rms,crestFactor
0.247555,2.100961,0.001668,3.024679,20.628402,2.115862,5.066747
0.237015,2.062690,-0.000792,3.029156,20.314159,2.076466,5.043114
0.230783,2.044657,-0.001680,3.028746,20.219575,2.057846,5.030472
0.235838,1.986232,-0.001031,3.025417,19.497090,2.000425,4.960363
0.235062,1.984086,-0.001014,3.031342,19.817176,1.998209,4.989612
""")
df = pd.read_csv(data)
for var in df.columns:
print(f"{var} {stats.spearmanr(df[var], range(len(df))).correlation:.2f}")
Comparing the first five values of each column to the strictly monotonic sequence range() yields the following table, suggesting the first few samples are antimonotone:
mean -0.70
std -1.00
skew -0.60
kurtosis 0.60
peak2peak -0.90
rms -1.00
crestFactor -0.90
Related
I would like to add random numbers to a dask dataframe that uses a column intensity of the original dataframe to set the limits of the random numbers for each row. The code works with pandas and numpy.random, but not with dask and dask.array.
import dask.array as da
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
fns = [list-of-filenames]
df = dd.read_parquet(fns)
# dataframe has a column called intensity of type float
# and no missing values
df['separation_dimension_1'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
The error is:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (0,) and arg 1 with shape (33276691,).
Seems the syntax of numpy.random.uniform is a bit different than dask_array.random.uniform?
Full traceback
Cell In[21], line 7
5 df['mz_'] = df.mz * 1000000000
6 df['rt_'] = df.scan_time*10
----> 7 df['separation_dimension_1'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
8 #df['separation_dimension_2'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
9 #df['separation_dimension_3'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
11 df = df[df.intensity > 1e5][['rt_', 'mz_', 'logint']]
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:465, in _make_api.<locals>.wrapper(*args, **kwargs)
462 if backend not in _cached_random_states:
463 # Cache the default RandomState object for this backend
464 _cached_random_states[backend] = RandomState()
--> 465 return getattr(
466 _cached_random_states[backend],
467 attr,
468 )(*args, **kwargs)
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:423, in RandomState.uniform(self, low, high, size, chunks, **kwargs)
421 #derived_from(np.random.RandomState, skipblocks=1)
422 def uniform(self, low=0.0, high=1.0, size=None, chunks="auto", **kwargs):
--> 423 return self._wrap("uniform", low, high, size=size, chunks=chunks, **kwargs)
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:170, in RandomState._wrap(self, funcname, size, chunks, extra_chunks, *args, **kwargs)
165 kwrg[k] = (getitem, lookup[k], slc)
166 vals.append(
167 (_apply_random, self._RandomState, funcname, seed, size, arg, kwrg)
168 )
--> 170 meta = _apply_random(
171 self._RandomState,
172 funcname,
173 seed,
174 (0,) * len(size),
175 small_args,
176 small_kwargs,
177 )
179 dsk.update(dict(zip(keys, vals)))
181 graph = HighLevelGraph.from_collections(name, dsk, dependencies=dependencies)
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:453, in _apply_random(RandomState, funcname, state_data, size, args, kwargs)
451 state = RandomState(state_data)
452 func = getattr(state, funcname)
--> 453 return func(*args, size=size, **kwargs)
File mtrand.pyx:1134, in numpy.random.mtrand.RandomState.uniform()
File _common.pyx:600, in numpy.random._common.cont()
File _common.pyx:517, in numpy.random._common.cont_broadcast_2()
File __init__.pxd:741, in numpy.PyArray_MultiIterNew3()
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (0,) and arg 1 with shape (6249365,).
As is often the case, you will be able to do this using map_partitions, which applies the operation you are after on each component real pandas dataframe
def op(df):
df['separation_dimension_1'] = np.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
return df
df2 = df.map_partitions(op)
I'm trying to predict time series data for the next few days looking at past few days, using Keras. My label data is target values for multiple future days, regression model has multiple output neurons (the "direct approach" for time series).
Here is test data with predictions for 10 days, using 60 days history.
10 days prediction for test data
As you can see, future values for all days are about the same. I've spent quite some time on it, and must admit that I'm probably missing something with respect to LSTM...
Here is training data with prediction:
10 days prediction for training data
In order to confirm that I'm preparing data properly, I've created a "tracking data set" which I used to visualize data transformations. Here it is...
Data set:
Open,High,Low,Close,Volume,OpenInt
111,112,113,114,115,0
121,122,123,124,125,0
131,132,133,134,135,0
141,142,143,144,145,0
151,152,153,154,155,0
161,162,163,164,165,0
171,172,173,174,175,0
181,182,183,184,185,0
191,192,193,194,195,0
201,202,203,204,205,0
211,212,213,214,215,0
221,222,223,224,225,0
231,232,233,234,235,0
241,242,243,244,245,0
251,252,253,254,255,0
261,262,263,264,265,0
271,272,273,274,275,0
281,282,283,284,285,0
291,292,293,294,295,0
Training set using 2 days history, predicting 3 days future values (I used different values of history days and future days, and it all makes sense to me), without feature scaling in order to visualize data transformations:
X train (6, 2, 5)
[[[111 112 113 114 115]
[121 122 123 124 125]]
[[121 122 123 124 125]
[131 132 133 134 135]]
[[131 132 133 134 135]
[141 142 143 144 145]]
[[141 142 143 144 145]
[151 152 153 154 155]]
[[151 152 153 154 155]
[161 162 163 164 165]]
[[161 162 163 164 165]
[171 172 173 174 175]]]
Y train (6, 3)
[[131 141 151]
[141 151 161]
[151 161 171]
[161 171 181]
[171 181 191]
[181 191 201]]
Test set
X test (6, 2, 5)
[[[201 202 203 204 205]
[211 212 213 214 215]]
[[211 212 213 214 215]
[221 222 223 224 225]]
[[221 222 223 224 225]
[231 232 233 234 235]]
[[231 232 233 234 235]
[241 242 243 244 245]]
[[241 242 243 244 245]
[251 252 253 254 255]]
[[251 252 253 254 255]
[261 262 263 264 265]]]
Y test (6, 3)
[[221 231 241]
[231 241 251]
[241 251 261]
[251 261 271]
[261 271 281]
[271 281 291]]
Model:
def CreateRegressor(self,
optimizer='adam',
activation='tanh', # RNN activation
init_mode='glorot_uniform',
hidden_neurons=50,
dropout_rate=0.0,
weight_constraint=0,
stateful=False,
# SGD parameters
learn_rate=0.01,
momentum=0):
kernel_constraint = maxnorm(weight_constraint) if weight_constraint > 0 else None
model = Sequential()
model.add(LSTM(units=hidden_neurons, activation=activation, kernel_initializer=init_mode, kernel_constraint=kernel_constraint,
return_sequences=True, input_shape=(self.X_train.shape[1], self.X_train.shape[2]), stateful=stateful))
model.add(Dropout(dropout_rate))
model.add(LSTM(units=hidden_neurons, activation=activation, kernel_initializer=init_mode, kernel_constraint=kernel_constraint,
return_sequences=True, stateful=stateful))
model.add(Dropout(dropout_rate))
model.add(LSTM(units=hidden_neurons, activation=activation, kernel_initializer=init_mode, kernel_constraint=kernel_constraint,
return_sequences=True, stateful=stateful))
model.add(Dropout(dropout_rate))
model.add(LSTM(units=hidden_neurons, activation=activation, kernel_initializer=init_mode, kernel_constraint=kernel_constraint,
return_sequences=False, stateful=stateful))
model.add(Dropout(dropout_rate))
model.add(Dense(units=self.y_train.shape[1]))
if (optimizer == 'SGD'):
optimizer = SGD(lr=learn_rate, momentum=momentum)
model.compile(optimizer=optimizer, loss='mean_squared_error')
return model
...which I create with these params:
self.CreateRegressor(optimizer = 'adam', hidden_neurons = 100)
... and then fit like this:
self.regressor.fit(self.X_train, self.y_train, epochs=100, batch_size=32)
... and predict:
y_pred = self.regressor.predict(X_test)
... or
y_pred_train = self.regressor.predict(X_train)
What am I missing?
Using the Images package, I can open up a color image, convert it to Gray scale and then :
using Images
img_gld = imread("...path to some color jpg...")
img_gld_gs = convert(Image{Gray},img_gld)
#change from floats to Array of values between 0 and 255:
img_gld_gs = reinterpret(Uint8,data(img_gld_gs))
Now I've got a 1920X1080 array of Uint8's:
julia> img_gld_gs
1920x1080 Array{Uint8,2}
Now I want to get a histogram of the 2D array of Uint8 values:
julia> hist(img_gld_gs)
(0.0:50.0:300.0,
6x1080 Array{Int64,2}:
1302 1288 1293 1302 1297 1300 1257 1234 … 12 13 13 12 13 15 14
618 632 627 618 623 620 663 686 189 187 187 188 185 183 183
0 0 0 0 0 0 0 0 9 9 8 7 8 7 7
0 0 0 0 0 0 0 0 10 12 9 7 13 7 9
0 0 0 0 0 0 0 0 1238 1230 1236 1235 1230 1240 1234
0 0 0 0 0 0 0 0 … 462 469 467 471 471 468 473)
But, instead of 6x1080, I'd like 256 slots in the histogram to show total number of times each value has appeared. I tried:
julia> hist(img_gld_gs,256)
But that gives:
(2.0:1.0:252.0,
250x1080 Array{Int64,2}:
So instead of a 256x1080 Array, it's 250x1080. Is there any way to force it to have 256 bins (without resorting to writing my own hist function)? I want to be able to compare different images and I want the histogram for each image to have the same number of bins.
Assuming you want a histogram for the entire image (rather than one per row), you might want
hist(vec(img_gld_gs), -1:255)
which first converts the image to a 1-dimensional vector. (You can also use img_gld_gs[:], but that copies the data.)
Also note the range here: the hist function uses a left-open interval, so it will omit counting zeros unless you use something smaller than 0.
hist also accepts a vector (or range) as an optional argument that specifies the edge boundaries, so
hist(img_gld_gs, 0:256)
should work.
I'm beginning with biopython and I have a question about parsing results. I used a tutorial to get involved in this and here is the code that I used:
from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("/Users/jcastrof/blast/pruebarpsb.xml")):
if record.alignments:
print "Query: %s..." % record.query[:60]
for align in record.alignments:
for hsp in align.hsps:
print " %s HSP,e=%f, from position %i to %i" \
% (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
Part of the result obtained is:
gnl|CDD|225858 HSP,e=0.000000, from position 32 to 1118
gnl|CDD|225858 HSP,e=0.000000, from position 1775 to 2671
gnl|CDD|214836 HSP,e=0.000000, from position 37 to 458
gnl|CDD|214836 HSP,e=0.000000, from position 1775 to 2192
gnl|CDD|214838 HSP,e=0.000000, from position 567 to 850
And what I want to do is to sort that result by position of the hit (Hsp_hit-from), like this:
gnl|CDD|225858 HSP,e=0.000000, from position 32 to 1118
gnl|CDD|214836 HSP,e=0.000000, from position 37 to 458
gnl|CDD|214838 HSP,e=0.000000, from position 567 to 850
gnl|CDD|225858 HSP,e=0.000000, from position 1775 to 2671
gnl|CDD|214836 HSP,e=0.000000, from position 1775 to 2192
My input file for rps-blast is a *.xml file.
Any suggestion to proceed?
Thanks!
The HSPs list is just a Python list, and can be sorted as usual. Try:
align.hsps.sort(key = lambda hsp: hsp.query_start)
However, you are dealing with a nested list (each match has a list of HSPs), and you want to sort over all of them. Here making your own list might be best - something like this:
for record in ...:
print "Query: %s..." % record.query[:60]
hits = sorted((hsp.query_start, hsp.query_end, hsp.expect, align.hit_id) \
for hsp in align.hsps for align in record.alignments)
for q_start, q_end, expect, hit_id in hits:
print " %s HSP,e=%f, from position %i to %i" \
% (hit_id, expect, q_start, q_end)
Peter
I tried to create a neural network to estimate y = x ^ 2. So I created a fitting neural network and gave it some samples for input and output. I tried to build this network in C++. But the result is different than I expected.
With the following inputs:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 -1
-2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23 -24 -25 -26 -27 -28 -29 -30 -31 -32 -33 -34 -35 -36 -37 -38 -39 -40 -41 -42 -43 -44 -45 -46 -47 -48 -49 -50 -51 -52 -53 -54 -55 -56 -57 -58 -59 -60 -61 -62 -63 -64 -65 -66 -67 -68 -69 -70 -71
and the following outputs:
0 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400
441 484 529 576 625 676 729 784 841 900 961 1024 1089 1156 1225 1296
1369 1444 1521 1600 1681 1764 1849 1936 2025 2116 2209 2304 2401 2500
2601 2704 2809 2916 3025 3136 3249 3364 3481 3600 3721 3844 3969 4096
4225 4356 4489 4624 4761 4900 5041 1 4 9 16 25 36 49 64 81 100 121 144
169 196 225 256 289 324 361 400 441 484 529 576 625 676 729 784 841
900 961 1024 1089 1156 1225 1296 1369 1444 1521 1600 1681 1764 1849
1936 2025 2116 2209 2304 2401 2500 2601 2704 2809 2916 3025 3136 3249
3364 3481 3600 3721 3844 3969 4096 4225 4356 4489 4624 4761 4900 5041
I used fitting tool network. with matrix rows. Training is 70%, validation is 15% and testing is 15% as well. The number of hidden neurons is two. Then in command lines I wrote this:
purelin(net.LW{2}*tansig(net.IW{1}*inputTest+net.b{1})+net.b{2})
Other information :
My net.b[1] is: -1.16610230053776 1.16667147712026
My net.b[2] is: 51.3266249426358
And net.IW(1) is: 0.344272596370387 0.344111217766824
net.LW(2) is: 31.7635369693519 -31.8082184881063
When my inputTest is 3, the result of this command is 16, while it should be about 9. Have I made an error somewhere?
I found the Stack Overflow post Neural network in MATLAB that contains a problem like my problem, but there is a little difference, and the differences is in that problem the ranges of input and output are same, but in my problem is no. That solution says I need to scale out the results, but how can I scale out my result?
You are right about scaling. As was mentioned in the linked answer, the neural network by default scales the input and output to the range [-1,1]. This can be seen in the network processing functions configuration:
>> net = fitnet(2);
>> net.inputs{1}.processFcns
ans =
'removeconstantrows' 'mapminmax'
>> net.outputs{2}.processFcns
ans =
'removeconstantrows' 'mapminmax'
The second preprocessing function applied to both input/output is mapminmax with the following parameters:
>> net.inputs{1}.processParams{2}
ans =
ymin: -1
ymax: 1
>> net.outputs{2}.processParams{2}
ans =
ymin: -1
ymax: 1
to map both into the range [-1,1] (prior to training).
This means that the trained network expects input values in this range, and outputs values also in the same range. If you want to manually feed input to the network, and compute the output yourself, you have to scale the data at input, and reverse the mapping at the output.
One last thing to remember is that each time you train the ANN, you will get different weights. If you want reproducible results, you need to fix the state of the random number generator (initialize it with the same seed each time). Read the documentation on functions like rng and RandStream.
You also have to pay attention that if you are dividing the data into training/testing/validation sets, you must use the same split each time (probably also affected by the randomness aspect I mentioned).
Here is an example to illustrate the idea (adapted from another post of mine):
%%# data
x = linspace(-71,71,200); %# 1D input
y_model = x.^2; %# model
y = y_model + 10*randn(size(x)).*x; %# add some noise
%%# create ANN, train, simulate
net = fitnet(2); %# one hidden layer with 2 nodes
net.divideFcn = 'dividerand';
net.trainParam.epochs = 50;
net = train(net,x,y);
y_hat = net(x);
%%# plot
plot(x, y, 'b.'), hold on
plot(x, x.^2, 'Color','g', 'LineWidth',2)
plot(x, y_hat, 'Color','r', 'LineWidth',2)
legend({'data (noisy)','model (x^2)','fitted'})
hold off, grid on
%%# manually simulate network
%# map input to [-1,1] range
[~,inMap] = mapminmax(x, -1, 1);
in = mapminmax('apply', x, inMap);
%# propagate values to get output (scaled to [-1,1])
hid = tansig( bsxfun(#plus, net.IW{1}*in, net.b{1}) ); %# hidden layer
outLayerOut = purelin( net.LW{2}*hid + net.b{2} ); %# output layer
%# reverse mapping from [-1,1] to original data scale
[~,outMap] = mapminmax(y, -1, 1);
out = mapminmax('reverse', outLayerOut, outMap);
%# compare against MATLAB output
max( abs(out - y_hat) ) %# this should be zero (or in the order of `eps`)
I opted to use the mapminmax function, but you could have done that manually as well. The formula is a pretty simply linear mapping:
y = (ymax-ymin)*(x-xmin)/(xmax-xmin) + ymin;