GridSearchCV freezing with linear svm - machine-learning

I have problem with GridSearchCV freezing (CPU is active but program in not advancing) with linear svm (but with rbf svm it works fine).
Depending on the random_state that I use for splitting my data, I have this freezing in different splits points of cv for different PCA components?
The features of one sample looks like the following(it is about 39 features)
[1 117 137 2 80 16 2 39 228 88 5 6 0 10 13 6 22 23 1 227 246 7 1.656934307 0 5 0.434195726 0.010123735 0.55568054 5 275 119.48398 0.9359527 0.80484825 3.1272728 98 334 526 0.13454546 0.10181818]
Another sample's features:
[23149 4 31839 9 219 117 23 5 31897 12389 108 2 0 33 23 0 0 18 0 0 0 23149 0 0 74 0.996405221 0.003549844 4.49347E-05 74 5144 6.4480677 0.286384 0.9947901 3.833787 20 5135 14586 0.0060264384 0.011664075]
If I delete the last 10 feature I don't have this problem ( The 10 new features that I added before my code worked fine). I did not check other combinations of the 10 last new features to check if a specific feature is causing this problem.
Also I use StandardScaler to scale the features but still facing this issue. I have less of this problem if I use MinMaxScaler scaler (but read soewhere it is not good for svm).
I also put n_jobs to different numbers and it only could advance by little but freezes again.
What do you suggest?
I followed part of this code to write my code:
TypeError grid seach

Related

How to extend nonlinear curve beyond supplied data in google sheets

I have a plotted spectral curve in google sheets. All points are real coordinates. As you can see, data is not provided for the slope below 614nm. I would like to extend the slope beyond the supplied data, so that it reaches 0. In a mathematically relevant way to follow the trajectory it was taking from when the slope started. Someone mentioned to me I would have to potentially use a linear regression? I'm not sure what that is. How would I go about extending this slope relevant to it's defined trajectory down to 0 in google sheets?
Here's the data
x-axis:
614
616
618
620
622
624
626
628
630
632
634
636
638
640
642
644
646
648
650
652
654
656
658
660
662
664
666
668
670
672
674
676
678
680
682
684
686
688
690
692
694
696
698
700
702
704
706
708
710
712
714
716
718
720
722
724
726
728
730
y-axis:
0.7101
0.7863
0.8623
0.9345
1.0029
1.069
1.1317
1.1898
1.2424
1.289
1.3303
1.3667
1.3985
1.4261
1.4499
1.47
1.4867
1.5005
1.5118
1.5206
1.5273
1.532
1.5348
1.5359
1.5355
1.5336
1.5305
1.5263
1.5212
1.5151
1.5079
1.4994
1.4892
1.4771
1.4631
1.448
1.4332
1.4197
1.4088
1.4015
1.3965
1.3926
1.388
1.3813
1.3714
1.359
1.345
1.3305
1.3163
1.303
1.2904
1.2781
1.2656
1.2526
1.2387
1.2242
1.2091
1.1937
1.1782
Thanks
I understand that you want The curve should be increased beyond the given data in a mathematically sound fashion until it approaches 0, In what follows, I'm going to show how it's done with the last 2 data points which make the filled data linear it might help, like this: take a look at this Sheet.
We need to
1 - Paste this SEQUENCE function formula in C3 to get the order of input
=SEQUENCE(COUNTA(B3:B),1,1,1)
2 - SORT the the input by pasting this formula in E3.
=SORT(A3:C61,3,0)
3 - In F62 after the last line of the sorted data paste this TREND function that Fits an ideal linear trend using the least squares approach to incomplete data about a linear trend and/or makes additional value predictions..
=TREND(F60:F61,E60:E61,E62:E101)
TREND takes
'known_data_y' set to F60:F61
'[known_data_x]' set to E61,E62 those are the 2 data point
[known_data_x] set to E62:E101, we get it by pasting E62:E101 after the last line of the sorted data in "x-axis:" in output table cell E62
4 - To see the newly genrated data in the red curve we need a new column that start from K62 till the very bottom of the data "y-axis:" in output table cell K62, by pasting this ArrayFormula in K62.
=ArrayFormula(E62:G101)
5 - Add a Serie in tne chart in chart editor > setup > Series > Add Serie.

Which clustering model can I use to predict the following outcome?

I have three columns in my dataset. This is the list of restaurants that come under the category 'pizza'.This data was derived from the yelp dataset.There are three columns for each restaurant present. Latitude,Longitude,Checkins. I am supposed to build a model where I should be able to predict the coordinates(latitude,longitude) where I should start a new restaurant so that the number of checkins can be high. There are totally 4951 rows
checkins latitude longitude
0 2 33.394877 -111.600194
1 2 43.841217 -79.303936
2 1 40.442828 -80.186293
3 1 41.141631 -81.356603
4 1 40.434399 -79.922983
5 1 33.552870 -112.133712
6 1 43.686836 -79.293838
7 2 41.131282 -81.490180
8 1 40.500796 -79.943429
9 12 36.010086 -115.118656
10 2 41.484475 -81.921150
11 1 43.842450 -79.027990
12 1 43.724840 -79.289919
13 2 45.448630 -73.608719
14 1 45.577027 -73.330855
15 1 36.238059 -115.210341
16 1 33.623055 -112.339758
17 1 43.762768 -79.491417
18 1 43.708415 -79.475884
19 1 45.588257 -73.428926
20 4 41.152875 -81.358754
21 1 41.608833 -81.525020
22 1 41.425152 -81.896178
23 1 43.694716 -79.304879
24 1 40.442147 -79.956513
25 1 41.336466 -81.784790
26 1 33.231942 -111.721218
27 2 36.291436 -115.287016
28 2 33.641847 -111.995571
29 1 43.570217 -79.566431
... ... ... ...
I tried to approach the problem with clustering using DBSCAN and ended with the following graph. But I am not able to make any sense of it. How do I Proceed further or how do I approach the problem in a different way to get my results?
import pandas as pd
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
review=pd.read_csv('pizza_category.csv')
checkin=pd.read_csv('yelp_academic_dataset/yelp_checkin.csv')
final=pd.merge(review,checkin,on='business_id',how='inner')
final.dropna()
final=final.reset_index(drop=True)
X=final[['checkins']]
X['latitude']=final[['latitude']].astype(dtype=np.float64).values
X['longitude']=final[['longitude']].astype(dtype=np.float64).values
print(X)
arr=X.values
db = DBSCAN(eps=2,min_samples=5)
y_pred = db.fit_predict(arr)
plt.figure(figsize=(20,10))
plt.scatter(arr[:, 0], arr[:, 1], c=y_pred, cmap="plasma")
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
Here's the plot I got
This is not a clustering problem.
What you want to do is density estimation, where you estimate density based on previous check-in frequencies.

Proper way to set corARMA() correlation structure for lme {nlme}

I have a time series with temperature measures every 5-minutes over ca. 5-7 days. I'm looking to set the correlation structure for my model as I have considerable temporal autocorrelation. I've decided that moving averages would be the best form, but I am unsure what to specify within the correlation = corARMA(q=?) part of the model. Here is the following output for ACF(m1):
lag ACF
1 0 1.000000000
2 1 0.906757430
3 2 0.782992821
4 3 0.648405513
5 4 0.506600300
6 5 0.369248402
7 6 0.247234208
8 7 0.139716028
9 8 0.059351579
10 9 -0.009968973
11 10 -0.055269347
12 11 -0.086383590
13 12 -0.108512009
14 13 -0.114441343
15 14 -0.104985321
16 15 -0.089398656
17 16 -0.070320370
18 17 -0.051427604
19 18 -0.028491302
20 19 0.005331508
21 20 0.044325557
22 21 0.083718759
23 22 0.121348020
24 23 0.143549745
25 24 0.151540265
26 25 0.146369313
It appears that there is highly significant autocorrelation in the first ca. 7 lags. See also the attached images: 1[Residuals] & 2[Model]
Would this mean I set the correlation = corARMA(q=7)?

Classification Supervised Training Confusion

So I am new to supervised machine learning, but I've been reading books and articles about it and I'm stuck on a problem. (Not stuck, but I don't understand the logic behind classification algorithms). I am trying to classify records as being wrong or not based on historical data.
So this is the original data (training data):
Name Office Age isWrong
F1 1 32 0
F2 2 61 1
F3 1 35 0
F4 0 25 0
F5 1 36 0
F6 2 52 0
F7 2 48 0
F8 1 17 1
F9 2 51 0
F10 0 24 0
F11 4 34 1
F12 0 21 0
F13 2 51 0
F14 0 27 0
F15 3 37 1
(only showing top 15 results of 200 results)
A wrong record is any record which reports an age LOWER than 18 or HIGHER than 60, or an office location that is NOT {0, 1, 2}. I have more records that display a 1 when any of the mentioned conditions are met. I trained my model with this dataset and I created a test dataset to test the results. However, I end up getting 0 on the prediction column of every record. I used a Naïve Bayes approach because this approach assumes independence between the features variables which is my case (no relationship between the office number and age). I know there are other methods like Logistic Regression and SVC(SVM), but I assume that they require a degree of relationship between the features variables. Despite that, I still tried those two approaches and got the same results. Am I doing something wrong? Do I need to specify something before training my model?
Here is what I did (very simple):
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
NaiveBayesModel nbm = nb.fit(dataset);
nbm.transform(dataset2).show();
Here is dataset2 (top 15):
Name Office Age
F1 9 36 //wrong, office is 9
F2 2 20
F3 1 17
F4 2 43
F5 2 90 // wrong, age is >60
F6 1 36
F7 1 40
F8 2 52
F9 2 49
F10 1 38
F11 0 28
F12 0 18
F13 1 40
F14 1 31
F15 2 45
But like I said, the prediction column displays 0 every time. Any idea why?
I don't know why you are opting for transform(). It just tries to cast the result dtype to the same one as the original column has
To get the probability you should be using the function:
predict_proba(X): Return probability estimates for the test vector X.
The following code should work perfectly in your scenario
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
nb.fit(dataset)
nb.predict_proba(dataset2)

Problems Implementing AR, ARMA, and possibly more complex timeseries models in pymc3 using theano.scan

I try to implement a simple ARMA model, however have serious difficulties getting it to run. When adding a parameter to the error term everything works fine (see the return x_m1 + a*e statement, commented out below), however if I add a parameter to the auto regressive part, I get a FloatingPointError or LinAlgError or PositiveDefiniteError, depending on the initialization method I use.
The code is also put into a gist you can find here. The model definition is replicated here:
with pm.Model() as model:
a = pm.Normal("a", 0, 1)
sigma = pm.Exponential('sigma', 0.1, testval=F(.1))
e = pm.Normal("e", 0, sigma, shape=(N-1,))
def x(e, x_m1, a):
# return x_m1 + a*e
return a*x_m1 + e
x, updates = theano.scan(
fn=x,
sequences=[e],
outputs_info=[tt.as_tensor_variable(data.iloc[0])],
non_sequences=[a]
)
x = pm.Deterministic('x', x)
lam = pm.Exponential('lambda', 5.0, testval=F(.1))
y = pm.StudentT("y", mu=x, lam=lam, nu=1, observed=data.values[1:]) #
with model:
trace = pm.sample(2000, init="NUTS", n_init=1000)
Here the errors respective to the initialization methods:
"ADVI" / "ADVI_MAP": FloatingPointError: NaN occurred in ADVI optimization.
"MAP": LinAlgError: 35-th leading minor not positive definite
"NUTS": PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71]
For details on the error messages, please look at this github issue posted at pymc3.
To be explicit, I really would like to have a scan-like solution which is easily extendable to for instance a full ARMA model. I know that one can represent the presented AR(1) model without scan by defining logP as already done in pymc3/distributions/timeseries.py#L18-L46, however I was not able to extend this vectorized style to a full ARMA model. The use of theano.scan seems preferable I think.
Any help is highly appriciated!

Resources