Predict next integer in sequence using ML.NET - machine-learning

Given a lengthy sequence of integers in the range of 0-1 I would like to be able to predict the next likely integer.
Example dataset:
1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0
A quick look at the above perhaps shows some obvious patterns which may be recognised by an ML model.
I do have other features available in the dataset but I don't think they correlate to the integer result so the prediction should be based purely on the statistical relevance of the supplied integer dataset.
I'm unsure how to approach this using ML.NET. I have successfully classified models previously but those predictions are all made based on multiple features. In this case if I just supply a 0 or 1 there's no relevant historical sequence to aid the prediction.
How do I train an ML.NET model to return a prediction based on a range of previous data?
Working theory: the above dataset has 100 integers. I could create a class which has 100 properties (Integer0..Integer99) and painstakingly map each field and submit that but it seems really clunky.

Related

How can I label connected components in APL?

I'm trying to do leet puzzle https://leetcode.com/problems/max-area-of-island/, requiring labelling connected (by sides, not corners) components.
How can I transform something like
0 0 1 0 0
0 0 0 0 0
0 1 1 0 1
0 1 0 0 1
0 1 0 0 1
into
0 0 1 0 0
0 0 0 0 0
0 2 2 0 3
0 2 0 0 3
0 2 0 0 3
I've played with the stencil ⌺ operator and also tried using scan operators but still not quite there. Can somebody help?
We can start off by enumerating the ones. We do the by applying the function ⍸ (where, but since all are 1s, it is equivalent to 1,2,3,…) # at the subset masked by ⊢ the bits themselves, i.e. ⍸#⊢:
⍸#⊢m
0 0 1 0 0
0 0 0 0 0
0 2 3 0 4
0 5 0 0 6
0 7 0 0 8
Now we need to flood-fill the lowest number in each component. We do this with repeated application until the fix-point ⍣≡ of processing Moore neighbourhoods ⌺3 3. To get the von Neumann neighbours, we reshape the 9 elements in the Moore neighbourhood into a 4-row 2-column matrix with 4 2⍴ and use ⊢/ to select the right column. We remove any 0s with 0~⍨ them prepend , the original value ⍵[2;2] (even if 0) and have ⌊/ select the smallest value:
{⌊/⍵[2;2],0~⍨⊢/4 2⍴⍵}⌺3 3⍣≡⍸#⊢m
0 0 1 0 0
0 0 0 0 0
0 2 2 0 4
0 2 0 0 4
0 2 0 0 4
We map the values to indices by finding their ⊢ indices ⍳⍨ in the unique elements of ∘∪ 0 followed by , the ravelled matrix ,:
(⊢⍳⍨∘∪0,,){⌊/⍵[2;2],0~⍨⊢/4 2⍴⍵}⌺3 3⍣≡⍸#⊢m
1 1 2 1 1
1 1 1 1 1
1 3 3 1 4
1 3 1 1 4
1 3 1 1 4
And decrement which adjusts back to begin with zero:
¯1+(⊢⍳⍨∘∪0,,){⌊/⍵[2;2],0~⍨⊢/4 2⍴⍵}⌺3 3⍣≡⍸#⊢m
0 0 1 0 0
0 0 0 0 0
0 2 2 0 3
0 2 0 0 3
0 2 0 0 3

Conditionally Assign Value to Dask Dataframe using Apply

I am trying to iterate through a Dask dataframe and compare the values in one of its columns to a column in another Dask dataframe with the same name. If the columns match I would like to update the value is the target Dask dataframe. The code below runs, but the values are not updated to '1' where I expected, or anywhere. I am new to Dask and suspect I am missing some crucial step or am not understanding the framework.
def populateSymptomsDDF(row):
for vac in row['vac_codes']:
if vac in symptoms_ddf.columns:
symptoms_ddf[vac] = symptoms_ddf[vac].where(symptoms_ddf['dog'] == row['dog'], 1)
with ProgressBar():
x = vac_ddf.apply(lambda x: populateSymptomsDDF(x), meta=('int64'), axis=1)
x.compute(scheduler='processes')
symptoms_ddf.compute()
Head of icd_ddf:
dog vac_codes
0 1 [G35, E11.40, R53.1, Z79.899, I87.2]
1 2 [G35, R53.83, G47.00]
2 3 [G35, G95.9, R53.83, F41.9]
3 4 [G35, N53.9, E55.9, Z74.09]
4 5 [G35, M51.26, R53.1, M47.816, R25.2, G82.50, R...
Head of symptoms_ddf (before running code):
dog W19 W10 W05.0 V00.811 R53.83 R53.8 R53.1 R47.9 R47.89 ... G81.12 G81.11 G81.10 G50.0 G31.84 F52.8 F52.31 F52.22 F52.0 F03
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Thank you for any insights you can provide!
Dask dataframes don't have the same in-place behavior as pandas. Generally every operation should be a bulk parallel operation. Otherwise there isn't much reason to use Dask.
Also, iterating through dataframes will generally be quite slow. This is also true with Pandas.
Fortunately, I think that you're maybe just looking for a join or merge operation. I would encourage you to look up the documentation for Pandas merge
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

R-Package vegan Decorana

enter image description hereI'm new to R and I was trying to run a Detrended correspondence analysis (DCA) which is a multivariate statistical analysis for ordination of species, I have four sites. I keep getting the error message:
> Error rowsums x must be numeric
Species Haasfontein Mini Pit Vlaklaagte Mini Pit Vlaklaagte Block 3 Mini Pit Block 10 Mini Pit
Agrostis lachnantha 1 0 0 0
Aristida congesta subsp. Congesta 0 0 0 0
Brachiaria nigropedata 0 0 0 0
Cynodon dactylon 0 12 2 3
Cyperus esculentus  0 5 0 0
Digitaria eriantha 0 1 6 20
Elionurus muticus 0 0 0 0
Eragrostis acraea De Winter 0 0 1 0
Eragrostis chloromelas 35 0 12 4
Eragrostis curvula 6 0 0 0
Eragrostis lehmanniana 5 0 0 0
Eragrostis rigidior 3 0 1 0
Eragrostis rotifer 3 0 0 0
Eragrostis trichophora 10 1 2 2
Hyparrhenia hirta 0 0 9 1
Melinis repens 0 0 2 0
Panicum coloratum 0 4 0 0
Panicum deustum  3 0 0 0
Paspalum dilatatum 0 0 0 0
Setaria sphacelata var. sphacelata 0 1 0 0
Sporobolus africanus 0 0 2 0
Sporobolus centrifuges 1 0 1 0
Sporobolus fimbriatus 0 0 0 0
Sporobolus ioclados 2 0 5 1
Themeda triandra 0 0 0 0
Trachypogon spicatus 0 0 0 0
Tragus berteronianus 0 0 0 1
Verbena bonariensis 16 0 2 0
Cirsium vulgare 0 0 0 0
Eucalyptus cameldulensis 1 0 0 0
Xanthium strumarium 0 0 0 0
Argemone ochroleuca 0 0 0 0
Solanum sisymbriifolium 0 0 0 0
Campuloclinium macrocephalum  7 0 0 0
Paspalum dilatatum 0 0 0 0
Senecio ilicifolius 0 0 0 0
Pseudognaphalium luteoalbum (L.) 8 0 0 0
 Cyperus esculentus  0 0 0 0
Foeniculum vulgare  0 0 0 0
Conyza canadensis 0 0 0 1
Tagetes minuta 0 0 0 0
Hypochaeris radicata 0 0 0 0
Solanum incanum 0 0 0 0
Asclepias fruticosa 11 0 0 0
Hypochaeris radicata 0 0 0 0
My data is organised as shown above and I'm not sure if my data is organised correctly or there is some other error. Can someone please assist me
You're still fighting to get you data into R. That is your first problem. After you tackle this problem and manage to read in your data, you have the following problems:
You should not have empty (all zero) rows in your data, but they will give an error (empty columns are removed and only give a warning).
DCA treats rows and columns non-symmetrically, and you should have species as columns and sampling units as rows. You should transpose your data (function t()).
You really should not use DCA with only four sampling units. It will be meaningless.
I think the last point is most important.

Feature Reduction

How do I reduce feature dimension ? My feature looks like :
1(Class Number) 10_10_1(File name) 0 0 0 0 0 0 0 0 0.564971751 23.16384181 25.98870056 19.20903955 16.10169492 13.27683616 1.694915254 0 0 0 0 0 0 0 3.95480226 11.5819209 20.33898305 60.4519774 3.672316384 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.107344633 62.99435028 33.89830508 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.412429379 66.66666667 31.92090395 0 0 0 0 0 0 0 0 0 0 0 0 0 0.564971751 22.59887006 26.83615819 46.89265537 3.107344633 0 0 0 0 0 0 0 0 0 0 0 0 0 0.564971751 16.38418079 28.53107345 50.84745763 3.672316384 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90.6779661 9.322033898 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.847457627 90.11299435 9.039548023 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17.79661017 81.3559322 0.847457627 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27.11864407 72.88135593 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.564971751 37.85310734 61.29943503 0.282485876 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.412429379 50.84745763 47.45762712 0.282485876 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24.57627119 75.42372881 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17.23163842 82.20338983 0.564971751 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29.37853107 70.62146893 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 55.64971751 44.35028249 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 64.40677966 35.59322034 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 67.79661017 32.20338983 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66.66666667 33.33333333 0 0 0 0 0 0 0 0 0 0 0 0 1 3 2 6 7 5 4 8 9 10 11 12 13 14 15 16 17 18 14.81834463 3.818489078 3.292123621 2.219541777 2.740791003 1.160544518 2.820053602 1.006906813 0.090413195 2.246638594 0.269778302 2.183126126 2.239168249 0.781498607 2.229795302 0.743329919 1.293839141 0.783068011 1.104421291 0.770312707 0.697659061 1.082266169 0.408339745 1.073922207 0.999148017 0.602195061 1.247286588 0.712143548 0.867327913 0.603063537 0.474115683 0.596387106 0.370847522 0.54900076 0.35930586 0.580272233 0.397060362 0.535337691
After filename, feature values are given.
If your feature is unsupervised, you can use PCA.
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
print(pca.explained_variance_ratio_)
If it is supervised, you can use LDA
import numpy as np
from sklearn.lda import LDA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = LDA()
clf.fit(X, y)
For reducing the amount of features handed to models, besides others, feature reduction or feature deletion/selection can be used. Some frequently used feature reduction approaches have already been pointed out here, like principal component analysis (PCA), linear discriminant analysis (LDA), or partial least square regression (PLSR). Those essentially project the original data into a subspace, which aims for representing the information in less features. In particular, PCA thereby tries to maximize the preserved variance in the original data (unsupervised), while LDA tries to minimize intra-class-variance and maximize inter-class-distance (supervised, for classification), and PLSR tries to maximize the preserved variance in the original data and maximize the correlation with the target variable (supervised, regression).
Additionally, classic feature selection can be employed for reducing the amount of features. Those don't project data into a subspace but select "useful" features straight from the existing set of features. Usually those approaches are divided in feature filters and feature wrappers, where filters decide on which features to use by looking only at the features and the target variable (e.g. try to minimize inter-feature-correlation while maximizing the feature-target-correlation). In contrast, feature wrappers additionally consider the model that uses the selected features - so they directly optimize the model performance instead. Usually, filters are computationally cheaper than wrappers - but similar to using PCA, feature filters don't necessarily need to improve subsequent model performance, as they don't know what to optimize for.
Edit: as you are working with image data, feature filters and wrappers might not be optimal if used alone - they likely require image preprocessing and/or downsizing before being employed.
If you are using R, I'd recommend using the caret package, which provides all of the above already embedded into the model model training and evaluation process, which is quite important (cf. here for some details on their filters/wrappers). Here's a small snippet for usage of the approaches above:
library(caret)
# PCA with preserving 95% variance in original data
modelPca <- train(x = iris[,1:4], iris[,5], preProcess=c('center', 'scale', 'pca'), trControl=trainControl(preProcOptions=list(thresh=0.95)), method='svmLinear', tuneGrid=expand.grid(C=3**(-3:3)))
# LDA with selection of dimensions
modelLda2 <- train(x = iris[,1:4], y = iris[,5], method='lda2', tuneGrid=expand.grid(dimen=1:4))
# PLSR with selection of dimensions
modelPls <- train(x = iris[,1:3], y = iris[,4], method='pls', tuneGrid=expand.grid(ncomp=1:3))
# feature wrapper: (backwards) recursive feature elimination (there exist more...)
modelRfe <- rfe(x = iris[,1:4], y = iris[,5], sizes = 1:4, rfeControl = rfeControl())
# feature filter: univariate filtering
modelSbf <- sbf(x = iris[,1:4], y = iris[,5], sbfControl = sbfControl())

Why is my convolution result shifted when using FFT

I'm implementing Convolutions using Radix-2 Cooley-Tukey FFT/FFT-inverse, and my output is correct but shifted upon completion.
My solution is to zero-pad both input size and kernel size to 2^m for smallest possible m, tranforming both input and kernel using FFT, then multiply the two element-wise and transform the result back using FFT-inverse.
As an example on the resulting problem:
0 1 2 3 0 0 0 0
4 5 6 7 0 0 0 0
8 9 10 11 0 0 0 0
12 13 14 15 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
with identity kernel
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
becomes
0 0 0 0 0 0 0 0
0 0 1 2 3 0 0 0
0 4 5 6 7 0 0 0
0 8 9 10 11 0 0 0
0 12 13 14 15 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
It seems any sizes of inputs and kernels produces the same shift (1 row and 1 col), but I could be wrong. I've performed the same computations using the online calculator at this link! and get same results, so it's probably me missing some fundamental knowledge. My available litterature has not helped. So my question, why does this happen?
So I ended up finding the answer why this happens myself. The answered is given through the definition of the convolution and the indexing that happens there. So by definition the convolution of s and k is given by
(s*k)(x) = sum(s(k)k(x-k),k=-inf,inf)
The center of the kernel is not "known" by this formula, and thus an abstraction we make. Define c as the center of the convolution. When x-k = c in the sum, s(k) is s(x-c). So the sum containing the interesting product s(x-c)k(c) ends up at index x. In other words, shifted to the right by c.
FFT fast convolution does a circular convolution. If you zero pad so that both the data and kernel are circularly centered around (0,0) in the same size NxN arrays, the result will also stay centered. Otherwise any offsets will add.

Resources