Is it possible to pass multiple columns at once to Croston method? - time-series

I want to implement the Croston method for intermittent demand. I have a data frame that has 10 features and all those have many zeros in them. I want to pass the entire data frame to the Croston model but the model accepts a one-dimensional array. I'm not interested in looping through the model. Is there any way or any other methods that can give forecasts for intermittent demand?.
Thanks in Advance!!

Any other methods that can give forecasts for intermittent demand?
In order to have the forecasts for the intermittent demand, Croston is one of the best methods in the industry but there are few drawbacks to it and those are addressed by the variants of the Croston model like SBA, TSB etc. For me, Croston TSB did perform much better because it can able to decay towards zeros when there is no demand for a long time.
Next,
Passing an entire data frame to the Croston model in one go without looping
NumPy vectorization is much better in this case. vectorization is way better than loops in terms of speed. You can refer on vectorization.

Related

Best way to store processed text data for streaming to gensim?

I've got several hundred pandas data frames, each of which has a column of very long strings that need to be processed/sentencized and finally tokenized before modeling with word2vec.
I can store them in any format on the disk, before I build a stream to pass them to gensim's word2vec function.
What format would be best, and why? The most important criterion would be performance vis-a-vis training (which will take many days), but coherent structure to the filesystem would also be nice.
Would it be crazy to store several million or maybe even a few billion text files containing one sentence each? Or perhaps some sort of database? If this was numerical data I'd use hdf5. But it's text. The cleanest would be to store them in the original data frames, but that seems less ideal from an i/o perspective, because I'd have to load each data frame (largish) every epoch.
What makes the most sense here?
As you do your preprocessing/tokenization of all the source data that you want to be part of a single training session, append the results to a single plain-text file.
Use space-separated words, and end each 'sentence' (or any other useful text-chunk that's less than 10,000 words long) with a newline.
Then you can use the corpus_file option for specifying your pre-tokenized training data, and will get the maximum possible multithreading benefit. (That mode will direct each thread to open its own view into a range of the single file, so there's no blocking on any distributor thread.)

Change in two 3D models

I'm trying to think of the best way to conduct some sort of analysis between two 3D models of the same object.
The first scan is of the original item and the second scan is after it has been put under some load x.
An example would be trying to find the difference between two types of metal.
I would like to be able to scan the initial metal cylinder, apply a measured load, scan it again, and then finally apply some sort of algorithm to compare the difference.
Is it possible to do this efficiently (maybe using Mablab) over say 50 - 100 items for an object around 5inch^3?
I am assuming I will need to work out some sort of utility function as the total mass should be the same?
Would machine learning be beneficial in this case?
Any suggestions or direction would be amazing.
Thank you :)
EDIT: The scan files are coming through as '.stl'

Does it help to duplicate original data in order to make more data for building model?

I just got an interview question.
"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?
Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.
But is there anyone can explain it more statistically? Thanks
Consider e.g. variance. The data set with the duplicated data will have the exact same variance - you don't have a more precise estimate of the distrbution afterwards.
There are, however, some exceptions. For example bootstrap validation helps when evaluating your model, but you have very little data.
Well, it depends on exactly what one means by "duplicating the data".
If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)
However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.
Also if one means bootstrapping, then that, too, makes a difference.
PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Hyperopt Exploration/Exploitation strategy

What kind of settings Hyperopt provides to adjust balance between exploration with exploitation ? There's something like "bandit" and "bandit_algo" in the code but no explanation.
Could someone provide any code sample.
Thanks a lot for any help!
I just found hyperopt partial() a magical wrapper function for the optimizer algo. It allows to balance between different strategies and then E/E:
Partial returns the result of a randomly-chosen suggest function. For example to search by sometimes using random search, sometimes anneal, and sometimes tpe, type:
fmin(...,
algo=partial(mix.suggest,
p_suggest=[
(.1, rand.suggest),
(.2, anneal.suggest),
(.7, tpe.suggest),]),
)
Parameter "p_suggest": list of (probability, suggest) pairs. Make a suggestion from one of the suggest functions, in proportion to its corresponding probability. sum(probabilities) must be [close to] 1.0.
If you want an even sharper control of algo progression: you can use the fact that hyperopt optimizer algos are stateless and return the trial object which can be provided as an input to a new fmin to continue the process. Then you can call fmin with max_evals at 1 and handle the process in a loop, therefore you could modify "trials" and "suggest algo" between each iteration.
For the best bet, read the papers by Bergstra et. al. 1 2 and 3. I am not 100% clear on what the bandit_algo is, except that one of the papers mentions it as an alternative method to Gaussian Process and Tree of Parzen Estimators - maybe you can use it in the same way as those two?
My guess is that if it not documented, it may not be finished yet. You can try raising an issue on Github - the devs are fairly responsive from what I have seen.
EDIT: Looking at this paper, these bandit algorithms may be the base class that the others inherit from.

How to bulk-load an r-tree in C#?

I am looking for C# code to construct an r-tree. I have code that builds an r-tree incrementally i.e. items are added one by one to the tree, but I guess a better r-tree could be built if all items are given all at once to the tree creation algorithm. Please let me know if anyone knows how to bulk-load an r-tree in this manner. I tried doing some search but couldn't find anything very useful.
The most common method for low-dimensional point data is sort-tile-recursive (STR). It does exactly that: sort the data, tile it into the optimal number of slices, then recurse if necessary.
The leaf level of a STR-loaded tree with point data will have no overlap, so it is really good. Higher levels may have overlap, as STR does not take the extend of objects into account.
A proven good bulk-loading is a key component to the Priority-R-Tree, too.
And even when not bulk-loading, the insertion strategy makes a big difference. R-Trees built with linear splits such as Guttmans or Ang-Tan will usually be worse than those built with the R*-Tree split heuristics. In particular Ang-Tan tends to produce "sliced" pages, that are very unbalanced in their spatial extend. It is a fast split strategy and probably the simplest, but the results aren't good.
A paper by Achakeev et al.,Sort-based Parallel Loading of R-trees might be of some help. And you could also find other methods in their references.

Resources