what is meant by time based spliting in cross validation techniques? - machine-learning

I have a timestamp for every record in the data set.
I heard about time based spiting but don't know anything about it.

Normal cross-validation
You have a set of data points:
data_points = [2, 4, 5, 8, 6, 9]
Then, if you do a 2-fold split, your data points will get randomly assigned to 2 different groups.
For example:
split_1 = [2, 5, 9]
split_2 = [3, 8, 6]
However, this assumes that there is no need to keep the order of your data points.
You can train your model with split_1 and test it with split_2.
Time based splitting
However, this assumption isn't always correct for time series prediction.
For example, given the same data points:
data_points = [2, 4, 5, 8, 6, 9]
It can be that they are arranged by time.
You could then have a model that to predict the next number, it looks back 3 time steps. (e.g. to predict the number after 9, it will have [8, 6, 9] as input. Meaning that the order of which the data points appear is important. Because of that, in order to test your model, you cannot randomly split your data points. The order in which they appear needs to be kept.
So if you do a 2-fold split, you could get the following splits:
split_1 = [2, 4, 5, 8]
split_2 = [5, 8, 6, 9]
Implementation
There is an implementation of time-based cross-validation from Sklearn: the TimeSeriesSplit.

Related

Segment Tree - Finding all subarray sums

Suppose we have an array that's like [1, 2, 3, 4], if I created a segment tree for that array we'd get something like: [null, 10, 3, 7, 1, 2, 3, 4], so all of the subarray sums would exactly what we have on the segment tree.
However, if our input array is like [1, 2, 3], our segment tree would be something like: [null, 6, 3, 3, 1, 2, 3, 0], with the trailing 0 since we don't have a complete binary tree due to 3 (the array's length) not being a power of 2.
Unlike in the first example, since our binary tree isn't complete, we run into duplicate ranges. In our tree: [null, 6, 3, 3, 1, 2, 3, 0], the 2nd last 3 and the last 3 represent the same range, since the right tree has a 0.
Is there any way to distinguish between this duplicate range? Or should I be using another data structure for a problem that's susceptible to this kind of segment tree duplicate range issue that I'm having with my second example?

Influxdb: Query for distinct values

First: I am aware of the distinct() function, but that's not what I want.
My problem: Imagine a series with sensor readings that barely change like e.g:
[2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 3, 3, 5, 5, 2]
In my application this series is very long (thousands of entries) and I would like to visualize it in a Diagram (on Android, but that doesn't matter).
What I'd like to achieve:
I would like to get the values, where the series changes e.g:
[2, 3, 4, 3, 5, 2]
of course with their respective timestamps and tags.
With the distinct() function the result would look like this:
[2, 3, 4, 5, ]
Thanks!

how to algorithm this?

I have a number of fruit baskets, all of them have a random amount of apples and they have different properties.
arrayOfBaskets = [
["basketId": 1, "typeOfPesticidesUsed": 1, "fromCountry":1, "numberOfApples": 5],
["basketId": 2, "typeOfPesticidesUsed": 1, "fromCountry":1, "numberOfApples": 6],
["basketId": 3, "typeOfPesticidesUsed": 2, "fromCountry":1, "numberOfApples": 3],
["basketId": 4, "typeOfPesticidesUsed": 2, "fromCountry":1, "numberOfApples": 7],
["basketId": 5, "typeOfPesticidesUsed": 1, "fromCountry":2, "numberOfApples": 8],
["basketId": 6, "typeOfPesticidesUsed": 1, "fromCountry":2, "numberOfApples": 4],
["basketId": 7, "typeOfPesticidesUsed": 2, "fromCountry":2, "numberOfApples": 9],
["basketId": 8, "typeOfPesticidesUsed": 2, "fromCountry":2, "numberOfApples": 5]
]
in this case, how do I formulate an algorithm of sorts to output into an array like so:
uniquePairingOfBasketProperties = [
["typeOfPesticidesUsed": 1, "fromCountry":1],
["typeOfPesticidesUsed": 2, "fromCountry":1],
["typeOfPesticidesUsed": 1, "fromCountry":2],
["typeOfPesticidesUsed": 2, "fromCountry":2]
]
my main point is so that I can get my UITableView to know how many rows it should have. Which in this case is 4 instead of total number of baskets.
Huh? You have an array of dictionaries. You want to divide those dictionaries into "buckets" where each bucket has a unique combination of pesticide type and country of origin?
Assuming that's the case, how about this:
let kNumberOfCountries = 2
uniqueValue = basket["typeOfPesticidesUsed"] * kNumberOfCountries +
basket["fromCountry"]
uniqueValue will jump in large steps based on the type of pesticide, and then change by 1s based on the country of origin. (think of a rectangular grid where the country number starts at 1 on the left and increases to the right, and the pesticide number starts at 1 at the top and increases as you go down. The unique value number is 1 at the top left square, counts up to the right, then wraps around to the next row and keeps counting up by 1s.
You can then group your table view based on uniqueVaue.
If you want to know how many unique parings you have, create an empty set of integers. Loop through your array of baskets. Calculate the uniqueValue for that basket, and add it to the set of uniqueValues (sets only have one entry for each value.) Once you are done looping, the number of entries in the set is the number of unique pairings you have. If you use an NSCountedSet, you can even get the count of the number of baskets with each pairing. (I don't know if Swift has a native counted set collection. It didn't last time I checked.)
EDIT:
It looks like Swift does NOT have a native counted set collection (at least not yet.) There is, however, at least one open source Swift counted set (aka a bag) on Github

Generate random integer from discrete selection?

I am trying to generate random numbers but only certain numbers. I know to generate a random number between 0 and 10 you'd use:
arc4random_uniform(11)
But what if I wanted to generate a random number between a selection of, say 3, 5, 8, and 10?
Vacawama is right and should be given credit.
a little more thought.
Chose what number you want and put them into an array. then use the index of the array to get the
[3, 5, 8, 10]
array index starts at zero so; [0: 3, 1: 5, 2: 8, 3: 10].
using "4" within the arc4random will let you choose between 0-3.

How to create subscale scores for 4 subscales of the REI using SPSS?

I need to create subscale scores for 4 subscales of the REI: REI_Appear; REI_Hlth; REI_Mood; REI_Enjoy. The items comprising each subscale are as follows:
Appearance (9 items): 1, 5, 9, 13, 16, 17, 19, 21, 24
Health (8 items): 3, 6, 8, 15, 18, 20, 22, 23
Mood (4 items): 2, 7, 12, 14
Enjoyment (3 items): 4, 10, 25
For example, I have placed REI_Appear in the target variable but then im unsure of what to place in the numeric expression section for it to work?
There are several important issues.
Do you want means or sums or some other composite?
Do any items need reversing?
How do you want to handle missing data?
Assuming you want means, there are no items needing reversing, and you want a participant to have at least 3 items to get a score, then you could use:
compute REI_appear = mean.3(item1, item5, ..., item24).
EXECUTE.
where you replace item1 etc. with the relevant variable names.
I have an existing post dedicated to the topic of computing scale scores for psychological tests which discusses some of these issues further.

Resources