I have a dataset that looks like this:
A B
5/8 2 3
6/8 4 2
7/8 3 5
8/8 3 2
and I want to finish like this
index1 index2 A B
5/8 5/8 2 3
6/8 4 2
6/8 6/8 4 2
7/8 3 5
7/8 7/8 3 5
8/8 3 2
etc.
and also an equivalent that would take numeric indexes. This way I can decide whether flatten the data or create a 3d array for the ML training.
I have done it with df.iterrows() but it so slow. I also tried by making this code:
def addDatas(x,df,window):
global dfOo #Dataset to create
if len(x)==window:
y = df.loc[x.index];
y.DateStarted = df.loc[x.index[-1]].created #index1 in table presented
dfOo = dfOo.append(y)
return 0;
dfOo= pd.DataFrame();
#created is the date index in the first table
dfTargets.rolling("5s",on="created").apply(lambda x : addDatas(x,dfTargets,5))
Both of these solutions work but they aren´t fast enough and not usable with big chunks of data. I can help but think that there must be an easier way to do this that I don´t know.
The following will work on any sortable index. It does create a copy of the dataframe in memory so that is a drawback of this approach if you are memory restricted.
import pandas as pd
# Minimal example
df = pd.DataFrame(data={'index':['5/8','6/8','7/8','8/8'],'A':[2,4,3,3],'B':[3,2,5,2]})
# Create a shifted version of the index 'index' column
df['index_2'] = df['index'].shift()
# Copy to df2, renaming columns and dropping null value (first shifted row)
df2 = df.copy().rename({'index':'index_2','index_2':'index'},axis=1).dropna()
# In original df overwrite index_2 to be equal to index column
df['index_2'] = df['index']
# Concatenate, set index, and sort by index
pd.concat([df,df2]).set_index(['index','index_2']).sort_index()
Output:
A B
index index_2
5/8 5/8 2 3
6/8 4 2
6/8 6/8 4 2
7/8 3 5
7/8 7/8 3 5
8/8 3 2
8/8 8/8 3 2
I'd like to present a solution using np.repeat.
We'll first load the data:
df = pd.DataFrame({'A':[2,4,3,3], 'B':[3,2,5,2]}, index=['5/8', '6/8', '7/8', '8/8'])
We produce first a list, call it xi, that has across its length values 2, except the first and last element.
xi=[2]*len(df)
xi[0]=1
xi[-1]=1
This list will be used in np.repeat to repeat the desired elements.
Basically, the following gives the desired data, except that an index missing:
ndf = df.loc[np.repeat(df.index.values, xi)]
The following prepares the first-level index:
ndf.set_index([np.repeat(ndf.index, [2,0]*int(len(ndf)/2)), ndf.index])
Related
I'm currently working with an enormous timeseries dataset in a CSV file. There is around 73 000 of observations and I need to slice them for backtest.
I tried going via while and for loops, but they didn't work out at all and cycle simply ran into infinity even with break rule. Itertools also didn't help much.
If I try to describe what I'm trying to get -
Assume there is 10 observations (1,2,3,4,5,6,7,8,9,10).
And I need to create a cycle slice of (1,2),(2,3),(3,4),(4,5) and so on till (9,10).
But without manual coding it like this
slice(1,2,1)
slice(2,3,1)
slice(3,4,1)
Ii is very not clear what you want. If taking 2 elements or 3 elements in R you can do this with combn function:
t(combn(1:10, 2))
t(combn(1:10, 3))
# > t(combn(1:10, 2))
# [,1] [,2]
# [1,] 1 2
# [2,] 1 3
# [3,] 1 4
# > t(combn(1:10, 3))
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 1 2 4
# [3,] 1 2 5
# [4,] 1 2 6
If it is a big data set you will probably have memory issues.
The number of possible combinations is ....
> choose(73000, 3)
[1] 6.48335e+13
I am unfamiliar with lua.
but the author of the article used lua.
can you help me understand what those two lines do:
what does
replicate(x,batch_size) do?
what does x = x:resize(x:size(1), 1):expand(x:size(1), batch_size) do?
original source code can be found here
https://github.com/wojzaremba/lstm/blob/master/data.lua
This basically boils down to simple maths and looking up a few functions in the torch manual.
Ok I'm bored so...
replicate(x,batch_size) as defined in https://github.com/wojzaremba/lstm/blob/master/data.lua
-- Stacks replicated, shifted versions of x_inp
-- into a single matrix of size x_inp:size(1) x batch_size.
local function replicate(x_inp, batch_size)
local s = x_inp:size(1)
local x = torch.zeros(torch.floor(s / batch_size), batch_size)
for i = 1, batch_size do
local start = torch.round((i - 1) * s / batch_size) + 1
local finish = start + x:size(1) - 1
x:sub(1, x:size(1), i, i):copy(x_inp:sub(start, finish))
end
return x
end
This code is using the Torch framework.
x_inp:size(1) returns the size of dimension 1 of the Torch tensor (a potentially multi-dimensional matrix) x_inp.
See https://cornebise.com/torch-doc-template/tensor.html#toc_18
So x_inp:size(1) gives you the number of rows in x_inp. x_inp:size(2), would give you the number of columns...
local x = torch.zeros(torch.floor(s / batch_size), batch_size)
creates a new two-dimensional tensor filled with zeros and creates a local reference to it, named x
The number of rows is calculated from s, x_inp's row count and batch_size. So for your example input it turns out to be floor(11/2) = floor(5.5) = 5.
The number of columns in your example is 2 as batch_size is 2.
torch.
So simply spoken x is the 5x2 matrix
0 0
0 0
0 0
0 0
0 0
The following lines copy x_inp's contents into x.
for i = 1, batch_size do
local start = torch.round((i - 1) * s / batch_size) + 1
local finish = start + x:size(1) - 1
x:sub(1, x:size(1), i, i):copy(x_inp:sub(start, finish))
end
In the first run, start evaluates to 1 and finish to 5, as x:size(1) is of course the number of rows of x which is 5. 1+5-1=5
In the second run, start evaluates to 6 and finish to 10
So the first 5 rows of x_inp (your first batch) are copied into the first column of x and the second batch is copied into the second column of x
x:sub(1, x:size(1), i, i) is the sub-tensor of x, row 1 to 5, column 1 to 1 and in the second run row 1 to 5, column 2 to 2 (in your example). So it's nothing more than the first and second columns of x
See https://cornebise.com/torch-doc-template/tensor.html#toc_42
:copy(x_inp:sub(start, finish))
copies the elements from x_inp into the columns of x.
So to summarize you take an input tensor and you split it into batches which are stored in a tensor with one column for each batch.
So with x_inp
0
1
2
3
4
5
6
7
8
9
10
and batch_size = 2
x is
0 5
1 6
2 7
3 8
4 9
Further:
local function testdataset(batch_size)
local x = load_data(ptb_path .. "ptb.test.txt")
x = x:resize(x:size(1), 1):expand(x:size(1), batch_size)
return x
end
Is another function that loads some data from a file. This x is not related to the x above other than both being a tensor.
Let's use a simple example:
x being
1
2
3
4
and batch_size = 4
x = x:resize(x:size(1), 1):expand(x:size(1), batch_size)
First x will be resized to 4x1, read https://cornebise.com/torch-doc-template/tensor.html#toc_36
And then it is expanded to 4x4 by duplicating the first row 3 times.
Resulting in x being the tensor
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
read https://cornebise.com/torch-doc-template/tensor.html#toc_49
Is there a way to vectorize this FOR loop I know about gallery ("circul",y) thanks to user carandraug
but this will only shift the cell over to the next adjacent cell I also tried toeplitz but that didn't work).
I'm trying to make the shift adjustable which is done in the example code with circshift and the variable shift_over.
The variable y_new is the output I'm trying to get but without having to use a FOR loop in the example (can this FOR loop be vectorized).
Please note: The numbers that are used in this example are just an example the real array will be voice/audio 30-60 second signals (so the y_new array could be large) and won't be sequential numbers like 1,2,3,4,5.
tic
y=[1:5];
[rw col]= size(y); %get size to create zero'd array
y_new= zeros(max(rw,col),max(rw,col)); %zero fill new array for speed
shift_over=-2; %cell amount to shift over
for aa=1:length(y)
if aa==1
y_new(aa,:)=y; %starts with original array
else
y_new(aa,:)=circshift(y,[1,(aa-1)*shift_over]); %
endif
end
y_new
fprintf('\nfinally Done-elapsed time -%4.4fsec- or -%4.4fmins- or -%4.4fhours-\n',toc,toc/60,toc/3600);
y_new =
1 2 3 4 5
3 4 5 1 2
5 1 2 3 4
2 3 4 5 1
4 5 1 2 3
Ps: I'm using Octave 4.2.2 Ubuntu 18.04 64bit.
I'm pretty sure this is a classic XY problem where you want to calculate something and you think it's a good idea to build a redundant n x n matrix where n is the length of your audio file in samples. Perhaps you want to play with autocorrelation but the key point here is that I doubt that building the requested matrix is a good idea but here you go:
Your code:
y = rand (1, 3e3);
shift_over = -2;
clear -x y shift_over
tic
[rw col]= size(y); %get size to create zero'd array
y_new= zeros(max(rw,col),max(rw,col)); %zero fill new array for speed
for aa=1:length(y)
if aa==1
y_new(aa,:)=y; %starts with original array
else
y_new(aa,:)=circshift(y,[1,(aa-1)*shift_over]); %
endif
end
toc
my code:
clear -x y shift_over
tic
n = numel (y);
y2 = y (mod ((0:n-1) - shift_over * (0:n-1).', n) + 1);
toc
gives on my system:
Elapsed time is 1.00379 seconds.
Elapsed time is 0.155854 seconds.
I am going to do some work for transition-based dependency parsing using LIBLINEAR. But I am confused how to utilize it. As follows:
I set 3 feature templates for my training&testing processes of transition-based dependency parsing:
1. the word in the top of the stack
2. the word in the front of the queue
3. information from the current tree formed with the steps
And the feature defined in LIBLINEAR is:
FeatureNode(int index, double value)
Some examples like:
LABEL ATTR1 ATTR2 ATTR3 ATTR4 ATTR5
----- ----- ----- ----- ----- -----
1 0 0.1 0.2 0 0
2 0 0.1 0.3 -1.2 0
1 0.4 0 0 0 0
2 0 0.1 0 1.4 0.5
3 -0.1 -0.2 0.1 1.1 0.1
But I want to define my features like(one sentence 'I love you') at some stage:
feature template 1: the word is 'love'
feature template 2: the word is 'you'
feature template 3: the information is - the left son of 'love' is 'I'
Does it mean I must define features with LIBLINEAR like: -------FORMAT 1
(indexes in vocabulary: 0-I, 1-love, 2-you)
LABEL ATTR1(template1) ATTR2(template2) ATTR3(template3)
----- ----- ----- -----
SHIFT 1 2 0
(or LEFT-arc,
RIGHT-arc)
But I have go thought some statements of others, I seem to define feature in binary so I have to define a words vector like:
('I', 'love', 'you'), when 'you' appears for example, the vector will be (0, 0, 1)
So the features in LIBLINEAR may be: -------FORMAT 2
LABEL ATTR1('I') ATTR2('love') ATTR3('love')
----- ----- ----- -----
SHIFT 0 1 0 ->denoting the feature template 1
(or LEFT-arc,
RIGHT-arc)
SHIFT 0 0 1 ->denoting the feature template 2
(or LEFT-arc,
RIGHT-arc)
SHIFT 1 0 0 ->denoting the feature template 3
(or LEFT-arc,
RIGHT-arc)
Which is correct between FORMAT 1 and 2?
Is there some something I have mistaken?
Basically you have a feature vector of the form:
LABEL RESULT_OF_FEATURE_TEMPLATE_1 RESULT_OF_FEATURE_TEMPLATE_2 RESULT_OF_FEATURE_TEMPLATE_3
Liblinear or LibSVM expect you to translate it into integer representation:
1 1:1 2:1 3:1
Nowadays, depending on the language you use there are lots of packages/libraries, which would translate the string vector into libsvm format automatically, without you having to know the details.
However, if for whatever reason you want to do it yourself, the easiest thing would be maintain two mappings: one mapping for labels ('shift' -> 1, 'left-arc' -> 2, 'right-arc' -> 3, 'reduce' -> 4). And one for your feature template result ('f1=I' -> 1, 'f2=love' -> 2, 'f3=you' -> 3). Basically every time your algorithms applies a feature template you check whether the result is already in the mapping and if not you add it with a new index.
Remember that Liblinear or Libsvm expect a sorted list in ascending order.
During processing you would first apply your feature templates to the current state of your stacks and then translate the strings to the libsvm/liblinear integer representation and sort the indexes in ascending order.
Starting to learn image filtering and stumped on a question found on website: Applying a 3×3 mean filter twice does not produce quite the same result as applying a 5×5 mean filter once. However, a 5×5 convolution kernel can be constructed which is equivalent. What does this kernel look like?
Would appreciate help so that I can understand the subject better. Thanks.
Marcelo's answer is right. Another way of seeing it (more easy to think it first in one dimension) : we know that the mean filter is equivalent to a convolution with a rectangular window. And we know that the convolution is a linear operation, which is also associative.
Now, applying a mean filter M to a signal X can be written as
Y = M * X
where * denotes convolution. Appying the filter twice would then give
Y = M * (M * X) = (M * M) * X = M2 * X
This says that filtering twice a signal with a mean filter is the same as filtering it once with an equivalent filter given by M2 = M * M. Now, this consists of applying the mean filter to itself, what gives a "smoother" filter (a triangular filter in this case).
The process can be repeated, (see first graph here) and it can be shown that the equivalent filter for many repetitions of a mean filter (N convolutions of the rectangular filter with itself) tends to a gaussian filter. Further, it can be shown that the gaussian filter has that property you didn't found in the rectangular (mean) filter: two passes of a gaussian filter are equivalent to another gaussian filter.
3x3 mean:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]
3x3 mean twice:
[1 2 3 2 1]
[2 4 6 4 2]
[3 6 9 6 3] * 1/81
[2 4 6 4 2]
[1 2 3 2 1]
How? Each cell contributes indirectly via one or more intermediate 3x3 windows. Consider the set of stage 1 windows that contribute to a given stage 2 computation. The number of such 3x3 windows that contain a given source cell determines the contribution by that cell. The middle cell, for instance, is contained in all nine windows, so its contribution is 9 * 1/9 * 1/9. I don't know if I've explained it that well, so I hope it makes sense to you.
Actually I believe that 3x3 twice should give:
[1 2 3 2 1]
[2 4 6 4 2]
[3 6 9 6 3] * 1/81
[2 4 6 4 2]
[1 2 3 2 1]
The reason is because the sum of all values must be equal to 1.