Pandas stack unstack pivot hierarchical index - reshape dataframe - stack

I have massaged a dataframe so it looks like this:
123
456
789
0AB
CDE
FGH
...
,,,
I would like to transform it, so it looks like this:
123789CDE...
4560ABFGH,,,
The pattern is this:
123 789 CDE ...
456 0AB FGH ,,,
That is, I take two rows and concatenate the next two rows, etc, so I get a wide dataframe.
But my real dataframe is not three columns, it is maybe 50 columns, and maybe 100,000 rows, so my dataframe is 100,000 x 50 big. I want to take 100 rows, and concatenate the next 100 rows, etc so I get a wide dataframe with dimension 100 x (50 * 100,000/100) = 100 x 50,000.
Can Pandas do this? My aim is to do some calculations on each of these 100 rows. Or is hierarchical indexing better?

shell [33]>>> df
[33]>>>
0
0 123
1 456
2 789
3 0AB
4 CDE
5 FGH
6 ...
7 ,,,
shell [34]>>> pd.DataFrame(df.values.reshape(4, 2)).sum()
[34]>>>
0 123789CDE...
1 4560ABFGH,,,
dtype: object
Another approach is using groupby.
shell [35]>>> df['group'] = 0
shell [36]>>> df[1::2]['group'] = 1
shell [37]>>> grouped = df.groupby('group')
shell [38]>>> grouped.sum()
[38]>>>
0
group
0 123789CDE...
1 4560ABFGH,,,
Maybe worth studying not to create a new frame and instead work directly on the groups? Certainly for multiple columns and huge numnber of rows.
shell [39]>>> for key, group in grouped:
print key
print group
....:
0
0 group
0 123 0
2 789 0
4 CDE 0
6 ... 0
1
0 group
1 456 1
3 0AB 1
5 FGH 1
7 ,,, 1

Related

Expanding arrays of intervals in Arrayfire

I have three Arrayfire arrays that look like this:
Array 1 Array 2 Array 3
20 5 9
3 0 0
9 4 8
0 20 22
... ... ...
Using Arrayfire, I would like to generate 2 new arrays. The first should contain values from Array 1. Each value should be repeated a number of times dictated by the interval between the corresponding values in Array 2 (inclusive) and Array 3 (exclusive). The second array should contain an expansion of the values within each interval for each value from Array 1. Sorry if that's not clear. Here's the desired output to hopefully clarify:
Array 1 Array 2
20 5
20 6
20 7
20 8
9 4
9 5
9 6
9 7
0 20
0 21
... ...
The order of the output doesn't matter.
Thanks, in advance, from an Arrayfire novice.

Apply function to each row in Torch

I know that tensors have an apply method, but this only applies a function to each element. Is there an elegant way to do row-wise operations? For example, can I multiply each row by a different value?
Say
A =
1 2 3
4 5 6
7 8 9
and
B =
1
2
3
and I want to multiply each element in the ith row of A by the ith element of B to get
1 2 3
8 10 12
21 24 27
how would I do that?
See this link: Torch - Apply function over dimension
(Thanks to Alexander Lutsenko for providing it. I just moved it to the answer.)
One possibility is to expand B as follow:
1 1 1
2 2 2
3 3 3
[torch.DoubleTensor of size 3x3]
Then you can use element-wise multiplication directly:
local A = torch.Tensor{{1,2,3},{4,5,6},{7,8,9}}
local B = torch.Tensor{1,2,3}
local C = A:cmul(B:view(3,1):expand(3,3))

why the result of method mostSimilarItems in mahout is not order by the weight?

I have the following codes:
ItemSimilarity itemSimilarity = new UncenteredCosineSimilarity(dataModel);
recommender = new GenericItemBasedRecommender(dataModel,itemSimilarity);
List<RecommendedItem> items = recommender.mostSimilarItems(10, 5);
my datamodel is like this:
uid itemid socre
userid itemid score
1 6 5
1 10 3
1 11 5
1 12 4
1 13 5
2 2 3
2 6 5
2 10 3
2 12 5
when I run the code above,the result is just like this:
13
6
11
2
12
I debug the code,and find that the List items = recommender.mostSimilarItems(10, 5); return the items has the same score,that is one!
so,I have a problem.in my opinion,I think the mostsimilaritem should consider the item co-occurrence matrix:
2 6 10 11 12 13
2 0 1 1 0 1 0
6 1 0 2 1 2 1
10 1 2 0 1 2 1
11 0 1 1 0 1 1
12 1 2 2 1 0 1
13 0 1 1 1 1 0
in the matrix above ,the item 12's most similar should be [6,12,11,13,2],because the item 1 and item 12 is more similar than the other items,isn't it?
now,anyone who can explain this for me?thanks!
In your matrix you have much more data than in your input. In particular you seem to be imputing 0 values that are not in the data. That is why you are likely getting answers different from what you expect.
Mahout expects your IDs to be contiguous Integers starting from 0. This is true of your row and column ids. Your matrix looks like it has missing ids. Just having Integers is not enough.
Could this be the problem? Not sure what Mahout would do with the input above.
I always keep a dictionary to map Mahout IDs to/from my own.

Automatically learning clusters

HI complete newbie question here: I have a table consisting of two columns. First column belongs to "bins" that are coded by where a the fruit flies live. The second column is either 0 or 1, neutral vs really like sugar, respectively. I have two question?
1) if I suspect that there is a single variable, something about where they live that is determining whether how much they like sugar. Is there a way that I can have the computer to group into just 2 clusters? All the bins that like sugar vs neutral. That way we can do further experiment to determine what is it about the bins.
2) automatically determine how many clusters there might be that is driving this behavior? For example may be there is 4 variables (4 clusters) that can determine the outcome of sugar preference.
Apologies if this is trivial. The table is listed below. thanks!
Bin sugar
1 1
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 1
3 1
4 1
4 1
4 1
5 1
5 0
5 1
6 0
6 0
6 0
7 0
7 1
7 1
8 1
8 0
8 1
9 1
9 0
9 0
9 0
10 0
10 0
10 0
11 1
11 1
11 1
12 0
12 0
12 0
12 0
13 0
13 0
13 1
13 0
13 0
14 0
14 0
14 0
14 0
15 1
15 0
15 0
16 1
16 1
17 1
17 1
18 0
18 1
18 1
17 1
19 1
20 1
20 0
20 0
20 1
21 0
21 0
21 1
21 0
22 1
22 0
22 1
22 1
23 1
23 1
24 1
24 0
25 0
25 1
25 0
26 1
26 1
27 1
27 1
Okay, assuming I understood what you meant, one approach to problem 1) should be addressed using bayes filtering.
Say event L is "a fly likes sugar", event B is "a fly is in bin B".
So what you have is:
number of flies = 84
size of each bins = (eg size of bin 1: 4)
probability that a fly likes sugar:
P(L) = flies that like sugar / total number of flies = 43/84
probability that a fly doesn't like sugar:
P(notL) = 1 - P(L) = 41/84
probability that a fly is in a given bin:
P(B) = size of the bin / sum of the sizes of all bins = 4/84 (for bin 1)
probability that a fly isn't in a given bin:
P(notB) = 1 - P(B) = 80/84 (for bin 1)
probability that a fly likes sugar, knowing that's in bin B:
P(L|B) = flies that like sugar in a bin / size of the bin
(eg for bin 1 is 2/4 = 1/2)
probability that a fly likes sugar, knowing that it's not in bin B:
P(L|notB) = (total flies that like sugar - flies that like sugar in the bin)/(size of bins - size of the bin)) = 41/80
You want to know the probability that a fly is in a given bin B knowing that likes sugar, which you can obtain with:
P(B|L) = (P(L|B) * P(B)) / (P(L|B) * P(B) + P(L|notB) * P(notB))
If you compute P(B|L) and P(B|notL) for each bin, then you know which of the bins have the highest probability of containing flies that like sugar. Then you can further study those bins.
Hope i was clear, my statistics is a bit rusty and I'm not even sure I am doing everything correctly. Take it as a hint to point you in the right direction to address the problem.
You can refer here to get more accurate reasoning and results.
As for problem 2)... I have to think about it a bit more.

Using COUNTIFS on 3 different columns and then need to SUM a 4th column?

I have written this formula below. I do not know the correct part of this formula that will add the numbers I have in Column AB2:AB552. As it is, this formula is counting the number of cells in that range that has numbers in it, but I need it to total those numbers as my final result. Any help would be great.
=COUNTIFS(Cases!B2:B552,"1",Cases!G2:G552,"c*",Cases!X2:X552,"No",**Cases!AB2:AB552,">0"**)
Assuming you don't actually need the intermediate counts, the sumifs function should give you the final result:
=SUMIFS(Cases!AB2:AB552,Cases!B2:B552,1,Cases!G2:G552,"c",Cases!X2:X552,"No",Cases!AB2:AB552,">0")
Testing this with some limited data:
Row B G X AB
2 2 a No 10
3 1 c No 24
4 2 c No 4
5 1 c No 0
6 1 a Yes 9
7 2 c No 12
8 2 c No 6
9 2 b No 0
10 1 b No 0
11 1 a No 10
12 2 c No 6
13 1 c No 20
14 1 c No 4
15 1 b Yes 22
16 1 b Yes 22
the formula above returned 48, the sum of AB3, AB13, and AB14, which were the only rows matching all 4 criteria

Resources