Sum of all subsets in the range - segment-tree

Given an array A consisting of N elements and Q queries of type [l,r]. Print the sum of all possible subsets in the range [l,r].
Example: A[]= { 1, 2, 3 } & l= 1, r= 3;
Sol: print {1}, {2}, {3}, {1+2}, {1+3}, {2+3}, {1+2+3}
P.S I am using Segment trees with bit manipulation to find the sum of all possible subsets but it will give tle. Is there any optimized solution?

Related

A differentiable approach to counting elements in PyTorch

I need to count the number of times a certain element appear in a tensor in a differentiable way.
I have a tensor
a = torch.arange(10, dtype = float, requires_grad=True)
print(a)
>>>tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=torch.float64,
requires_grad=True)
Say I'm trying to count the number of times the element 5.0 appear. I found this SO question that is exactly the same, but the accepted answer is non differentiable:
(a == 5).sum()
>>>tensor(1)
(a == 5).sum().requires_grad
>>>False
My goal is to have a loss that enforces the element to appear N times:
loss = N - (a == 5).sum()
What you probably care about is differentiability wrt parameters, so your vector [1,2,3,4,5] is actually an output of f(x | theta). Sicne you casted everything onto integers, this will never create a meaningful gradient for theta, you have two paths:
Change your output, so that you do not output numbers, but rather distributions over number sequences, so instead of having a vector of integers, output a matrix of probabilities, N x K, where K is the maximum number and N number of integers, and an entry p_nk is a probability of nth number to be equal to k. Then, you can just write a nice smooth loss that will take expected number of each digit, lets call it Z (which is of length K) and then we can do
loss(P, Z) := - SUM_k [ || Z_k - [ SUM_n P_nk ] || ]
Treat the whole setup as RL problem, and then you do not need a "differentiable" loss. Just use a difference between expected occurences, and actual occurences as negative reward

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

How to use segment tree and scanline

Given 300000 segments.
Consider any pair of segments: a = [l1,r1] and b = [l2,r2].
If l2 >= l1 and r2 <= r1 , it is "good" pair.
If a == b, it is "bad" pair.
Overwise, it is "bad" pair.
How to find number of all "good" pairs among given segments using segment tree and scanline?
Sort the segments in increasing order with respect to their l-values and for pairs with same l-values sort them in decreasing order with respect to their r-value.
Suppose for a particular , you want to count the number of good pairs (ai,aj) such that j < i. Let ai=[l1,r1] and aj = [l2,r2]. Then we have l2 <= l1. Now we need to count all the possible values of j such that r2 <= r1. This can be done by maintaining a segment tree for the values of r for all j such that 0 < j < i. After querying for the i-th pair, update the segment tree with the r-value of the i-th segment.
Coming to segment tree part, build a segment tree on the values of r. On updating a value of r in segment tree, add 1 to the value of r in the segment tree and for querying for a particular value of r, query for sum in the range [0,r-1]. This will give total number of segments that fit good with the given segment.
If the values of r are big that would not fit into segment tree, then apply coordinate compression to values first and then use segment tree for the compressed values.

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20&plus;%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?
Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.
What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

Maxima: Differentiate sum at specific index-position

In Maxima 12.04.0 I have a sum
mysum : sum(u[i]^2, i, 1, N);
now I differentiate it
diff(mysum, u[i]);
now I specify a defined index i=A to differentiate it at
at(%, i=A);
Unfortunately maxima won't replace the u[i] in the sum that way.
How can I bring maxima to a result like
2*u[A]
After you differentiate, push the 2 into the sum, pick out the i-th term, and then substitute i=A:
(%i1) mysum : sum(u[i]^2, i, 1, N);
diff(mysum, u[i]);
intosum(%);
part(%, 1);
%, i=A;

Resources