How to slice tensors with a predefined order in torch? - lua

I have a dataset of length 10 train = torch.range(1,10). I want to slice it in a random order defined by p = torch.randperm(10).
To get slice by ranges one can do a = train[{{1,3}}] to get elements th first three elements. But lets say I want the the 2nd, 3rd and 9th elements. Can I get this without operating a for loop like this
for i = 1,3 do
print(a[{ p[i] }])
end
where
p[1] = 2, p[2] = 3, p[3] = 9.
a = train[{{ p[{{1,3}}] }}] doesn't work.

Well, for one there's index, it however requires longTensors:
train = torch.range(1,10)
p = torch.randperm(10):long()
print(train:index(p))

Related

ERROR: MethodError: no method matching DocumentTermMatrix(::Vector{String})

I am trying to train a basic SVM model for multiclass text classification in Julia. My dataset has around 75K rows and 2 columns (text and label). The context of the dataset is the abstracts of scientific papers gathered from PubMed. I have 10 labels in the dataset.
The dataset looks like this:
I keep receiving two different Method errors. The starting one is:
ERROR: MethodError: no method matching DocumentTermMatrix(::Vector{String})
I have tried:
convert(Array,data[:,:text])
and also:
convert(Matrix,data[:,:text])
Array conversion gives the same error, and matrix conversion gives:
ERROR: MethodError: no method matching (Matrix)(::Vector{String})
My code is:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
Text = convert(Array,data[:,:text])
m = DocumentTermMatrix(Text)
X = tfidf(m)
return X
end
function Classify(data)
data = ReadData(data)
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
I appreciate your help to solve this problem.
EDIT:
I have managed to get the corpus but now facing DimensionMismatch Error:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
#count = countmap(df.label)
#println(count)
#amt,lesslabel = findmin(count)
#println(amt, lesslabel)
#println(first(df,5))
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
crps = Corpus(StringDocument.(data.text))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X = tf_idf(m)
return X
end
function Classify(data)
data = ReadData(data)
#println(labels)
#println(first(instances))
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
Error:
ERROR: DimensionMismatch("Size of second dimension of training instance\n matrix (247317) does not match length of\n labels (38263)")
(Copying Bogumił Kamiński's solution from the comments, as a community wiki answer, for better visibility.)
The argument to DocumentTermMatrix should be of type Corpus, as in this example.
A Corpus can be created with:
Corpus(StringDocument.(data.text))
There's a DimensionMismatch error after that, which is due to the mismatch between what tf_idf sends and what svmtrain expects. tf_idf's return value has one row per document, whereas svmtrain expects one column per document i.e. expects each column to be an X value. So, performing a permutedims on the result before passing it to svmtrain resolves this mismatch.

shape of input to calculate information gain

I want to calculate the information gain on 20_newsgroup data set.
I am using the code here(also I put a copy of the code down of the question).
As you see the input to the algorithm is X,y
My confusion is that, X is going to be a matrix with documents in rows and features as column. (according to 20_newsgroup it is 11314,1000
in case i only considered 1000 features).
but according to the concept of information gain, it should calculate information gain for each feature.
(So I was expecting to see the code in a way loop through each feature, so the input to the function be a matrix where rows are features and columns are class)
But X is not feature here but X stands for documents, and I can not see the part in the code that take care of this part! ( I mean considering each document, and then going through each feature of that document; like looping through rows but at the same time looping through columns as the features are stored in columns).
I have read this and this and many similar questions but they are not clear in terms of input matrix shape.
this is the code for reading 20_newsgroup:
newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target
cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)
(X_vec.shape) is (11314,1000) which is not features in the 20_newsgroup data set. I am thinking am I calculating Information gain in a incorrect way?
This is the code for Information gain:
def information_gain(X, y):
def _calIg():
entropy_x_set = 0
entropy_x_not_set = 0
for c in classCnt:
probs = classCnt[c] / float(featureTot)
entropy_x_set = entropy_x_set - probs * np.log(probs)
probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
for c in classTotCnt:
if c not in classCnt:
probs = classTotCnt[c] / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
return entropy_before - ((featureTot / float(tot)) * entropy_x_set
+ ((tot - featureTot) / float(tot)) * entropy_x_not_set)
tot = X.shape[0]
classTotCnt = {}
entropy_before = 0
for i in y:
if i not in classTotCnt:
classTotCnt[i] = 1
else:
classTotCnt[i] = classTotCnt[i] + 1
for c in classTotCnt:
probs = classTotCnt[c] / float(tot)
entropy_before = entropy_before - probs * np.log(probs)
nz = X.T.nonzero()
pre = 0
classCnt = {}
featureTot = 0
information_gain = []
for i in range(0, len(nz[0])):
if (i != 0 and nz[0][i] != pre):
for notappear in range(pre+1, nz[0][i]):
information_gain.append(0)
ig = _calIg()
information_gain.append(ig)
pre = nz[0][i]
classCnt = {}
featureTot = 0
featureTot = featureTot + 1
yclass = y[nz[1][i]]
if yclass not in classCnt:
classCnt[yclass] = 1
else:
classCnt[yclass] = classCnt[yclass] + 1
ig = _calIg()
information_gain.append(ig)
return np.asarray(information_gain)
Well, after going through the code in detail, I learned more about X.T.nonzero().
Actually it is correct that information gain needs to loop through features.
Also it is correct that the matrix scikit-learn give us here is based on doc-features.
But:
in code it uses X.T.nonzero() which technically transform all the nonzero values into array. and then in the next row loop through the length of that array range(0, len(X.T.nonzero()[0]).
Overall, this part X.T.nonzero()[0] is returning all the none zero features to us :)

Random sum of elements in an array equals to y - ruby [duplicate]

This question already has answers here:
Finding all possible combinations of numbers to reach a given sum
(32 answers)
Closed 6 years ago.
Need to create an array whose sum should be equal to expected value.
inp = [1,2,3,4,5,6,7,8,9,10]
sum = 200
output:
out = [10,10,9,1,3,3,3,7,.....] whose sum should be 200
or
out = [10,7,3,....] Repeated values can be used
or
out = [2,3,4,9,2,....]
I tried as,
arr = [5,10,15,20,30]
ee = []
max = 200
while (ee.sum < max) do
ee << arr.sample(1).first
end
ee.pop(2)
val = max - ee.sum
pair = arr.uniq.combination(2).detect { |a, b| a + b == val }
ee << pair
ee.flatten
Is there any effective way to do it.
inp = [1,2,3,4,5,6,7,8,9,10]
sum = 20
inp.length.downto(1).flat_map do |i|
inp.combination(i).to_a # take all subarrays of length `i`
end.select do |a|
a.inject(:+) == sum # select only those summing to `sum`
end
One might take a random element of resulting array.
result = inp.length.downto(1).flat_map do |i|
inp.combination(i).to_a # take all subarrays of length `i`
end.select do |a|
a.inject(:+) == sum # select only those summing to `sum`
end
puts result.length
#⇒ 31
puts result.sample
#⇒ [2, 4, 5, 9]
puts result.sample
#⇒ [1, 2, 3, 6, 8]
...
Please note, that this approach is not efficient for long-length inputs. As well, if any original array’s member might be taken many times, combination above should be changed to permutation, but this solution is too ineffective to be used with permutation.
I found an answer of this question in the following link:
Finding all possible combinations of numbers to reach a given sum
def subset_sum(numbers, target, partial=[])
s = partial.inject 0, :+
#check if the partial sum is equals to target
puts "sum(#{partial})=#{target}" if s == target
return if s >= target #if we reach the number why bother to continue
(0..(numbers.length - 1)).each do |i|
n = numbers[i]
remaining = numbers.drop(i+1)
subset_sum(remaining, target, partial + [n])
end
end
subset_sum([1,2,3,4,5,6,7,8,9,10],20)

Perform a find between a matrix and a vector and concatenate results - MATLAB

I have a 3D array
a = meshgrid(2500:1000:25000,2500:1000:25000,2500:1000:25000);
Usually I use a loop to execute the following logic
k =[];
for b = 0.01:0.01:0.2
c = find(a <= b.*0.3 & a <= b.*0.5);
if(~isempty(c))
for i=1:length(c)
k = vertcat(k,a(c(i)));
end
end
end
How do I remove the loop? And perform the action above with one line
Of course
b = [0.01:0.01:0.2];
c=find(a<b*.8)
is not possible
bsxfun based approach to create a mask for the finds and using it to index into a replicated version of input array, a to have the desired output -
vals = repmat(a,[1 1 1 numel(b)]); %// replicated version of input array
mask = bsxfun(#le,a,permute(b*0.3,[1 4 3 2])) & ...
bsxfun(#le,a,permute(b*0.5,[1 4 3 2])); %// mask created
k = vals(mask); %// desired output in k
Please note that you would be needed to change the function handle used with bsxfun according to the condition you would be using.

combine time series plot by using R

I wanna combine three graphics on one graph. The data from inside of R which is " nottem ". Can someone help me to write code to put a seasonal mean and harmonic (cosine model) and its time series plots together by using different colors? I already wrote model code just don't know how to combine them together to compare.
Code :library(TSA)
nottem
month.=season(nottem)
model=lm(nottem~month.-1)
summary(nottem)
har.=harmonic(nottem,1)
model1=lm(nottem~har.)
summary(model1)
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
points(y=nottem,x=time(nottem), pch=as.vector(season(nottem)))
Just put your time series inside a matrix:
x = cbind(serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)),
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)))
plot(x)
Or configure the plot region:
par(mfrow = c(2, 1)) # 2 rows, 1 column
serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
require(zoo)
plot(serie1)
lines(rollapply(serie1, width = 10, FUN = mean), col = 'red')
plot(serie2)
lines(rollapply(serie2, width = 10, FUN = mean), col = 'blue')
hope it helps.
PS.: zoo package is not needed in this example, you could use the filter function.
You can extract the seasonal mean with:
s.mean = tapply(serie, cycle(serie), mean)
# January, assuming serie is monthly data
print(s.mean[1])
This graph is pretty hard to read, because your three sets of values are so similar. Still, if you want to simply want to graph all of these on the sample plot, you can do it pretty easily by using the coefficients generated by your models.
Step 1: Plot the raw data. This comes from your original code.
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
Step 2: Set up x-values for the mean and cosine plots.
x <- seq(1920, (1940 - 1/12), by=1/12)
Step 3: Plot the seasonal means by repeating the coefficients from the first model.
lines(x=x, y=rep(model$coefficients, 20), col="blue")
Step 4: Calculate the y-values for the cosine function using the coefficients from the second model, and then plot.
y <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
lines(x=x, y=y, col="red")
ggplot variant: If you decide to switch to the popular 'ggplot2' package for your plot, you would do it like so:
x <- seq(1920, (1940 - 1/12), by=1/12)
y.seas.mean <- rep(model$coefficients, 20)
y.har.cos <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
plot_Data <- melt(data.frame(x=x, temp=nottem, seas.mean=y.seas.mean, har.cos=y.har.cos), id="x")
ggplot(plot_Data, aes(x=x, y=value, col=variable)) + geom_line()

Resources