I am trying to clean my formulas.
I favor using FILTER in my formulas. FILTER will return #N/A when it can not find any matches in a filter. And COUNTA includes #N/A errors in its count. So using this table with the following formulas.
A
foo
bar
baz
=COUNTA(FILTER(A1:A3, A1:A3 = "foo"))
=COUNTA(FILTER(A1:A3, A1:A3 = "bar"))
=COUNTA(FILTER(A1:A3, A1:A3 = "baz"))
=COUNTA(FILTER(A1:A3, A1:A3 = "Gabriel"))
=COUNTA(FILTER(A1:A3, A1:A3 = "bog"))
=COUNTA(FILTER(A1:A3, A1:A3 = "nit"))
=COUNTA(FILTER(A1:A3, A1:A3 = "bug"))
All of the following formulas will return 1. Even if it doesn't find a match! The value will be one because it is counting the #N/A
The only work around I have found is doing something like this
=IF(IFERROR(FILTER(A1:A3, A1:A3 = "bog"), -1) = -1, 0, COUNTA(FILTER(A1:A3, A1:A3 = "bog"))
This more than doubles the length of each formula I use this method on. In Excel I would just use LET but I need to use Google Sheets.
The closest I got to a solution is using COUNTIF
=COUNTIF(FILTER(A1:A3, A1:A3 = "foo"), NA())
This returns the number of #N/As in the list. Which would be 1 but I need something like
=COUNTIF(FILTER(A1:A3, A1:A3 = "foo"), "<>"&NA())
which doesn't work. Oddly enough it does the exact same thing as the formula previous.
You can add an IFNA() function to result in an empty cell, which COUNTA() doesn't count:
=COUNTA(ifna(FILTER(A1:A3, A1:A3 = "foo"),))
=COUNTA(ifna(FILTER(A1:A3, A1:A3 = "bar"),))
=COUNTA(ifna(FILTER(A1:A3, A1:A3 = "baz"),))
=COUNTA(ifna(FILTER(A1:A3, A1:A3 = "Gabriel"),))
=COUNTA(ifna(FILTER(A1:A3, A1:A3 = "bog"),))
=COUNTA(ifna(FILTER(A1:A3, A1:A3 = "nit"),))
=COUNTA(ifna(FILTER(A1:A3, A1:A3 = "bug"),))
Related
I am trying to train a basic SVM model for multiclass text classification in Julia. My dataset has around 75K rows and 2 columns (text and label). The context of the dataset is the abstracts of scientific papers gathered from PubMed. I have 10 labels in the dataset.
The dataset looks like this:
I keep receiving two different Method errors. The starting one is:
ERROR: MethodError: no method matching DocumentTermMatrix(::Vector{String})
I have tried:
convert(Array,data[:,:text])
and also:
convert(Matrix,data[:,:text])
Array conversion gives the same error, and matrix conversion gives:
ERROR: MethodError: no method matching (Matrix)(::Vector{String})
My code is:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
Text = convert(Array,data[:,:text])
m = DocumentTermMatrix(Text)
X = tfidf(m)
return X
end
function Classify(data)
data = ReadData(data)
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
I appreciate your help to solve this problem.
EDIT:
I have managed to get the corpus but now facing DimensionMismatch Error:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
#count = countmap(df.label)
#println(count)
#amt,lesslabel = findmin(count)
#println(amt, lesslabel)
#println(first(df,5))
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
crps = Corpus(StringDocument.(data.text))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X = tf_idf(m)
return X
end
function Classify(data)
data = ReadData(data)
#println(labels)
#println(first(instances))
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
Error:
ERROR: DimensionMismatch("Size of second dimension of training instance\n matrix (247317) does not match length of\n labels (38263)")
(Copying Bogumił Kamiński's solution from the comments, as a community wiki answer, for better visibility.)
The argument to DocumentTermMatrix should be of type Corpus, as in this example.
A Corpus can be created with:
Corpus(StringDocument.(data.text))
There's a DimensionMismatch error after that, which is due to the mismatch between what tf_idf sends and what svmtrain expects. tf_idf's return value has one row per document, whereas svmtrain expects one column per document i.e. expects each column to be an X value. So, performing a permutedims on the result before passing it to svmtrain resolves this mismatch.
I am trying to sort a table 'array' that contains tables with two keys called 'pt' and 'angle'. I want to sort the 'array' elements regarding their 'angle' value. To my understanding of table.sort this code snippet should do the trick:
local array = {}
-- Some code that calls
-- table.insert(array, {pt = somePt, angle = someAngle})
-- multiple times
local sorted_table = table.sort(array, function(a,b) return a.angle < b.angle end)
However, sorted_table is always nil. Am I doing something wrong here?
table.sort sorts the array part of a table in place. It does not return a new array. If you need to keep the original, you first need to copy to a temporary array.
So, try something like this:
table.sort(array,function(a,b) return a.angle < b.angle end)
table.sort sorts the table in place; that is, it changes the table that you give it and doesn't return a new one.
If you want a sorted copy, you'd first have to make a copy of the table yourself, then sort it.
This could look like this:
local function sorted_copy(tab, func)
local tab = {table.unpack(tab)}
table.sort(tab, func)
return tab
end
That will create a copy of the table (at least the numeric indices up to some random border) and sort it.
there is no secret, there is an algorithm for this sorting, which has been used many times:
function quicksort(t, sortname, start, endi)
start, endi = start or 1, endi or #t
sortname = sortname or 1
if(endi - start < 1) then return t end
local pivot = start
for i = start + 1, endi do
if t[i][sortname] <= t[pivot][sortname] then
local temp = t[pivot + 1]
t[pivot + 1] = t[pivot]
if(i == pivot + 1) then
t[pivot] = temp
else
t[pivot] = t[i]
t[i] = temp
end
pivot = pivot + 1
end
end
t = quicksort(t, sortname, start, pivot - 1)
return quicksort(t, sortname, pivot + 1, endi)
end
local array = {}
table.insert(array, {pt = 1, angle = 2})
table.insert(array, {pt = 4, angle = 9})
table.insert(array, {pt = 1, angle = 5})
table.insert(array, {pt = 2, angle = 7})
table.insert(array, {pt = 2, angle = 1})
table.insert(array, {pt = 5, angle = 2})
local s_t = quicksort(array, "angle")
for k,v in pairs(s_t) do
print(k, "v=", v.pt, v.angle)
end
Output:
1 v= 2 1
2 v= 5 2
3 v= 1 2
4 v= 1 5
5 v= 2 7
6 v= 4 9
I have the following Cypher:
MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
WHERE v.id = {valueId}
OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
WHERE {fetchCreateUsers}
WITH u, hv ORDER BY hv.createDate DESC
WITH count(hv) as count, count(hv) / {maxResults} as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count
right now:
count(hv) = 260
{maxResults} = 100
The step variable equals 2 but I expect round(260/100) = 3
I tried the following round(count(hv) / {maxResults}) as step but step is still 2.
How to fix my query in order to get a proper round (3 as step variable in this particular case)?
Use toFloat() in one of the values:
return round(toFloat(260) / 100)
Output:
╒═══════════════════════════╕
│"round(toFloat(260) / 100)"│
╞═══════════════════════════╡
│3 │
└───────────────────────────┘
You're currently doing integer division. If you enter return 260/100 you'll get 2, and that's the value that gets rounded (though there's nothing to round, so you get 2 back).
You need to be working with floating point values. You can do this by having maxResults have an explicit decimal (100.0), or use toFloat() around either maxResults or the count. Both return 260/100.0 and return toFloat(260)/100 or return 260/toFloat(100) will result in 2.6. If you round() that you'll get your expected 3 value.
My SPSS Syntax code below does not produce the results intended. Even when reason is equal to 15 or 16, it will flag ped.crossing.with.signal to 1.
COMPUTE ped.crossing.with.signal = 0.
IF ((ped.action = 1 | ped.action = 2) AND (reason ~= 15 | reason ~= 16)) ped.crossing.with.signal = 1.
EXECUTE.
When I do it like this, it works... but why?
COMPUTE ped.crossing.with.signal = 0.
IF (ped.action = 1 | ped.action = 2) ped.crossing.with.signal = 1.
IF (reason = 15 | reason = 16) ped.crossing.with.signal = 0.
EXECUTE.
It's not a wonky result, but rather the correct application of boolean algebra which is explained by De Morgan's Laws.
The expression (reason ~= 15 | reason ~= 16) is equivalent to ~(reason = 15 and reason = 16) which in this case can never evaluate to false (because a single variable can't hold two values). Logically, the correct expression to use would be (reason ~= 15 & reason ~= 16) or ~(reason = 15 | reason = 16) although as pointed out already using the any function is more straightforward.
Your syntax looks good but in fact is logically not as you intend as Jay points out in his answer.
Also, you can simplify the syntax as follow to avoid complicated Boolean negations:
COMPUTE ped.crossing.with.signal = ANY(ped.action, 1, 2) AND ANY(reason, 15, 16)=0.
Using a single COMPUTE command in this way shall make your processing more efficient/faster, not to mention more parsimonious code also.
Sometimes, evaluating a boolean expression in a model doesn't return a concrete boolean value even when the expression clearly has a concrete value. I've been able to reduce this to cases involving array expressions such as this test case:
from z3 import *
x = K(IntSort(), 0)
s = Solver()
s.check()
m = s.model()
print m.evaluate(x == Store(x, 0, 1), model_completion=True)
I would expect this to print False, but instead it prints K(Int, 0) == Store(K(Int, 0), 0, 1). Other examples produce similar results. Replacing the first line with x = Array('x', IntSort(), IntSort()) gives the same result (though that's up to the default interpretation). Interestingly, replacing the first line with x = Store(K(IntSort(), 0), 0, 1) causes the example to print True.
Is there a way to force Z3 to evaluate such expressions to concrete values?