Suppose I have the dataset in the following format:
col1 col2 col3 col4 col5 (to be predicted)
12 13 4 primary 12
1 15 2 secondary 13
5 7 8 primary 18
14 12 44 college 6
col5 needs to be predicted for some test data using col1, col2, col3 and col4
During training, col1, col2, col3 can be feeded as such in an array to the classifier but how to feed col4.
I am aware that this is categorical and need to be converted to numeric type, but even after assigning some number, it will still remain as nominal type.
So if primary=1, secondary=2 and college=3, the numbers 1,2 and 3 cant be compared as per their magnitude because they are still like labels, with no numerical significance.
So how should I proceed after this step... should they be normalized ? or any further should be done ?
You should use One Hot Encoding in such cases. Every possible categorial value creates new binary feature.
One Hot Encoding for Machine learning
Related
I'm trying to query/filter rows from a dataset structured like this:
Creator
Title
Barcode
Inv. No.
springer
Cellbio
014678
POL02P14x
springer
Cellbio
026938
POL02P26r
springer
Cellbio
038745
nature
Cellular
026672
POL02P26h
elsevier
Biomed
026678
POL02P26g
elsevier
Biomed
026678
POL02P26g
spring
Cellbit
POL02P147
spring
Cellbit
026938
POL02P26j
spring
Cellbit
038745
I need to return all rows where the value/string in column B(title) is duplicate and when in those duplicate rows at least one string/value in column C(barcode) starts with 014 and at least one starts with 026. If the criteria is not met in column C the next check would be similar in column D (Inv. no.): at least one value string starts with POL02P14 and at least one starts with POL026.
So the basic logic would be something like this:
Select all rows where B is duplicate and
((at least one value in C starts with x and one with y) or ( at least one value in D starts with z and one with W)).
So the desired output should be like this:
Creator
Title
Barcode
Inv. No.
springer
Cellbio
014678
POL02P14x
springer
Cellbio
026938
POL02P26r
springer
Cellbio
038745
spring
Cellbit
POL02P147
spring
Cellbit
026938
POL02P26j
spring
Cellbit
038745
Here is a sample spreadsheet more similar to the actual dataset which is fairly large:
https://docs.google.com/spreadsheets/d/1xj5LnOxIwEmcjnXD0trmvcCKJIGIcfDkARV80Hx5Fvc/edit?usp=sharing
Tried adapting formulas with similar logic but always getting errors or unexpected results either the query logic/syntax is wrong or there is filter/array dimension mismatch.
Some examples(the column references are mixed up here because i was trying to reduce the number of columns) :
=FILTER(query(list!A1:AR, "Select * where C starts with 'POL02P'"), list!B1:B<>"",COUNTIF(list!B1:B,list!B1:B)>1)
={results!A1:AR1;array_constrain(
query(
{Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26"));
countif(index(Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26")),0,45),
index(Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26")),0,45))}
,"Select * where Col46>1")
,9^9,44)}
=query(FILTER({list!A2:A&list!J2:J,list!A2:J,
iferror(
vlookup(list!A2:A&list!J2:J,query(query(filter(list!A2:A&
list!J2:J,REGEXMATCH(list!C2:C, "^POL02P14|POL02P26")),
"select Col4, count(Col4) where Col4 <> '' group by Col4"),
"select Col4 where Col2 >1 "),1,false))},REGEXMATCH(list!C2:C, "^POL02P14|POL02P26")),
"select Col1, Col2, Col3, Col5, Col6, Col7, Col8, Col9, Col10, Col11 where Col12 <> ''
order by Col3 asc, Col11 asc")
Please try this out in your sample sheet:
={results!A1:AR1;FILTER(results!A2:AR,REGEXMATCH(results!B2:B,JOIN("|","^"&LAMBDA(z,LAMBDA(x,y,z,{filter(filter(x,y="014"),xmatch(filter(x,y="014"),filter(x,y="026")));filter(filter(x,z="POL02P14"),xmatch(filter(x,z="POL02P14"),filter(x,z="POL02P26")))})(INDEX(z,,1),INDEX(z,,2),INDEX(z,,3)))((UNIQUE(FILTER({results!B2:B,LEFT(results!C2:C,3),LEFT(results!D2:D,8)},results!B2:B<>"",results!D2:D<>""))))&"$")))}
formula logic at a glance:
filter Col_B (Title) in 4 ways (matches to 014, 026, POL02P14, POL02P26)
capture the Col_B which has both 014 and 026
capture the Col_B which has both POL02P14 and POL02P26
Shortlist the Col_B which is TRUE for either step 2 OR step 3 above
Once the list is finalised join them all for regexmatch with Col_B for the final output.
I have to work with a large table on Google Sheet containing, roughly, weights associated with a day (and other superfluous data).
It looks like this:
Date
Weight 1
Weight 2
Weight 3
01/01/22
20
22
21
01/02/22
19
25
A date, and multiple weights associated.
Ideally, I would need an intermediate table that includes all the columns of the initial table, but with one weight per row.
Like this one for example:
Date
Weight
01/01/22
20
01/01/22
22
01/01/22
21
01/02/22
19
01/02/22
25
I tried several methods to filter this table and retrieve all the weights for each date independently.
Index/match, filter, query...
I couldn't get what I needed.
Do you know if there is a formula that would allow me to obtain this second table?
try:
=ARRAYFORMULA(QUERY(SPLIT(FLATTEN(if(len(B2:D)*len(A2:A),A2:A22&"|"&B2:D22,)),"|"),"Select * Where Col2 is not null"))
try:
=ARRAYFORMULA(QUERY(SPLIT(FLATTEN(A2:A&""&B2:D), ""), "where Col2 is not null", ))
I have a list of values in Google Sheets for example:
10
14
36
43
64
110
92
103
and I want to change it to a range of
0-20, 21-40, 41-80, 81-120
so that it outputs
2
1
2
3
(two values in the range 0-20, one value in the 21-40 range, two values in the 41-80 range, and three values in the 81-120 range.)
You can do it in one step with the Frequency function FREQUENCY(data, classes):
=frequency(A2:A10,{20,40,80,120})
Note that Frequency creates one count per class, plus an extra count for values which exceed the highest class value. You can suppress this if you want to, but it could be a useful check for outliers.
=QUERY(ARRAYFORMULA({A1:A, IF(LEN(A1:A),
IFERROR(VLOOKUP(A1:A, {{0, "0-20" };
{21, "21-40" };
{41, "41-80" };
{81, "81-120" }}, 2), ),)}),
"select Col2, count(Col2)
where Col2 !=''
group by Col2
label count(Col2)''")
alternatives: https://webapps.stackexchange.com/a/123741/186471
I have data like following:
col1 col2 col3
2 14 text, text, some text
I went through http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing but I could only find information to vectorize col3 and pass it on for classification. In my scenario, I have numerical information in col1 and col2 as well.
If without vectorizing I pass col1, 2 and 3 I get an error for col3 as it is String.
If I vectorize col3, the output is a sparse matrix. I need to add col1 and col2 to the vectorized data. How do I do that?
I am using scikit-learn.
I have try to get sum of two columns using query function in Google Sheets.
Col1 Col2 Col3
-----------------------
12 User1
23 44 creature
55 User1
14 User1
This work fine if there are at least one number in each column:
=QUERY(IMPORTRANGE('SomeURL';"Page!A1:C");
"select (sum(Col1) + sum(Col2)) where Col3 = 'User1'")
However this query cause error QUERY:AVG_SUM_ONLY_NUMERIC if all cells in one column are empty in result set.
Col1 Col2 Col3
-----------------------
12 User1
23 44 creature
User1
14 User1
How can I get sum of columns using query function, if sometimes the cells are empty in one of the column?
=ARRAYFORMULA(SUM(query(IMPORTRANGE("url","page!A1:C6"),"select Col1,Col2 where Col3 = 'User1'")))
This should work for a simple sum. But i don't think there's a way inside QUERY to consider blanks as zero or assume them as numbers. If you could actually import range into sheet(i.e., use them as helper columns), then you can use ARRAYFORMULA(Query ({filter (A1:B6*1,NOT(ISEMAIL(A1:A6))),C1:C6}, "select *.... You should convert blanks outside Query (by *1)or Sum them outside query. Or use a DOUBLE Query and double import range, which would be performance depreciative.
You can SUM the Query, like this:
=sum(query('SomeURL';"Page!A1:C"); "select Col1, Col2 where Col3 ='User1'"))
OR use SUMIF() twice, once for each column. It means 2 importranges, though, so it will probably be slower.