Machine Learning using Multiple Features - Text Processing - machine-learning

I have data like following:
col1 col2 col3
2 14 text, text, some text
I went through http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing but I could only find information to vectorize col3 and pass it on for classification. In my scenario, I have numerical information in col1 and col2 as well.
If without vectorizing I pass col1, 2 and 3 I get an error for col3 as it is String.
If I vectorize col3, the output is a sparse matrix. I need to add col1 and col2 to the vectorized data. How do I do that?
I am using scikit-learn.

Related

How to find most frequent value in two columns based on a condition in Google Sheets?

I'm not too savvy with programming, so bear with me. The data is of the players' names and fighters in a fighting game, with the table as below:
I'm making a leaderboard where it will show which fighter each player uses most. I found a formula that shows which fighter is used the most in general using the following formula:
=index(query({E2:E,G2:G},
"select Col1,count(Col1)
group by Col1
order by count(Col1) desc"), 2, 1)
The problem is, I don't understand how to make a formula that checks the player's name before returning the most used fighter from two columns. What I'm after is a result like so:
Thanks in advance!
Formula
=ArrayFormula(
{QUERY(QUERY({SPLIT("Wins"&"|"&UNIQUE({D2:D;F2:F}&"|"&"."),"|");SPLIT("Loss"&"|"&UNIQUE({D2:D;F2:F}&"|"&"."),"|");
SPLIT("Wins"&"|"&FILTER(D2:D&"|"&E2:E,D2:D<>""),"|");
SPLIT("Loss"&"|"&FILTER(F2:F&"|"&G2:G,F2:F<>""),"|")},
"select Col2,count(Col3)-1 where Col2!='.' group by Col2 pivot Col1"),"select Col1,Col3,Col2,Col3/Col2,Col3/(Col3+Col2) order by Col3 desc label Col1 'Name',Col2 'Loss',Col3 'Wins',Col3/Col2 'W/L',Col3/(Col3+Col2) 'Win Rate %' format Col3/Col2 '0.0',Col3/(Col3+Col2) '#.0%'"),
VLOOKUP(INDEX(QUERY({D2:D},"select Col1,count(Col1) where Col1!='' group by Col1 order by count(Col1) desc label Col1 'Name'"),0,1),QUERY({D2:E;F2:G},"select Col1,Col2,count(Col2) group by Col1,Col2 order by Count(Col2) desc label Col1 'Name',Col2 'Most Used Character'"),2,0)})
Note:
column Wins/Loss cannot be calculated when Loss=0 .
Hope you're okay with that

Efficient way to collate/aggregate specific data in Google Sheets

I'm looking for an efficient way to gather and aggregate some date in Google Sheets. I've been looking at the query function, pivot tables, and Index + Match formulas, but so far I've not found a way that brings me to the result I'm looking for. I have a set of data which looks more or less as follows.
The fields with an X represent irrelevant data which I don't want to show up in my end result. They only serve to illustrate that there are columns of data that I don't want in between the columns of data that I do want. The data in those columns is of varying types and of varying values per type, they are not actually fields with an "X" in it. Only the fields with numbers are of interest along with the related names at the top and left of those. The intent is to create a list that looks more or less like this.
I've highlighted those yellow fields because that data has been aggregated. For example, in the original file field D3 shows a relation between Laura and Pete with the number 1, and field L3 also shows a relation between Laura and Pete, so the number in that field is to be added to the number in the other field resulting in an aggregated total of 2 for that particular combination.
I would really appreciate any suggestions that can help me get to an elegant and efficient solution for this. The only solutions I can come up with would involve multiple "in-between" sheets and there just has to be a better way.
UPDATE:
Solved by applying the solution in player0's answer. I just had to switch around the order of Col1 and Col2 in the formula to get the table sorted the way I needed it. Formula looks like below now. Many thanks to both player0 and Erik Tyler for their efforts.
=INDEX(QUERY(SPLIT(FLATTEN(A2:A&"×"&D1:N1&"×"&D2:N), "×"),
"select Col2,Col1,sum(Col3)
where Col2 is not null
and Col3 is not null
group by Col2,Col1
label sum(Col3)''", ))
try:
=INDEX(QUERY(SPLIT(FLATTEN(A2:A&"×"&D1:N1&"×"&D2:N), "×"),
"where Col3 is not null and Col2 is not null", ))
update:
=INDEX(QUERY(SPLIT(FLATTEN(A2:A&"×"&D1:N1&"×"&D2:N), "×"),
"select Col1,Col2,sum(Col3)
where Col3 is not null
and Col2 is not null
group by Col1,Col2
label sum(Col3)''", ))
Given your current data set (which only appears to extend to Col N), place the following somewhere to the right of Col N:
=ArrayFormula(SPLIT(TRANSPOSE(QUERY(TRANSPOSE(QUERY(SPLIT(QUERY(FLATTEN(FILTER(IF(NOT(ISNUMBER(D2:N)),,D1:N1&"~ "&A2:A&"|"&D2:N),A2:A<>"")),"Select * WHERE Col1 Is Not Null"),"|"),"Select Col1, SUM(Col2) GROUP BY Col1 LABEL SUM(Col2) ''")&"~ "),,2)),"~ ",0,1))
It would be better if this were placed in a different sheet from the original data. Supposing that your original data sheet is named Sheet1, place the following version of the above formula into a new sheet:
=ArrayFormula(SPLIT(TRANSPOSE(QUERY(TRANSPOSE(QUERY(SPLIT(QUERY(FLATTEN(FILTER(IF(NOT(ISNUMBER(INDIRECT("Sheet1!D2:"&ROWS(Sheet1!A:A)))),,Sheet1!D1:1&"~ "&Sheet1!A2:A&"|"&INDIRECT("Sheet1!D2:"&ROWS(Sheet1!A2:A))),Sheet1!A2:A<>"")),"Select * WHERE Col1 Is Not Null"),"|"),"Select Col1, SUM(Col2) GROUP BY Col1 LABEL SUM(Col2) ''")&"~ "),,2)),"~ ",0,1))
This separate-sheet approach and formula allows for the original data to extend indefinitely past Col N.

How to round up the values to 1 decimal place using google sheet query?

Hi everyone,
I have a set of raw data, I use query function in cell C2 to order and count the raw data. May I know how to include the ROUND function in the QUERY so that the output in column C will be only 1 decimal place. The reason I'm doing this is to reduce the number of bars in the bar chart. As you can see in the chart, 50.79 and 50.8 are considered as 2 bars, but it will be more presentable if I combine them together by rounding up 50.79
*Preferably doing the rounding in QUERY instead of creating another column
This is my spreadsheet:
https://docs.google.com/spreadsheets/d/12enDKh4hDE67XyvA-21_0CeVxNojzMmNWGjRZfCFmQE/edit#gid=0
Any help will be greatly appreciated!
I have added two sheets ("Erik Help" and "Erik Help 2").
Yours is a case where pre-processing the QUERY data will be beneficial. Note that doing so creates a virtual array that is no longer able to be referenced by column letter; rather, Colx notation is required in the Select clause.
The formula in "Erik Help" produces exactly the results you requested in your post:
=query(FILTER(ROUND(A2:A,1),A2:A<>""),"select Col1, count(Col1) where Col1 is not null group by Col1 order by Col1 asc label Col1 'Values', count(Col1) 'Count'")
The formula in "Erik Help 2" refines the data by rounding every number in A3:A to the nearest 0.5 (which you could change to 0.2 or 0.25 or whatever you like). You can use this option depending on how discrete you need your results to be:
=query(FILTER(MROUND(A2:A,0.5),A2:A<>""),"select Col1, count(Col1) where Col1 is not null group by Col1 order by Col1 asc label Col1 'Values', count(Col1) 'Count'")

Filter IMPORTHTML data

When I import data, it comes in this format (image 1), with blank spaces. I would like to know if there is any way to adjust so that these blanks disappear, the two models expected (image 2 and 3) if there was any way to reach them would be important to me.
Remembering that all dates have / and all times have :
I tried to filter from QUERY, but when trying to "Select Col1, Col2, Col4 Where Col2 is not null" the dates disappear and only the times remain, I tried via REGEXMATCH to separate the dates from the times using / and : but also I was not successful.
I also tried it via IMPORTXML, but some data ends up not being imported correctly on some pages of the site, for IMPORTHTML these errors do not happen. The XML's I used were:
"//tr[#class='no-date-repetition-new' and ..//td[#class='team team-a']] | //tr[#class='no-date-repetition-new live-now' and ..//td[#class='team team-a']]"
"//td[#class='team team-a']/a | //td[#class='team team-a strong']/a"
The current formula is as follows:
=IMPORTHTML("https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/","table",1)
IMPORTHTML Original:
Expected formats:
---
Rather than filtering what you need is to restructure the imported data.
Anyway, I think that the easier solution to get the final result is to use multiple IMPORTXML formulas.
URL
A1: https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/
Headers
A2: //table[contains(#class,'matches')]/thead/tr/th
Day
A3: //td[contains(#class,'date')]/parent::tr
Teams and Score
A4: //td[contains(#class,'team-a')]/parent::tr
A6: =transpose(IMPORTXML($A$1,A2))
A7: =IMPORTXML($A$1,A3)
B7: =IMPORTXML(A1,A4)
You might want to replace the formula on A6 by static values in order to place them properly.
You can join 2 queries together (one next to the other) in a single formula, to get your results
={QUERY(IMPORTHTML("https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/","table",1),
"select Col1 where Col2 is null and not Col1 contains '*'",1),
QUERY(IMPORTHTML("https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/","table",1),
"select Col1, Col2, Col3, Col4 where Col2 is not null label Col1 'Time'",1)}
How the formula works:
As you notice the data part of both queries is the same in both of them. What is actually different is "what we ask for from the query"
In the first one we use "select Col1 where Col2 is null and not Col1 contains '*'"
In the second one "select Col1, Col2, Col3, Col4 where Col2 is not null label Col1 'Time'"
We create an array by joining them together as in ={1stQUERY,2ndQUERY}

feeding categorical data to classifier

Suppose I have the dataset in the following format:
col1 col2 col3 col4 col5 (to be predicted)
12 13 4 primary 12
1 15 2 secondary 13
5 7 8 primary 18
14 12 44 college 6
col5 needs to be predicted for some test data using col1, col2, col3 and col4
During training, col1, col2, col3 can be feeded as such in an array to the classifier but how to feed col4.
I am aware that this is categorical and need to be converted to numeric type, but even after assigning some number, it will still remain as nominal type.
So if primary=1, secondary=2 and college=3, the numbers 1,2 and 3 cant be compared as per their magnitude because they are still like labels, with no numerical significance.
So how should I proceed after this step... should they be normalized ? or any further should be done ?
You should use One Hot Encoding in such cases. Every possible categorial value creates new binary feature.
One Hot Encoding for Machine learning

Resources