Should we compare null value with known value? - machine-learning

I have a binary classification problem and need to prepare the data for model training. There are two classes, duplicate, and nonduplicate. Assume two records of the data is like
Id
Name
Phone
Email
City
A1
Mick
12345
m#m.com
London
A2
Mick
12345
null
London
It seems that these two records are duplicates. I need to turn them in one record and assign each feature a binary value of 1 if their values match; otherwise, a 0 as follows
Id1
Id2
Name
Phone
Email
City
Label
A1
A2
1
1
?
1
1
As the first table shows, we have a missing value for the email in the second row. I know I cannot compare a known value with a missing one. The question is, what is the best practice in this case?
Note: The number of missing values is high in my dataset, and I cannot drop them.
I tried to put 0, but I know it introduces bias in the dataset.

you can drop the records wit the null values
to do this use
Pandas dropna()

Related

Identifying "mismatch duplicates" with a Google Sheets formula

Not sure how to describe this one, so apologies for the vague title.
I'm trying to identify when a specific value in a column in Google Sheets appears more than once, but only if the value in a separate column is different. A visual will probably help here:
So in this scenario you can see that the 1111 ID is assigned to James twice and Nicole once. It's absolutely fine that 1111 is listed multiple times. But it's not good that it's assigned to more than one unique person. So my desire is for every row using the 1111 ID gets flagged with a formula (as seen in the 'Status' column) so that I can filter for it and handle the problem.
The example above uses names for the Owner, but that could be numbers instead.
Here is an example sheet:
https://docs.google.com/spreadsheets/d/1eBF3G6UAICgzUJdUA8onGyJiba0Wmu4DOK3U6-SQz7c/edit?usp=sharing
If Owner is in column A and ID is in column B then, for Excel, you could put this in C2 and copy it to the other cells in column C:
=IF(COUNTIFS(B:B,B2)=COUNTIFS(A:A,A2,B:B,B2),"Good","Mismatch Detected")
It compares the count of the ID against the count of the ID and the name. If the ID 1111 appears 3 times but James,1111 only appears 2 times there is a mismatch.
You probably want to change the A:A and B:B to be the range of your actual data.
You can try in GS:
=ArrayFormula(IF(LEN(A2:A),IF((COUNTIF(A2:A&B2:B,A2:A&B2:B)=COUNTIF(B2:B,B2:B)),"Good","Bad"),))
try:
=INDEX(IF(A2:A="",,IF(COUNTIFS(A2:A&B2:B, A2:A&B2:B)=1, "mismatch", "good")))

Have a result from several conditions on Google Sheets

I have set up a Sheets file to have the list of the personnel of a company. Each line corresponds to a person and each column to a data of this same person. There is the date of entry in the company, the name of the person, his first name, if he followed the mandatory training and if he was present the first day of work.
I am trying to set up indicators and one of them is causing a problem. I would like to have a column where a certain result appears based on the data entered in the "here in formation?" and "here 1st day?" columns. Unfortunately, I can't combine the logical AND operator in my IFS to get the desired result.
You can see via this link the expected results in column F according to the data present in the above mentioned columns.
I'd suggest having an additional IFS layer. The idea would be to first check the D value, and for each of those values add an IFS that will check the E value.
Formula:
=IF(B2<>"";
IFS(D2="Yes";IFS(E2="";"Data not completed";E2="Yes";"Yes";E2="No";"No");
D2="No";IFS(E2="";"No";E2="No";"No");
D2="Formation < 1 year";IFS(E2="Yes";"Yes";E2="No";"No";E2="";"Data not completed");
D2="other formation";IFS(E2="Yes";"Yes";E2="No";"No";E2="";"Data not completed");
D2="No need";IFS(E2="No";"No";E2="";"No"));"")
supreme fx:
=INDEX(IF(B2:B="";;IF(REGEXMATCH(D2:D&E2:E;
"^No(?: need)?(?:No)?$"); "No";
IF(E2:E=""; "Data not completed"; E2:E))))

Correct Way To "COUNTUNIQUE" That Only Counts Once

I am trying to count the unique values of a column, based on their status in another column, example:
Customers
License Active
Adam
Yes
Barry
No
Adam
No
Claire
No
In this situation, I want to know how many customers have at least 1 active license, and how many customers do not have at least one active license.
The formula I have tried is:
=COUNTUNIQUEIFS(A2:A,B2:B,"Yes")
This returns 1 in this situation which is correct, as there is 1 customer who has a Yes on column B.
My issue is when I try to do the reverse, count the "No" using this formula:
=COUNTUNIQUEIFS(A2:A,B2:B,"No") it returns 3 which is not the desired result as it is counting the second Adam as a unique value too because they have a "No" in column B.
The result I want here is 2, because Adam has a yes somewhere in column B so I don't want him counted again the next time his field is counted.
It seems to me that the easiest way to get the "No" count is like this:
=COUNTUNIQUE(A2:A)-COUNTUNIQUEIFS(A2:A,B2:B,"Yes")
It's even easier if you've already pulled the "Yes" count to a cell (say, C2), in which case the "No" count could be gained quite simply with this:
=COUNTUNIQUE(A2:A)-C2
I don't think you can do it in a single step - try filtering out those with at least one "Yes" like this:
=countunique(filter(A2:A,countifs(B2:B,"Yes",A2:A,A2:A)=0))
Explanation
When a countifs has a range instead of a single value in its criteria part countifs(B2:B,"Yes",A2:A,A2:A) , the countifs gets re-evaluated for each cell in the range. So you get an array with the results of
countifs(B2:B,"Yes",A2:A,A2)
countifs(B2:B,"Yes",A2:A,A3)
countifs(B2:B,"Yes",A2:A,A4)
countifs(B2:B,"Yes",A2:A,A5)
and so on all the way down the columns.
The first countifs above checks right through a2:a and b2:b to see if there are any cases where the name is Adam and the license condition is true and gets a count of 1 so that row is filtered out. The same thing happens in the next row containing Adam (row 4) - the countifs checks right through both columns excluding the headers and the count is still 1 so that row is filtered out as well leaving just Barry and Claire.
If you wanted to exclude all records containing "Test" in the Customer column, You could add a condition to the filter using the multiplication operator to 'AND' it with the existing condition:
=countunique(filter(A2:A,(countifs(B2:B,"Yes",A2:A,A2:A)=0)*(A2:A<>"Test")))
If you had several names to exclude, you would probably want to make a list of them and use a lookup to stop the formula getting too long and unwieldy, but it would be the same idea.

How can I get the last numerical value value in a column in Google Sheets?

I need to find the last numerical value in a column. I was using this formula to get the last value in column G, but I made some changes and it no longer works: =INDEX(G:G, COUNTA(G:G), 1). My column now looks like this:
645
2345
4674.2345
123.1
"-"
"-"
"-"
...and the formula returns "-". I want it to return 123.1. How can I do this?
There are many ways to go about this. Here is one of them:
=QUERY(FILTER({G:G,ROW(G:G)},ISNUMBER(G:G)),"Select Col1 ORDER BY Col2 Desc LIMIT 1")
FILTER creates a virtual array of only numeric values in G in the first column and the row of those numeric values in the second column.
QUERY returns flips the order by row number and returns only the new top value from the first column (which winds up being your last numeric value in the original range).
However, if your numeric values start at G1, and if there are only numeric values up to where you start adding hyphens in cells, you could just alter your original formula like this:
=INDEX(G:G,COUNT(G:G))
This would work because COUNT only counts numeric values while COUNTA counts all non-null values (including errors BTW).
Not to take anything away from the accepted answer, but I've been working on this a bit lately in relation to this for the never-ending last row discussion and thought I'd share some potential similar solutions. These ideas are inspired by a pattern of google sheet array questions that seem to be coming up more often. I am also intentionally using different ways to do the same thing just to give people some ideas (i.e. left and Regex).
Last Row that is...
Number: =max(filter(row(G:G),isnumber(G:G)))
Text: =max(filter(row(G:G),isText(G:G)))
An error: =max(filter(row(G:G),iserror(G:G)))
Under 0 : =max(filter(row(G:G),G:G<0))
Also exists in column D: =max(filter(row(G:G),ISNUMBER(match(G:G,D:D,0))))
Not Blank: =max(filter(row(A:A),NOT(ISBLANK(A:A))))
Starts with ab: =max(filter(row(G:G),left(G:G,2)="ab"))
Contains the character !: =max(filter(row(G:G),isnumber(Find("!",G:G))))
Starts with a number: =max(filter(row(G:G),REGEXMATCH(G:G,"^\d")))
Only contains letters: =max(filter(row(G:G),REGEXMATCH(G:G,"^[a-zA-Z]+$")
Last four digits are upper case: =Max(filter(row(G:G),REGEXMATCH(G:G,"[A-Z]{4}$")))
To get the actual value (which I realize was the actual question), just wrap an index function around the Max function. So for this question, a solution could be :
=Index(G:G,max(filter(row(G:G),isnumber(G:G))))

How do I order a mixed text and integer field in a pivot table in Google Sheets?

Let's say that we have two columns on a sheet:
Name Room
-------------
Steve A1
Jill A1
Sam A1
Steve A2
...
Lisa A10
Sally A11
Jim A11
My actual dataset has up to a hundred of these rooms.
The issue I'm running into is with pivot tables. When I want to get a list of rooms and the count (counta is the one I'm using) it works, but the order is not what I wanted. It comes out as:
Room Count
--------------
A1 3
A10 1
A11 2
...
A2 1
I guess I can kind of see why it would be doing that. I'd much rather have it list it out in order. A1, A2, A3... A10, A11, A12, etc.
Is there an easy way to do this without some sort of data manipulation?
An "easy" way to do this without "data manipulation" is to copy the PT, Paste special, Paste values only and then drag the relevant rows (presumably at most only 8) to where you want them. The easiest way is probably with "data manipulation", for example:
=if(len(A1)=2,SUBSTITUTE(A1,"A","A0"),A1)
(Though in you case, whichever column would be the right one, it would not be ColumnA.)
I suggest you transform the string elements into number values using a lookup table.
I've created a sample spreadsheet here.
The input data in the 'input' sheet has the keys as you described.
The next sheet is the "lookup table" to translate each key into a value number. I suggest choosing large numbers to leave room for future intermediate numbers if needed
Pivot 1 is based on the original data as you described
Pivot 2 is based on the re-calculated room name using the lookup table.
The formula I used for the re-calculation is:
=VALUE(SUBSTITUTE(A2,MID(A2,1,1),VLOOKUP(MID(A2,1,1),'Lookup table'!$A$1:$B$2,2)))
I was a little lazy with the string lookup in the original name (MID), assuming your string is the first character and is 1 character long. This can be mended specifically with pattern matching.

Resources