choose the right doublette to delete - spss

I have a dataset with 400.000 cases. The problem is, there are sometimes around 21.000 doublettes for one case. Another problem is that I can't just delete the doublettes, cause I don't know which case is complete with all information. Unfortunately there is no date of modification. So I'd like to know if there's any possibility to choose the case with the most information to delete all the doublettes with less information.
Thanks for your help in advance!

I suggest counting the missing/empty cells in each row and then selecting the rows with the minimum number of empty cells - for each ID.
*First creating some fake data for demonstration.
*I'm constructing this so it contains a mixture of string and number variables.
DATA LIST list/ID (f) Svar1 (a1) Nvar1(f) Svar2 (a1) Nvar2(f).
begin data
111,'a',1,'a',2
111,p ,1,'b',
111,'c',1, ,1
222,'v',1, ,2
222,'k',2,'x',2
222,'n',,'z',
end data.
* note the separate treatment of string and number variables.
count EmptyCells=Svar1 Svar2("") Nvar1 Nvar2 (sysmis).
aggregate /outfile=* mode=addvariables /break=id /EmptyCells_min=min(EmptyCells).
* variable EmptyCells now contains the count of empty cells in every row.
* EmptyCells_min has the minimum number of empty cells in a row for every ID.
select if EmptyCells_min=EmptyCells.
After selecting, you are left with the row that has the maximum number of non-empty cells for each ID.
Note - you may be left with a number of rows for one ID - all containing the same number of non empty cells. You'll have to decide how to select among those rows, or to think about aggregating them.

Related

ARRAYFORMULA not populating down as expected

I have a table which shows the distance between point A and multiple point B's.
Point B is listed as a series of columns (Scout groups: 104th Portsmouth, 2nd Portsmouth, etc.).
From each row, I identify the 3 closest points to Point A (by finding the 3 smallest distances).
I use INDEX to extract the column number for each (for example, the smallest) and return the group name from the column header in Row 1.
I want to use an ARRAYFORMULA because the number of rows in the table will change over time.
The problem is, it only returns a single answer and does not populate down the column as expected.
I have a similar problem with the find a minimum formula in cell R2
My latest attempt (in cell V2) is
=ARRAYFORMULA(INDEX($A$1:$P$1,,MIN(IF($A2:$P=R2,COLUMN(A:P)))))
My latest attempt (in cell R2) is
=ARRAYFORMULA(SORTN(TRANSPOSE(A2:P2),1,0,1,TRUE))
$A$1:$P$1 are the column headers - the Scout Group names (point B)
R2 is the cell containing the calculated smallest distance from the row (that I want to lookup).
$A2:$P is the range to search for the value in R2.
Screenshot
I expect the ARRAYFORMULA to populate down to the last row.
I have searched (and searched and searched) for a solution, trying many different approaches - all without success. So... it's time to ask for some help.
Example spreadsheet with data and formula
I think this gets the answers you're looking for (change end range as needed)
R2:
=byrow(A2:P25, LAMBDA(row, ARRAYFORMULA(SORTN(transpose(row),1,0,1,TRUE))))
V2:
=byrow(R2:R25, LAMBDA(row, ArrayFormula(INDEX($A$1:$P$1,,MIN(IF($A2:$P=row,COLUMN(A:P)))))))

How to count the number of contiguous blocks of cells, each block comprising of the same row values?

In Google Sheets, I have a sheet with a list of customers.
Row 1 has headers, and data starts in row 2.
Column A is Customer name,
Column B is street address,
Column C is City and Post Code,
Column D is Country.
I would like to count the number of occurrences of each customer's row, i.e. when A, B, C, D are the same as a composite key.
However, I want to count different occurrences of a row ONLY IF those occurrences are not adjacent / concurrent, i.e.
I do want to count separate occurrences if row 5 and 7 have the same customer,
but not if row 5 and 6 have the same customer...in this case I will count it as one occurrence
Sample sheet (Customers) with examples:
https://docs.google.com/spreadsheets/d/1J7WajZjJfl94tpgXXgk0y5ALCwG2PxoJw6poxwUyrU8/edit?usp=sharing
I have added explanations for counts in column N.
Say for example, you want to know the number of contiguous blocks whose column A value equals "O2 Arena", you can do
=countifs(FILTER(A2:A,A2:A<>A3:A),"="&A5)
It works because we want to omit rows where the value in column A is repeated in the next row. In other words, we keep those with different values than their next rows. Hence, A2:A<>A3:A.
If you want a list of counts for unique blocks, I recommend setting up the a list of the unique values first, ie. say in another sheet's A1, you have
=unique(Customers!A2:A)
then in B1, you can do
=countif(FILTER(Customers!$A$2:$A,Customers!$A$2:$A<>Customers!$A$3:$A),"="&A1)
and spread the above formula by double clicking the square on the lower right when you select B1.
The ranges in filter() should be absolute because the location of your data does not change. The range in the 2nd input of countif() should be relative because that is meant to iterate.
If values in column A does not uniquely identify your customers, you can add more columns to the input of filter() as required. For example, FILTER(A2:A,A2:A<>A3:A,B2:B<>B3:B)
For function usage, please consult official documentation by typing the function name in the search bar.

Can change shape of range with ARRAYFORMULA() in Google Sheets?

My intention is to convert a single line of data into rows consist of a specific number of columns in Google Sheets.
For example, starting with the raw data:
A
B
C
D
E
F
1
id1
attr1-1
attr2-1
id2
attr2-1
attr2-2
And the expected result is:
(by dividing columns by three)
A
B
C
1
id1
attr1-1
attr1-2
2
id2
attr2-1
attr2-2
I already know that it's possible a bit manually, like:
=ARRAYFORMULA({A1:C1;D1:F1})
But I have to start over with it every time the target range is moved OR the subset size needs to be changed (in the case above it was three)!
So I guess there will be a much more graceful way (i.e. formula does not require manual update) to do the same thing and suspect ARRAYFORMULA() is the key.
Any help will be appreciated!
I added a new sheet ("Erik Help") where I reduced your manually entered parameters from two to one (leaving only # of columns to be entered in A2).
The formula that reshapes the grid:
=ArrayFormula(IFERROR(VLOOKUP(SEQUENCE(ROUNDUP(COUNTA(7:7)/A2),A2),{SEQUENCE(COUNTA(7:7),1),FLATTEN(FILTER(7:7,7:7<>""))},2,FALSE)))
SEQUENCE is used to shape the grid according to whatever is entered in A2. Rows would be the count of items in Row 7 divided by the number in A2 (rounded to the nearest whole number); and the columns would just be whatever number is entered in A2.
Example: If there are 11 items in Row 7 and you want 4 columns, ROUNDUP(11/4)=3 rows to the SEQUENCE and your requested 4 columns.
Then, each of those numbers in the grid is VLOOKUP'ed in a virtual array consisting of a vertical SEQUENCE of ordered numbers matching the number of data pieces in Row 7 (in Column 1) and a FLATTENed (vertical) version of the Row-7 data pieces themselves (in Column 2). Matches are filled into the original SEQUENCE grid, while non-matches are left blank by IFERROR
Though it's a bit messy, managed to get it done thanks to SEQUENCE() function anyway.
It constructs a grid by accepting number of rows/columns input, and that was exactly I was looking for.
For reference set up a sheet with the sample data here:
https://docs.google.com/spreadsheets/d/1p972tYlsPvC6nM39qLNjYRZZWGZYsUnGaA7kXyfJ8F4/edit#gid=0
Use a custom formula
Although you already solved this. If you are doing this kind of thing a lot, it could be beneficial to look into Apps Script and custom formulas.
In this case you could use something like:
function transposeSingleRow(range, size) {
// initialize new range
let newRange = []
// initialize counter to keep track
let count = 0;
// start while loop to go through row (range[0])
while (count < range[0].length){
// add a slice of the original range to the new range
newRange.push(
range[0].slice(count, count + size)
);
// increment counter
count += size;
}
return newRange;
}
Which works like this:
The nice thing about the formula here is that you select the range, and then you put in a number to represent its throw, or how many elements make up a complete row. So if instead of 3 attributes you had 4, instead of calling:
=transposeSingleRow(A7:L7, 3)
you could do:
=transposeSingleRow(A7:L7, 4)
Additionally, if you want this conversion to be permanent and not dependent on formula recalculation. Making it in run fully in Apps Script without using formulas would be neccesary.
Reference
Apps Script
Custom Functions

Indirect Addresses in Array Formula

I have the following formula
=average(arrayformula(indirect(split(A1,","))))
Where A1 contains a list of cell addresses, such as E4,E6,E12. I expect this to be equivalent to =AVERAGE(E4,E6,E12), but this does not behave as expected, yielding 4 no matter what the data in the cells are. Preliminary research indicates that the INDIRECT() function doesn't pass through ARRAYFORMULA() correctly. Attempting SUM() on the outside yields precisely the same results.
Any ideas on how to average the values of cells obtained indirectly by a list of cell addresses?
I do have a list of columns and the row doesn't ever change for this average calculation, so I'm wondering if I could do some kind of subset instead, such as
=AVERAGE(RANGE){LIST_TO_SUBSET_BY}
I'm not sure about a built-in formula to do this so I've written a custom function to do it for you.
Go to Tools -> Script editor and replace the existing function with the code below and then save the project.
Now in your spreadsheet in any cell =CUSTOMFUNCTION(A1) where A1 contains a list of comma-separated cell references.
NOTE:
Updating values in the referenced cells won't force a recalculation of this formula, only updating cell A1 will.
I suggest you also go to File -> Spreadsheet settings -> Calculation and change 'Recalculation' to 'On change and every minute' that will force a recalculation of this function every minute.
/**
* Returns the average value of a dataset.
* #param {"A1"} cell The cell containing the list of cell references.
* #return The input repeated a specified nunmber of times.
* #customfunction
*/
function CUSTOMAVERAGE(cell){
var ss = SpreadsheetApp.getActiveSheet();
var array = [];
var cellRefs = cell.split(",");
for(var i in cellRefs){
array.push(ss.getRange(cellRefs[i]).getValue());
}
var sum = 0;
for(var i in array){
sum += array[i]
}
var avg = sum/array.length;
return avg;
}
Though this is a very specific application in response to this question, for the sake of the knowledge base, I'd like to show how this can be done without a script.
To give this context, imagine the LIST_CELL is a list of question numbers
(which are entered in as a header row, call the range QUESTIONS) on a test that correspond to certain standards, and the goal is to average only the questions that correspond to the standard next to which the list is written, and for each student. Using
=iferror(join(",",ArrayFormula(match(split(LIST_CELL,","),QUESTIONS,FALSE))),"")
The split function splits the a hand-entered list of questions on commas, the match function returns the column number of that particular question in QUESTIONS, and the join function joins the data back together. ArrayFormula allows the match to be performed on an array instead of just the first value.
Another single row heading lists the standards to which each question has been matched (possibly to more than one standard) by the comma separated list in LIST_CELL. For a column list of students in A:A, each standard needs to average the scores of every question that is listed next to the standard. This is accomplished by the nifty (if clunky):
average(ArrayFormula(hlookup(split(vlookup(LOOKUP_VAL,SEARCH_RANGE,COL_W_LIST),","),DATA_SOURCE,row(CURRENT_CELL))))
Breakdown from center outward:
LOOKUP_VAL is the value being looked up (the one that has multiple matches); in the example context, it's the standard.
SEARCH_RANGE is a range of cells containing both the list of lookup value (the standards in context) and the comma separated lists of column numbers generated by the first function. COL_W_LIST is the column number in the array SEARCH_RANGE that contains the list of row numbers matched from LIST_CELL.
Split takes the elements apart and placed them in a temporary array so that hlookup can be performed on each element. Via ArrayFormula the hlookup grabs each value on the same row in the appropriate QUESTIONS column - in context, it grabs the point scores for each question matched to the standard.
Finally, average is self-explanatory, and does take an array as input apparently.
These two functions in combination allow of use of indirect cell references in an array formula, and solves the much asked, "how do I include multiple matches in a calculation" question. At least in this specific context.
EDIT
There is an example "template" with this implemented here. You'll need to make your own copy to edit it.

Sum above cells ignoring blanks

I have a spreadsheet where I have data from a bank account. Each bank transaction has a date and an indication if that transaction is already done or if it's just expected. When it's already done, it must be added to the total balance up to date. If not, then the total balance up to date must be blank. I need to autofilter the data, so I can filter and order it depending on date or other conditions, that's why I've been using this formula:
=IF(D3="Y";B3+INDIRECT(ADDRESS(ROW()-1;COLUMN()));"")
Problem here is that when the cell above is blank, total sum resets and it starts from the value of that transaction. I need a formula that ignores the upper blank cells, and sums all cells above that are not blank plus the amount of that transaction.
Besides, once I change the "N" in "Done" Column to a "Y" I need the formula to update and show the correct balance.
I share an example sheet for better understanding https://docs.google.com/spreadsheets/d/1_gk0YaziUhOZfRbrlfHizMrVu6OT7njIaTUyQaE6Lbs/edit?usp=sharing
Ok I THINK I understand what your going for - please let me know if I am confused, but I added an example on your sheet.... basically what I ended up doing was including one of your conditionals, but then also adding another function to exclude the blank rows by way of filter , index and counta It looks more complicated than it is because I nested it all back into one formula:
=IF(I3="Y";sum(G3;index(filter(indirect("F2:"&address(row()-1;column();4));ISNUMBER(indirect("F2:"&address(row()-1;column();4))));counta(filter(indirect("F2:"&address(row()-1;column();4));ISNUMBER(indirect("F2:"&address(row()-1;column();4)))))););)
To work it from the inside out - the way I am excluding the blank rows is by using FILTER to get all the rows from the first row with a value ( Like A2 in your example) and using INDIRECT and ADDRESS to end the array I want to include exactly one cell above the current cell.
Then I use the condition that the range I built has a number value in it, there fore excluding the blanks.
In order to get the last value available, I use COUNTA to find out the total rows in the filter, then wrap the formula with INDEX to use the counta value as the row to return (which automatically is the last row available above the current cell)
Try this in A3 and copy down:
=IF(D3="Y";B3+INDIRECT(ADDRESS(ROW()-1;COLUMN()));A2+0)
If you want to display the "N" rows as blank, add a column (B) fill in the header and the starting number (5000) then put this in B3:
=if(E3="N";"";A3)
Copy it down then hide column A.

Resources