Data preprocessing of very unusual data - machine-learning

I have a dataset with about 100k data and there I have a column with salary of people. I have only 8k cells with salary out of 100k cells. As you can see most of the cells of salary column are empty. Now among 8k cells, I have 500 cells with exactly 99k salary, 6 cells with 40k salary and 7 cells with 34k salary. Rest of the cells of 8k cells have salary with gradual decrease from 27k. So the total dataset of the column looks like this
500 cells => 99k
6 cells => 40
7 cells => 34k
7487 cells => gradual decrease of salary from 27k to 10k
92k cells => no data
Now, how can I handle this data?

Related

how can i make google sheet itrate over columns

I have a sheet like this how can make a cell in front of x under the sum column get the sum of the x count column and y get the sum y count column
of course, I use sum function on both but the issue I face is how to make the z ,x1,y1,z1 the same I try to fill it down but as you see in the picture it is wrong
how can I do it for 100 row ?
This formula seems to give the result you want:
={"Header","Sum";
ARRAY_CONSTRAIN(TRANSPOSE(D1:1),COUNTA(D1:1),1),
ARRAYFORMULA(MMULT(TRANSPOSE(N(D2:AJ10)),{SEQUENCE(ROWS(D2:AJ10),1,1,0)}))}
It places the two column labels in the first row, then transposes all of the header values into a vertical column in A2:A, but prevents any blank rows by using ARRAY_CONSTRAIN, and a check for the number of header values to transpose.
The main result is the Sums, calculated using MMULT. You need to enter the range of the cells you are going to sum over - I've used D2:AJ10, entered twice in the formula. MMULT can slow down performance the more cells it has to review, but this seemed fine for 33 columns by 9 rows. Test it out in your actual sheet, and report back if any issues.
REFERENCES:
ARRAY_CONSTRAIN To limit size of an array result, by # rows and # columns
MMULT The matrix product of two matrices. Can be used for summing, if one matrix is one dimensional (eg. a row or a column) with values of just 1.

Delphi FMX: How to create a string grid cell with multiple columns

I have a string grid, but I want to create cells that would be split into 3 "mini" columns within the cell, how would I do this, if it is possible, using FMX delphi?
So essentially in the grid each cell would have 3 columns (1 row), and in each of those columns is a cell.
This string grid would be to represent a calendar/diary planner etc, each of the main cells would be a day in the month, and the mini cells would represent an early/late/night time frame in the day. What I would like to have is the ability to access TStringGrid.Cells[x,y] and then set the columns within that Cell to a certain letter/number. So the cell would be shown as e.g 24 E L N, the 24 if possible could be a header to the main cell.

Is there calculation time limit in Google Sheet?

I'm using Array Formula in Google Sheet to calculate some values.
Each row has around 200 fields (from Google form).
Using the array formula, I've multiplied each column to a cell in another sheet (200 fields there).
The response has the no of units 1, 2, 3, ... and another sheet has price 5, 10, 100.
So, each unit is being calculated by its price to get a total value.
=ArrayFormula(IF(ISBLANK('Form Responses 1'!T2:T),0,'Form Responses 1'!T2:T * Data!T2))
Ok, then I've to find the total sum of all these results, for that, I'm using MMULT.
=MMULT(T2:EV100,TRANSPOSE(ARRAYFORMULA(COLUMN(T2:EV100)^0)))
Now, the actual problem is that the ArrayFormula is only showing the result (the value 0) to 104 rows only.
Is there a limitation on the amount of calculation? Or the rows will increase over time?
I've tried ArrayFormula in an isolated sheet, and it goes to the bottom.
Poor me.
I just noticed that the sheet attached with Google form had only 103 rows, that is why the arrayFormula was only showing results till 103 rows.
I added 1000 more rows and it expanded. But it is slower than before, I guess the calculations are being performed in my browser.

ArrayFormula Google Sheets - remove high and low from average

I'm trying to use an arrayformula to calculate the average across 7 columns while removing the max and min number from those columns. The tricky part is there is no preset limit on how may cell will be filled, each time its different.
I have the formula to calculate the average complete:
=ARRAYFORMULA(IF(ISBLANK($A$2:$A),"",IF($J$2:$J="Granted",($AO$2:$AO+$AP$2:$AP+$AQ$2:$AQ+$AR$2:$AR+$AS$2:$AS+AT2:AT)/6,0)))
I've tried using the Trimmean function but it isn't working with the array formula, =Trimmean(AO2:AU2,0.33) any suggestions on how to get it to work?
Assuming the values in the cells that you want to ignore are empty, you want :
Average of all cells that are filled and not maximum or minimum
Which is
Sum of all cells that are filled and not maximum or minimum / (number of filled cells - 2)
Thus
=(sum(YourRange)-max(YourRange)-min(YourRange))/(count(YourRange)-2)
should give you what you want

How to eliminate highlighting duplicates in google sheets conditional formatting

I have a spreadsheet where I need to conditional format/highlight the lowest 3 scores in a row to reflect dropped scores that are part of a Total calculation. I'm using the SMALL function to successfully calculate the Total..=SUM(A2:I2)-SMALL(A2:I2,1)-SMALL(A2:I2,2)-SMALL(A2:I2,3) but when I try to use the SMALL function in the Custom Formula field of the Conditional Format it highlights 0,60,60,60 and not 0,60,60
119 101 60 100 0 109 60 60 112 TOTAL:601
If four of the values are 0, it will highlight all for 0's.. if 60 is the lowest score and there are 4 or more scores of 60, it will highlight all and not reflect that only 3 of the scores are actually dropped.
Is there another way (custom formula) that can only highlight the lowest 3 scores in the row even when the 3rd lowest may have duplicates in the row?
I've come up with this formula (assuming values start in A1) which unfortunately is a bit long
=OR(A1<SMALL($A1:$I1,3),AND(A1=SMALL($A1:$I1,3),COUNTIF($A1:A1,SMALL($A1:$I1,3))<=(3-COUNTIF($A1:$I1,"<"&SMALL($A1:$I1,3)))))
or
=OR(A1<SMALL($A1:$I1,3),AND(A1=SMALL($A1:$I1,3),(COUNTIF($A1:A1,SMALL($A1:$I1,3))+COUNTIF($A1:$I1,"<"&SMALL($A1:$I1,3))<=3)))
The logic is that it highlights all cells which are less than the third smallest value, then any values (starting from the left) which are equal to the third smallest value until the total equals three.
I've changed the second row to show that it selects the second zero instead of the second 60.

Resources