Overfitting in data frame that some rows repeated - machine-learning

I have a machine learning problem in a logistic regression algorithm. That I have a data frame where some rows and features are repeated like the below table:
feature 1
feature 2
feature 3
...
feature n-1
feature n
Target
a1
a2
a3
..
an
1
1
b1
b2
b3
..
bn
1
0
c1
c2
c3
..
cn
1
1
..
..
..
..
..
1
..
a1
a2
a3
..
an
2
..
b1
b2
b3
..
bn
2
..
c1
c2
c3
..
cn
2
..
..
..
..
..
..
2
..
a1
a2
a3
..
an
3
..
b1
b2
b3
..
bn
3
..
c1
c2
c3
..
cn
3
..
..
..
..
..
..
..
..
Is it possible to occur overfitting or underfitting with this data frame or not?
And what about a data frame that has between 6 or 8 features with about 500 rows?
I should add and notice this, rows that are repeated in features from 1 to n-1 vary in feature n.

Whether you overfit or not is due to:
the complexity of the model
the available data.
But what's important is the actual data. If you double the data by repeating it, you don't effectively change the data you have. In fact, many algorithms randomly sample from the dataset. So, having duplicates changes nothing (except if the duplicated data has a different distribution than the non-duplicated data)
As such, removing the duplication in the data will not affect whether your overfit or not.
Edit: Now, if the data is not duplicated, but rather modified, it is a different story:
where some rows and features are repeated
Then, no effect.
But if the data is modified, as the table shows, then you need to explain: Is this actual noisy measurements? Is this some random transcription/data collection error?
However, if it is not errors in the dataset but actual data, then it is important to keep it. This is not about overfitting, this is about training with the actual data.

Related

how to generate a "serie" of numbers based on a value:

I am stuck with a little problem.
i would like to auto-generate a Serie of numbers(is this the right description?):
for example:
The input/value in A1 is 5
the OUTPUT in B1 should now be 1
in
in cell B2 - 2,
B3 - 3,
B4 - 4,
B5 - 5,
B6 - 1, (STARTING at one again)
B7 - 2
. . . and so on until end of range
if I then change the value in A1 to 7 for example it should now "count" from 1 to 7 and repeat until the end of the range again.
any hints available would be much appreciated
=ARRAYFORMULA(MOD(SEQUENCE(ROWS(B2:B),1,0),A1)+1)
this is creating an array of numbers from 0 to however many rows there are on the sheet, then taking the modulus of each number using the value of A1 as a divisor.
MOD() means "what's the remainder after dividing by [n]?" where N in your example case is 6 or 7 or whatever.

How to manipulate multiple nested arrays in Dyalog APL?

I have been given matrices filled with alphanumerical values excluding lower case letters like so:
XX11X1X
XX88X8X
Y000YYY
ZZZZ789
ABABABC
and have been tasked with counting the repetitions in each row and then tallying up a score depending on the ranking of the character being repeated. I used {⍺ (≢⍵)}⌸¨ ↓ m to help me. For the example above I would get something like this:
X 4 X 4 Y 4 Z 4 A 3
1 3 8 3 0 3 7 1 B 3
8 1 C 1
9 1
This is great but now I need to do a function that would be able to multiply the numbers with each letter. I can access the first matrix with ⊃ but then I am completely lost on how to access the other ones. I can simply write ⊃w[2] and ⊃w[3] and so forth but I need a way to change every matrix at the same time in one function. For this example, the array of the ranking is as follow: ZYXWVUTSRQPONMLKJIHGFEDCBA9876543210 so for the first array XX11X1X
which corresponds to:
X 4
1 3
So the X is 3rd in the array so it corresponds to a 3 and 1 is 35th so it's a 35. The final scoring would be something like (3×104)+(35×103). My biggest problem is not necessarily the scoring part but being able to access each matrix individually in one function. So for this nested array:
X 4 X 4 Y 4 Z 4 A 3
1 3 8 3 0 3 7 1 B 3
8 1 C 1
9 1
if I do arr[1] it gives me the scalar
X 4
1 3
and ⍴ arr[1] gives me nothing confirming it so I can do ⊃arr[1] to get the matrix itself and have access to each column individually. This is where I'm stuck. I'm trying to write a function to be able to do the math for each matrix and then saving those results to an array. I can easily do the math for the first matrix but I can't do it for all of them. I might have made a mistake by making using {⍺ (≢⍵)}⌸¨ ↓ m to get those matrices. Thanks.
Using your example arrangement:
⎕ ← arranged ← ⌽ ⎕D , ⎕A
ZYXWVUTSRQPONMLKJIHGFEDCBA9876543210
So now, we can get the index values:
1 ⌷ m
XX11X1X
∪ 1 ⌷ m
X1
arranged ⍳ ∪ 1 ⌷ m
3 35
While you could compute the intermediary step first, it is much simpler to include most of the final formula in in Key's operand:
{ ( arranged ⍳ ⍺ ) × 10 * ≢⍵ }⌸¨ ↓m
┌───────────┬───────────┬───────────┬─────────────────┬───────────────┐
│30000 35000│30000 28000│20000 36000│10000 290 280 270│26000 25000 240│
└───────────┴───────────┴───────────┴─────────────────┴───────────────┘
Now we just need to sum each:
+/¨ { ( arranged ⍳ ⍺ ) × 10 * ≢⍵ }⌸¨ ↓m
65000 58000 56000 10840 51240
In fact, we can combine the summation with the application of Key to avoid a double loop:
{ +/ { ( arranged ⍳ ⍺ ) × 10 * ≢⍵ }⌸ ⍵}¨ ↓m
65000 58000 56000 10840 51240
For completeness, here is a way to use the intermediary result. Let's start by working on just the first matrix (you can get the second one with 2⊃ instead of ⊃ ― for details, see Problems when trying to use arrays in APL. What have I missed?):
⊃{⍺ (≢⍵)}⌸¨ ↓m
X 4
1 3
We can insert a function between the left column elements and the right column elements with reduction:
{⍺ 'foo' ⍵}/ ⊃{⍺ (≢⍵)}⌸¨ ↓m
┌─────────┬─────────┐
│┌─┬───┬─┐│┌─┬───┬─┐│
││X│foo│4│││1│foo│3││
│└─┴───┴─┘│└─┴───┴─┘│
└─────────┴─────────┘
So now we simply have to modify the placeholder function with one that looks up the left argument in the arranged items, and multiplies by ten to the power of the right argument:
{ ( arranged ⍳ ⍺ ) × 10 * ⍵ }/ ⊃{⍺ (≢⍵)}⌸¨ ↓m
30000 35000
Instead of applying this to only the first matrix, we apply it to each matrix:
{ ( arranged ⍳ ⍺ ) × 10 * ⍵ }/¨ {⍺ (≢⍵)}⌸¨ ↓m
┌───────────┬───────────┬───────────┬─────────────────┬───────────────┐
│30000 35000│30000 28000│20000 36000│10000 290 280 270│26000 25000 240│
└───────────┴───────────┴───────────┴─────────────────┴───────────────┘
Now we just need to sum each:
+/¨ { ( arranged ⍳ ⍺ ) × 10 * ⍵ }/¨ {⍺ (≢⍵)}⌸¨ ↓m
65000 58000 56000 10840 51240
However, this is a much more circuitous approach, and is only provided here for reference.

Count mismatches between rows in google sheet

Need some help on this cause I'm getting an issue
I've this three columns (Time, R1 and R2) and I'm trying to count the mismatches between R1 and R2 but for each month (on the time column)
I already used a formula but I'm having an issue to add 1.
https://docs.google.com/spreadsheets/d/1bVP79Gbd14lO6xunu2K9POT7y55yrXegD-cTF70Fb4k/edit#gid=0 (the spreadsheet with the values)
=iferror(if(EOMONTH($A64,0)=$A64,SUMPRODUCT(month(Database!$C$2:$C) = month($A64),--(Database!G$2:G <> Database!H$2:H)),""),"Error")
This part "month(Database!$C$4:$C) = month($A5)" is where I compare the information of the months, ( but I'm having an issue cause cause "month(Database!$C$4:$C)" only retrieves 4 that is the month of april)
This part "(Database!G$4:G <> Database!H$4:H)" is where I compare the columns R1 and R2
The part "EOMONTH($A5,0)=$A5" is where I take the month to based myself
Time R1 R2
2020-04-30 BA BU
2020-04-30 BU BA
2020-04-29 BA BU
2020-04-29 BU BA
2020-04-28 BA BU
2020-04-28 AA BA
2020-04-25 AA BA
2020-04-22 BU BA
2020-04-19 AA BU
2020-04-19 AA BA
2020-03-27 BA AA
2020-03-27 BA AA
2020-03-26 BU AA
2020-03-18 BA AA
2020-03-18 AA BU
Approach
In order to validate the answer I created a test Spreadsheet from a copy of your. In this sheet I created two support columns, one which contains the month number: MONTH($A1) and the other one a flag if the two values R1 and R2 are different: IF($B1=$C1,"",1).
In this way I can use this two support structure to validate the numbers obtained by the formula which didn't use any. I will use a much simpler formula this time to compute the sum =SUMIF(D:D,month(<DATE_VALUE>),E:E).
I will link here the test sheet. As you can see the values are the same as the ones obtained by the formula. So, I can confirm that if you are expecting different results, your database is not consistent.
In conclusion the formula: if(EOMONTH($A64,0)=$A64,SUMPRODUCT(month(Database!$C$2:$C) = month($A64),--(Database!G$2:G <> Database!H$2:H)),"") is working correctly.

How to perform calculation with cumulative sum using ARRAYFORMULA

Is it possible to perform an arbitrary calculation (eg. A2*B2) on a set of rows and obtain the cumulative sum along the way using ARRAYFORMULA? For example, in the following sheet we have numbers (column A), multipliers (column B), the result of multiplying them (column C), and a cumulative tally (column D):
| A B C D E F
-------------------------------------------------------------------------------
1 | number multiplier result cumulative array formula array formula sum?
2 | 3 4 12 12 12
3 | 2 4 8 20 8
4 | 10 1 10 30 10
5 | 7 9 63 93 63
I can use ARRAYFORMULA in cell E2 (specifically, ARRAYFORMULA(A2:A5*B2:B5)) to do the multiplication. Is it possible to use ARRAYFORMULA (or alternative tool) in cell F2 to show the cumulative total?
use:
=ARRAYFORMULA(IF(A2:A="",,MMULT(TRANSPOSE((ROW(A2:A)<=
TRANSPOSE(ROW(A2:A)))*A2:A*B2:B), SIGN(B2:B))))
Calculate the cumulative sum with the SCAN and LAMBDA functions:
=SCAN(0, F5:F, LAMBDA(accumulated_value, cell_value, accumulated_value + cell_value))
This will run faster as it runs with linear complexity (O(N)) compared to the ARRAYFORMULA solution, which runs in quadratic time (O(N**2)).
Where:
0 is the initial value of the cumulative sum
F5:F is the range to sum over
LAMBDA(accumulated_value, cell_value, accumulated_value + cell_value)) is the function that calculates the sum at each cell
Sample File

missing data in time series

As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}
Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):

Resources