Count mismatches between rows in google sheet - google-sheets

Need some help on this cause I'm getting an issue
I've this three columns (Time, R1 and R2) and I'm trying to count the mismatches between R1 and R2 but for each month (on the time column)
I already used a formula but I'm having an issue to add 1.
https://docs.google.com/spreadsheets/d/1bVP79Gbd14lO6xunu2K9POT7y55yrXegD-cTF70Fb4k/edit#gid=0 (the spreadsheet with the values)
=iferror(if(EOMONTH($A64,0)=$A64,SUMPRODUCT(month(Database!$C$2:$C) = month($A64),--(Database!G$2:G <> Database!H$2:H)),""),"Error")
This part "month(Database!$C$4:$C) = month($A5)" is where I compare the information of the months, ( but I'm having an issue cause cause "month(Database!$C$4:$C)" only retrieves 4 that is the month of april)
This part "(Database!G$4:G <> Database!H$4:H)" is where I compare the columns R1 and R2
The part "EOMONTH($A5,0)=$A5" is where I take the month to based myself
Time R1 R2
2020-04-30 BA BU
2020-04-30 BU BA
2020-04-29 BA BU
2020-04-29 BU BA
2020-04-28 BA BU
2020-04-28 AA BA
2020-04-25 AA BA
2020-04-22 BU BA
2020-04-19 AA BU
2020-04-19 AA BA
2020-03-27 BA AA
2020-03-27 BA AA
2020-03-26 BU AA
2020-03-18 BA AA
2020-03-18 AA BU

Approach
In order to validate the answer I created a test Spreadsheet from a copy of your. In this sheet I created two support columns, one which contains the month number: MONTH($A1) and the other one a flag if the two values R1 and R2 are different: IF($B1=$C1,"",1).
In this way I can use this two support structure to validate the numbers obtained by the formula which didn't use any. I will use a much simpler formula this time to compute the sum =SUMIF(D:D,month(<DATE_VALUE>),E:E).
I will link here the test sheet. As you can see the values are the same as the ones obtained by the formula. So, I can confirm that if you are expecting different results, your database is not consistent.
In conclusion the formula: if(EOMONTH($A64,0)=$A64,SUMPRODUCT(month(Database!$C$2:$C) = month($A64),--(Database!G$2:G <> Database!H$2:H)),"") is working correctly.

Related

Overfitting in data frame that some rows repeated

I have a machine learning problem in a logistic regression algorithm. That I have a data frame where some rows and features are repeated like the below table:
feature 1
feature 2
feature 3
...
feature n-1
feature n
Target
a1
a2
a3
..
an
1
1
b1
b2
b3
..
bn
1
0
c1
c2
c3
..
cn
1
1
..
..
..
..
..
1
..
a1
a2
a3
..
an
2
..
b1
b2
b3
..
bn
2
..
c1
c2
c3
..
cn
2
..
..
..
..
..
..
2
..
a1
a2
a3
..
an
3
..
b1
b2
b3
..
bn
3
..
c1
c2
c3
..
cn
3
..
..
..
..
..
..
..
..
Is it possible to occur overfitting or underfitting with this data frame or not?
And what about a data frame that has between 6 or 8 features with about 500 rows?
I should add and notice this, rows that are repeated in features from 1 to n-1 vary in feature n.
Whether you overfit or not is due to:
the complexity of the model
the available data.
But what's important is the actual data. If you double the data by repeating it, you don't effectively change the data you have. In fact, many algorithms randomly sample from the dataset. So, having duplicates changes nothing (except if the duplicated data has a different distribution than the non-duplicated data)
As such, removing the duplication in the data will not affect whether your overfit or not.
Edit: Now, if the data is not duplicated, but rather modified, it is a different story:
where some rows and features are repeated
Then, no effect.
But if the data is modified, as the table shows, then you need to explain: Is this actual noisy measurements? Is this some random transcription/data collection error?
However, if it is not errors in the dataset but actual data, then it is important to keep it. This is not about overfitting, this is about training with the actual data.

how to generate a "serie" of numbers based on a value:

I am stuck with a little problem.
i would like to auto-generate a Serie of numbers(is this the right description?):
for example:
The input/value in A1 is 5
the OUTPUT in B1 should now be 1
in
in cell B2 - 2,
B3 - 3,
B4 - 4,
B5 - 5,
B6 - 1, (STARTING at one again)
B7 - 2
. . . and so on until end of range
if I then change the value in A1 to 7 for example it should now "count" from 1 to 7 and repeat until the end of the range again.
any hints available would be much appreciated
=ARRAYFORMULA(MOD(SEQUENCE(ROWS(B2:B),1,0),A1)+1)
this is creating an array of numbers from 0 to however many rows there are on the sheet, then taking the modulus of each number using the value of A1 as a divisor.
MOD() means "what's the remainder after dividing by [n]?" where N in your example case is 6 or 7 or whatever.

how to use seq2seq to decode concatenated string

Am trying to decode a concatenated String like below ...
SQCB7A750BATWE SQ CB 7 A 750 B A T WE
PT05A1219PY023 PT 05 A 12 19 P Y 023
PT55A1019PX02 PT 55 A 10 19 P X 02
PT33SE2215SW023 PT 33 SE 22 15 S W 023
PT05A2216PW023(LC) PT 05 A 22 16 P W 023 (LC)
am looking for a smarter way rather than hard-coded rules as the input will have variations(number of characters and digits), I came across SEQ2SEQ model and I want to know if it's possible to use it in such problem
I already followed some tutorials to get a taste of it, but the results weren't even close
it also seems there are 2 approaches character level and word level as per this tutorial
Character level:
Input sentence: SQCACA333BA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
-
Input sentence: SQCAAC152DA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
am still trying to implement the word level, but I'd like to know if the problem can be solved using this approach (seq2seq)

VARSTOCASES (in SPSS) function with unequal spacing of waves

My problem with the VARSTOCASES is that I'm unable to deal with unequal spacing of waves in longitudinal data (I'm using the NLSY79). My dependent variable (log of wage) is not available for all years. But with R, I can easily deal with that using a syntax like this :
ld = reshape(d, varying = c("logwage1989", "logwage1990", "logwage1991", "logwage1992", "logwage1993", "logwage1994", "logwage1996", "logwage1998", "logwage2000", "logwage2002", "logwage2004", "logwage2006", "logwage2008", "logwage2010"), v.names = "logwage", timevar = "year", times = c("1989", "1990", "1991", "1992", "1993", "1994", "1996", "1998", "2000", "2002", "2004", "2006", "2008", "2010"), direction = "long")
And in SPSS, what I do is something like this :
VARSTOCASES
/make logwage from logwage1989 logwage1990 logwage1991 logwage1992 logwage1993 logwage1994 logwage1996 logwage1998 logwage2000 logwage2002 logwage2004 logwage2006 logwage2008 logwage2010
/index= year(14)
/keep=grade AFQT educmom educdad occupationmom occupationdad familyincome.
In the above, 14 is the total number of waves. And what SPSS outputs is a series of numbers going from 1 to 14. The data is collected once every year first, and then it's collected once every two years. For SPSS, the values 1 and 2 in the year variable correspond to 1989 and 1990 while values 13 and 14 correspond to 2008 and 2010, respectively. And that's the problem.
How would you write the reshape function in SPSS as I did in R ?
On the VARSTOCASES command instead of using a numeric index you can use a string index, which will put the original variable names into the column. This can then be converted to a numeric column of the years.
DATA LIST FREE /logwage1989 logwage1990 logwage1991 logwage1992 logwage1993 logwage1994 logwage1996 logwage1998
logwage2000 logwage2002 logwage2004 logwage2006 logwage2008 logwage2010.
BEGIN DATA.
89 90 91 92 93 94 96 98 00 02 04 06 08 10
END DATA.
VARSTOCASES
/MAKE logwage FROM logwage1989 TO logwage2010
/INDEX=year (logwage).
*Now convert to an actual year.
COMPUTE year = REPLACE(year,"logwage","").
ALTER TYPE year (F4.0).

Print list from different predicate in PROLOG

I'm currently starting a new language (PROLOG) and I came across with a few issues.
I'm developing a simple board game in which I'm required to print the board. I've developed a PrintBoard and initialBoard predicate (as shown below), and I want to be able to run it from the main predicate, just like so:
printBoard([Head|Tail]) :-
printRow(Head),
printBoard(Tail).
printBoard([]).
printRow([Head|Tail]) :-
write(Head),
write(' '),
printRow(Tail).
printRow([]) :- nl.
initialBoard(Board) :- Board = ([
['b0','b0','b0','b0','b1'],
['b0','b0','b0','b0','b0'],
['b0','b0','b0','b0','b0'],
['b0','b0','b0','b0','b0'],
['b2','b0','b0','b0','b0']
]).
main :- Board = initialBoard(Board), printBoard(Board).
By typing main. in the SICStus PROLOG, the program should output the following:
b0 b0 b0 b0 b1
b0 b0 b0 b0 b0
b0 b0 b0 b0 b0
b0 b0 b0 b0 b0
b2 b0 b0 b0 b0
But, instead, it returns nothing. (Returns no).
The only way it seems to work is by inserting the whole list all over again as the variable, just like so:
main :- printBoard(<insert whole list here>).
Even though I'm looking to run it as:
main :- printBoard(initialBoard(Board)).
The portion of code above works, if main is passed the Board argument, but is it possible without passing it?
Functional code:
main(Board) :- printBoard(initialBoard(Board)).

Resources