Ignoring "noise" in ANTLR4 - parsing

I'd like to build a natural language date parser in ANTLR4 and got stuck on ignoring "noise" input. The simplified grammar below parses any string that contains valid dates in the format DATE MONTH:
dates
: simple_date dates
| EOF
;
simple_date
: DATE MONTH
;
DATE : [0-9][0-9]?;
MONTH : January | February | March // etc.;
Text such as "1 January 22 February" will be accepted. I wanted the grammar accept other text as well, so I added ANY : . -> skip; at the end:
dates
: simple_date dates
| EOF
;
simple_date
: DATE MONTH
;
DATE : [0-9][0-9]?;
MONTH : January | February | March // etc.;
ANY : . -> skip;
This doesn't quite do what I want, however. While string such as "On 1 January and 22 February" is accepted and the simple_date rule is matched twice, string "On 1XX January" will also match the rule.
Question: How do I build a grammar where rules are matched only with the exact token sequence while ignoring all other input, including tokens in an order not defined in any of the rules? Consider the following cases:
"From 1 January to 2 February" -> simple_date matches "1 January" and "2 February"
"From 1XX January to 2 February" -> simple_date matches "2 February", rest is ignored
"From January to February" -> no match, everything ignored

Do not drop extra "noise" in lexer such as your ANY rule. Lexer does not know under what context the current token is. And what you want is "dropping some noise tokens when it is not of the form DATE MONTH". Move your ANY rule to parser rules that match the noise.
Also, it's advisable to drop white spaces IN THE LEXER. But in that case, your ANY rule should exclude those matched by the WS rule. Also pay attention that your DATE rule intercepted a noise token of the form [0-9][0-9]?
dates
: (noise* (simple_date) noise*)+
;
simple_date
: DATE MONTH
;
noise: (DATE|ANY);
DATE : [0-9][0-9]?;
MONTH : 'January' | 'February' | 'March' ;
ANY : ~(' '|'\t' | '\f')+ ;
WS : [ \t\f]+ -> skip;
Accepts:
1 January and 22 February noise 33
1 January and 22 February 3
Rejects:
1xx January
This wasn't fully tested. Also your MONTH lexer rule also intercepted a standalone month literal (e.g. January) which is considered a noise but not handled in my grammar e.g.
22 February January

Related

Get multi column range but only where specific column is not repeated

So, I have a sheet named "Calendar" and another sheet called "Stats".
Here's a sample of the "Calendar" sheet:
F
G
H
I
J
K
2023-01-27
Fri
11:30 PM
Family
Family Activity 1
YYY
2023-01-27
Fri
11:45 PM
Family
Family Activity 1
YYY
2023-01-28
Sat
12:00 AM
Family
Family Activity 1
YYY
2023-01-28
Sat
12:15 AM
Family
Family Activity 1
XY
2023-01-28
Sat
12:30 AM
Fun
Fun Activity 1
ABC
2023-01-28
Sat
12:45 AM
Fun
Fun Activity 1
ABC
2023-01-28
Sat
1:00 AM
Obligations
Obligations 1
AAA
2023-01-28
Sat
1:15 AM
Fun
Fun Activity 2
ZZZ
2023-01-28
Sat
1:30 AM
Fun
Fun Activity 2
ZZZ
2023-01-28
Sat
1:45 AM
Family
Family Activity 2
MMM
2023-01-28
Sat
2:00 AM
Family
Family Activity 2
MMM
Now, on the "Stats" sheet there's a date in cell B16. For this example, it's 2023-01-28.
What I want is that I can get the columns H, I, J, and K from "Calendar" where F equals the date specified in cell B16 of the "Stats" sheet.
The tricky part, where I'm having issues, is to only show the rows where the previous row isn't identical, resp. where I, J, and K aren't the exact same as the previous row, like this:
H
I
J
K
12:00 AM
Family
Family Activity 1
YYY
12:15 AM
Family
Family Activity 1
XY
12:30 AM
Fun
Fun Activity 1
ABC
1:00 AM
Obligations
Obligations 1
AAA
1:15 AM
Fun
Fun Activity 2
ZZZ
1:45 AM
Family
Family Activity 2
MMM
I'm not sure if it's comprehensive, if it isn't please let me know so I can clarify.
What I got so far is the following formula:
=QUERY(A:K,"select H,I,J,K where F = date '2023-01-28'")
This only works if I execute it in the "Calendar" sheet and the date isn't dependent of cell B16 of the "Stats" sheet. However, ideally I'd like place the formula into the "Stats" sheet.
you can try:
=FILTER(Calendar!H2:K,Calendar!F2:F=B16,{"";LAMBDA(z,MAKEARRAY(COUNTA(z),1,LAMBDA(r,c,IF(INDEX(z,r)=INDEX(z,r-1),1,))))(INDEX(Calendar!F3:F&Calendar!I3:I&Calendar!J3:J&Calendar!K3:K))}<>1)
If you can, you may add an auxiliary column in your raw data sheet. I'll say it's L, with this formula in L2:
=MAP(F2:F,I2:I,J2:J,K2:K,LAMBDA(fx,ix,jx,kx,IF(OR(fx<>OFFSET(fx,-1,0),ix<>OFFSET(ix,-1,0),jx<>OFFSET(jx,-1,0),kx<>OFFSET(kx,-1,0)),1,0)))
It checks if F,I,J and K are equal, and returns 1 or 0. Then you can do a QUERY like this:
=QUERY(A:L,"select H,I,J,K WHERE L = 1 AND F = date '"&TEXT(B16,"YYYY-MM-DD")&"'")
If you can't add the column you may do it like this joining all this in one formula:
=QUERY({Calendar!F:K,"";MAP(Calendar!F2:F,Calendar!I2:I,Calendar!J2:J,Calendar!K2:K,LAMBDA(fx,ix,jx,kx,IF(OR(fx<>OFFSET(fx,-1,0),ix<>OFFSET(ix,-1,0),jx<>OFFSET(jx,-1,0),kx<>OFFSET(kx,-1,0)),1,0)))},"select Col3,Col4,Col5,Col6 WHERE Col7 = 1 AND Col1 = date '"&TEXT(Stats!B16,"YYYY-MM-DD")&"'")
date is just a number. try:
=QUERY(Calendar!A:K, "select H,I,J,K where F = "&B16*1, )

Speed up filter and sum results based on multiple criteria?

I am filtering then summing transaction data based on a date range and if a column contains one of multiple possible values.
example data
A | B | C | D
-----------|-----|---------------------------------------------------|-------
11/12/2017 | POS | 6443 09DEC17 C , ALDI 84 773 , OFFERTON GB | -3.87
18/12/2017 | POS | 6443 16DEC17 C , CO-OP GROUP 108144, STOCKPORT GB | -6.24
02/01/2018 | POS | 6443 01JAN18 , AXA INSURANCE , 0330 024 1229 GB | -220.10
I'm currently have the following formula, that works but is really quite slow.
=sum(
iferror(
filter(
Transactions!$D:$D,
Transactions!$A:$A>=date(A2,B2,1),
Transactions!$A:$A<=date(A2,B2,31),
regexmatch(Transactions!$C:$C, "ALDI|LIDL|CO-OP GROUP 108144|SPAR|SAINSBURYS S|SAINSBURY'S S|TESCO STORES|MORRISON|MARKS AND SPENCER , HAZEL GROVE|HAZELDINES|ASDA")
)
,0
)
) * -1
The formula is on a seperate sheet that is just a simple view of the results breakdown for each month of a year
| A | B | C
--|------|----|----------
1 | 2017 | 12 | <formula> # December 2017
2 | 2017 | 11 | <formula> # November 2017
3 | 2017 | 10 | <formula> # October 2017
Is there a way to achieve this that would be more performant?
I tried using ArrayFormula and SUMIF which works for the string criteria but to add more criteria with SUMIFS for the date, it stops working.
I couldn't figure out a way to utilize INDEX and/or MATCH
=query(filter( {Transactions!$A:$A,
Transactions!$D:$D},
regexmatch(Transactions!$C:$C, "ALDI|LIDL|CO-OP GROUP 108144|SPAR|SAINSBURYS S|SAINSBURY'S S|TESCO STORES|MORRISON|MARKS AND SPENCER , HAZEL GROVE|HAZELDINES|ASDA")
), "select year(Col1), month(Col1)+1, -1*sum(Col2) group by year(Col1), month(Col1)+1", 0)
The result is a table like this:
year() sum(month()1()) sum
2017 11 3.87
2017 12 6.24
Add labels if needed. Sample query text with labels:
"select year(Col1), month(Col1)+1, -1*sum(Col2) group by year(Col1), month(Col1)+1 label year(Col1) 'Year', month(Col1)+1 'Month'"
The result:
Year Month sum
2017 11 3.87
2017 12 6.24
Explanations
the single formula report reduces the number of filter functions, so must work faster.
Used query syntax. more info here.

Estimating mortality with acmeR package

There is a relatively new package that has come out called acmeR for producing estimates of mortality (for birds and bats), and it takes into consideration things like bleedthrough (was the carcass still there but undetected, and then found in a later search?), diminishing searcher efficiency, etc. This is extremely useful, except I can't seem to get it to work, despite it seeming to be pretty straightforward. The data structure should be like:
Date, in US format mm/dd/yyyy or ISO 8601 format yyyy-mm-dd
Time, in am/pm US format HH:MM:SS AM or 24-hr ISO format HH:MM:SS
ID, arbitrary distinct alpha strings unique to each carcas
Species, arbitrary distinct alpha strings (e.g. AOU, ABMP, IBP)
Event, “Place”, “Check”, or “Search” (only 1st letter counts)
Found, TRUE or FALSE (only 1st letter counts)
and look something like this:
“Date”,“Time”,“ID”,“Species”,“Event”,“Found”
“1/7/2011”,“08:00:00 PM”,“T091”,“UNBA”,“Place”,TRUE
“1/8/2011”,“12:00:00 PM”,“T091”,“UNBA”,“Check”,TRUE
“1/8/2011”,“16:00:00 PM”,“T091”,“UNBA”,“Search”,FALSE
My data look like this:
Date: Date, format: "2017-11-09" "2017-11-09" "2017-11-09" ...
Time: Factor w/ 644 levels "1:00 PM","1:01 PM",..: 467 491 518 89 164 176 232 39 53 247 ...
Species: Factor w/ 52 levels "AMCR","AMKE",..: 31 27 39 27 39 31 39 45 27 24 ...
ID: Factor w/ 199 levels "GHBT001","GHBT002",..: 1 3 2 3 2 1 2 7 3 5 ...
Event: Factor w/ 3 levels "Check","Place",..: 2 2 2 3 3 3 1 2 1 2 ...
Found: logi TRUE TRUE TRUE FALSE TRUE TRUE ...
I have played with the date, time, event, etc formats too, trying multiple formats because I have had some issues there. So here are some of the errors I have encountered, depending on what subset of data I use:
Error in optim(p0, f, rd = rd, method = "BFGS", hessian = TRUE) :non-finite value supplied by optim In addition: Warning message: In log(c(a0, b0, t0)) : NaNs produced
Error in read.data(fname, spec = spec, blind = blind) : Expecting date format YYYY-MM-DD (ISO) or MM/DD/YYYY (USA) USA
Error in solve.default(rv$hessian): system is computationally singular: reciprocal condition number = 1.57221e-20
Warning message: # In sqrt(diag(Sig)[2]) : NaNs produced
Error in solve.default(rv$hessian) : Lapack routine dgesv: system is exactly singular: U[2,2] = 0
The last error is most common (and note, my data are non-numeric, sooooo... I am not sure what math is being done behind the scenes to come up with these equations, then fail in the solve), but the others are persistent too. Sometimes, despite the formatting being exactly the same in one dataset, a subset of that data will return an error when the parent dataset does not (does not have anything to do with species being there/not being there in one or the other dataset, as far as I can tell).
I cannot find any bug reports or issues with the acmeR package out there - so maybe it runs perfectly and my data are the problem, but after three ecologists have vetted the data and pronounced it good, I am at a dead end.
Here is a subset of the data, so you can see what they look like:
Date Time Species ID Event Found
8 2017-11-09 1:39 PM VATH GHBT007 P T
11 2017-11-09 2:26 PM CORA GHBT004 P T
12 2017-11-09 2:30 PM EUST GHBT006 P T
14 2017-11-09 6:43 AM CORA GHBT004 S T
18 2017-11-09 8:30 AM EUST GHBT006 S T
19 2017-11-09 9:40 AM CORA GHBT004 C T
20 2017-11-09 10:38 AM EUST GHBT006 C T
22 2017-11-09 11:27 AM VATH GHBT007 S F
32 2017-11-09 10:19 AM EUST GHBT006 C F

Is col X functionally dependent of col y?

I am trying to understand database normalisation. I saw this example of 2 Normal form which is not 3 normal forms
Tournament Year Winner Winner_Date_of_Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
Here the primary key is Tournament, Year. So no non primary key attribute is Functionally dependent on subset of primary, it is in 2NF.
How, acc to wikipedia, it is not in 3 NF because
Touranment, Year -> Winner and
Winner -> Winner_Date_Of_Birth
So there is a transitive property of Functional Dependency among keys. I understand this part, but what I would like to know is that, Since for our key
(Tournament,Year) there can only be one unique winner_date_of_birth, is it right to say that ( Touranment, Year ) -> Winner_Date_Of_Birth without using the transitive property above?
Yes, transitive means that you can derive A -> C from A -> B and B -> C.

Fix shift/reduce conflict in bison grammar

The bison grammar I wrote for parsing a text file gives me 10 shift/reduce conflicts. The parser.output file doesn't help me enough. The file gives me information as:
State 38 conflicts: 5 shift/reduce
State 40 conflicts: 4 shift/reduce
State 46 conflicts: 1 shift/reduce
Grammar
0 $accept: session $end
1 session: boot_data section_start
2 boot_data: head_desc statement head_desc head_desc
3 head_desc: OPEN_TOK BOOT_TOK statement CLOSE_TOK
4 | OPEN_TOK statement CLOSE_TOK
5 statement: word
6 | statement word
7 word: IDENTIFIER
8 | TIME
9 | DATE
10 | DATA
11 section_start: section_details
12 | section_start section_details
13 | section_start head_desc section_details
14 $#1: /* empty */
15 section_details: $#1 section_head section_body section_end
16 section_head: START_TOK head_desc START_TOK time_stamp
17 time_stamp: statement DATE TIME
18 section_body: log_entry
19 | section_body log_entry
20 log_entry: entry_prefix body_statements
21 | entry_prefix TIME body_statements
22 body_statements: statement
23 | head_desc
24 entry_prefix: ERROR_TOK
25 | WARN_TOK
26 | /* empty */
27 $#2: /* empty */
28 section_end: END_TOK statement $#2 END_TOK head_desc
state 38
8 word: TIME .
21 log_entry: entry_prefix TIME . body_statements
OPEN_TOK shift, and go to state 1
TIME shift, and go to state 6
DATE shift, and go to state 7
DATA shift, and go to state 8
IDENTIFIER shift, and go to state 9
OPEN_TOK [reduce using rule 8 (word)]
TIME [reduce using rule 8 (word)]
DATE [reduce using rule 8 (word)]
DATA [reduce using rule 8 (word)]
IDENTIFIER [reduce using rule 8 (word)]
$default reduce using rule 8 (word)
head_desc go to state 39
statement go to state 40
word go to state 11
body_statements go to state 45
state 39
23 body_statements: head_desc .
$default reduce using rule 23 (body_statements)
state 40
6 statement: statement . word
22 body_statements: statement .
TIME shift, and go to state 6
DATE shift, and go to state 7
DATA shift, and go to state 8
IDENTIFIER shift, and go to state 9
TIME [reduce using rule 22 (body_statements)]
DATE [reduce using rule 22 (body_statements)]
DATA [reduce using rule 22 (body_statements)]
IDENTIFIER [reduce using rule 22 (body_statements)]
$default reduce using rule 22 (body_statements)
word go to state 19
state 46
9 word: DATE .
17 time_stamp: statement DATE . TIME
TIME shift, and go to state 48
TIME [reduce using rule 9 (word)]
$default reduce using rule 9 (word)
The equivalent part of my grammar is:
statement : word
{
printf("WORD\n");
$$=$1;
}
|statement word
{
printf("STATEMENTS\n");
$$=$1;
printf("STATEMENT VALUE== %s\n\n",$$);
}
;
word : IDENTIFIER
{
printf("IDENTIFIER\n");
$$=$1;
}
|TIME
{
printf("TIME\n");
$$=$1;
}
|DATE
{
printf("DATE\n");
$$=$1;
}
|DATA
{
}
;
section_start : section_details
{
printf("SINGLE SECTIONS\n");
}
|section_start section_details
{
printf("MULTIPLE SECTIONS\n");
}
|section_start head_desc section_details
;
section_details :
{
fprintf(fp,"\n%d:\n",set_count);
}
section_head section_body section_end
{
printf("SECTION DETAILS\n");
set_count++;
}
;
section_head : START_TOK head_desc START_TOK statement time_stamp
{
printf("SECTION HEAD...\n\n%s===\n\n%s\n",$2,$4);
fprintf(fp,"%s\n",$4);
}
;
time_stamp : DATET TIME
{
}
;
section_body :log_entry
{
}
|section_body log_entry
{
}
;
log_entry : entry_prefix body_statements
{
}
|entry_prefix TIME body_statements
{
}
;
body_statements : statement
{
}
|head_desc
{
}
;
Please help me to fix this..
Thanks
A conflict in a yacc/bison parser means that the grammar is not LALR(1) -- which generally means that something is either ambiguous or needs more than 1 token of lookahead. The default resolution is to choose always shifting over reducing, or choose always reducing the first rule (for reduce/reduce conflicts), which results in a parser that will recognize a subset of the language described by the grammar. That may or may not be ok -- in the case of an ambiguous grammar, it is often the case that the "subset" is in fact the entire language, and the default resolution trims off the ambiguous case. For cases that require more lookahead, however, as well as some ambiguous cases, the default resolution will result in failing to parse some things in the language.
To figure out what is going wrong on any given conflict, the .output file generally gives you all you need. In your case, you have 3 state with conflicts -- generally the conflicts in a single state are all a single related issue.
state 38
8 word: TIME .
21 log_entry: entry_prefix TIME . body_statements
This conflict is an ambiguity between the rules for log_entry and body_statements:
log_entry: entry_prefix body_statements
| entry_prefix TIME body_statements
a body_statements can be a sequence of one or more TIME/DATE/DATA/IDENTIFIER tokens, so when you have an input with (eg) entry_prefix TIME DATA, it could be either the first log_entry rule with TIME DATA as the body_statements or the second log_entry rule with just DATA as the body_statements.
The default resolution in this case will favor the second rule (shifting to treat the TIME as part of log_statements rather than reducing it as word to be part of a body_statements), and will result in a "subset" that is the entire language -- the only parses that will be missed are ambiguous. This is a case analogous to the "dangling else" that shows up in some languages, where the default shift likely does exactly what you want.
To eliminate this conflict, the easiest way is just to get rid of the log_entry: entry_prefix TIME body_statements rule. This has the opposite effect of the default resolution -- now TIME will always be considered part of the BODY. The issue is that now
you don't have a separate rule to reduce if you want to treat the case of the being an initial TIME in the body differently. You can do a check in the action for a body that begins with TIME if you need to do something special.
state 40
6 statement: statement . word
22 body_statements: statement .
This is another ambiguity problem, this time with section_body where it can't tell where one log_entry ends and another begins. A section_body consists of one or more log_entries, each of which is an entry_prefix followed by a body_statements. The body_statements as noted above may be one or more word tokens, while an entry_prefix may be empty. So if you have a section_body that is just a sequence of word tokens, it can be parsed as either a single log_entry (with no entry_prefix) or as a sequence of log_entry rules, each with no entry_prefix. The default shift over reduce resolution will favor putting as many tokens as possible into a single statement before reducing a body_statement, so will parse it as a single log_entry, which is likely ok.
To eliminate this conflict, you'll need to refactor the grammar. Since the tail of any statement in a log_entry might be another log_entry with no entry_prefix and statement for the body_statements, you pretty much need to eliminate that case (which is what the default conflict resolution does). Assuming you've fixed the previous conflict by removing the second log_entry rule, first unfactor log_entry to make the problematic case its own rule:
log_entry: ERROR_TOK body_statements
| WARN_TOK body_statements
| head_desc
initial_log_entry: log_entry
| statements
Now change the section_body rule so it uses the split-off rule for just the first one:
section_body: initial_log_entry
| section_body log_entry
and the conflict goes away.
state 46
9 word: DATE .
17 time_stamp: statement DATE . TIME
This conflict is a lookahead ambiguity problem -- since DATE and TIME tokens can appear in a statement, when parsing a time_stamp it can't tell where the statement ends and the terminal DATE/TIME begins. The default resolution will result in treating any DATE TIME pair seen as the end of the time_stamp. Now since time_stamp only appears at the end of a section_head just before a section_body, and a section_body may begin with a statement, this may be fine as well.
So it may well be the case that all the conflicts in your grammar are ignorable, and it may even be desirable to do so, as that would be simpler than rewriting the grammar to get rid of them. On the other hand, the presence of conflicts makes it tougher to modify the grammar, since any time you do, you need to reexamine all of the conflicts to make sure they are still benign.
There's a confusing issue with "the default resolution of a conflict" and "the default action in a state". These two defaults have nothing to do with one another -- the first is a decision made by yacc/bison when it builds the parser, and the second is a decision by the parser at runtime. So when you have a state in the output file like:
state 46
9 word: DATE .
17 time_stamp: statement DATE . TIME
TIME shift, and go to state 48
TIME [reduce using rule 9 (word)]
$default reduce using rule 9 (word)
This is telling you that bison has determined that the possible actions from this state are to either shift to state 48 or reduce rule 9. The shift action is only possible if the lookahead token is TIME, while the reduce action is possible for many lookahead tokens including time. So in the interest of minimizing table size, rather than enumerating all the possible next tokens for the reduce action, it just says $default -- meaning the parser will do the reduce action as long as no previous action matches the lookahead token. The $default action will always be the last action in the state.
So the parser's code in this state will run down the list of actions until it finds one that matches the lookahead token and does that action. The TIME [reduce... action is only included to make it clear that there was a conflict in this state and that bison resolved it by disallowing the action reduce when the lookahead is TIME. So the actual action table constructed for this state will have just a single action (shift on token TIME) followed by a default action (reduce on any token).
Note that it does this despite the fact that not all tokens are legal after a word, it still does the reduction on any token. That's because even if the next token is something illegal, since the reduction doesn't read it from the input (it's still the lookahead), a later state (after potentially multiple default reductions) will see (and report) the error.

Resources