When to use the terms "delimiter," "terminator," and "separator" - delimiter

What are the semantics behind usage of the words "delimiter," "terminator," and "separator"? For example, I believe that a terminator would occur after each token and a separator between each token. Is a delimiter the same as either of these, or are they simply forms of a delimiter?
SO has all three as tags, yet they are not synonyms of each other. Is this because they are all truly different?

A delimiter denotes the limits of something, where it starts and where it ends. For example:
"this is a string"
has two delimiters, both of which happen to be the double-quote character. The delimiters indicate what's part of the thing, and what is not.
A separator distinguishes two things in a sequence:
one, two
1\t2
code(); // comment
The role of a separator is to demarcate two distinct entities so that they can be distinguished. (Note that I say "two" because in computer science we're generally talking about processing a linear sequence of characters).
A terminator indicates the end of a sequence. In a CSV, you could think of the newline as terminating the record on one line, or as separating one record from the next.
Token boundaries are often denoted by a change in syntax classes:
foo()
would likely be tokenised as word(foo), lparen, rparen - there aren't any explicit delimiters between the tokens, but a tokenizer would recognise the change in grammar classes between alpha and punctuation characters.
The categories aren't completely distinct. For example:
[red, green, blue]
could (depending on your syntax) be a list of three items; the brackets delimit the list and the right-bracket terminates the list and marks the end of the blue token.
As for SO's use of those terms as tags, they're just that: tags to indicate the topic of a question. There isn't a single unified controlled vocabulary for tags; anyone with enough karma can add a new tag. Enough differences in terminology exist that you could never have a single controlled tag vocabulary across all of the topics that SO covers.

Technically a delimiter goes between things, perhaps in order to tell you where one field ends and another begins, such as in a comma-separated-value (CSV) file.
A terminator goes at the end of something, terminating the line/input/whatever.
A separator can be a delimiter or anything else that separates things. Consider the spaces between words in the English language for example.
You could argue that a newline character is a line terminator, a delimiter of lines or something that separates two lines. For this reason there are a few different newline-type characters in the Unicode specification.

A delimiter is one or two markers that show the start and end of something. They're needed because we don't know how long that 'something' will be. We can have either: 1. a single delimiter, or 2. a pair of pair-delimiters
[a, b, c, d, e] each comma (,) is a single delimiter. The left and right brackets, ([, ]) are pair-delimiters.
"hello", the two quote symbols (") are pair-delimiters
A seperator is a synonym of a "delimiter", but from my experience it usually refers to field delimiters. A field delimiter acts as a divider between one field and the one following it, which is why is can be though of as "separating" them.
<file1>␜<file2>␜<file3>, the file separator character (␜), despite explicitly the name having "separator", is both a delimiter and a separator
A terminator marks the end of a group of things, again needed because we don't know how long it is.
abdefa\0, here the null character \0 is a terminator that tells us the string has ended.
foo\n, here the newline character \n is a terminator that tells us the line has ended.
The terms, delimiter, separator originate from the classical idea of storage, conceptually, being comprised of files, records, and fields, (a file has many records, a record has many fields). In this context, a single delimiter and pair-delimiters might be called record delimiters and field delimiters. Because of the historical significance of files-records-field taxonomy, this terms have a more widespread usage (see Wikipedia page for Delimiter).
Below are two files, each with three records with each record having four fields:
martin,rodgers,33,28000\n
timothy,byrd,22,25000\n
marion,summers,35,37000\n
===
lucille,rowe,28,33000\n
whitney,turner,24,19000\n
fernando,simpson,35,40900\n
Here, , and \n as we know are single delimiters, but they might also be called a record delimiters and field delimiters respectively.
For complex nested structures, a terminator can also be a delimiter/separator (they're not mutually exclusive definitions). From the previous example, the === marker from inside a file could be considered a terminator (it's the end of the file). But when we look at many files, the === acts like a delimiter/separator.
Consider lines in a UNIX file
This is line 1\n
This is line 2\n
This is line 3\n
The newlines are both terminators (they tell us where the string ends) and are delimiters (they tell us where each line begins and ends). From Wikipedia:
Two ways to view newlines, both of which are self-consistent, are that newlines either separate lines or that they terminate lines.
Really you'll only need to say "terminator" when you're talking at one individual item, (just one string 1234\0, just one line abcd\n, etc.) -- and it'll be unclear whether the terminator in this context could also be a delimiter in a more complex parent structure.

This response is in context of CSV because all of the provided answers focus on English language instead.
Delimiters are all elements mentioned in the given CSV specification that describe the boundaries of stuff, separator is a common name for field delimiters, terminator is a common name for record delimiters.
Delimiter is a part of CSV format specification, it defines boundaries and doesn't have to be a printable character.
Terminators, separators and field qualifiers are delimiters but are not necessary to specify a CSV format, e.g. 10 columns field delimiter and 30 columns record delimiter mean each 30 columns are one record and each 10 columns are one field (usually padded with white space). In other words CSV format without separators has a constant field and record length, e.g.:
will smith 1 chris rock 0
Terminator is a delimiter that marks the end of a single CSV record and is usually represented either by Line Feed (LF), a Carriage Return (CR) or a combination of both (e.g. CRLF), e.g.:
will smith 1
chris rock 0
Separator is a delimiter that marks the division between CSV fields and is most often represented by a comma (or a semicolon), it has been introduced to store dynamic length values, e.g. two comma separated records in CSV format with CRLF terminator after 1 and 0:
will,smith,1
chris,rock,0
Field qualifier is a delimiter usually used in pairs instead of escape sequence. It is a printable character that isn't allowed in the field value (unless given CSV format specification provides the escape sequence) and marks the beginning and the end of a field, it was introduced to store values containing separators, e.g. this CSV has 2 records with 3 fields each but 3rd field value can contain a semicolon that otherwise acts as a fields separator:
will;smith;"rich;famous;slaps people"
chris;rock;"rich;famous;gets slapped"
Escape sequence is a character (or a set of characters) that marks anything that follows the escape sequence as non-significant and therefore as a part of the field value (e.g. backslash might specify the immediately following separator as a part of the value). This sequence can escape one or multiple characters, e.g. CSV with \ as a 1 character escape sequence:
will;smith;rich\;famous\;slaps people 100\\100% of time
chris;rock;rich\;famous\;slaps people 0\\100% of time

Delimiter
There are a couple of senses for delimiter:
As the space used in sentences (frontier).
A delimiter is like a frontier, it exists between countries.
In that sense, there must be two countries to have a frontier.
An space usually exists between words, but not at the end. The space delimits words but does not terminate sentences (collection of words). The sentence:
This is a short sentence.
Has four spaces, they act as word delimiters. There is no ending space.
In fact, there are two additional delimiters usually not named: The start and end of the sentence. Like the ^ and $ used in regular expressions to mark the start and end of an string of text.
And, in human language, there are punctuation marks (dot, comma, semicolon, colon, etc.) that serve also as word delimiters (additionally to spaces)
As used in quotes (boundary).
A sentence like:
“This is a short sentence.”
Is delimited (start and end) by the double quotes (“”). In this sense it is like "balanced delimiters" (Balanced Brackets in Wikipedia).
Some may argue the frontier and boundary are essentially the same, and, under some conditions they actually are correct.
Separator
Is exactly the same as the first sense (above) of a delimiter (a frontier).
So, a separator is a synonym of delimiter in many computer uses.
Terminator
Demarcate the end of an individual "field".
Like the newlines in a Unix text file. Each line is terminated by a NewLine (\n).
In a proper Unix text file all lines are terminated (even the last one).
Like paragraphs are terminated by a newline in human language.
Or, more strictly, as the NUL (\0) is the terminator of a C string:
A string is defined as a contiguous sequence of code units terminated by the first zero code unit (often called the NUL code unit).
So, a terminator character is also a delimiter but must also appear at the end.
Tags
Stackoverflow has tags only for delimiters and separators
delimiterA delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams.
separatorA character that separates parts of a string.
The terminator tag only apply to a shell terminal emulator:
terminatorTerminator is a GPL terminal emulator.
And, yes, delimiter and separator are many times equivalent
except for the parenthesis, braces, square brackets and similar balanced delimiters.

Interesting question and answers. To summarize, 1) delimiter marks the "limits" of something, i.e. beginning and/or end; 2) terminator is just a special term for "end delimiter"; 3) separator entails there are items on both sides of it (unlike delimiter).
Best example I can think of for a start delimiter is the start-comment markers in programming languages ("#", "//", etc.).
Best example I can think of for a terminator (end delimiter) is the newline character in Unix. It's a misnomer -- it always terminates a (possibly empty) line but doesn't always start a new line, i.e. when it is the last character in a file. Maybe a better common example is the simple period for sentences.
Best example I can think of for a separator is the simple comma. Note that comma never appears in English without text both before and after it.
Interesting to note that none of these is necessarily limited to single-character. In fact awk (or maybe only gawk?) in Unix allows FS (field separator) to be any regexp.
Also, although "any non-zero amount of whitespace" is considered a "word delimiter" in e.g. the wc command, there are also zero-width "word boundary" specifiers in regexps (e.g. \b). Interesting to ponder whether such zero-width items/boundaries could be considered "delimiters" as well. I tend to think not (too much of a stretch).

Terminators are separators when you start with empty. A;B;C; is actually A;B;C;empty.

Just like the English language, there is the technically correct answer, and the generally used answer, and it is probably relevant to isolate to the programming usage of the term definitions being sought.
The industry has long used the phrase 'Comma Delimited' file to mean:
FirstRowFirstValue,FirstRowSecondValue,FirstRowThirdValue
SecondRowFirstValue,SecondRowSecondValue,SecondRowThirdValue
TECHNICALLY, this is a Comma 'SEPARATED' list.
TECHNICALLY, THIS is a Comma 'DELIMITED' list.
,FirstRowFirstValue,FirstRowSecondValue,FirstRowThirdValue,
,SecondRowFirstValue,SecondRowSecondValue,SecondRowThirdValue,
or this:
,FirstRowFirstValue,,FirstRowSecondValue,,FirstRowThirdValue,
,SecondRowFirstValue,,SecondRowSecondValue,,SecondRowThirdValue,
and nobody does that. Ever.
And the industry standard is to use 'TEXT QUALIFIER' for the TECHNICAL definition of a 'DELIMITER' where (") is the 'TEXT QUALIFIER' and (,) is called the 'DELIMITER'.
FirstRowFirstValue,"First Row Second Value",FirstRowThirdValue
SecondRowFirstValue,SecondRowSecondValue,SecondRowThirdValue

Adding to the answer here already, I've use the term notator.
Annotation is a super set of notation.
A notator is the super set of delimiter.
A delimiter is the super set of terminator and separator.
Annotation is all notation and markup used in a particular document. For example, a "TODO List" document must be a line separated list of strings.
Notation is markup used to denote specific meaning. For example, "string are in quotes" is a notation.
A delimiter is the character or set of characters used to denote a notation. For example, the character quote is the delimiter for strings.
A terminator is ending delimiter and prefix is the starting delimiter. For the "TODO List" document, quote may be used as the prefix and terminating delimiter.
A seperator is a delimiter that separates two things. For example, "new line" is the separator for each "TODO List" item. In this example, "new line" is also a terminator; a new line may be used to terminate each line. A separator also being a terminator is typical, but not guaranteed to always be the case.
Delimiters can also be "positional". A positionally delimited example is a column delimited mainframe flat file.

"word 1", "word 2" \NULL
The words are delimited by quotes,
separated by the comma,
and the whole thing is terminated by \NULL.

Related

Antlr differentiating a newline from a \n

Let's say I have the following statement:
SELECT "hi\n
there";
Notice there is a literal newline in there, and the escape \n. The string that antlr4 picks up for me is:
String_Literal: "hi\n\nthere"
In other words, not differentiating between the literal newline and the \n one. Is there a way to differentiate the two, or what's the usual process to do that?
My guess is that the output you pasted into your question comes from a call to the Antlr4 runtime method tree.toStringTree(parser) (or equivalent in whatever target language you've chosen).
That function calls escapeWhitespace in the utilities class/module/file, and that function does what it's name suggests: it converts (some) whitespace characters to C-like backslash escape sequences. (Specifically, it handles newline, carriage return, and tab characters.) It does not escape backslash characters, which makes its output ambiguous; there's no way to distinguish between the two character escape sequence \n and the escaped conversion of a newline character in the message.
They are different in the actual character string, because the Antlr4 lexer does not transform the string value of the matched token in any way. That's your responsibility.
In computing, it is very often the case that what you see is not what you got. What you see is just what you see, and a lot of computational power has gone into creating that vision for you. By the same token, nothing guarantees that the vision is an unambiguous, or even useful, representation of the actual values. The best you can say for it is that it's probably more useful than trying to read the data as individual bits. (And, indeed, the individual bits are not physical objects either; despite the common refrain, you could completely disassemble a computer and examine it with an arbitrarily powerful microscope, and you will not see a single 1 or 0.)
That might seem like irrelevant philosophizing, but it has a real consequence: when you're debugging and you see something that makes you think, "that looks wrong", you need to consider two possibilities: maybe the underlying data is incorrect, but may it's the process which rendered the representation which is at fault. In this case, I'd say that the failure of escapeWhitespace to convert backslash characters into pairs of backslashes is a bug, but that's a value judgement on my part. Anyway, the function is not critical to the operation of Antlr4, and you could easily replace it.

Can EDI files have ~ in data?

I'm parsing a EDI file and splitting by ~s. I am wondering if it's possible for EDI to have ~ in the data itself? Is there a rule that says no ~ in the data? This is for 810/850 etc
The value defined in the 106th character of the ISA segment (or, alternatively – to be a bit less brittle to whitespace issues – the 1st character after the ISA16 element) is the segment delimiter (in official terms: the segment terminator). Most of the time people specify the ~ character, but other choices are certainly valid.
In this example, the 106th character is ~:
ISA*00* *00* *ZZ*AMAZONDS *01*TESTID *070808*1310*U*00401*000000043*1*T*+~
Instead of counting 106 characters (which, again, can be brittle to whitespace issues), you can count 16 elements – that is, 16 asterisks – to find the value for ISA16 (which is +), and then pick the next character (which is ~).
There are two relevant sections in the official X12 specification (bolded for emphasis):
12.5.4.3 Delimiter Specifications
The delimiters consist of three separators and a terminator. The
delimiters are devised for inclusion within the data stream of the transfer. The delimiters are:
segment terminator [note: this is the one we're discussing]
data element separator
component element separator
repetition separator
The delimiters are assigned by the interchange sender. These characters are disjoint from those of the data elements; if a character is selected for the data element separator, the component element separator, the repetition separator or the segment terminator from those available for the data elements, that character is no longer available during this interchange for use in a data element. The instance of the terminator (<tr>) must be different from the instance of the data element separator (<gs>), the component element separator (<us>) and the repetition separator (<rs>). The data element separator, component element separator and repetition separator must not have the same character assignment.
So, according to this part of the spec, if the ~ is used as the segment terminator, then the use of the ~ is disallowed in a data element (that is, the textual body).
Now, let's look at section 12.5.A.5 – Recommendations for the Delimiters:
Delimiter characters must be chosen with care, after consideration of data content, limitations of the transmission protocol(s) used, and applicable industry conventions. In the absence of other guidelines, the following recommendations are offered:
<tr> terminator: ~ | Note: the "~" was chosen for its infrequency of use in textual data.
This section is saying that ~ was chosen as the default because ~ is seldom found in textual data (it would have been a bad idea, for example, to use . as the default, since that's such a common inclusion).
That said, even though using the segment terminator is technically prohibited, it's still possible for an EDI transmission to inadvertently include ~ in the textual data – in other words, your trading partner may include this by accident. Further, the BIN and BSD (binary data) segments can certainly include ~ (though these may not apply based on the transaction sets you're working with).
In our parsing API, we apply a set of specific set of patterns based on the type of segment we encounter. For us, it's not sufficient to split naively based on the segment delimiter alone because we may encounter binary segments (BIN, BSD), where it's possible that the segment delimiter character is included in the textual data.
For a regular segment (i.e. not BIN or BSD), the logic is something like this:
Consume the segment code (i.e. the characters before the first element delimiter).
Consume each element of the segment based on the element delimiter.
Stop if the next character is a segment delimiter or a new line.
As an example, for segment BEG*PO-00001**20210901~, the process would look like:
Consume BEG. Since this is not a special segment (BIN or BSD), consume elements by splitting on *.
Consume PO-00001.
Consume ''.
Consume 20210901.
Stop since next char is ~.
(The pattern for binary segments is different from the pattern we use for regular segments.)
Here's an example of how our parser "fails" on a ~ in the textual data when the ISA16 segment delimiter is also ~; the JSON representation is particularly helpful for seeing the issue.
Here's an example of our parser succeeding on a ~ in the textual data when the ISA16 segment delimiter is ^.
Lastly, here's an example of our parsing succeeding where the ~ is specified in ISA16, but has been omitted altogether in favor of newlines – which we see occasionally.
Hope this helps.

How to create a model with table names as integers?

Is there a way to generate modles with table field names as numbers
rails g model Numbers 1-10:string 11-20:string
You should not do this according to both SQL Standards and Ruby Syntax.
PostgreSQL 4.1.1. Identifiers and Key Words
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard, so their use might render applications less portable. The SQL standard will not define a key word that contains digits or starts or ends with an underscore, so identifiers of this form are safe against possible conflict with future extensions of the standard.
Many SQL servers will allow columns starting with numbers though such as MSSQL, and MySQL/MariaDB however referencing these columns requires the need for explicit double quotes/square brackets (MSSQL) or backticks (MySQL/MariaDB) otherwise 1-10 would be considered a expression rather than a column reference thus resulting in -9.
Also even if this wasn't true ruby does not allow method names to begin with a number so this would make for very awkward code usage.
Method names may be one of the operators or must start a letter or a character with the eight bit set. It may contain letters, numbers, an _ (underscore or low line) or a character with the eight bit set. The convention is to use underscores to separate words in a multiword method name.

CFStringTokenizer not tokenizing lower-case sentences

I'm trying to use CFStringTokenizer with kCFStringTokenizerUnitSentence to split a string into sentences. The first problem I'm having is that sentences need to be capitalized in order for them to be recognized as sentences. If not, it just thinks it's part of the previous sentence.
I'm splitting user-entered text so I'm expecting the text to be very unclean.
Is there something else I can do with CFStringTokenizer to have it detect uncapitalized sentences? Or will I have to use another method of splitting altogether?
I followed the answer on this SO question for my implementation:
How to get an array of sentences using CFStringTokenizer?
NOTE: After testing a bit more it seems that with kCFStringTokenizerUnitSentence, if a '!' or a '?' is followed by an uncapitalized sentence, it will recognize the sentence. Also, if one of those punctuation marks is followed by a sentence without a space between the '!' and the first word, it will still separate.
So the one case I need to work around is a '.' followed by an uncapitalized sentence.
ANOTHER OPTION I found, if you're getting the text from a textField, is to use this:
textField.autocapitalizationType = UITextAutocapitalizationTypeSentences;
It will automatically capitalize sentences so you don't have to worry about converting for CFStringTokenizer. It still doesn't account for edge cases like abbreviations, but at least in my case the user will have an option to delete the auto-capitalization if it's wrong.
You can convert the input string to all uppercase first and then run it through CFStringTokenizer and use the ranges to get the substrings of the original input string. But you must be careful here because some characters might become more than 1 character after conversion to uppercase.

What to do when unescapable character(s) are escaped?

In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("), and any quotes in a given string are escaped with a back-slash(\): for input "We \said, \"We want Moshiach Now\"" -- what would should be done with the letter s in said which is escaped?
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
>>> print "a\tb"
a b
>>> print "a\tb\Rc"
a b\Rc
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\sbar/ vs new RegExp('foo\sbar').

Resources