CFStringTokenizer not tokenizing lower-case sentences - ios

I'm trying to use CFStringTokenizer with kCFStringTokenizerUnitSentence to split a string into sentences. The first problem I'm having is that sentences need to be capitalized in order for them to be recognized as sentences. If not, it just thinks it's part of the previous sentence.
I'm splitting user-entered text so I'm expecting the text to be very unclean.
Is there something else I can do with CFStringTokenizer to have it detect uncapitalized sentences? Or will I have to use another method of splitting altogether?
I followed the answer on this SO question for my implementation:
How to get an array of sentences using CFStringTokenizer?
NOTE: After testing a bit more it seems that with kCFStringTokenizerUnitSentence, if a '!' or a '?' is followed by an uncapitalized sentence, it will recognize the sentence. Also, if one of those punctuation marks is followed by a sentence without a space between the '!' and the first word, it will still separate.
So the one case I need to work around is a '.' followed by an uncapitalized sentence.
ANOTHER OPTION I found, if you're getting the text from a textField, is to use this:
textField.autocapitalizationType = UITextAutocapitalizationTypeSentences;
It will automatically capitalize sentences so you don't have to worry about converting for CFStringTokenizer. It still doesn't account for edge cases like abbreviations, but at least in my case the user will have an option to delete the auto-capitalization if it's wrong.

You can convert the input string to all uppercase first and then run it through CFStringTokenizer and use the ranges to get the substrings of the original input string. But you must be careful here because some characters might become more than 1 character after conversion to uppercase.

Related

NSRegularExpression not matching number sign (#)

I'm working on a Guitar Chord transposer, and so from a given text file, I want to identify guitar chords. e.g. G#, Ab, F#m, etc.
I'm almost there! I have run into a few problems already due to the number sign (hash tag).
#
For example, you can't include the number sign in your regex pattern. The NSRegularExpression will not initialize with this:
let fail: String = "\\b[ABCDEFG](b|#)?\\b"
let success: String = "\\b[CDEFGAB](b|\\u0023)?\\b"
I had to specifically provide the unicode character. I can live with that.
However, now that I have a NSRegularExpression object, it won't match these (sharps = number sign) when I have a line of text such as:
Am Bb G# C Dm F E
When it starts processing the G#, the sharp associated with that second capture group is not matched. (i.e. the NSTextCheckingResult's second range has a location of NSNotFound) Note, it works for Bb... it matches the 'b'
I'm wondering what I need to do here. It would seem the documentation doesn't cover this case of '#' which IS in fact sometimes used in Regex patterns (I think related to comments or sth)
One thing that would be great would be to not have to look up the unicode identifier for a #, but just use it as a String "#" then convert that so it plays nicely with the pattern. There exists the chance that \u0023 is in fact not the code associated with # ...
The \b word boundary is a context dependent construct. It matches in 4 contexts: 1) between start of string and a word char, 2) between a word char and end of string, 3) between word and a non-word and 4) a non-word and a word char.
Your regex is written in such a way that ultimately the regex engine sees a \b after # and that means a # will only match if there is a word char after it.
If you replace \b with (?!\w), a negative lookahead that fails the match if there is a word char immediately to the right of the current location, it will work.
So, you may use
\\b[CDEFGAB](b|\\u0023)?(?!\\w)
See the regex demo.
Details
\b - a word boundary
[CDEFGAB] - a char from the set
(b|\\u0023)? - an optional sequence of b or #
(?!\\w) - a negative lookahead failing the match (and causing backtracking into the preceding pattern! To avoid that, add + after ? to prevent backtracking into that pattern) if there is a word char immediately to the right of the current position.
(I'd like to first say #WiktorStribiżew has been a tremendous help and what I am writing now would not have been possible without him! I'm not concerned about StackOverflow points and rep, so if you like this answer, please upvote his answer.)
This issue took many turns and had a few issues going on. Ultimately this question should be called How do I use Regex on iOS to detect Musical Chords in a text file?
The answer is (so far), not simply.
CRASH COURSE IN MUSIC THEORY
In music you have notes. They are made up of a letter between A->G and an optional symbol called an accidental. (A note relates to the acoustic frequency of the sound you hear when that note is played) An accidental can be a flat (represented as a ♭ or simply a b), or a sharp (represented as a ♯ or simply a #, as these are easier to type on a keyboard). An accidental serves to make a note a semitone higher (#) or lower (b). As such, a F# is the same acoustic frequency as a Gb. On a piano, the white keys are notes with no accidentals, and the black keys represent notes with an accidental. Depending on some factors of the piece of music, that piece won't mix accidental types. It will either be flats throughout the piece or sharps. (Depending on the musical key of the composition, but this is not that relevant here.)
In terms of regex, you have something like ABCDEFG? to determine the note. In reality it's more complicated.
Then, a Musical Chord is comprised of the root note and it's chord type. There are over 50 types of chords. They have a 'text signature' that is unique. Also, a 'major' chord has an empty signature. So in terms of pseudo-regex you have for a Chord:
[ABCDEFG](b|#)?(...|...|...)?
where the first part you recognize as the note (as before), and the last optional is to determine the chord type. The different types were omitted, but can be as simple as a m (for Minor chord), or maj7#5 (for a major 7th chord with an augmented 5th... don't worry about it. Just know there are many string constants to represent a chord type)
Then finally, with guitar you often have a corresponding bass note that changes the chord's tonality somewhat. You denote this by adding a slash and then the note, giving the general pseudoform:
[ABCDEFG](b|#)?(...|...|...)?(/[ABCDEFG](b|#)?)? // NOT real Regex
real examples: C/F or C#m/G# and so on
where the last part has a slash then the same pattern to recognize a note.
So putting these all together, in general we want to find chords that could take on many forms, such as:
F Gm C#maj7/G# F/C Am A7 A7/F# Bmaj13#11
I was hoping to find one Regex to rule them all. I ended up writing code that works, though it seems like I kind of hacked around a bit to get the results I desired.
You can see this code here, written in Swift. It is not complete for my purposes, but it will parse a string, return a list of Chord Results and their text range within the original string. From there you would have to finish the implementation to suit your needs.
There have been a few issues on iOS:
iOS does not handle the number sign (#) well at all. When providing regex patterns or match text, I either had to replace the # with its unicode \u0023, or what ultimately worked was replacing all occurrences of # with another character (such as 'S'), and then convert it back once regex did it's thing. So this code I wrote often has to 'sanitize' the pattern or the input text before doing anything.
I couldn't get a Regex Pattern to perfectly parse a chord structure. It wasn't fully working for a Chord with a bass note, but it would successfully match a Chord with a bass note, then I had to split those 2 components and parse them separately, then recombine them
Regex is really a bit of voodoo, and I think it sucks that for something so confusing to many people, there are also different platform-dependent implementations of it. For example, Wiktor referred me to Regex patterns he wrote to help me solve the problem on www.regex101.com, that would WORK on that website, but these would not work on iOS, and NSRegularExpression would throw an error (often it had something to do with this # character)
My solution pays absolutely no regard to performance. It just wanted it to work.

I want to my edittext first 6 characters should allow only alphabets and next characters should be only numbers

I want to my edittext first 6 characters should allow only alphabets and next characters should be only numbers and whole edittext should not allow special characters
When you can get your hands on the text the user typed you can try putting it through a regular expression. Java has a class, java.util.regex.Pattern, that can tell you whether a string fits a pattern. (the match method)
Your pattern would be something like:
[\p{Alpha}][\p{Alpha}][\p{Alpha}][\p{Alpha}][\p{Alpha}][\p{Alpha}][\d]+
This regular expression says "six alphabet characters followed by one or more digits."
That's the best I can offer. Your question is a bit vague. Does the string the user enters have to have six letters? Or is it one to six letters? If it's the latter then my expression above is insufficient. And what about the digits? Is there a minimum number of digits required?

Smarter Autocapitalization

I've been looking around, and I am wondering whether there is a simple way to capitalize all words in a UITextField, while leaving certain words (such as of, the, or, etc.) lowercase, unless they are the first word of the phrase.
This is an
Example of the Effect I'm Trying to Convey.
One of the methods I've found is to search the text field value for the certain words and replace them with lowercase versions, as the user types a new word or character, perhaps listening for the space bar.
I'm not sure if the method above is best practice, or whether my searches haven't been broad enough to find a solution already in the mix.
I was originally thinking something along these "pseudocode" lines:
When value of textfield is changed
Get current value textfield
For each word in value:
If the word matches ("For", "Of", "The", etc.) and the word is not the first word in the value:
Change the word to lowercase, and replace word
Go to next word
My actual question is mainly one of performance. Would this method be overly strenuous on my application? If so, are there any better solutions?
Thank you all for your assistance!
Update:
Thanks to holex, cluemein, and others who have already commented and answered. I will try your solutions when I get the opportunity to do so.
A better way then converting the words to lowercase is to capitalize the words that are NOT those words you specified. Set up if statements to capitalize the beginning letter of the first word, and to capitalize the words following that if they are not the words you specified. Then, if you want to make sure the specified words weren't capitalized after the first word, use an else statement. "pseudocode" example:
Capitalize letter of first word;
Move on to next word;
While not end of textfield (or while typing):
if word is not ("the"|"and"|"of"|"or"|...):
capitalize first letter;
else:
set first letter to lowercase;
move to next word at space;
This will on average be roughly twice as fast as going back through the text looking for the specified words in terms of runtime. This isn't the code you would use, but the algorithm you would implement. Also, take into account what holex said about spaces. I leave how you implement this algorithm up to you. Just to clarify, this algorithm is for both autocapitalizng and auto-setting to lower case.

When to use the terms "delimiter," "terminator," and "separator"

What are the semantics behind usage of the words "delimiter," "terminator," and "separator"? For example, I believe that a terminator would occur after each token and a separator between each token. Is a delimiter the same as either of these, or are they simply forms of a delimiter?
SO has all three as tags, yet they are not synonyms of each other. Is this because they are all truly different?
A delimiter denotes the limits of something, where it starts and where it ends. For example:
"this is a string"
has two delimiters, both of which happen to be the double-quote character. The delimiters indicate what's part of the thing, and what is not.
A separator distinguishes two things in a sequence:
one, two
1\t2
code(); // comment
The role of a separator is to demarcate two distinct entities so that they can be distinguished. (Note that I say "two" because in computer science we're generally talking about processing a linear sequence of characters).
A terminator indicates the end of a sequence. In a CSV, you could think of the newline as terminating the record on one line, or as separating one record from the next.
Token boundaries are often denoted by a change in syntax classes:
foo()
would likely be tokenised as word(foo), lparen, rparen - there aren't any explicit delimiters between the tokens, but a tokenizer would recognise the change in grammar classes between alpha and punctuation characters.
The categories aren't completely distinct. For example:
[red, green, blue]
could (depending on your syntax) be a list of three items; the brackets delimit the list and the right-bracket terminates the list and marks the end of the blue token.
As for SO's use of those terms as tags, they're just that: tags to indicate the topic of a question. There isn't a single unified controlled vocabulary for tags; anyone with enough karma can add a new tag. Enough differences in terminology exist that you could never have a single controlled tag vocabulary across all of the topics that SO covers.
Technically a delimiter goes between things, perhaps in order to tell you where one field ends and another begins, such as in a comma-separated-value (CSV) file.
A terminator goes at the end of something, terminating the line/input/whatever.
A separator can be a delimiter or anything else that separates things. Consider the spaces between words in the English language for example.
You could argue that a newline character is a line terminator, a delimiter of lines or something that separates two lines. For this reason there are a few different newline-type characters in the Unicode specification.
A delimiter is one or two markers that show the start and end of something. They're needed because we don't know how long that 'something' will be. We can have either: 1. a single delimiter, or 2. a pair of pair-delimiters
[a, b, c, d, e] each comma (,) is a single delimiter. The left and right brackets, ([, ]) are pair-delimiters.
"hello", the two quote symbols (") are pair-delimiters
A seperator is a synonym of a "delimiter", but from my experience it usually refers to field delimiters. A field delimiter acts as a divider between one field and the one following it, which is why is can be though of as "separating" them.
<file1>␜<file2>␜<file3>, the file separator character (␜), despite explicitly the name having "separator", is both a delimiter and a separator
A terminator marks the end of a group of things, again needed because we don't know how long it is.
abdefa\0, here the null character \0 is a terminator that tells us the string has ended.
foo\n, here the newline character \n is a terminator that tells us the line has ended.
The terms, delimiter, separator originate from the classical idea of storage, conceptually, being comprised of files, records, and fields, (a file has many records, a record has many fields). In this context, a single delimiter and pair-delimiters might be called record delimiters and field delimiters. Because of the historical significance of files-records-field taxonomy, this terms have a more widespread usage (see Wikipedia page for Delimiter).
Below are two files, each with three records with each record having four fields:
martin,rodgers,33,28000\n
timothy,byrd,22,25000\n
marion,summers,35,37000\n
===
lucille,rowe,28,33000\n
whitney,turner,24,19000\n
fernando,simpson,35,40900\n
Here, , and \n as we know are single delimiters, but they might also be called a record delimiters and field delimiters respectively.
For complex nested structures, a terminator can also be a delimiter/separator (they're not mutually exclusive definitions). From the previous example, the === marker from inside a file could be considered a terminator (it's the end of the file). But when we look at many files, the === acts like a delimiter/separator.
Consider lines in a UNIX file
This is line 1\n
This is line 2\n
This is line 3\n
The newlines are both terminators (they tell us where the string ends) and are delimiters (they tell us where each line begins and ends). From Wikipedia:
Two ways to view newlines, both of which are self-consistent, are that newlines either separate lines or that they terminate lines.
Really you'll only need to say "terminator" when you're talking at one individual item, (just one string 1234\0, just one line abcd\n, etc.) -- and it'll be unclear whether the terminator in this context could also be a delimiter in a more complex parent structure.
This response is in context of CSV because all of the provided answers focus on English language instead.
Delimiters are all elements mentioned in the given CSV specification that describe the boundaries of stuff, separator is a common name for field delimiters, terminator is a common name for record delimiters.
Delimiter is a part of CSV format specification, it defines boundaries and doesn't have to be a printable character.
Terminators, separators and field qualifiers are delimiters but are not necessary to specify a CSV format, e.g. 10 columns field delimiter and 30 columns record delimiter mean each 30 columns are one record and each 10 columns are one field (usually padded with white space). In other words CSV format without separators has a constant field and record length, e.g.:
will smith 1 chris rock 0
Terminator is a delimiter that marks the end of a single CSV record and is usually represented either by Line Feed (LF), a Carriage Return (CR) or a combination of both (e.g. CRLF), e.g.:
will smith 1
chris rock 0
Separator is a delimiter that marks the division between CSV fields and is most often represented by a comma (or a semicolon), it has been introduced to store dynamic length values, e.g. two comma separated records in CSV format with CRLF terminator after 1 and 0:
will,smith,1
chris,rock,0
Field qualifier is a delimiter usually used in pairs instead of escape sequence. It is a printable character that isn't allowed in the field value (unless given CSV format specification provides the escape sequence) and marks the beginning and the end of a field, it was introduced to store values containing separators, e.g. this CSV has 2 records with 3 fields each but 3rd field value can contain a semicolon that otherwise acts as a fields separator:
will;smith;"rich;famous;slaps people"
chris;rock;"rich;famous;gets slapped"
Escape sequence is a character (or a set of characters) that marks anything that follows the escape sequence as non-significant and therefore as a part of the field value (e.g. backslash might specify the immediately following separator as a part of the value). This sequence can escape one or multiple characters, e.g. CSV with \ as a 1 character escape sequence:
will;smith;rich\;famous\;slaps people 100\\100% of time
chris;rock;rich\;famous\;slaps people 0\\100% of time
Delimiter
There are a couple of senses for delimiter:
As the space used in sentences (frontier).
A delimiter is like a frontier, it exists between countries.
In that sense, there must be two countries to have a frontier.
An space usually exists between words, but not at the end. The space delimits words but does not terminate sentences (collection of words). The sentence:
This is a short sentence.
Has four spaces, they act as word delimiters. There is no ending space.
In fact, there are two additional delimiters usually not named: The start and end of the sentence. Like the ^ and $ used in regular expressions to mark the start and end of an string of text.
And, in human language, there are punctuation marks (dot, comma, semicolon, colon, etc.) that serve also as word delimiters (additionally to spaces)
As used in quotes (boundary).
A sentence like:
“This is a short sentence.”
Is delimited (start and end) by the double quotes (“”). In this sense it is like "balanced delimiters" (Balanced Brackets in Wikipedia).
Some may argue the frontier and boundary are essentially the same, and, under some conditions they actually are correct.
Separator
Is exactly the same as the first sense (above) of a delimiter (a frontier).
So, a separator is a synonym of delimiter in many computer uses.
Terminator
Demarcate the end of an individual "field".
Like the newlines in a Unix text file. Each line is terminated by a NewLine (\n).
In a proper Unix text file all lines are terminated (even the last one).
Like paragraphs are terminated by a newline in human language.
Or, more strictly, as the NUL (\0) is the terminator of a C string:
A string is defined as a contiguous sequence of code units terminated by the first zero code unit (often called the NUL code unit).
So, a terminator character is also a delimiter but must also appear at the end.
Tags
Stackoverflow has tags only for delimiters and separators
delimiterA delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams.
separatorA character that separates parts of a string.
The terminator tag only apply to a shell terminal emulator:
terminatorTerminator is a GPL terminal emulator.
And, yes, delimiter and separator are many times equivalent
except for the parenthesis, braces, square brackets and similar balanced delimiters.
Interesting question and answers. To summarize, 1) delimiter marks the "limits" of something, i.e. beginning and/or end; 2) terminator is just a special term for "end delimiter"; 3) separator entails there are items on both sides of it (unlike delimiter).
Best example I can think of for a start delimiter is the start-comment markers in programming languages ("#", "//", etc.).
Best example I can think of for a terminator (end delimiter) is the newline character in Unix. It's a misnomer -- it always terminates a (possibly empty) line but doesn't always start a new line, i.e. when it is the last character in a file. Maybe a better common example is the simple period for sentences.
Best example I can think of for a separator is the simple comma. Note that comma never appears in English without text both before and after it.
Interesting to note that none of these is necessarily limited to single-character. In fact awk (or maybe only gawk?) in Unix allows FS (field separator) to be any regexp.
Also, although "any non-zero amount of whitespace" is considered a "word delimiter" in e.g. the wc command, there are also zero-width "word boundary" specifiers in regexps (e.g. \b). Interesting to ponder whether such zero-width items/boundaries could be considered "delimiters" as well. I tend to think not (too much of a stretch).
Terminators are separators when you start with empty. A;B;C; is actually A;B;C;empty.
Just like the English language, there is the technically correct answer, and the generally used answer, and it is probably relevant to isolate to the programming usage of the term definitions being sought.
The industry has long used the phrase 'Comma Delimited' file to mean:
FirstRowFirstValue,FirstRowSecondValue,FirstRowThirdValue
SecondRowFirstValue,SecondRowSecondValue,SecondRowThirdValue
TECHNICALLY, this is a Comma 'SEPARATED' list.
TECHNICALLY, THIS is a Comma 'DELIMITED' list.
,FirstRowFirstValue,FirstRowSecondValue,FirstRowThirdValue,
,SecondRowFirstValue,SecondRowSecondValue,SecondRowThirdValue,
or this:
,FirstRowFirstValue,,FirstRowSecondValue,,FirstRowThirdValue,
,SecondRowFirstValue,,SecondRowSecondValue,,SecondRowThirdValue,
and nobody does that. Ever.
And the industry standard is to use 'TEXT QUALIFIER' for the TECHNICAL definition of a 'DELIMITER' where (") is the 'TEXT QUALIFIER' and (,) is called the 'DELIMITER'.
FirstRowFirstValue,"First Row Second Value",FirstRowThirdValue
SecondRowFirstValue,SecondRowSecondValue,SecondRowThirdValue
Adding to the answer here already, I've use the term notator.
Annotation is a super set of notation.
A notator is the super set of delimiter.
A delimiter is the super set of terminator and separator.
Annotation is all notation and markup used in a particular document. For example, a "TODO List" document must be a line separated list of strings.
Notation is markup used to denote specific meaning. For example, "string are in quotes" is a notation.
A delimiter is the character or set of characters used to denote a notation. For example, the character quote is the delimiter for strings.
A terminator is ending delimiter and prefix is the starting delimiter. For the "TODO List" document, quote may be used as the prefix and terminating delimiter.
A seperator is a delimiter that separates two things. For example, "new line" is the separator for each "TODO List" item. In this example, "new line" is also a terminator; a new line may be used to terminate each line. A separator also being a terminator is typical, but not guaranteed to always be the case.
Delimiters can also be "positional". A positionally delimited example is a column delimited mainframe flat file.
"word 1", "word 2" \NULL
The words are delimited by quotes,
separated by the comma,
and the whole thing is terminated by \NULL.

Regex for string first chars

in my Rails app I need to validate a string that on creation can not have its first chars empty or composed by any special chars.
For example: " file" and "%file" aren't valid. Do you know what Regex I should use?
Thanks!
The following regex will only match if the first letter of the string is a letter, number, or '_':
^\w
To restrict to just letters or numbers:
^[0-9a-zA-Z]
The ^ has a special meaning in regular expressions, when it is outside of a character class ([...]) it matches the start of the string (without actually matching any characters).
If you want to match all invalid strings you can place a ^ inside of the character class to negate it, so the previous expressions would be:
^[^\w]
or
^[^0-9a-zA-Z]
A good place to interactively try out Ruby regexes is Rubular. The link I gave shows the answer that #Dave G gave along with a few test examples (and at first glance it seems to work). You could expand the examples to convince yourself further.
The regex
^[^[:punct:][:space:]]+
Should do what you want. I'm not 100% sure of what Ruby provides as far as regular expressions and POSIX class support so your mileage on this may vary.

Resources