Regex that finds a line with exactly 3 words in it - grep

I have a problem that requires me to write a regex that finds a line that containing exactly 3 groups of characters (it could be words or numbers) and that ends with another specific word. The way I had in mind was to find a pattern that ended in a space, and look for it 3 times. assuming this is the correct way to go about it, I do no know how to find a space, but I thought it would look like .*"find a space"{3} endword$. Is this the way it would be done? Even if it is not the way to do it how do you find a space? Any suggestions?

Assuming by three groups of words you would accept any non-space character, you could write:
/^\s*(?:\S+\s+){3}endword$/
The initial caret is to make sure you have exactly 3 non-space groups on the line.
Of course you need to consider whether things like control characters could appear, and adjust accordingly.

Depending on your flavor, something like the below would do it:
\b+.+?\b+.+?\b+.+?\bendword$
This makes use of the word boundary mark (\b) and non-greedy repetitions (+?), so it may be slightly different in your specific implementation, especially if you're using something old like grep.

Related

UILabel: "opposite" of non breaking space? I.e. show where a word can be hyphenated

I need to mark occurrences in words where it can be hyphenated if there's not enough rough, e.g.
loremip|sum
so if there's enough room, it should show loremipsum if not, it should be loremip- on the first line, sum on the second.
Bonus point: if the character is ignored by search, i.e. searching for loremips would find that occurrence. Does something like that exist on iOS?
Using NSParagraphStyle isn't enough as it splits the words at points where it isn't allowed (note: I'm using german). Also I would like to use words not really common as they are dialect.

How to get a % difference of two NSStrings

I'm thinking this may be impossible to do resonably, but I figured I would take a shot at it. So lets say I have two NSStrings. One is #"Singin' In The Rain" and the other is #"Singing In The Rain". These strings are very similar, but have a small difference. I'm trying to find a way where I could write something like the following:
NSString *stringOne = #"Singin' In The Rain";
NSString *stringTwo = #"Singing In The Rain";
float dif = [stringOne differenceFrom:stringTwo];
//dif = .9634 or something like that
One project that I did find similar to this was taken from the previous similar question on Stack Overflow: Check if two NSStrings are similar. However, this simply returns a BOOL which isn't as accurate as I need it to be. I also tried looking into the compare: documentation for NSString but it all looked too basic. Another similar thing I found is at https://gist.github.com/iloveitaly/1515464. However, this gives varying results, even saying two of the same string are different occasionally. Any advice would be much appreciated.
The question is a little vague, but I would assume that the most satisfactory results will come from using NSLinguisticTagger. If you parse each for tags with the NSLinguisticTagSchemeLexicalClass scheme then your string will be broken down into verbs, nouns, adjectives, etc. In your example, even if you weren't spotting that singin' and singing are the same, you'd spot the other three words are the same and that the thing at the end is a noun, so they're both about doing something in the same thing.
It'd probably be wise to use something like a BK-Tree to compare individual words where you suspect there may be a match (a noun obviously doesn't match an adverb but two nouns may match even if spellings differ).
Another off the wall suggestion:
The source, and hence the algorithm, for diff and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.
When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.
So:
Break each string into "words" (white space separated should be sufficient).
Compare the two lists using the diff algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync)
Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.
For the two example strings this would give 1:4 changes or 75% similar.
If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).
For the two example strings this would give 3 6/7 words out of 4, or 96% similar.
I'd recommend dynamic time warping for such comparisons:
http://en.wikipedia.org/wiki/Dynamic_time_warping
This will however return distance between two strings (so you'll get 0 for identical), but this the best starting point I can think of.

seaching 2D ArrayLib does not work for some cases

I have 2D array in which the second column has domain names of some emails, let us call the array myData[][]. I decided to use ArrayLib in order to search the second column for a specific domain.
ArrayLib.indexOf(myData, 1, domain)
Here is where I found an issue. In myData array, one of the domains look like this "ewmining.com" (pay attention to the w).
While searching for "e.mining.com" (notice the first dot), the indexOf() function actully gave me the row containing "ewmining.com".
This is what is in the array "ewmining.com"
This is what is in the serach string "e.mining.com"
It seams that ArrayLib treats the dot to mean any character. Is this supposed to be the correct behavior? Is there a way to stop this behavior and search for exact match.
I really need help on this issue.
Thanks in advance for your help.
The dot usually represents "any character" in regular expressions. I am not familiar with ArrayLib, but maybe you should look for a way to turn off regular expressions when searching. Otherwise you might have to escape the dot, for example search for e[.]mining[.]com

RegEx how to properly use OR pipelines

I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.

Latex - Apply an operation to every character in a string

I am using LaTeX and I have a problem concerning string manipulation.
I want to have an operation applied to every character of a string, specifically
I want to replace every character "x" with "\discretionary{}{}{}x". I want to do
this because I have a long string (DNA) which I want to be able to separate at
any point without hyphenation.
Thus I would like to have a command called "myDNA" that will do this for me instead of
inserting manually \discretionary{}{}{} after every character.
Is this possible? I have looked around the web and there wasnt much helpful
information on this topic (at least not any I could understand) and I hoped
that you could help.
--edit
To clarify:
What I want to see in the finished document is something like this:
the dna sequence is CTAAAGAAAACAGGACGATTAGATGAGCTTGAGAAAGCCATCACCACTCA
AATACTAAATGTGTTACCATACCAAGCACTTGCTCTGAAATTTGGGGACTGAGTACACCAAATACGATAG
ATCAGTGGGATACAACAGGCCTTTACAGCTTCTCTGAACAAACCAGGTCTCTTGATGGTCGTCTCCAGGT
ATCCCATCGAAAAGGATTGCCACATGTTATATATTGCCGATTATGGCGCTGGCCTGATCTTCACAGTCAT
CATGAACTCAAGGCAATTGAAAACTGCGAATATGCTTTTAATCTTAAAAAGGATGAAGTATGTGTAAACC
CTTACCACTATCAGAGAGTTGAGACACCAGTTTTGCCTCCAGTATTAGTGCCCCGACACACCGAGATCCT
AACAGAACTTCCGCCTCTGGATGACTATACTCACTCCATTCCAGAAAACACTAACTTCCCAGCAGGAATT
just plain linebreaks, without any hyphens. The DNA sequence will be one
long string without any spaces or anything but it can break at any point.
This is why my idea was to inesert a "\discretionary{}{}{}" after every
character, so that it can break at any point without inserting any hyphens.
This takes a string as an argument and calls \discretionary{}{}{} after each character. The input string stops at the first dollar sign, so you should not use that.
\def\hyphenateWholeString #1{\xHyphenate#1$\wholeString}
\def\xHyphenate#1#2\wholeString {\if#1$%
\else\say{#1}\discretionary{}{}{}%
\takeTheRest#2\ofTheString
\fi}
\def\takeTheRest#1\ofTheString\fi
{\fi \xHyphenate#1\wholeString}
\def\say#1{#1}
You’d call it like \hyphenateWholeString{CTAAAGAAAACAGGACG}.
Instead of \discretionary{}{}{} you can also try \hspace{0pt}, if you like that more (and are in a latex environment). In order to align the right margin, I think you’d need to do some more fine tuning (but see below). The effect is of course minimised by using a font of fixed width.
Revision:
\def\hyphenateWholeString #1{\xHyphenate#1$\wholeString\unskip}
\def\xHyphenate#1#2\wholeString {\if#1$%
\else\transform{#1}%
\takeTheRest#2\ofTheString\fi}
\def\takeTheRest#1\ofTheString\fi
{\fi \xHyphenate#1\wholeString}
\def\transform#1{#1\hskip 0pt plus 1pt}
Steve’s suggestion of using \hskip sounds like a very good idea to me, so I made a few corrections. Note that I’ve renamed the \say macro and made it more useful in that it now actually does the transformation. (However, if you remove the \hskip from \transform, you’ll also need to remove the \unskip in the main macro definition.
Edit:
There is also the seqsplit package which seems to be made for printing DNA data or long numbers. They also bring a few options for nicer output, so maybe that is what you’re looking for…
Debilski's post is definitely a solid way to do it, although the \say is not necessary. Here's a shorter way that makes use of some LaTeX internal shortcuts (\#gobble and \#ifnextchar):
\makeatletter
\def\hyphenatestring#1{\xHyphen#te#1$\unskip}
\def\xHyphen#te{\#ifnextchar${\#gobble}{\sw#p{\hskip 0pt plus 1pt\xHyphen#te}}}
\def\sw#p#1#2{#2#1}
\makeatother
Note the use of \hskip 0pt plus 1pt instead of \discretionary - when I tried your example I ended up with a ragged margin because there's no stretchability. The \hskip adds some stretchable glue in between each character (and the \unskip afterwards cancels the extra one we added). Also note the LaTeX style convention that "end user" macros are all lowercase, while internal macros have an # in them somewhere so that users don't accidentally call them.
If you want to figure out how this works, \#gobble just eats whatever's in front of it (in this case the $, since that branch is only run when a $ is the next char). The main point is that \sw#p is only given one argument in the "else" branch, so it swaps that argument with the next char (that isn't a $). We could just as well have written \def\hyphenate#next#1{#1\hskip...\xHyphen#te} and put that with no args in the "else" branch, but (in my opinion) \sw#p is more general (and I'm surprised it's not in standard LaTeX already).
There is a contrib package on CTAN that deals with typesetting DNA sequences. It does a little more than just line-breaking, for example, it also supports colouring. I'm not sure if it is possible to get the output you are after though, and I have no experience in the DNA-sequence-typesetting area, but is one long string the most readable representation?
Assuming your string is the same, in your preamble, use the \newcommand{}{}. Like this:
\newcommand{\myDNA}{blah blah blah}
if that doesn't satisfy your requirements, I suggest:
2. Break the strings down to the smallest portion, then use the \newcommand and then call the new commands in sequence: \myDNA1 \myDNA2.
If that still doesn't work, you might want to look at writing a perl script to satisfy your string replacement needs.

Resources