How do I use GREP to find and replace all hyphens between numbers in an InDesign document? - grep

I am working on a multi-language book, and I need to find and replace all hyphens with en-dashes that occur between numbers in the citations section. I need to avoid all hyphens that exist between roman letters.
If I use a GREP [0-9]-[0-9] it selects the numbers before and after the hyphen and I have to manually select the hyphen and replace it with an en-dash. This is labor intensive.
Is there a way for me to find the hyphen that exists between the numbers, but EXCLUDE the numbers themselves from being highlighted? This way I can run a the Find and Replace to change what will probably be 1000+ manual changes?
I tried using GREP [0-9]-[0-9] to find the hyphens, but then couldn't find a way to have the find and replace keep the existing numbers.

That's what lookaheads and lookbehinds are for`
(?<=[0-9])-(?=[0-9])

If you have selected GREP, another way could be to use 2 capture groups and use those 2 groups in the replacement with an en-dash in between.
([0-9])-([0-9])
Replace with $1–$2

Related

Why are hyphens screwing up my duplicate finding conditional formatting?

I have a sheet I use as a database of scientific papers. I copy journal article titles from different sources (some could be from an email, others are links on a web page, or just the title from the article page). I have conditional formatting set to let me know if I'm adding a title that is already in the list. I've noticed that there are some titles that are "ignoring" the conditional formatting, and it looks like there are hyphens in all of the offenders. If I remove the hyphens, the conditional formatting works. So there is some 'difference' in the hyphens originating from the same title that is preventing the conditional formatting from viewing them as identical.
Shared sheet
Examples of offending titles:
End-to-end continuous bioprocessing: impact on facility design, cost of goods and cost of development for monoclonal antibodies
End‐to‐end continuous bioprocessing: impact on facility design, cost of goods and cost of development for monoclonal antibodies
End‐to‐end continuous bioprocessing: Impact on facility design, cost of goods, and cost of development for monoclonal antibodies
What is this difference, and is there a way to fix it? Do I need to write a script to find/replace the hyphens to get this to work?
TIA
Just because characters appear identical, it does not mean that they are identical. You have fallen foul of the similarity between the hyphen and dashes. Visually, they are almost identical - dashes are slightly widest than the hyphen.
Dashes are regarded as "special characters" (i.e. they aren't keys on the keyboard) but they are used widely in html. So if, for instance, you copied an item from a website then you might unwittingly have copied dashes rather than hyphens.
You can identify the exact nature of a character by using the CODE function.
You ask "What is this difference, and is there a way to fix it? Do I need to write a script to find/replace the hyphens to get this to work?"
WHAT IS THIS DIFFERENCE?
It's important to recognise that though these examples appear identical, there are other differences that are more than just hyphens vs dashes.
Example#1 - Hyphen - CODE returns "45"
Example#2 - Dash - CODE returns "8208"
Example#3 - Dash - CODE returns "8208".
But there are other factors that contribute to fail to trigger the conditional formatting rule:
Length = 128 (vs 127 for the other examples). There is an additional comma (after "cost of sales")
the word "Impact" is spelled with an upper case "I" (lower case for the other examples)
MOVING FORWARD
Do you need a script? No (IMHO)
Is there a way to fix it? As outlined above, there are more differences that just hyphens and dashes. And, as time goes by, the number & type of difference might increase. However, there is a solution to the "Hyphen Vs Dash" problem which is the focus of this question.
FORMULA AND FORMATTING
Your data is currently in Column A and Column A is also subject to conditional formatting.
Remove the conditional formatting rules from Column A
Insert this formula in cell B2
=arrayformula(if(LEN($A2:A)-LEN(SUBSTITUTE($A2:A, char(8208), ""))=0,A2:A,arrayformula(substitute(A2:A,char(8208),char(45)))))
Conditional Formatting for Column B
select the range in Column B
select, Format, Conditional Formatting.
Select "Custom Formula is" and enter this formula: =countif($B$2:$B2,B2)>1
Select a preferred Formatting Style and then click Done.
FORMULA LOGIC
arrayformula enables the formula to automatically populate all the relevant cell in the column.
LEN($A2:A)-LEN(SUBSTITUTE($A2:A, char(8208), ""))=0
a test for dashes in a string. It substitutes a nil value for any/all instances of a dash (char(8208)), then compares the length to the adjusted length. If the value is zero, then there are no dashes in the string.
IF: Test for any dashes,
if the string doesn't contain any dashes then use that value
else, the string must contains dashes so substitute any dashes for hyphens, and use the substituted value
arrayformula(substitute(A2:A,char(8208),char(45)))
The conditional formatting rule then looks for duplicate values in the column, and formats any/all duplicate values.
You'll note that Example#3 is not flagged as a duplicate despite containing dashes. This is because of the spelling of "Impact" and the extra comma after "cost of sales".
Sample

Regex that finds a line with exactly 3 words in it

I have a problem that requires me to write a regex that finds a line that containing exactly 3 groups of characters (it could be words or numbers) and that ends with another specific word. The way I had in mind was to find a pattern that ended in a space, and look for it 3 times. assuming this is the correct way to go about it, I do no know how to find a space, but I thought it would look like .*"find a space"{3} endword$. Is this the way it would be done? Even if it is not the way to do it how do you find a space? Any suggestions?
Assuming by three groups of words you would accept any non-space character, you could write:
/^\s*(?:\S+\s+){3}endword$/
The initial caret is to make sure you have exactly 3 non-space groups on the line.
Of course you need to consider whether things like control characters could appear, and adjust accordingly.
Depending on your flavor, something like the below would do it:
\b+.+?\b+.+?\b+.+?\bendword$
This makes use of the word boundary mark (\b) and non-greedy repetitions (+?), so it may be slightly different in your specific implementation, especially if you're using something old like grep.

RegEx how to properly use OR pipelines

I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.

is it ever appropriate to localize a single ascii character

When would it be appropriate to localize a single ascii character?
for instance /, or | ?
is it ever necessary to add these "strings" to the localization effort?
just want to give some people the benefit of the doubt and make sure there's not something I didn't think of.
Generally it wouldn't be appropriate to use something like that except as a graphic element (which of course wouldn't be I18N'd in the first place, much less L10N'd). If you are trying to use it to e.g. indicate a ratio then you should have something like "%d / %d" instead, and localize the whole thing.
Yes, there are cases where these individual characters change in localization. This is not a comprehensive list, just examples I happen to know.
Not every locale uses , to separate thousands and . for the decimal. (However, these will usually be handled by your number formatter. If you do so yourself, you're probably doing it wrong. See this MSDN blog post by Michael Kaplan, Number format and currency format are not always the same.)
Not every language uses the same quotation marks (“, ”, ‘ and ’). See Wikipedia on Non-English Uses of Quotation Marks. (Many of these are only easy to replace if you use full quote marks. If you use the " and ' on your keyboard to mark both the start and end of sentences, you won't know which of two symbols to substitute.)
In Spanish, a question or exclamation is preceded by an inverted ? or !. ¿Question? ¡Exclamation! (Obviously, you can't fix this with a locale substitution for a single character. Any questions or exclamations in your application should be entire strings anyway, unless you're writing some stunningly intelligent natural language generator.)
If you do find a circumstance where you need to localize these symbols, be extra cautious not to accidentally localize a symbol like / used as a file separator, " to denote a string literal or ? for a search wildcard.
However, this has already happened with CSV files. These may be separated by ,, or may be separated by the local list separator. See What would happen if you defined your system's CSV delimiter as being a quotation mark?
In Greek, questions end with a semicolon rather than ?, so essentially the ? is replaced with ; ... however, you should aim to always translate the question as a complete string including question mark anyway.

What's the best character to represent blank spaces in a URL?

When you are building URLs that should be legible for users and search engines and you do it automatically from the content, what's the best way to represent blank spaces? Hyphens (this is what StackOverflow uses)? Underscores? Any other? Does any of those make a different for SEO?
Both are valid URL characters and both have their pros and cons.
Pro dash
Google recommends dashes, and here is what Matt Cutts from Google has to say about
Dashes vs. underscores.
If you have a url like word1-word2,
that page can be returned for the
searches word1, word2, and even “word1
word2″.
That’s why I would always choose
dashes instead of underscores.
Dashes seem to be what major blogs do:
The Huffington Post,
TechCrunch,
Engadget, ...
Dashes seem to be what major CMS do.
Not sure about that one anymore, can anyone comment?
As mentioned by Kazar, underscores can clash with the underlining of links.
I find underscores awkward to type.
Rene Saarsoo pointed out that dashes take less space than underscores in proportional fonts.
Ionut G. Stan mentioned that underscores are not allowed in hostnames. If you strive for consistency you should opt for dashes.
Pro underscore
Dashes are not allowed in
ISO9660 file systems.
This can be a problem if your content is also shipped on DVD or CD (e.g help files or
eLearning content).
In some languages (e.g. German) dashes can be word characters and are not generally considered word separators.
Another advantage of dashes is that in proportional font they take less space that underscores. Compare:
https://stackoverflow.com/../whats-the-best-character-to-represent-blank-spaces-in-a-url
https://stackoverflow.com/../whats_the_best_character_to_represent_blank_spaces_in_a_url
It's not a lot, but every little helps :)
Again, personal preference - personally I think hyphens work better than underscores, because underscores can clash with the underlining a tags add (by default), so http://someurl.com/this_is_a_address looks like there are no underscores there. (as this is stack overflow, roll over the link). http://someurl.com/this-is-a-address looks fine.
You know, if you buy a domain name, you're allowed to use hyphens inside that name, but no underscores. This is an additional reason for which I believe hyphens are better than underscores.
I'd say dashes. I used to use underscored for pretty much every such purpose (representing spaces) but nowadays, with all the visual thingies blinking all round, you often find underlining that makes them normally invisible.
This may answer your question. Things looks like changed for Google few years ago about - and _
See this article here:
http://www.blog-tutorials.com/marketing-and-seo/linking/google-oks-underscores-as-word-separators-in-urls-and-more-seo-tips/
I think that depends on your favorite. My favourites are underscores, but I don't see any (dis-)advantages if using hyphens or other valid URL characters instead. And everything looks better than %20 :)

Resources