How to use grep to find Em dashes with a character right before or after and no spaces? - grep

I am trying to clean up a bunch of writing that has Em Dashes, and want them formatted as "word - word". But in the text there are a lot of occurrences where it is "word-word". What is a good way in grep to do a search-and-replace that will identify if there is no spaces before and after the Em Dash and insert them?

Related

How do I use GREP to find and replace all hyphens between numbers in an InDesign document?

I am working on a multi-language book, and I need to find and replace all hyphens with en-dashes that occur between numbers in the citations section. I need to avoid all hyphens that exist between roman letters.
If I use a GREP [0-9]-[0-9] it selects the numbers before and after the hyphen and I have to manually select the hyphen and replace it with an en-dash. This is labor intensive.
Is there a way for me to find the hyphen that exists between the numbers, but EXCLUDE the numbers themselves from being highlighted? This way I can run a the Find and Replace to change what will probably be 1000+ manual changes?
I tried using GREP [0-9]-[0-9] to find the hyphens, but then couldn't find a way to have the find and replace keep the existing numbers.
That's what lookaheads and lookbehinds are for`
(?<=[0-9])-(?=[0-9])
If you have selected GREP, another way could be to use 2 capture groups and use those 2 groups in the replacement with an en-dash in between.
([0-9])-([0-9])
Replace with $1–$2

Slug - How to deal with Unicode or non-Ascii character

So I am learning Slug convention, but so far they mostly deal with English character, like "Hello world" would be converted to "hello-world", but there are very little information on non-ascii character.
I saw that some "look like" latin word could be converted to latin word, like ă ă ą ä would be convert to a, but for something like こんにちは, how should i deal with it?
Some website, like wikipedia for japanese just keep the japanese text as it is in the URL; some instead transliterate them, like /konichiwa and it tend to be all over the place
On top of that, there are symbol, like ¿ ¡ 、(japanese comma) 。 (japanese period)
So what is the convention here, should these character be kept as is, transliterate, or be avoided altogether? Should it be all lowercase? Should we keep the symbol? If not how do we seperate symbols from non-latin word like Chinese or Abrabic text?

GREP to find long quotes in text

I'm trying to find long quotes in the text that I'm editing so that I can apply a different style to them. I've tried this GREP:
~[.{230}(?!.~])
What I need is for the GREP to find any 230 characters preceded by a left quote mark, not including any 230-character sequence including a character followed by a right quote mark. This should then eliminate quotes of less than 230 characters from the search. My GREP finds the correct length sequence but doesn't exclude those sequences which include a right quote mark.
So I want to find this, which my GREP does:
But not this, which my GREP also finds:
Because it has a closing quote in it and is therefore what I'm classing as a short quote.
Any ideas? TIA
It took me a while to figure out how to express this in a way that would suit my purposes. Wiktor Stribiżew came up with the code:
‘[^‘]{260,}[.,?!]’
Find opening quote but no more opening quotes (to preclude multiple short quotes) followed by 260 or more characters (about five lines in my text which is the point at which a long quote should be formatted as a broken off quote) ending with either a full point, comma, question mark, or exclamation mark AND a closing quote (I've included the punctuation marks as well as the closing quote rather than just a closing quote because otherwise it will see a possessive apostrophe as the end of the quote).
All thanks to Wiktor Stribiżew for the code!
Edit, Neil is correct this code won't find multiparagraph long quotes. But I can run:
‘[^’]{150,}~b‘
which will find any multiparagraph quotes (doesn't work in the Regex demo but does in InDesign for some reason).

GREP to find capital T not preceded by a full stop and a space in InDesign

I have a document that has lots of capital letter Ts (The A&E Department, The Post Office). I want to find all instances of a capital T when not preceded by a full point and a space so I can change the capital T to a small t.
I tried:
(?<!.~.)[T]
and
(?<!.~.)T
which I thought should find all Ts not preceded by a full point and a space. However, both find all capital Ts, the negative lookbehind seems to be ignored.
I'm fairly new to GREP and I've spent a few hours Googling and tried lots of different variations but these seem to me that they should work?
Thanks in advance.
(?<!\. )T which will match T only if not preceded by a . and a space character sequence.
. is a metacharacter, so it has to be escaped for matching it literally

RegEx how to properly use OR pipelines

I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.

Resources