GREP to find long quotes in text - grep

I'm trying to find long quotes in the text that I'm editing so that I can apply a different style to them. I've tried this GREP:
~[.{230}(?!.~])
What I need is for the GREP to find any 230 characters preceded by a left quote mark, not including any 230-character sequence including a character followed by a right quote mark. This should then eliminate quotes of less than 230 characters from the search. My GREP finds the correct length sequence but doesn't exclude those sequences which include a right quote mark.
So I want to find this, which my GREP does:
But not this, which my GREP also finds:
Because it has a closing quote in it and is therefore what I'm classing as a short quote.
Any ideas? TIA

It took me a while to figure out how to express this in a way that would suit my purposes. Wiktor Stribiżew came up with the code:
‘[^‘]{260,}[.,?!]’
Find opening quote but no more opening quotes (to preclude multiple short quotes) followed by 260 or more characters (about five lines in my text which is the point at which a long quote should be formatted as a broken off quote) ending with either a full point, comma, question mark, or exclamation mark AND a closing quote (I've included the punctuation marks as well as the closing quote rather than just a closing quote because otherwise it will see a possessive apostrophe as the end of the quote).
All thanks to Wiktor Stribiżew for the code!
Edit, Neil is correct this code won't find multiparagraph long quotes. But I can run:
‘[^’]{150,}~b‘
which will find any multiparagraph quotes (doesn't work in the Regex demo but does in InDesign for some reason).

Related

Does [:space:] in a grep command not include newlines and carriage returns? [duplicate]

This question already has answers here:
How to grep for the whole word
(7 answers)
Closed 11 months ago.
I'm curently writing a simple Bash script. The idea is to use grep to find the lines where a certain pattern is found, within some files. The pattern contains 3 capital letters at the start, followed by 6 digits; so the regex is [A-Z]{3}[0-9}{6}.
However, I need to only include the lines where this pattern is not concatenated with other strings, or in other words, if such a pattern is found, it has to be separated from other strings with spaces.
So if the string which matches the pattern is ABC123456 for example, the line something ABC123456 something should be fine, but somethingABC123456something should fail.
I've extended my regex using the [:space:] character class, like so:
[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]
And this seems to work, except for when the string which matches the pattern is the first or last one in the line.
So, the line something ABC123456 something will match correctly;
The line ABC123456 something won't;
And the line something ABC123456 won't as well.
I believe this has something to do with [:space:] not counting new lines and carriage returns as whitespace characters, even though it should from my understanding. Could anyone spot if I'm doing something wrong here?
A common solution to your problem is to normalize the input so that there is a space before and after each word.
sed 's/^ //;s/$/ /' file |
grep -oE '[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]'
Your question assumes that the newlines are part of what grep sees, but that is not true (or at least not how grep is commonly implemented). Instead, it reads just the contents of each new line into a memory buffer, and then applies the regular expression to that buffer.
A similar but different solution is to specify beginning of line or space, and correspondingly space or end of line:
grep -oE '(^|[[:space:]])[A-Z]{3}[0-9}{6}([[:space:]]|$)' file
but this might not be entirely portable.
You might want to postprocess the results to trim any spaces from the extracted strings, too; but I have already had to guess several things about what you are actually trying to accomplish, so I'll stop here.
(Of course, sed can do everything grep can do, and then some, so perhaps switch to sed or Awk entirely rather than build an elaborate normalization pipeline around grep.)

GREP to find capital T not preceded by a full stop and a space in InDesign

I have a document that has lots of capital letter Ts (The A&E Department, The Post Office). I want to find all instances of a capital T when not preceded by a full point and a space so I can change the capital T to a small t.
I tried:
(?<!.~.)[T]
and
(?<!.~.)T
which I thought should find all Ts not preceded by a full point and a space. However, both find all capital Ts, the negative lookbehind seems to be ignored.
I'm fairly new to GREP and I've spent a few hours Googling and tried lots of different variations but these seem to me that they should work?
Thanks in advance.
(?<!\. )T which will match T only if not preceded by a . and a space character sequence.
. is a metacharacter, so it has to be escaped for matching it literally

RegEx how to properly use OR pipelines

I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.

Erlang a special value doule quote

When i run my program,a error hanppened, and when i look into the log, appears this {k,3108,"s"},{k,3109,"}, how can a one double quote as a varible's value.
In the text font it is a little hard to see exactly what you actually got in the log but I am guessing it is:
{k,3108,"s"},{k,3109,''}
The first true double quotes make an Erlang string (which is really a list of integers) while the second is actually a pair of ' which is the quote character for atoms. In this case it is the atom with the empty name which is allowed. This is what #shk indicated.
But without more information from you it is really hard to give a proper answer.

is it ever appropriate to localize a single ascii character

When would it be appropriate to localize a single ascii character?
for instance /, or | ?
is it ever necessary to add these "strings" to the localization effort?
just want to give some people the benefit of the doubt and make sure there's not something I didn't think of.
Generally it wouldn't be appropriate to use something like that except as a graphic element (which of course wouldn't be I18N'd in the first place, much less L10N'd). If you are trying to use it to e.g. indicate a ratio then you should have something like "%d / %d" instead, and localize the whole thing.
Yes, there are cases where these individual characters change in localization. This is not a comprehensive list, just examples I happen to know.
Not every locale uses , to separate thousands and . for the decimal. (However, these will usually be handled by your number formatter. If you do so yourself, you're probably doing it wrong. See this MSDN blog post by Michael Kaplan, Number format and currency format are not always the same.)
Not every language uses the same quotation marks (“, ”, ‘ and ’). See Wikipedia on Non-English Uses of Quotation Marks. (Many of these are only easy to replace if you use full quote marks. If you use the " and ' on your keyboard to mark both the start and end of sentences, you won't know which of two symbols to substitute.)
In Spanish, a question or exclamation is preceded by an inverted ? or !. ¿Question? ¡Exclamation! (Obviously, you can't fix this with a locale substitution for a single character. Any questions or exclamations in your application should be entire strings anyway, unless you're writing some stunningly intelligent natural language generator.)
If you do find a circumstance where you need to localize these symbols, be extra cautious not to accidentally localize a symbol like / used as a file separator, " to denote a string literal or ? for a search wildcard.
However, this has already happened with CSV files. These may be separated by ,, or may be separated by the local list separator. See What would happen if you defined your system's CSV delimiter as being a quotation mark?
In Greek, questions end with a semicolon rather than ?, so essentially the ? is replaced with ; ... however, you should aim to always translate the question as a complete string including question mark anyway.

Resources