BBEdit/Grep: Find character and replace everything after character - grep

I'm loosing my mind trying to figure this out with Find function in BBEdit.
Example data:
Co-founder & Head of Engineering ; 10 months
Technical Recruiter ; 1 year
Chief Technology Officer ; 3 years
The result should be:
Co-founder & Head of Engineering
Technical Recruiter
Chief Technology Officer
How do I tell BBedit to remove everything in the line after each ";"?
I've tried googling and wondered around in the documentation, but can't seem to figure it out.

Just saw this. There are different ways of doing it. Here is one way that is easy to follow hopefully.
In the Find window choose "Grep", then type the following regex expression:
(.*);(.*)
Here, each pair of round brackets capture different part of the sentence in each line. For example:
(Co-founder & Head of Engineering) ; (10 months)
When you use parentheses in regex match, it holds the value in a capture symbol. The content of thirst () will be stored in "\1" symbol, the content of the second () will be stored in "\2", in the given order. Now that we have \1 and \2, we can put them back with "Replace", and anything not captured or not replaced will be deleted. See the attached picture.
Note: Actually you do not need second parentheses since you are not using what you capture with it. I just put it there in case you want to experiment with it by doing: \1 and \2.
I hope it helps in the future.

Related

NSRegularExpression not matching number sign (#)

I'm working on a Guitar Chord transposer, and so from a given text file, I want to identify guitar chords. e.g. G#, Ab, F#m, etc.
I'm almost there! I have run into a few problems already due to the number sign (hash tag).
#
For example, you can't include the number sign in your regex pattern. The NSRegularExpression will not initialize with this:
let fail: String = "\\b[ABCDEFG](b|#)?\\b"
let success: String = "\\b[CDEFGAB](b|\\u0023)?\\b"
I had to specifically provide the unicode character. I can live with that.
However, now that I have a NSRegularExpression object, it won't match these (sharps = number sign) when I have a line of text such as:
Am Bb G# C Dm F E
When it starts processing the G#, the sharp associated with that second capture group is not matched. (i.e. the NSTextCheckingResult's second range has a location of NSNotFound) Note, it works for Bb... it matches the 'b'
I'm wondering what I need to do here. It would seem the documentation doesn't cover this case of '#' which IS in fact sometimes used in Regex patterns (I think related to comments or sth)
One thing that would be great would be to not have to look up the unicode identifier for a #, but just use it as a String "#" then convert that so it plays nicely with the pattern. There exists the chance that \u0023 is in fact not the code associated with # ...
The \b word boundary is a context dependent construct. It matches in 4 contexts: 1) between start of string and a word char, 2) between a word char and end of string, 3) between word and a non-word and 4) a non-word and a word char.
Your regex is written in such a way that ultimately the regex engine sees a \b after # and that means a # will only match if there is a word char after it.
If you replace \b with (?!\w), a negative lookahead that fails the match if there is a word char immediately to the right of the current location, it will work.
So, you may use
\\b[CDEFGAB](b|\\u0023)?(?!\\w)
See the regex demo.
Details
\b - a word boundary
[CDEFGAB] - a char from the set
(b|\\u0023)? - an optional sequence of b or #
(?!\\w) - a negative lookahead failing the match (and causing backtracking into the preceding pattern! To avoid that, add + after ? to prevent backtracking into that pattern) if there is a word char immediately to the right of the current position.
(I'd like to first say #WiktorStribiżew has been a tremendous help and what I am writing now would not have been possible without him! I'm not concerned about StackOverflow points and rep, so if you like this answer, please upvote his answer.)
This issue took many turns and had a few issues going on. Ultimately this question should be called How do I use Regex on iOS to detect Musical Chords in a text file?
The answer is (so far), not simply.
CRASH COURSE IN MUSIC THEORY
In music you have notes. They are made up of a letter between A->G and an optional symbol called an accidental. (A note relates to the acoustic frequency of the sound you hear when that note is played) An accidental can be a flat (represented as a ♭ or simply a b), or a sharp (represented as a ♯ or simply a #, as these are easier to type on a keyboard). An accidental serves to make a note a semitone higher (#) or lower (b). As such, a F# is the same acoustic frequency as a Gb. On a piano, the white keys are notes with no accidentals, and the black keys represent notes with an accidental. Depending on some factors of the piece of music, that piece won't mix accidental types. It will either be flats throughout the piece or sharps. (Depending on the musical key of the composition, but this is not that relevant here.)
In terms of regex, you have something like ABCDEFG? to determine the note. In reality it's more complicated.
Then, a Musical Chord is comprised of the root note and it's chord type. There are over 50 types of chords. They have a 'text signature' that is unique. Also, a 'major' chord has an empty signature. So in terms of pseudo-regex you have for a Chord:
[ABCDEFG](b|#)?(...|...|...)?
where the first part you recognize as the note (as before), and the last optional is to determine the chord type. The different types were omitted, but can be as simple as a m (for Minor chord), or maj7#5 (for a major 7th chord with an augmented 5th... don't worry about it. Just know there are many string constants to represent a chord type)
Then finally, with guitar you often have a corresponding bass note that changes the chord's tonality somewhat. You denote this by adding a slash and then the note, giving the general pseudoform:
[ABCDEFG](b|#)?(...|...|...)?(/[ABCDEFG](b|#)?)? // NOT real Regex
real examples: C/F or C#m/G# and so on
where the last part has a slash then the same pattern to recognize a note.
So putting these all together, in general we want to find chords that could take on many forms, such as:
F Gm C#maj7/G# F/C Am A7 A7/F# Bmaj13#11
I was hoping to find one Regex to rule them all. I ended up writing code that works, though it seems like I kind of hacked around a bit to get the results I desired.
You can see this code here, written in Swift. It is not complete for my purposes, but it will parse a string, return a list of Chord Results and their text range within the original string. From there you would have to finish the implementation to suit your needs.
There have been a few issues on iOS:
iOS does not handle the number sign (#) well at all. When providing regex patterns or match text, I either had to replace the # with its unicode \u0023, or what ultimately worked was replacing all occurrences of # with another character (such as 'S'), and then convert it back once regex did it's thing. So this code I wrote often has to 'sanitize' the pattern or the input text before doing anything.
I couldn't get a Regex Pattern to perfectly parse a chord structure. It wasn't fully working for a Chord with a bass note, but it would successfully match a Chord with a bass note, then I had to split those 2 components and parse them separately, then recombine them
Regex is really a bit of voodoo, and I think it sucks that for something so confusing to many people, there are also different platform-dependent implementations of it. For example, Wiktor referred me to Regex patterns he wrote to help me solve the problem on www.regex101.com, that would WORK on that website, but these would not work on iOS, and NSRegularExpression would throw an error (often it had something to do with this # character)
My solution pays absolutely no regard to performance. It just wanted it to work.

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

How do I get auto-conversion of Part [[ double-brackets ]] on paste?

A pet peeve of mine is the use of double square brackets for Part rather than the single character \[LeftDoubleBracket] and \[RightDoubleBracket]. I would like to have these automatically replaced when pasting plain-text code (from StackOverflow for example) into a Mathematica Notebook. I have been unable to configure this.
Can it be done with ImportAutoReplacements or another automatic method (preferred), or will I need use a method like the "Paste Tabular Data Palette" referenced here?
Either way, I am not good with string parsing, and I want to learn the best way to handle bracket counting.
Sjoerd gave Defer and Simon gave Ctrl+Shift+N which both cause Mathematica to auto-format code. These are fine options.
I am still interested in a method that is automatic and/or preserves as much of the original code as possible. For example, maintaining prefix f#1, infix 1 ~f~ 2, and postfix 1 // f functions in their original forms.
A subsection of this question was reposted as Matching brackets in a string and received several good answers.
Not really an answer, but a thread on entering the double [[ ]] pair (with the cursor between both pairs) using a single keystroke occurred a couple of weeks ago on the mathgroup. It didn't help me, but for others this was a solution apparently.
EDIT
to make good on my slightly off-topic first response here's a pattern replacement that seems to do the job (although I have difficulties myself to understand why it should be b and not b_; the latter doesn't work):
Defer[f[g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]]]] /.
HoldPattern[Part[b, a_]] -> HoldPattern[b\[LeftDoubleBracket]a\[RightDoubleBracket]]
I leave the automation part to you.
EDIT 2
I discovered that if you add the above rule to ImportAutoReplacements and paste your SO code in a notebook in a Defer[] and evaluate this, you end up with a usable form with double brackets which can be used as input somewhere else.
EDIT 3
As remarked by Mr.Wizard invisibly below in the comments, the replacement rule isn't necessary. Defer does it on its own! Scientific progress goes "Boink", to cite Bill Watterson.
EDIT 4
The jury is still out on Defer. It has some peculiar side effects, and doesn't work well on all expressions. try the "Paste Tabular Data Palette" in the toolbag question for instance. Pasting this block of code in Defer and executing gives me this:
It worked much better in another code snippet from the same thread:
The second part is how it looks after turning it in to input by editing the output of the first block (basically, I inserted a couple of returns to restore the format). This turns it into Input. Please notice that all double brackets turned into the correct corresponding symbol, but notice also the changing position of ReleaseHold.
Simon wrote in a comment, but declined to post as an answer, something fairly similar to what I requested, though it is not automatic on paste, and is not in isolation from other formatting.
(One can) select the text and press Ctrl+Shift+N to translate to StandardForm

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...
You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.
Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.
This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Latex multicols. Can I group content so it won't split over cols and/or suggest colbreaks?

I'm trying to learn LaTeX. I've been googling this one for a couple days, but I don't speak enough LaTeX to be able to search for it effectively and what documentation I have found is either too simple or goes way over my head (http://www.uoregon.edu/~dspivak/files/multicol.pdf)
I have a document using the multicol package. (I'm actually using multicols* so that the first col fills before the second begins instead of trying to balance them, but I don't think that's relevant here.) The columns output nicely, but I want to be able to indicate that some content won't be broken up into different columns.
For instance,
aaaaaaaa bbbbbbb
aaaaaaaa bbbbbbb
aaaaaaaa
ccccccc
bbbbbbbb ccccccc
That poor attempt at ascii art columns is what's happening. I'd like to indicate that the b block is a whole unit that shouldn't be broken up into different columns. Since it doesn't fit under the a block, the entirety of the b block should be moved to the second column.
Should b be wrapped in something? Is there a block/float/section/box/minipage/paragraph structure I can use? Something specific to multicol? Alternatively is there a way that I can suggest a columnbreak? I'm thinking of something like \- that suggests a hyphenated line break if its convenient, but this would go between blocks.
Thanks!
Would putting the text inside a minipage not work for this?
\begin{minipage}{\columnwidth}
text etc
\end{minipage}
Forcing a column break is as easy as \columnbreak.
There are some gentler possibilities here.
If you decide to fight LaTeX algorithms to the bitter end, there is also this page on preventing page breaks. You can try the \samepage command, but as the page says, "it turns out to be surprisingly tricky".

Resources