Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv) - ruby-on-rails

There is a very similar question already. One of the solutions uses code like this one:
string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s
Which works wonders, until you notice it also removes spaces, dots, dashes, and who knows what else.
I'm not really sure how the first code works, but could it be made to strip only accents? Or at the very least be given a list of chars to preserve? My knowledge of regexps is small, but I tried (to no avail):
/[^\-x00-\x7F]/n # So it would leave the dash alone
I'm about to do something like this:
string.mb_chars.normalize(:kd).gsub('-', '__DASH__').gsub
(/[^x00-\x7F]/n, '').gsub('__DASH__', '-').to_s
Atrocious? Yes...
I've also tried:
iconv = Iconv.new('UTF-8', 'US-ASCII//TRANSLIT') # Also tried ISO-8859-1
iconv.iconv 'Café' # Throws an error: Iconv::IllegalSequence: "é"
Help please?

it also removes spaces, dots, dashes, and who knows what else.
It shouldn't.
string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s
You've mistyped, there should be a backslash before the x00, to refer to the NUL character.
/[^\-x00-\x7F]/n # So it would leave the dash alone
You've put the ‘-’ between the ‘\’ and the ‘x’, which will break the reference to the null character, and thus break the range.

I'd use the transliterate method. See http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate

It's not as neat as Iconv, but does what I think you want:
http://snippets.dzone.com/posts/show/2384

Related

Don't match dot in beginning of string

I have one path in form of string like this Folder1/File.png
But in this string sometimes if file is hidden or folder is hidden I don't want it to be matched by my regex.
regex = %r{([a-zA-Z0-9_ -]*)\/[^.]+$}
input_path = "Folder_1/.file" # This shouldn't be matched.
input_path = "Folder/file.png" # This should be matched.
But my regex works for first input but its not even matching second one.
You are currently looking for \/[^.]+$, that is a / followed by any character except . until the end. Since the filename+extension format has a . character, it fails to match the second case.
Instead of using [^.]+$, check only that the character following / is not ., and match everything after that:
([a-zA-Z0-9_ -]*)\/[^.].*$
While there are some suggestions here that work, my suggestion would be
\/[^.][^\/\n]+$
It finds a slash, followed by anything but a dot, which in turn is followed by one, or more, of anything but a slash or a newline.
To handle the two lines given as an example,
Folder_1/.file
Folder/file.png
it takes 8 steps.
The suggested ones all work, but ([a-zA-Z0-9_ -]*)\/[^.] takes 75 steps, ([a-zA-Z0-9_ -]*)\/[^.]+\.[^.]+\z 78 steps and ([a-zA-Z0-9_ -]*)\/[^.].*$ takes 77 steps.
This may be totally irrelevant and I may have missed some angle, but I wanted to mention it ;)
Se it here at regex101.
regex = %r{([a-zA-Z0-9_ -]*)\/[^.]}

Grep with BBEdit

I do have one large text file with lot of the following patterns;
because of,this
or that,has
or,not
Of course I want to change the following
because of, this
or that, has
or, not
To make myself clear: i would like to insert a space after each ,
How can i do that with BBEdit Find/Replace/Grep?
Find works ok with
[\,](\w)
but i can't figure out the coresponding part for replace.
Find: (\,)(\w)
Replace: \1 \2
Note: You need to hit the spacebar between \1 and \2. Works on my computer with BBedit v13
I complicates matters.
Pattern like Letter,Letter can also be found with ,\b.
\b is at the beginning or end of a word. \b in a regular expression means "word boundary".
The replacement is then done with ,_
Nota Bene: _ is a "Space" after ,
you could just replace all commas with a comma-space and then replace all space-space with space

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

What does these two regex match?

I can't figure out what does this regex match:
A: "\\/\\/c\\/(\\d*)"
B: "\\/\\/(\\d*)"
I suppose they are matching some kind of number sequence since \d matches any digit but I'd like to know an example of a string that would be a match for this regex.
The pattern syntax is that specified by ICU. Expressions are created with NSRegularExpression in an iOS app and are correct.
The first matches //c/ + 0 or more digits. The second matches // + 0 or more digits. In both the digits are captured.
An example of a match for A) is //c/123
An example of a match for B) is //12345
When I use Cygwin which emulates Bash on Windows, I sometimes run into situations where I have to escape my escape characters which is what I think is making this expression look so weird. For instance, when I use sed to look for a single '\' I sometimes have to write it as '\\\\'. (Funny, StackOverflow proved my point. If you write 4 backslashes in the comment, it only shows two. So if you process it again, they might all disappear depending on your situation).
Considering this, it might be helpful to think of pairs of backslashes as representing only one if you're coming from a similar situation. My guess would be you are. Because of this I would say Erik Duymelinck is probably spot on. This will capture a sequence of digits that may or may not follow a couple slashes and a c:
//c/000
//00000
This regex matches an odd sequence of characters, which, at first glance, almost seem like a regex, since \d is a digit, and followed by an asterisk (\d*) would mean zero-or-more digits. But it's not a digit, because the escape-slash is escaped.
\\/\\/c\\/(\\d*)
So, for instance, this one matches the following text:
\/\/c\/\
\/\/c\/\d
\/\/c\/\dd
\/\/c\/\ddd
\/\/c\/\dddd
\/\/c\/\ddddd
\/\/c\/\dddddd
...
This one is almost the same
\\/\\/(\\d*)
except you just delete the c\/ from the above results:
\/\/\
\/\/\d
\/\/\dd
\/\/\ddd
\/\/\dddd
\/\/\ddddd
\/\/\dddddd
...
In both cases, the final \ and optional d is [capture group][1] one.
My first impression was that these regexes were intended for escaping in Java strings, meaning they would be completely invalid. If the were escaped for Java strings, such as
Pattern p = Pattern.compile("\\/\\/c\\/(\\d*)");
It would be invalid, because after un-escaping, it would result in this invalid regex:
\/\/c\/(\d*)
The single escape-slashes (\) are invalid. But the \d is valid, as it would mean any digit.
But again, I don't think they're invalid, and they're not escaped for a Java string. They're just odd.

.gsub erroring with non-regular character 194

I've seen this posted a couple of times but none of the solutions seem to work for me so far...
I'm trying to remove a spurious  character from a string...
e.g.
"myÂstring here Â$100"
..but it should be my string here $100
I've tried:
string.gsub(/\194/,'')
string.gsub(194.chr,'')
string.delete 194.chr
All of these still leave the  intact..
Any thoughts?
By default, Rails supports UTF-8.
You can use your favorite editor to write a gsub call using the proper character you want to replace, as in:
"myÂstring here Â$100".gsub(/Â/,"")
If this does not work as well, you might be having an encoding error somewhere on your stack, probably on your HTML document. Try running rails console, extract somehow that string (if it comes from the Model, try to perform a find on the containing class) and run the gsub. It won't solve your problem, but you'll get a clue to where exactly the problem may lie.
Looks like a character encoding problem to me. For every Unicode code point in the range U+0080..U+00BF inclusive, the UTF-8 encoding is a two-byte sequence, 0xC2 (194 decimal) and the numeric value the code point. For example, a non-breaking space--U+00A0--becomes 0xC2 0xA0. Was there another extra character in there, that you already removed?
At any rate, gsub(/\194/,'') is wrong. \nnn is supposed to be an octal escape, but the number is in its decimal form. 194 in octal is \302.
"myÂstring here Â$100".gsub("Â","") # "mystring here $100"
Is that what you meant?

Resources