.gsub erroring with non-regular character 194

.gsub erroring with non-regular character 194 - ruby-on-rails

I've seen this posted a couple of times but none of the solutions seem to work for me so far...
I'm trying to remove a spurious Â character from a string...
e.g.
"myÂstring here Â$100"
..but it should be my string here $100
I've tried:
string.gsub(/\194/,'')
string.gsub(194.chr,'')
string.delete 194.chr
All of these still leave the Â intact..
Any thoughts?

By default, Rails supports UTF-8.
You can use your favorite editor to write a gsub call using the proper character you want to replace, as in:
"myÂstring here Â$100".gsub(/Â/,"")
If this does not work as well, you might be having an encoding error somewhere on your stack, probably on your HTML document. Try running rails console, extract somehow that string (if it comes from the Model, try to perform a find on the containing class) and run the gsub. It won't solve your problem, but you'll get a clue to where exactly the problem may lie.

Looks like a character encoding problem to me. For every Unicode code point in the range U+0080..U+00BF inclusive, the UTF-8 encoding is a two-byte sequence, 0xC2 (194 decimal) and the numeric value the code point. For example, a non-breaking space--U+00A0--becomes 0xC2 0xA0. Was there another extra character in there, that you already removed?
At any rate, gsub(/\194/,'') is wrong. \nnn is supposed to be an octal escape, but the number is in its decimal form. 194 in octal is \302.

"myÂstring here Â$100".gsub("Â","") # "mystring here $100"
Is that what you meant?

Related

Character Encoding not resolved

I have a text file with unknown character formatting, below is a snapshot
\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134
Anyone has an idea how can I convert it to normal text?

This is apparently how Lua stores strings. Each \nnn represents a single byte where nnn is the byte's value in decimal. (A similar notation is commonly used for octal, which threw me off for longer than I would like to admit. I should have noticed that there were digits 8 and 9 in the data!) This particular string is just plain old UTF-8.
$ perl -ple 's/\\(\d{3})/chr($1)/ge' <<<'\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134'
دموع المرأة أقوى نفوذاً من القوانين
You would obviously get a similar result simply by printing the string from Lua, though I'm not familiar enough with the language to tell you how exactly to do that.
Post scriptum: I had to look this up for other reasons, so here's how to execute Lua from the command line.
lua -e 'print("\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134")'

Regex for clearly single line words with wildcards in Swift

I'm attempting to construct a regex string in Swift 4 that gets characters at the start of a line where some are known and others aren't.
Let's say I've got a text file with line breaks for each word that reads as follows:
pucker
tuckered
duckerdinger
sucker punch
I'd like to get every word that contains "cker" in it that's 1 to 8 characters long.
I'm attempting to use this statement ^..cker..{1,8} as my RegEx string. All I'm getting is a partial match in Patterns (a Mac App), but Regex101.com's saying no match, and most importantly, Xcode says I'm using an invalid regex. I've also tried ^(..cker..) and a bazillion other variations.
What am I screwing up and how do I fix it? What I'm trying to do seems like it would be super simple, but I've wasted more time than I care to admit fiddling with it.
Update:
This has been the best I've been able to get so far...
"\\b..cker..", but I'm only able to get words that are exactly 8 characters long. I'd like to capture words that contain "cker" that are the 3rd, 4th, 5th, and 6th letters while capturing words up to 8 characters long.

Try this regex:
\b(?=.*cker)[a-zA-Z]{1,8}\b
Click for Demo
Explanation:
\b - matches a word boundary
(?=.*cker) - Positive Lookahead to make sure our string should contain the character sequence cker
[a-zA-Z]{1,8} - Matches 1 to 8 occurrences of a letter
\b - matches a word boundary

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks

Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/

You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.

You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

What character encoding are the following German words using?

I'm trying to process a German word list and can't figure out what encoding the file is in. The 'file' unix command says the file is "Non-ISO extended-ASCII text". Most of the words are in ascii, but here are the exceptions:
ANDR\x82
ATTACH\x82
C\x82ZANNE
CH\x83TEAU
CONF\x82RENCIER
FABERG\x82
L\x82VI-STRAUSS
RH\x93NETAL
P\xF2ANGE
Any hints would be great. Thanks!
EDIT: To be clear, the hex codes above are C hex string literals so replace \xXX with the literal hex value XX.

It looks like CP437 or CP852, assuming the \x82 sequences encode single characters, and are not literally four characters. Well, at least everything else does, but the last line is a bit of a puzzle.

Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)

There is a very similar question already. One of the solutions uses code like this one:
string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s
Which works wonders, until you notice it also removes spaces, dots, dashes, and who knows what else.
I'm not really sure how the first code works, but could it be made to strip only accents? Or at the very least be given a list of chars to preserve? My knowledge of regexps is small, but I tried (to no avail):
/[^\-x00-\x7F]/n # So it would leave the dash alone
I'm about to do something like this:
string.mb_chars.normalize(:kd).gsub('-', '__DASH__').gsub
(/[^x00-\x7F]/n, '').gsub('__DASH__', '-').to_s
Atrocious? Yes...
I've also tried:
iconv = Iconv.new('UTF-8', 'US-ASCII//TRANSLIT') # Also tried ISO-8859-1
iconv.iconv 'Café' # Throws an error: Iconv::IllegalSequence: "é"
Help please?

it also removes spaces, dots, dashes, and who knows what else.
It shouldn't.
string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s
You've mistyped, there should be a backslash before the x00, to refer to the NUL character.
/[^\-x00-\x7F]/n # So it would leave the dash alone
You've put the ‘-’ between the ‘\’ and the ‘x’, which will break the reference to the null character, and thus break the range.

I'd use the transliterate method. See http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate

It's not as neat as Iconv, but does what I think you want:
http://snippets.dzone.com/posts/show/2384

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

.gsub erroring with non-regular character 194 - ruby-on-rails

"myÂstring here Â$100".gsub("Â","") # "mystring here $100" Is that what you meant?

Related

Character Encoding not resolved

Regex for clearly single line words with wildcards in Swift

Rails strip all except numbers commas and decimal points

What character encoding are the following German words using?

Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)

Categories

Resources