Testing for characters in right-to-left languages using STRING MATCH and REGEXP; and encodings? - character-encoding

I'm trying to test for a character at the end of a right-to-left language, Hebrew, in Tcl?
And I think it may be further complicated because the Hebrew was passed in JSON; but I'm not sure because I'm still quite confused by encodings.
I've been trying the following code on some tests strings and, although I think I understand why the regular expressions at the bottom "work", I don't understand why the {*־} in string match provides the desired result.
set hebJoined עַל־הָאָרֶץ
set hebSep עַל־
chan puts stdout [string match {*־} $hebJoined]; #=> 0
chan puts stdout [string match {^*־} $hebJoined]; #=> 0
chan puts stdout [string match {*^־} $hebJoined]; #=> 0
chan puts stdout [string match {*־^} $hebJoined]; #=> 0
chan puts stdout [string match {$*־} $hebJoined]; #=> 0
chan puts stdout [string match {*־$} $hebJoined]; #=> 0
chan puts stdout [string match {*$־} $hebJoined]; #=> 0
chan puts stdout [string match {*־} $hebSep]; #=> 1
chan puts stdout [string match {^*־} $hebSep]; #=> 0
chan puts stdout [string match {*^־} $hebSep]; #=> 0
chan puts stdout [string match {*־^} $hebSep]; #=> 0
chan puts stdout [string match {$*־} $hebSep]; #=> 0
chan puts stdout [string match {*־$} $hebSep]; #=> 0
chan puts stdout [string match {*$־} $hebSep]; #=> 0
chan puts stdout [regexp {(.*־)(.*)} $hebJoined {\1\2} heb1 heb2]
chan puts stdout $heb1; # => עַל־
chan puts stdout $heb2; # => הָאָרֶץ
chan puts stdout [regexp {(.*־)$} $hebJoined]; # 0
chan puts stdout [regexp {(.*־)$} $hebSep]; # 1
Then there is the real larger issue that I am working with data that was passed as JSON and the regular expressions above will not provide the desired result but a modification of the string match does.
string match {*־} [encoding convertto iso8859-1 $hebrew] appears to find all the words that end in a hyphen; that is, on the left-hand side; and does not return results for hyphens in the middle of a string. And I do not understand why it does so. I don't know how to provide an example because the stored data for the Hebrew looks like עַל.
Can string match or regular expression test for a unicode value like \u05BE which is what I think this hyphen is?
Would you please tell me why the code I've used seems to work and how I can correct this to work properly? If the change the encoding to utf-8, then the string match does not provide any matches.
Thank you.
EDIT:
I think this is what is needed. I was confused for awhile partly because I was looking at a file that purposely leaves off the hyphen. This code yields correct results but is ugly and likely not the best approach.
chan puts stdout [regexp {(×¢Ö·×Ö¾)} [encoding convertto utf-8 $hebJoined] {\1} h1]; # => 1
chan puts stdout [regexp {(Ö¾)$} [encoding convertto utf-8 $hebJoined] {\1} h2]; # => 0
chan puts stdout [regexp {(Ö¾)$} [encoding convertto utf-8 $hebSep] {\1} h2]; # => 1
chan configure stdout -encoding iso8859-1 -translation crlf
chan puts stdout $h1; # => עַל־
chan puts stdout $h2; # => ־ the desired hyphen.
ANOTHER EDIT:
I was making a serious mistake in reading this data into Tcl as iso8859-1 instead of utf-8. If change the encoding of the channel receiving the data to utf-8, then most of these issues disappear altogether; and testing with the unicode values like \U05BE works nicely. In this particular case, my error of reading utf-8 as iso8859-1 appears to have resulted in the multi-byte characters being read as individual bytes and that complicated the matching inn string match and regexp.

Can string match or regular expression test for a unicode value like \u05BE which is what I think this hyphen is?
Sure. It is just a conventional character to both of those matching systems. Indeed, they work on the logical sequence of characters, so that's what you should always use; which way they display is an output rendering problem, and Tcl itself says nothing about that. (Your terminal, or a GUI toolkit like Tk, would care more.)

Related

Smarter CSV ignore blank lines in csv

I am using Smarter CSV to and have encountered a csv that has blank lines. Is there anyway to ignore these? Smarter CSV is taking the blank line as a header and not processing the file correctly. Is there any way I can bastardize the comment_regexp?
mail.attachments.each do | attachment |
filename = attachment.filename
#filedata = attachment.decoded
puts filename
begin
tmp = Tempfile.new(filename)
tmp.write attachment.decoded
tmp.close
puts tmp.path
f = File.open(tmp.path, "r:bom|utf-8")
options = {
:comment_regexp => /^#/
}
data = SmarterCSV.process(f, options)
f.close
puts data
Sample File:
[
output
Let's first construct your file.
str = <<~_
#
# Report
#---------------
Date header1 header2 header3 header4
20200 jdk;df 4543 $8333 4387
20200 jdk 5004 $945876 67
_
fin_name = 'in'
File.write(fin_name, str)
#=> 223
Two problems must be addressed to read this file using the method SmarterCSV::process. The first is that comments--lines beginning with an octothorpe ('#')--and blank lines must be skipped. The second is that the field separator is not a fixed-length string.
The first of these problems can be dealt with by setting the value of process' :comment_regexp option key to a regular expression:
:comment_regexp => /\A#|\A\s*\z/
which reads, "match an octothorpe at the beginning of the string (\A being the beginning-of-string anchor) or (|) match a string containing zero or more whitespace characters (\s being a whitespace character and \z being the end-of-string anchor)".
Unfortunately, SmarterCSV is not capable of dealing with variable-length field separators. It does have an option :col_sep, but it's value must be a string, not a regular expression.
We must therefore pre-process the file before using SmarterCSV, though that is not difficult. While are are at, we may as well remove the dollar signs and use commas for field separators.1
fout_name = 'out.csv'
fout = File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close
Let's look at the file produced.
puts File.read(fout_name)
displays
Date,header1,header2,header3,header4
20200,jdk;df,4543,8333,4387
20200,jdk,5004,945876,67
Now that's what a CSV file should look like! We may now use SmarterCSV on this file with no options specified:
SmarterCSV.process(fout_name)
#=> [{:date=>20200, :header1=>"jdk;df", :header2=>4543,
# :header3=>8333, :header4=>4387},
# {:date=>20200, :header1=>"jdk", :header2=>5004,
# :header3=>945876, :header4=>67}]
1. I used IO::foreach to read the file line-by-line and then write each manipulated line that is neither a comment nor a blank line to the output file. If the file is not huge we could instead gulp it into a string, modify the string and then write the resulting string to the output file: File.write(fout_name, File.read(fin_name).gsub(/^#.*?\n|^[ \t]*\n|^[ \t]+|[ \t]+$|\$/, '').gsub(/[ \t]+/, ',')). The first regular expression reads, "match lines beginning with an octothorpe or lines containing only spaces and tabs or spaces and tabs at the beginning of a line or spaces and tabs at the end of a line or a dollar sign". The second gsub merely converts multiple tabs and spaces to a comma.
File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close

Specify Unicode Character in Regular Expression

How can I create a ruby regular expression that includes a unicode character?
For example, I would like to the character "\u0002" in my regular expression.
You can write /\x02/ :
"\u0002" =~ /\x02/
#=> 0
If you're not sure, you can just start from a string :
Regexp.new("\u0002")
#=> /\x02/
Here's another example :
"☀☁☂" =~ /\u2602/
#=> 2
As mentionned by #TomLord in the comments, you can also specify a range. To check if a string includes a UTF-8 arrow :
"↹" =~ /[\u2190-\u21FF]/
#=> 0

How do I replace all the apostrophes that come right before or right after a comma?

I have a string aString = "old_tag1,old_tag2,'new_tag1','new_tag2'"
I want to replace the apostrophees that come right before or right after a comma. For example in my case the apostrophees enclosing new_tag1 and new_tag2 should be removed.
This is what I have right now
aString = aString.gsub("'", "")
This is however problematic as it removes any apostrophe inside for example if I had 'my_tag's' instead of 'new_tag1'. How do I get rid of only the apostrophes that come before or after the commas ?
My desired output is
aString = "old_tag1,old_tag2,new_tag1,new_tag2"
My guess is to use regex as well, but in a slightly other way:
aString = "old_tag1,old_tag2,'new_tag1','new_tag2','new_tag3','new_tag4's'"
aString.gsub /(?<=^|,)'(.*?)'(?=,|$)/, '\1\2\3'
#=> "old_tag1,old_tag2,new_tag1,new_tag2,new_tag3,new_tag4's"
The idea is to find a substring with bounding apostrophes and paste it back without it.
regex = /
(?<=^|,) # watch for start of the line or comma before
' # find an apostrophe
(.*?) # get everything between apostrophes in a non-greedy way
' # find a closing apostrophe
(?=,|$) # watch after for the comma or the end of the string
/x
The replacement part just paste back the content of the first, second, and third groups (everything between parenthesis).
Thanks for #Cary for /x modificator for regexes, I didn't know about it! Extremely useful for explanation.
This answers the question, "I want to replace the apostrophes that come right before or right after a comma".
r = /
(?<=,) # match a comma in a positive lookbehind
\' # match an apostrophe
| # or
\' # match an apostrophe
(?=,) # match a comma in a positive lookahead
/x # free-spacing regex definition mode
aString = "old_tag1,x'old_tag2'x,x'old_tag3','new_tag1','new_tag2'"
aString.gsub(r, '')
#=> => "old_tag1,x'old_tag2'x,x'old_tag3,new_tag1,new_tag2'"
If the objective is instead to remove single quotes enclosing a substring when the left quote is at the the beginning of the string or is immediately preceded by a comma and the right quote is at the end of the string or is immediately followed by comma, several approaches are possible. One is to use a single, modified regex, as #Dimitry has done. Another is to split the string on commas, process each string in the resulting array and them join the modified substrings, separated by commas.
r = /
\A # match beginning of string
\' # match single quote
.* # match zero or more characters
\' # match single quote
\z # match end of string
/x # free-spacing regex definition mode
aString.split(',').map { |s| (s =~ r) ? s[1..-2] : s }.join(',')
#=> "old_tag1,x'old_tag2'x,x'old_tag3',new_tag1,new_tag2"
Note:
arr = aString.split(',')
#=> ["old_tag1", "x'old_tag2'x", "x'old_tag3'", "'new_tag1'", "'new_tag2'"]
"old_tag1" =~ r #=> nil
"x'old_tag2'x" =~ r #=> nil
"x'old_tag3'" =~ r #=> nil
"'new_tag1'" =~ r #=> 0
"'new_tag2'" =~ r #=> 0
Non regex replacement
Regular expressions can get really ugly. There is a simple way to do it with just string replacement: search for the pattern ,' and ', and replace with ,
aString.gsub(",'", ",").gsub("',", ",")
=> "old_tag1,old_tag2,new_tag1,new_tag2'"
This leaves the trailing ', but that is easy to remove with .chomp("'"). A leading ' can be removed with a simple regex .gsub(/^'/, "")

Ruby regular expression for version numbers

I want to write a program which takes build number in the format of 23.0.23.345 (first two-digits then dot, then zero, then dot, then two-digits, dot, three-digits):
number=23.0.23.345
pattern = /(^[0-9]+\.{0}\.[0-9]+\.[0-9]$)/
numbers.each do |number|
if number.match pattern
puts "#{number} matches"
else
puts "#{number} does not match"
end
end
Output:
I am getting error:
floating literal anymore put zero before dot
I'd use something like this to find patterns that match:
number = 'foo 1.2.3.4 23.0.23.345 bar'
build_number = number[/
\d{2} # two digits
\.
0
\.
\d{2} # two more digits
\.
\d{3}
/x]
build_number # => "23.0.23.345"
This example is using String's [/regex/] method, which is a nice shorthand way to apply and return the result of a regex. It returns the first match only in the form I'm using. Read the documentation for more information and examples.
Your pattern won't work because it doesn't do what you think it does. Here's how I'd read it:
/( # group
^ # start of line
[0-9]+ # one or more digits
\.{0} # *NO* dots
\. # one dot
[0-9]+ # one or more digits
\. # one dot
[0-9] # one digit
$ # end of line
)/x
The problem is \.{0} which means you don't want any dots.
The x flag tells Ruby to use multiline, which ignores blanks/whitespace and comments, making it easy to build a pattern that is documented.
Why reinvent the wheel? Use a gem like versionomy. You can parse the versions, compare them, check for equality, increment a particular part, etc. It even handles alpha, beta, patchlevels, etc.
require 'versionomy'
number='23.0.23.345'
v = Versionomy.parse number
v.major #=> 23
v.minor #=> 0
v.tiny #=> 23
v.tiny2 #=> 345
numbers = "23.0.23.345", "23.0.33.173", "0.0.0.0"
pattern = /\d{2}\.0\.\d{2}\.\d{3}/x
numbers.each do |number|
if number.match pattern
puts "#{number} matches"
else
puts "#{number} does not match"
end
end
The "number" array in line one needs to have values of strings and not integers, I also changed the array "number" to "numbers", you will also need multiple items in the numbers array to call the ".each" method in your loop.
There seems to be agreement on what regular expression you should use. If your ultimate goal is to extract the elements of the strings as integers, you could do this:
str = "I'm looking for 23.0.345.26, or was that 23.0.26.345?"
str.scan(/(\d{2})\.(0)\.(\d{2})\.(\d{3})/).flatten.map(&:to_i)
#=> [23, 0, 26, 345]

Can't understand "puts" keyword in Ruby code [duplicate]

This question already has an answer here:
Anyone can comment this ruby code? [closed]
(1 answer)
Closed 8 years ago.
I don't know ruby language. I was reading a very interesting article which contain a following 2 line ruby code which i need to understand.
(0..0xFFFFFFFFFF).each do |i|
puts "#{"%010x" % i}"
end
By googling, i get the 1st line. But i am not able to understand 2nd line. Can someone please explain its meaning?
puts "#{"%010x" % i}" has actually two parts - string interpolation (which G.B tells you about), and string format using %:
Format—Uses str as a format specification, and returns the result of
applying it to arg. If the format specification contains more than one
substitution, then arg must be an Array or Hash containing the values
to be substituted. See Kernel::sprintf for details of the format
string.
"%05d" % 123 #=> "00123"
"%-5s: %08x" % [ "ID", self.object_id ] #=> "ID : 200e14d6"
"foo = %{foo}" % { :foo => 'bar' } #=> "foo = bar"
So "%010x" % i formats the integer in hex format (x) with at least 10 digits (10), padding with zeros (the leading 0):
"%010x" % 150000
# => "00000249f0"
Actually
puts "#{"%010x" % i}"
is exactly the same as
puts "%010x" % i
since the interpolation simply puts the resulting value (a string) within a string....
Puts key word is used to print the data on the console.
for example
puts "writing data to console"
above line will print exact line to the console "writing data to console"
#a = "this is a string"
puts #a
this will print "this is a test string"
puts "My variable a contains #{#a}"
this will print "My variable a contains this is a string" and this merging technique is called string interpolation.
this first argument in puts "#{"%010x" % i}" specifies the format and second represents the value.
and for your exact question and further details see this link
it's string interpolation and sprintf
documents:
http://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#Interpolation
http://www.ruby-doc.org/core-2.1.2/Kernel.html#method-i-sprintf
http://batsov.com/articles/2013/06/27/the-elements-of-style-in-ruby-number-2-favor-sprintf-format-over-string-number-percent/
"%010x" % i is same as sprintf("%010x", i)
puts "#{"%010x" % i}"
This line print the content. and if you want o interpolated the string please used single quote inside the double quote. like this "#{'%010x' % i }"
and %x means convert integer into hexadecimal and %010x means make it 10 place value means if out is 0 then make it like this 0000000000.
Print
puts is the equivalent of echo in PHP and printf in C
When included in either a command-line app, or as part of a larger application, it basically allows you to output the text you assign to the method:
puts "#{"%010x" % i}"
This will basically print the contents of "#{"%010x" % i}" on the screen - the content of which means that ruby will output the calculaton of what's inside the curly braces (which has been explained in another answer)

Resources