Using iconv to convert Traditional Chinese to Simplified Chinese - ruby-on-rails

How do I use the iconv in Ruby to convert a string from Simplified Chinese to Traditional Chinese (and vice-versa)?
I've tried
Iconv.conv("gb2312//IGNORE", "big5//IGNORE", '大家一起學中文')
I get an entirely different string. I've tried with the GBK and BIG5 encodings, I get an IllegalSequence Error.
Thanks.

https://rubygems.org/gems/tradsim
I just wrote a gem
To install the gem
gem install tradsim
To use the gem
# encoding: UTF-8
require 'tradsim'
puts Tradsim::to_sim("大家一起學中文")
it will yield
大家一起学中文
and you can use Tradsim::to_trad to do the reverse.

Are you trying to convert, say, 學 to 学? I could be wrong, but I don't think Iconv will perform that type of conversion.

OpenCC
https://github.com/BYVoid/OpenCC
As of 2021, this sees to be the most popular choice:
sudo apt install opencc
opencc -i input.txt -o output.txt -c t2s.json
With:
input.txt
大家一起學中文
we get:
output.txt
大家一起学中文
It also has APIs for several languages like Python and Node.js.
Tested on Ubuntu 21.04, opencc 1.1.1.

Related

How to get fully working grep in git bash (msysgit) on windows?

I would like to use grep -o, but in git bash there is no -o option. Is there a way to get full working grep in git bash, just like it's in linux bash shell?
There is no -o flag for grep
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html
You can use sed instead
There is an open issue for that on Github (even though it's under "nvm"). User UltCombo posted a workaround. Quoting:
Open <Git install directory>/bin and overwrite grep.exe with a more up to date version. I found two alternatives that provide -o support:
GnuWin32's grep 2.5.4 (link).
ezwinports' grep 2.10 (link). Note: You also have to extract libprce-0.dll in the same folder as grep.
Though ezwinports' grep port is much more up to date, I can't say whether any of these will cause stability/compatibility issues. I haven't found any issues yet, but use it at your own risk.
Marking this Community Wiki because it's really somebody else's work.
Alternatively, get the pretty awesome MSYS2 and enjoy full grep and co.

Using globs in GNU grep's path argument

BSD (Mac) grep allows for this command:
grep -n "FIXME" **/*.rb
But GNU grep forces me to specify at least a folder to start from:
grep -n "FIXME" {lib,spec}/**/*.rb
Is there a way to get this to behave like it does in BSD grep?
Switch to ack. It uses the recursive strategy by default, and comes with loads of tricky regexes for types of language files available as flags.
For instance, writing:
ack FIXME --ruby
Will search the current directory recursively for anything that may be a Ruby file. This will work the same on Mac and Linux.

Rails includes return characters in environment variables

I'm storing environment variables in /etc/environment like:
FACEBOOK_API_KEY=XXXXXXXXXXX
FACEBOOK_API_SECRET=XXXXXXXXXX
But when i access the ENV variable through Rails i get this:
ENV['FACEBOOK_API_KEY']
=> XXXXXXXXX\r
Notice the \r, How to get rid of that without cleaning up each and every call to ENV vars?
My guess is that you're getting a \r because you're editing /etc/environment using a Windows text editor and installing it on a Unix system (or possibly using Cygwin in Windows, but the same applies). Or you copied/pasted it from a source that did that. In any case, something has introduced a CRLF into a Unix file that only wants LF line-endings.
If it is an editor, you'll want to fix it to stop using Windows CRLF (\r\n) line endings, and use Unix LF (\n) line endings. Notepad++ has an option for this, as do many other editors and IDEs. Google around for yours and find out how to use Unix line endings. You'll run into a ton of problems like this otherwise.
You can fix the existing file by running it through a program like dos2unix (on a Unix system; you may have to install the package), or using a simple tr command like this:
Edit: fixed the filename order in the mv command below.
tr -d '\r' </etc/environment >/tmp/environment
# <verify new file looks good>
mv /tmp/environment /etc/environment
Please be careful, make backups, check the file, etc.
You can make sure there aren't any rogue \r characters in your file by looking at an octal dump:
od -c /tmp/environment
Look for any \r in the output.
You can use figaro to manage your ENV variables.
It creates a config/application.yml file for you that should not be checked into version control.
# config/application.yml
FACEBOOK_API_KEY: XXXXXXXXXXX
FACEBOOK_API_SECRET: XXXXXXXXXX
Variables will be available at ENV['FACEBOOK_API_KEY'] as you are used to.
An alternative would be configatron.

Download video from URL in ruby on rails

Currently I am using system command wget to download the video and store in our server:
system("cd #{RAILS_ROOT}/public/users/familyvideos/thumbs && wget -O sxyz1.mp4 #{video.download_url}")
but it's saying:
cd: 1: can't cd to /var/home/web/***.com/public/users/familyvideos/thumbs
Anybody has any idea? Also please provide alternative option to do this.
A much better solution would be to use open-uri for this.
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/
This snippet should do the trick:
url = video.download_url
new_file_path = "#{Rails.root.to_s}/public/users/familyvideos/thumbs/your_video.mp4"
open(new_file_path, "wb") do |file|
file.print open(url).read
end
It's the cd program complaining, and that's low-level operating system stuff, so the most likely reason is that one or more of the folders after #{RAILS_ROOT}/public doesn't yet exist, or there's a space or some other character in the file path causing problems. Try surrounding the directory name in quotes.
`cd "#{RAILS_ROOT}/public/users/familyvideos/thumbs" && wget -O sxyz1.mp4 #{video.download_url}`
As for better ways, wget is pretty tried and tested so there's nothing wrong with using it, since this is what it was designed for. You could just give it the folder and the filename at the same time instead of cd'ing to it first, however.
`wget -O "#{RAILS_ROOT}/public/users/familyvideos/thumbs/sxyz1.mp4" #{video.download_url}`
Note also I'm using the backtick method of execution instead of system so you're not dealing with escaping double quotes. You could also use %x{wget ...}.

An encoding-savvy grep replacement?

I am frustrated that grep fails to find a word like "hello" in my UTF-16 documents.
Can anyone recommend a version of grep that attempts to guess the file encoding and then properly handle it?
ack as perl-based grep replacement?
You'll definitely want to check out ack.
It supports Unicode encodings, and is basically grep, but better.
try a matching Unicode locale with grep
If you are under Linux, Unix, etc. you may want to change your LANG envariable to an encoding to match your documents.
Check your locale first. Here is what mine is set to by default on my MacBook Pro:
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
say, under bash:
$ LANG="foo" grep 'gotta be found now' file.name
something a little more permanent (be careful with this):
$ export LANG="foo"
$ grep 'bar' mitz.vah
Perl has a way better regex syntax than grep (much more powerful), it has UTF8 and UTF16 support, but I'm not sure how good it is at guessing the encoding... if you tell it which encoding to use, though, it can read these files without any issues and run regexes over them. You'll have to write yourself a tiny Perl program for that (your own micro-grep implementation in Perl so to say), but that isn't too hard. Perl exists for all major operating systems.
I am frustrated that grep fails to find a word like "hello" in my
UTF-16 documents.
Can anyone recommend a version of grep that attempts to guess the file
encoding and then properly handle it?
ugrep which is free BSD-3 open source, supports all UTF encodings and claims to be a true drop-in replacement for grep by supporting the GNU/BSD grep command line options. Likewise, ripgrep, ack, and silver searcher (ag) also support UTF encodings but are not drop-in replacements for grep since their behavior and options differ from grep.
You could use the iconv filter utility in combination with grep to convert UTF-16 files to UTF-8, but you will have to explicitly specify the input and output encodings, something like:
iconv -f utf-16 -t utf8` < file.txt | grep PATTERN

Resources