rails: how to open the ENV in binary mode? - ruby-on-rails

One day my application declared all passwords invalid.
After tedious search the problem was found: a cipher initialization vector (just a bunch of random bits) is given to the application via ENV. And rails had decided to convert this string (which is arbitrary binary data) to UTF-8.
I'm doing basically this, before server start:
ENV["RAILS_ACC_VEC"] = "\xB3n%-\x9E^\xE1\x93 \x17\xEER\x1B\n\x84S"
Rack::Server.start( ...
and later
if Rails.env != "production"
salt = "dummy"
else
salt = ENV["RAILS_ACC_VEC"]
end
The bitstring should be 128 bit long. But it happened to be 176 bit long and contained valid UTF-8. (Obviousely, the cipher routines did utterly fail with that.)
The application currently runs on Rails 4.2.8 and ruby 2.4, and with default encoding.
The reason for the problem could be found: usually the application is started with the server or from deploy, with no locale in the environment. This time it was started from a console, and that console happened to be set to ISO 8859.
The consequence is also clear: one needs to take care that the application is always started with a definite locale in the ENV - either LC_CTYPE=C (equivalent to no locale), or -maybe better- UTF-8 (in case the application has default config.encoding).
What I am now trying to figure out, is, when and why does ruby/rails do such things?
I know that transcoding may happen with an IO object, but there the intended charset can be specified when opening.
It may make some sense, if the system seems to run in ISO 8859, and rails itself runs with UTF-8, that the ENV, when moved from outside to inside, may need transcoding. But that holds true only if language is concerned, and not all ENV content might be language.
So, how is the ENV opened in binary mode?
The more ambitioned question then is, are there more evil dangers of such kind around with the Encoding feature?

You should not store binary data in the system environment. The operating system is not designed to store binary data in its environment. I don't believe any provide that feature. All environment variables should be text. Maybe an OS can store binary data in the environment, but I don't believe that is a standard. I doubt they can store a null byte (\x00). It is probably a security risk for operating systems, leading to buffer overflow exploits for other programs that read the environment. Try a search of 'posix env binary'.
You should store your IV as base64 encoded data whenever you store it as text.
ENV['IV'] = 'VGhpcyBjYW4gYmUgYmluYXJ5Lg=='
export IV=VGhpcyBjYW4gYmUgYmluYXJ5Lg== # or from the shell
...
iv = Base64.decode64 ENV['IV']

Related

Is there a way to check if a Ruby variable contains binary data?

I'm using Ruby 2.4 and Rails 5. I have file content in a variabe named "content". The content could contain data from things like a PDF file, a Word file, or an HTML file. Is there any way to tell if the variable contains binary data? Ultimately, I would like to know if this is a PDf, Microsoft Office, or some other type of OpenOffice file. This answer -- Rails: possible to check if a string is binary? -- suggests that I can check the encoding of the variable
content.encoding
and it would produce
ASCII-8BIT
in the case of binary data, however, I've noticed there are cases where HTML content stored in the variable could also return "ASCII-8BIT" as the content.encoding, so using "content.encoding" is not a foolproof way to tell me if I have binary data. Does such a way exist and if so, what is it?
If your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. The gem is a simple wrapper around the libmagic library which is standard on unix-like systems. The library works by scanning the content of a file and matching it against a set of known "magic" patterns in various file types.
Sample usage for a string buffer (e.g. data read form the database):
require "ruby-filemagic"
content = File.read("/.../sample.pdf") # just an example to get some data
fm = FileMagic.new
fm.buffer(content)
#=> "PDF document, version 1.4"
For the gem to work (and compile) you need the file utility as well as the magic library with headers installed on your system. Quoting from the readme:
The file(1) library and headers are required:
Debian/Ubuntu:: +libmagic-dev+
Fedora/SuSE:: +file-devel+
Gentoo:: +sys-libs/libmagic+
OS X:: brew install libmagic
Tested to work well under Rails 5.
If you're on an unix machine, you can use the file command:
file titi.pdf
You could then do something like:
require 'open2'
cmd = 'file -'
Open3.popen3(cmd) do |stdin, stdout, wait_thr|
stdin.write(content)
stdin.close
puts "file type is:" + stoud.read
end

Get localized name other channel

I get the version number of the firefox from the applications.ini.
Then I hardcoded that between date #### and #### v35 is release. So now based on this and the current date and version from applications.ini I figure out the channel of other builds.
But now I want to get the localized name of the channel.
So for example I'm using beta channel and from this build I want to get the localized name of "Nightly" in chineese, so it has the chineese characters, and word for nightly in chineese. Can this also be obtained from the applications.ini? Is [App] -> Name localized in applications.ini?
This is the applications.ini method: https://ask.mozilla.org/question/705/detect-if-auroranightlybetanormal-and-get-paths/ (credits to #paa)
EDIT
i discovered this file: OS.Path.join(Services.dirsvc.get('XREExeF', Ci.nsIFile).parent.path, 'defaults', 'pref', 'channel-prefs.js')
its contents is the following:
//#line 2 "c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\browser\app\profile\channel-prefs.js"
/* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, You can obtain one at http://mozilla.org/MPL/2.0/. */
pref("app.update.channel", "beta");
Is this a reliable check? Does this channel-prefs.js file exist for all builds as soon as they are installed?
Is this a reliable check?
Not really. There used to be channel switcher add-ons, and in theory the user can change this pref (although at the moment this is not sufficient to really switch the channel I think).
Does this channel-prefs.js file exist for all builds as soon as they are installed?
Yes, for now. But this is an implementation detail. There is no guarantee that the file won't be moved or renamed later, or merged with another file.
Can this also be obtained from the applications.ini?
The localized name? I didn't even know there was one... I thought it was called e.g. "Nightly" in all locales like it was a (product) name. But yeah, it is theoretically possible to localize that string. It is not available from the ini file, though.
I wouldn't poke in application.ini anyway, and instead just use Services.appinfo.defaultUpdateChannel
But now I want to get the localized name of the channel.
Since you're in a running Firefox instance already (judging from your OS.File code), you should use the string bundle service to load chrome://branding/locale/brand.properties and get the brandShortName or brandFullName string from there.

Foreign character issue with CSV import to Heroku Postgres DB

I have a rails app where my users can manually set up products via a web form. This works fine and accepts foreign characters well, words like 'Svölk' for example.
I now have a need to bulk import products and am using FasterCSV to do so. Generally this works without issue, but when the CSV contains foreign characters it stalls at that point.
Am I correct to believe the file needs to be UTF-8 in the first instance?
Also, I'm running Ruby 1.8.7 so is ICONV my only solution for converting the file? This could be an issue as the format of the original file won't be known.
Have others encountered this issue and if so, how did you overcome it?
You have two alternatives:
Use ensure_encoding gem to find the actual encoding of the strings.
Use Ruby to determine the file encoding using:
File.open(source_file).read.encoding
I prefer the first approach as it tries to detected the encoding based on Strings, and tries to convert to your desired encoding (UTF-8) and then you can set the encoding on FasterCSV options.

Options for MeCab Japanese tokenizer on iOS?

I'm using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having some trouble getting it to tokenize all possible words. Specifically, I cannot tokenize "吉本興業" into two pieces "吉本" and "興業". Are there any options that I could use to fix this? The iPhone library does not expose anything, but it uses C++ underneath the objective-c wrapper. I assume there must be some sort of setting I could change to give more fine-grained control, but I have no idea where to start.
By the way, if anyone wants to tag this 'mecab' that would probably be appropriate. I'm not allowed to create new tags yet.
UPDATE: The iOS library is calling mecab_sparse_tonode2() defined in libmecab.cpp. If anyone could point me to some English documentation on that file it might be enough.
There is nothing iOS-specific in this. The dictionary you are using with mecab (probably ipadic) contains an entry for the company name 吉本興業. Although both parts of the name are listed as separate nouns as well, mecab has a strong preference to tag the compound name as one word.
Mecab lacks a feature that allows the user to choose whether or not compounds should be split into parts. Note that such a feature is generally hard to implement because not everyone agrees on which compounds can be split and which ones can't. E.g. is 容疑者 a compound made up of 容疑 and 者? From a purely morphological point of view perhaps yes, but for most practical applications probably no.
If you have a list of compounds you'd like to get segmented, a quick fix is to create a user dictionary for the parts they consist of, and make mecab use this in addition to the main dictionary.
There is Japanese documentation on how to do this here. For your particular example, it would involve the steps below.
Make a user dictionary with two entries, one for 吉本 and one for 興業:
吉本,,,100,名詞,固有名詞,人名,名,*,*,よしもと,ヨシモト,ヨシモト
興業,,,100,名詞,一般,*,*,*,*,こうぎょう,コウギョウ,コウギョウ
I suspect that both entries exist in the default dictionary already, but by adding them to a user dictionary and specifying a relatively low specificness indicator (I've used 100 for both -- the lower, the more likely to be split), you can get mecab to tend to prefer the parts over the whole.
Compile the user dictionary:
$> $MECAB/libexec/mecab/mecab-dict-index -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
You may have to adjust the command. The above assumes:
Mecab was installed from source in $MECAB. If you use mecab installed by a package manager, you might have difficulties finding the mecab-dict-index tool. Best install from source.
The default dictionary is in /usr/lib64/mecab/dict/ipadic. This is not part of the mecab package; it comes as a separate package (e.g. this) and you may have difficulties finding this, too.
mydic is the name of the user dictionary created in step 1. mydic.dic is the name of the compiled dictionary you'll get as output (needs not exist).
Both the system dictionary (-t option) and the user dictionary (-f option) are encoded in UTF-8. This may be wrong, in which case you'll get an error message later when you use mecab.
Modify the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc or similar. In your case it may be located somewhere else. Add the following line to the end of the configuration file:
userdic = home/myhome/mydic.dic
Make sure the absolute path to the dictionary compiled above is correct.
If you then run mecab against your input, it will split the compound into its parts (I tested it, using mecab 0.994 on a Linux system).
A more thorough fix would be to get the source of the default dictionary and manually remove all compoun nouns you want to get split, then recompile the dictionary. As a general remark, using a CJK tokenizer for a serious application in production mode over a longer period of time usually involves a certain amount of dictionary maintenance (adding/removing entries) regularly.

Rails: possible to check if a string is binary?

In a particular Rails application, I'm pulling binary data out of LDAP into a variable for processing. Is there a way to check if the variable contains binary data? I don't want to continue with processing of this variable if it's not binary. I would expect to use is_a?...
In fact, the binary data I'm pulling from LDAP is a photo. So maybe there's an even better way to ensure the variable contains binary JPEG data? The result of this check will determine whether to continue processing the JPEG data, or to render a default JPEG from disk instead.
There is actually a lot more to this question than you might think. Only since Ruby 1.9 has there been a concept of characters (in some encoding) versus raw bytes. So in Ruby 1.9 you might be able to get away with requesting the encoding. Since you are getting stuff from LDAP the encoding for the strings coming in should be well known, most likely ISO-8859-1 or UTF-8.
In which case you can get the encoding and act on that:
some_variable.encoding # => when ASCII-8BIT, treat as a photo
Since you really want to verify that the binary data is a photo, it would make sense to run it through an image library. RMagick comes to mind. The documentation will show you how to verify that any binary data is actually JPEG encoded. You will then also be able to store other properties such as width and height.
If you don't have RMagick installed, an alternative approach would be to save the data into a Tempfile, drop down into Unix (assuming you are on Unix) and try to identify the file. If your system has ImageMagick installed, the identify command will tell you all about images. But just calling file on it will tell you this too:
~/Pictures$ file P1020359.jpg
P1020359.jpg: JPEG image data, EXIF standard, comment: "AppleMark"
You need to call the identify and file commands in a shell from Ruby:
%x(identify #{tempfile})
%x(file #{tempfile})

Resources