Utility for reporting the character set associated with a given file - character-encoding

The 'file' utility does not provide what I am looking for.
I would like to "sanitize" selected files by using the iconv utility to transform those files to any of the specified character-set encodings that it recognizes (see iconv -l).
Is there a utility which would examine any file (regular and zip, tar, etc) and report the proper character-set designation for that file in a manner recognized by iconv for the input-format specifier ?

Related

Is there a way to determine the coverage of a .PO file?

I've got a python program under active development, which uses gettext for translation.
I've got a .POT file with translations, but it is slightly out of date. I've got a script to generate an up-to-date .PO file. Is there a way to check how much of the new .PO file is covered by the .POT file?
I've got a .POT file with translations, but it is slightly out of date. I've got a script to generate an up-to-date .PO file
I think you mean the other way around. POT files are generated from your source code with PO files containing the translations.
Is there a way to check how much of the new .PO file is covered by the .POT file?
The Gettext command line msgmerge program can be used for syncing your out-of-date PO files with your latest source strings. To create a new PO file from an updated POT you would issue this command:
msgmerge old.po new.pot > updated.po
The new file will contain all the existing translations that are still valid and add any new source strings. Open it in your favourite PO editor and you should see how many strings now remain untranslated.
Update
As pointed out in the comments, you can see how many strings remain untranslated with the "statistics" option of the msgfmt program (normally used for compiling to .mo) e.g.
msgfmt --statistics updated.po
Or without bothering with the interim file:
msgmerge old.po new.pot | msgfmt --statistics -
This would produce a synopsis like:
123 translated messages, 77 untranslated messages.

Is there a way to check if a Ruby variable contains binary data?

I'm using Ruby 2.4 and Rails 5. I have file content in a variabe named "content". The content could contain data from things like a PDF file, a Word file, or an HTML file. Is there any way to tell if the variable contains binary data? Ultimately, I would like to know if this is a PDf, Microsoft Office, or some other type of OpenOffice file. This answer -- Rails: possible to check if a string is binary? -- suggests that I can check the encoding of the variable
content.encoding
and it would produce
ASCII-8BIT
in the case of binary data, however, I've noticed there are cases where HTML content stored in the variable could also return "ASCII-8BIT" as the content.encoding, so using "content.encoding" is not a foolproof way to tell me if I have binary data. Does such a way exist and if so, what is it?
If your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. The gem is a simple wrapper around the libmagic library which is standard on unix-like systems. The library works by scanning the content of a file and matching it against a set of known "magic" patterns in various file types.
Sample usage for a string buffer (e.g. data read form the database):
require "ruby-filemagic"
content = File.read("/.../sample.pdf") # just an example to get some data
fm = FileMagic.new
fm.buffer(content)
#=> "PDF document, version 1.4"
For the gem to work (and compile) you need the file utility as well as the magic library with headers installed on your system. Quoting from the readme:
The file(1) library and headers are required:
Debian/Ubuntu:: +libmagic-dev+
Fedora/SuSE:: +file-devel+
Gentoo:: +sys-libs/libmagic+
OS X:: brew install libmagic
Tested to work well under Rails 5.
If you're on an unix machine, you can use the file command:
file titi.pdf
You could then do something like:
require 'open2'
cmd = 'file -'
Open3.popen3(cmd) do |stdin, stdout, wait_thr|
stdin.write(content)
stdin.close
puts "file type is:" + stoud.read
end

karaf configuration property is garbled

I implement org.osgi.service.cm.ManagedService interface to get Karaf configuration. But when I give a Chinese value to the property, it is garbled.Initially, the files in the etc folder are encoded in latin1. I have tried to set utf-8 encoding, but it has no effect. Can anyone help me?
In Karaf, the configurations files (ie etc/*.cfg) are handled by the felix subproject "fileinstall".
fileinstall doesn't support yet to specified a custom character encoding for the configuration, it uses the Properties class and Properties.load(InputStream), which documents:
The load(Reader) / store(Writer, String) methods load and store
properties from and to a character based stream in a simple
line-oriented format specified below. The load(InputStream) /
store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output
stream is encoded in ISO 8859-1 character encoding. Characters that
cannot be directly represented in this encoding can be written using
Unicode escapes as defined in section 3.3 of The Java™ Language
Specification; only a single 'u' character is allowed in an escape
sequence. The native2ascii tool can be used to convert property files
to and from other character encodings.
So, you have to encode your file in ISE-8859-1 and quote every UTF character, or use an xml file to encode your configuration files.
There is a way to change cfg files encoding.
Configuration for fileinstall subproject polling etc/*.cfg files is written in config.properties file.
You can add
felix.fileinstall.configEncoding = UTF-8
The solution was checked in Karaf 4

Default syntax in case of conflicting file extensions in the Atom text editor

In the Atom text editor, when two language packages define syntax and snippets for files with the same file extension, what determines the precedence?
For example, both language-ruby and language-ruby-on-rails are available by default, as they are included in the so-called Core Package set, and the two packages share the .rb file extension.
How can I make sure that Atom will by default treat .rb file as, say, source.ruby.rails instead of source.ruby files in my projects?
In your config.cson file you can specify filename regexes and a related grammars, like so:
"*":
"file-types":
".rb?$": "source.ruby.rails"

How can I change a huge text file from ANSI to UTF-8?

I have a text file whose size is 1.3 GB. Most of text editors (including NotePad++) cannot open it. I need to change its format from ANSI to UTF-8. In what program can I do this?
Try EmEditor. It supports Huge files very well. Free version exists.
If you want a free (and open source) command-line tool that can run on Windows, and which allows you to convert huge files from ANSI to UTF-8 (or any other encodings), you can use this tool that I've just created (runs on nodejs and uses the iconv-lite library):
https://github.com/sorin-postelnicu/convert-file-encoding
You can use it like this:
node bin\convertFileEncoding.js -f latin-1 -t utf-8 -i myinputfile.txt -o myoutputfile.txt
It is fast and supports converting very large files with minimal memory consumption (around 20MB of RAM no matter the size of the input file).
You can also use shareware text editor UltraEdit.
First, configure UltraEdit for editing large files according to power tip large file text editor.
Then open your file in UltraEdit and use File - Save As and select for Encoding (Windows 7/8/8.1/Vista) respectively Format (Windows XP/2000) the option UTF-8 - NO BOM or UTF-8 for saving with conversion to UTF-8 encoding without or with byte order mark at beginning of the file.

Resources