How come Tika parses plain text as octet/stream - apache-tika

Beside other files, I have tons of files for which the UNIX file command says: "ASCII text", but Tika insists that it is "application/octet-stream" and does not parse it. I naively use
Tika tika = new Tika();
String text = tika.parseToString(inStream)
Some answers to somehow related questions point to the AutoDetectParser, but the default setup in version 1.13 uses exactly that.
Is there way to help Tika along such that it is more courageous in deciding for "text/plain" if there is a lot of ascii in the file?

Related

Is there a way to check if a Ruby variable contains binary data?

I'm using Ruby 2.4 and Rails 5. I have file content in a variabe named "content". The content could contain data from things like a PDF file, a Word file, or an HTML file. Is there any way to tell if the variable contains binary data? Ultimately, I would like to know if this is a PDf, Microsoft Office, or some other type of OpenOffice file. This answer -- Rails: possible to check if a string is binary? -- suggests that I can check the encoding of the variable
content.encoding
and it would produce
ASCII-8BIT
in the case of binary data, however, I've noticed there are cases where HTML content stored in the variable could also return "ASCII-8BIT" as the content.encoding, so using "content.encoding" is not a foolproof way to tell me if I have binary data. Does such a way exist and if so, what is it?
If your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. The gem is a simple wrapper around the libmagic library which is standard on unix-like systems. The library works by scanning the content of a file and matching it against a set of known "magic" patterns in various file types.
Sample usage for a string buffer (e.g. data read form the database):
require "ruby-filemagic"
content = File.read("/.../sample.pdf") # just an example to get some data
fm = FileMagic.new
fm.buffer(content)
#=> "PDF document, version 1.4"
For the gem to work (and compile) you need the file utility as well as the magic library with headers installed on your system. Quoting from the readme:
The file(1) library and headers are required:
Debian/Ubuntu:: +libmagic-dev+
Fedora/SuSE:: +file-devel+
Gentoo:: +sys-libs/libmagic+
OS X:: brew install libmagic
Tested to work well under Rails 5.
If you're on an unix machine, you can use the file command:
file titi.pdf
You could then do something like:
require 'open2'
cmd = 'file -'
Open3.popen3(cmd) do |stdin, stdout, wait_thr|
stdin.write(content)
stdin.close
puts "file type is:" + stoud.read
end

karaf configuration property is garbled

I implement org.osgi.service.cm.ManagedService interface to get Karaf configuration. But when I give a Chinese value to the property, it is garbled.Initially, the files in the etc folder are encoded in latin1. I have tried to set utf-8 encoding, but it has no effect. Can anyone help me?
In Karaf, the configurations files (ie etc/*.cfg) are handled by the felix subproject "fileinstall".
fileinstall doesn't support yet to specified a custom character encoding for the configuration, it uses the Properties class and Properties.load(InputStream), which documents:
The load(Reader) / store(Writer, String) methods load and store
properties from and to a character based stream in a simple
line-oriented format specified below. The load(InputStream) /
store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output
stream is encoded in ISO 8859-1 character encoding. Characters that
cannot be directly represented in this encoding can be written using
Unicode escapes as defined in section 3.3 of The Java™ Language
Specification; only a single 'u' character is allowed in an escape
sequence. The native2ascii tool can be used to convert property files
to and from other character encodings.
So, you have to encode your file in ISE-8859-1 and quote every UTF character, or use an xml file to encode your configuration files.
There is a way to change cfg files encoding.
Configuration for fileinstall subproject polling etc/*.cfg files is written in config.properties file.
You can add
felix.fileinstall.configEncoding = UTF-8
The solution was checked in Karaf 4

—- " added in HTML when converting MarkDown file to HTML using Jekyll tool

I have used Jekyll tool to convert MarkDown file To HTML. It has been successfully converted to HTML. but the below following encoded punctuation characters has been added at the top of the HTML, due to the file encoded format is Encode in UTF-8.
"—-"
After changed the same markdown file to Encode in ANSI format in NotePad++[encoding option in menu bar]. The punctuation character not included in generated HTML.
In this we need to manually change the markdown file to ANSI for HTML generation 'Jekyll'.
Any solution for this?
 is the UTF-8 BOM so that's probably what you are seeing, assuming you are looking at it using CP1252; and — is something out of the General Punctuation block.
Proper diagnostics are not possible without an indication of which character encoding you are using instead of UTF-8 to view the file, and/or what exact bytes you have in the file, probably as a hex dump. The first few bytes (the BOM) would be EF BB BF. See also the character-encoding tag wiki for troubleshooting tips.
Quick googling indicates that Jekyll is highly allergic to UTF-8 BOM in its input, so it seems unlikely that it generates spurious BOM characters on output. I could speculate that the template file you are using has a BOM and that it is being faithfully included in the output, but I'm not really familiar enough with Jekyll to actually help troubleshoot any further.
Of course, as per the big ugly warnings all over the Jekyll site, I assume you have already made sure that your Markdown input doesn't have a BOM character. Many Windows editors are notorious for putting one in when you save as UTF-8; make sure you use "UTF-8 without a BOM" as the "Save As..." format -- and switch to an editor which offers this option if yours doesn't have it.
try using charset=utf-8
or
Check your content has any straight double quote (" ") or straight single quote (' ') and remove those
http://practicaltypography.com/straight-and-curly-quotes.html
This encoding format issue. make the markdown file in UTF-8 without BOM format.
This will remove the punctuation character in 'html' .

libCURL Multipart POST with a large file

I need to upload a large file (>2 GB) using multipart POST request. Source file can be named using unicode symbols. The problem is that libcurl does not support unicode wfopen in windows, so I am not able to complete this task in usual way like
curl_formadd(&formpost, &lastptr,
CURLFORM_COPYNAME, fieldname,
CURLFORM_FILENAME, filename,
CURLFORM_FILE, full_path_to_file,
CURLFORM_CONTENTTYPE, "application/octet-stream",
CURLFORM_END);
I figured out that I can use a CURLFORM_STREAM option of curl_formadd in conjunction with CURLOPT_READFUNCTION. Now I need to manually set the file size through CURLFORM_CONTENTSLENGTH option, but it accepts only "long" as a parameter when I need to set a "long long" file size. After a look through curl manual I found some CURLOPT_POSTFIELDSIZE_LARGE option, but it does nothing in my case. It seems that multipart request system ignores this parameter. I don't know what to do, I don't want to give up unicode names or large files support.

What's the best way to programmatically output a file in the format of a Word document in Ruby?

I need to output a file in the format of a Word document from a Ruby-based web app (Rails/Sinatra) based on some textual content in the app. Is there library support in Ruby for creating and structuring a Word document?
Take a look at WordML, the XML format for Word files.
John Durant's blog has a useful list of WordML and FAQ resources
Walkthrough: Word 2007 XML Format
Useful tool for creating XSLT transforms: Office 2003 Tool: WordprocessingML Transform Inference Tool
These SO posts might also be of interest:
Creating Word or XML document with VBA
Generating WordML Reports Using Templates and XPath using ASP.Net
Convert XHTML to Word ML
XML to WordML using XSLT 1.0 - replace html tags within xml content with wordML formatting tags
How can I convert convert docx or wordml xml files to xsl-fo?
You don't specify what "a Word document" means exactly. Is it a Word 2003-style doc file? Is it a Word 2007 docx file? Is it just something Word can open than supports styling?
If the latter is what you want, you could use RTF, which is somewhat easier than the doc format. There is a library called Ruby RTF that should do what you want, though I've honestly never used it myself.
Would it be easier to generate a Word 2003 document: Is there an easier to understand file format for a basic Word 2003 .doc that doesn't require a PhD in XML, etc?

Resources