Apache Tika detecting docx mimetype wrong

Apache Tika detecting docx mimetype wrong - apache-tika

I'm using Apache tika to see if the file extension matches the actual mimetype.For example, if a file is named .pdf but is actually an .exe, it would return false.
I send in a .docx with the mimetype
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Tika does a .detect on the content and says that it is a
application/x-tika-ooxml
Any idea what could be wrong? Obviously the method returns false in this case cause they're not a match.

Related

How to extract inline images from PDF using Apache Tika Server and save them as files?

Is there a way to do this? I'm using the following headers in my PUT request to http://localhost:9998/tika
"Content-Type", "application/pdf"
"X-Tika-OCRLanguage", "eng"
"X-Tika-PDFextractInlineImages", "true"
"X-Tika-PDFOcrStrategy", "no_ocr"
Will the response contain the images? And if so, how do I save them?
Using Apache Tika server 1.26

the response will be string not the image(s)
the flag: PDFOcrStrategy tells the tika to use ocr (tesseract) or only try to extract the text from document without ocr - useful for native pdfs
the flag: PDFextractInlineImages tells the tika to ignore/include the embedded images
so when You have the scanned pdf you should use
"X-Tika-PDFextractInlineImages", "true"
"X-Tika-PDFOcrStrategy", "ocr_only"
for native pdfs
"X-Tika-PDFextractInlineImages", "false"
"X-Tika-PDFOcrStrategy", "no_ocr"
but in both scenarios tika will return text
If you want to take the images from pdf document IMO you should use the pdf box, or similar library. The goal of tika is to return text from input

Handling Binary (excel) file in Multi-data Post data in Suave.IO

I am trying to build a simple Suave.IO application to centralize the sending of emails. Currently the application has one endpoint that takes subject, body, recipients, attachments, and sender as form data and turns them into an EWS email message from a logging email account.
Everything works as intended in most cases, but I get a file corruption issue when one of the attachments is an excel file. In those cases, the file seems to get corrupted.
Currently, I am filtering the request.multipartFields down to only the ones that are marked as attachment files, and then doing this:
for (fileField: (string*string)) in fileFields do
let fname = (fst fileField)
let fpath = "uploadedFiles\\" + fname
File.WriteAllBytes(fpath, Encoding.ASCII.GetBytes (snd fileField)) |> ignore
The file path and the attachment names are then fed into the EWS message before sending.
Again, this seems to work with all attachments except attachments with binary. It seems like Suave.IO automatically encodes all multiPartFields as (string*string), which may require special handling when it's binary data.
How should I handle upload of binary files?
Thanks all in advance.

It looks like the issue was one of encoding. I was testing using python's request interface, and by default the files are encoded as multipart/form-data. By specifying a specific encoding for each file, I was able to help the server identify the incoming data as a file.
instead of
requests.post(url, data=data, files={filename: open(filepath, 'rb')})
I needed to make it
requests.post(url, data=data, files={filename: (filename, open(filepath, 'rb'), mimetypes.guess(filepath)})
With the second python script, files do end up in the files section of the request and I was able to save the excel file without corruption.

DXGI_FORMAT in dds file

I am parsing DDS file to read its header data. I want to modify format of image, but it seems that header mentioned at this site does not specify where DXGI_FORMAT (internal format) is stored. Where I can I get internal format in file ?
Like DXGI_FORMAT_BC1_UNORM value is 71, but i did not find it in header

Spring API for Uploading Files (MultiPart) interprets contentType of images as application/octet-stream

Uploading files(images,..) with Spring API(MultiPartFile) works fine on localhost.
However after deployement on Linux Server , the console shows that Spring API interprets contentType of file uploaded such as application/octet-stream .,
at java.io.FileOutputStream.<init>(FileOutputStream.java:209)
at java.io.FileOutputStream.<init>(FileOutputStream.java:160)
at org.apache.commons.fileupload.disk.DiskFileItem.write(DiskFileItem.java:449)
at com.myproject.utils.upload.FileUploadUtil.uploadFile(FileUploadUtil.java:64)
at com.myproject.utils.GenericFileUploadService$_upload_closure1.doCall(GenericFileUploadService.groovy:56)
at com.myproject.utils.GenericFileUploadService.upload(GenericFileUploadService.groovy:53)
at com.myproject.utils.GenericFileUploadService.upload(GenericFileUploadService.groovy:63)
... 7 more
org.springframework.web.multipart.commons.CommonsMultipartFile#1723bb6
content.AssetService File instance : org.springframework.web.multipart.commons.CommonsMultipartFile#1723bb6
println contentType =application/octet-stream
Hence , when i use ImagikImage to convert the uploaded file to thumbnail , i get the following error :
`org.im4java.core.CommandException: org.im4java.core.CommandException: convert: unable to open image
/var/lib/tomcat7/myproject/ROOT/media/5/34: # error/blob.c/OpenBlob/2587.
knowing that the image should be saved normally in the following path
/var/lib/tomcat7/myproject/ROOT/media/5/34.png
i found this configuration and i don't know its efficiency:
grails.web.disable.multipart=true

your file upload form should have an attribute enctype='multipart/form-data', if it doesn't, file contents can be treated as if they were unicode characters and your image files would get corrupted

Character Encoding issue in Rails v3/Ruby 1.9.2

I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.
open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error
Contents of the file:
# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works
This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(
Environment:
Rails v3.0.3
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.5.0]

Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding. What you're doing when you opening your file and passing "r:UTF-8" is forcing the same external encoding that Ruby is using by default.
Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:
File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }
If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).

If you want to change your file encoding, you can use gem 'charlock holmes'
https://github.com/brianmario/charlock_holmes
$require 'charlock_holmes/string'
content = File.read('test2.txt')
if !content.is_utf8?
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
end
Then you can save your new content in a temp file and overwrite your original file.
Hope this help.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Apache Tika detecting docx mimetype wrong - apache-tika

Related

How to extract inline images from PDF using Apache Tika Server and save them as files?

Handling Binary (excel) file in Multi-data Post data in Suave.IO

DXGI_FORMAT in dds file

Spring API for Uploading Files (MultiPart) interprets contentType of images as application/octet-stream

Character Encoding issue in Rails v3/Ruby 1.9.2

Categories

Resources