I wanted to know for any Apache Tika Server instance, how does it handle files that are sent to it from a PUT request such as PDF files or image files when extracting text such as by OCR.
Are they stored at all to disk or are they read directly from the byte array? If they are stored to disk first before processing, are they cleaned up?
Related
We have stored image as an encrypted format and stored in local path. And once we captured the all documents the user click on submit button on that scenario we have decrypted all the images using RNCryptor(https://github.com/RNCryptor/RNCryptor) and save as Zip and https://github.com/marmelroy/Zip
But we have to store a decrypt format in memory instead of disk.
How would I zip a file so I could send it without writing to the hard drive and do it purely in memory?
Update
Another alternative is the ZIPFoundation library on Github (MIT/Thomas Zoechling). It appears to be Swift compatible and is apparently "effortless." BTW - I learned about this library while reading an interesting blog article where the author (Max Desiatov) walks through how he unzips in memory using the library (see the section - Unzipping an archive in memory and parsing the contents).
Original
Have you taken a close look at the Single-Step Compression article? There is a section that talks about writing the compressed data to file (but it's already been compressed in memory at that point). Once you get the data generated then I guess you could do with it as you will...
Article Steps
Create the Source Data
Create the Destination Buffer
Select a Compression Algorithm
Compress the Data
Write the Encoded Data to a File
Read the Encoded Data from a File
Decompress the Data
I'm downloading an Excel file from an Azure Storage Blob and therefore want to use stream_get_contents to get the file. But PhpSpreadsheet seems to only want to read the file off the filesystem.
For now, I'm saving it to a temp folder and reading it back, but that is less than ideal.
Is there a way to get PhpSpreadsheet to load via something other than a local file?
This is not supported. PhpSpreadsheet will always read from disk.
On a side note, since 1.13.0, PhpSpreasheet is able to write in memory. See https://github.com/PHPOffice/PhpSpreadsheet/pull/1292
I am hosting a small fileserver, where users can upload documents from all around the world.
Due to problems in encoding (see otherquestion), I am asking myself if I should disallow users to upload (and on the other hand download) files not supported by CP1252 charset?
or otherwise; is it senseful to allow users upload documents with arabian or chinese letters in their filenames?
PS: they download the same file some time later (and it should have the same filename as uploaded)
You should be storing the files on disk using a randomly generated name, or let the file name be based on a hash of the file contents (good for deduplicating storage as well). You can save the original file name as meta data in a database together with all other meta data about the file (who uploaded it and things like that). Then you serve the file again using a PHP script which sets the original file name from the database in an HTTP header. This way you:
don't need to worry about file name sanitisation or duplication
file system encoding issues
storage duplication (if using a hash)
I am trying to upload a zip file of 350mb - 500mb to server. It gives "ENOSPC" error.
Is it possible to upload file in chunks and receive it on server as one file ?
or
Use custom location for tmpfs, so that it will be independent of system tmp, because in my case tmp is of 128MB only.
Why not use the Web-server uploading feature like nginx-upload and apache-upload
Not sure what it is called in apache but I guess apache too has it
if you are using Nginx
there is also a nginx-upload-progress which can be helpful if you want to track the progress of the upload
Hope this help
In many Windows setups, when you print directly to a printer, two files are typically created in the windows spool directory "C:\Windows\System32\spool\PRINTERS". A spool file "80021.SPL" and a shadow file "80021.SHD" are examples of these files. The spool file contains the meat and potatoes of the drawing instructions so the printer can print the page. The data in this spool file comes in a smorgasbord of different formats depending on the language technology and the print driver used. However, when you are printing to a printer that's on a print server, a single ".TMP" file is created instead and gets transmitted to the print server. I think its fair to assume that this is just the .SHD and .SPL files combined into a single transport file to get it to the server. However, its unreadable, i'm nto sure if its zipped, encrypted, or what, but I can't decipher it. When printing PDFs you can typically see plain text PostScript instructions in the spool file (.SPL), by just opening it and viewing it in a text editor. You can even send that spool file (.SPL) to a postscript viewer like GhostScript and have it show you the pages drawn on screen. But when the job is all packaged up in a .TMP file, its basically just a binary pile of bits. Does anyone know how to uncompress the data from these transport .TMP spool files?
I believe that file you have will be an EMF file that is padded with a proprietary MS structure at the beginning. Easiest way to find out if you are dealing with an EMF structure is to look for the ANSI characters ' EMF' in tmp file you have.
Assuming that you do find these characters it is just a matter of removing the proprietary structure data from the beginning of the file then treating it as a standard EMF file. Fortunately all EMF files have a standard header format so it should be reasonably to determine where the EMF file starts.
There is a good description of EMF file headers here