Looking at corrupted files on FTP server I think about verifying files uploaded with TIdFtp.Put by downloading them just after upload and comparing byte-to byte.
I think that TIdFtp may be theoretically caching data and return it from cache instead of actually downloading.
Please allay or confirm my concerns.
No, there is no caching, as there is no such thing in the FTP protocol in general. TIdFTP deals only with live data.
Are you, perhaps, uploading binary files in ASCII mode? If so, that would alter line break characters (CR and LF) during transmission. That is a common mistake to make, since ASCII is FTP's default mode. Make sure you are setting the TIdFTP.TransferType property as needed before transferring a file. ASCII mode should only be used for text files, if used at all.
And FWIW, you may not need to download a file to verify its bytes. If the server supports any X<Hash> commands (where Hash can be SHA512, SHA256, SHA1, MD5, or CRC) , TIdFTP has VerifyFile() methods to use them. That calculates a hash of a local file and then compares it to a hash calculated by the server for a remote file. No transfer of file data is needed.
Related
The common method to store images in a database is to convert the image to base64 data before storing the data. This process will increase the size by 33%. Alternatively it is possible to directly store the image as a BLOB; for example:
$image = new Imagick("image.jpg");
$data = $image->getImageBlob();
$data = $mysqli->real_escape_string($data);
$mysqli->query("INSERT INTO images (data) VALUES ('$data')");
and then display the image with
<img src="data:image/jpeg;base64,' . base64_encode($data) . '" />
With the latter method, we save 1/3 storage space. Why is it more common to store images as base64 in MySQL databases?
UPDATE: There are many debates about advantages and disadvantages of storing images in databases, and most people believe it is not a practical approach. Anyway, here I assume we store image in database, and discussing the best method to do so.
I contend that images (files) are NOT usually stored in a database base64 encoded. Instead, they are stored in their raw binary form in a binary column, blob column, or file.
Base64 is only used as a transport mechanism, not for storage. For example, you can embed a base64 encoded image into an XML document or an email message.
Base64 is also stream friendly. You can encode and decode on the fly (without knowing the total size of the data).
While base64 is fine for transport, do not store your images base64 encoded.
Base64 provides no checksum or anything of any value for storage.
Base64 encoding increases the storage requirement by 33% over a raw binary format. It also increases the amount of data that must be read from persistent storage, which is still generally the largest bottleneck in computing. It's generally faster to read less bytes and encode them on the fly. Only if your system is CPU bound instead of IO bound, and you're regularly outputting the image in base64, then consider storing in base64.
Inline images (base64 encoded images embedded in HTML) are a bottleneck themselves--you're sending 33% more data over the wire, and doing it serially (the web browser has to wait on the inline images before it can finish downloading the page HTML).
On MySQL, and perhaps similar databases, for performance reasons, you might wish to store very small images in binary format in BINARY or VARBINARY columns so that they are on the same page as the primary key, as opposed to BLOB columns, which are always stored on a separate page and sometimes force the use of temporary tables.
If you still wish to store images base64 encoded, please, whatever you do, make sure you don't store base64 encoded data in a UTF8 column then index it.
Pro base64: the encoded representation you handle is a pretty safe string. It contains neither control chars nor quotes. The latter point helps against SQL injection attempts. I wouldn't expect any problem to just add the value to a "hand coded" SQL query string.
Pro BLOB: the database manager software knows what type of data it has to expect. It can optimize for that. If you'd store base64 in a TEXT field it might try to build some index or other data structure for it, which would be really nice and useful for "real" text data but pointless and a waste of time and space for image data. And it is the smaller, as in number of bytes, representation.
Just want to give one example why we decided to store image in DB not files or CDN, it is storing images of signatures.
We have tried to do so via CDN, cloud storage, files, and finally decided to store in DB and happy about the decision as it was proven us right in our subsequent events when we moved, upgraded our scripts and migrated the sites serveral times.
For my case, we wanted the signatures to be with the records that belong to the author of documents.
Storing in files format risks missing them or deleted by accident.
We store it as a blob binary format in MySQL, and later as based64 encoded image in a text field. The decision to change to based64 was due to smaller size as result for some reason, and faster loading. Blob was slowing down the page load for some reason.
In our case, this solution to store signature images in DB, (whether as blob or based64), was driven by:
Most signature images are very small.
We don't need to index the signature images stored in DB.
Index is done on the primary key.
We may have to move or switch servers, moving physical images files to different servers, may cause the images not found due to links change.
it is embarrassed to ask the author to re-sign their signatures.
it is more secured saving in the DB as compared to exposing it as files which can be downloaded if security is compromised. Storing in DB allows us better control over its access.
any future migrations, change of web design, hosting, servers, we have zero worries about reconcilating the signature file names against the physical files, it is all in the DB!
AC
I recommend looking at modern databases like NoSQL and also I agree with user1252434's post. For instance I am storing a few < 500kb PNGs as base64 on my Mongo db with binary set to true with no performance hit at all. Mongo can be used to store large files like 10MB videos and that can offer huge time saving advantages in metadata searches for those videos, see storing large objects and files in mongodb.
I am implementing pdf upload using Carrierwave with Rails 4. I was asked by the client about malicious content, e.g. if someone attempts to upload a malicious file masked as a pdf. I will be restricting filetype on the frontend to 'application/pdf'. Is there anything else I need to worry about, assuming the uploaded file has a .pdf extension?
File uploads is often a security issue, since there are so many ways to get it wrong. Regarding just the issue of masking a malicious file as a PDF, checking the content type (application/pdf) is good, but not enough, since it's controlled by the client and can be modified.
Filtering on the .pdf extension is definitely advisable, but make sure you don't accept files like virus.pdf.exe.
Other filename attack techniques exist, e.g. involving null or control characters.
Consider using a file type detector to determine that the file is really a PDF document.
But that's just for restricting the file type. There are many other issues you need to be aware of when accepting file uploads.
PDF files can contain malicious code and are a common attack vector.
Make sure uploaded files are written to an appropriate directory on the server. If they aren't meant to be publicly accessible, choose a directory outside of the web root.
Restrict the maximum upload file size.
This is not a complete list by any means. Check out the Unrestricted File Upload vulnerability by OWASP for more info.
In addition to #StefanOS 's great answer, PDF files are required to start with the string:
%PDF-[VERSION]
Generally, at least often, the first couple of bytes (or more) indicate the file type - especially for executables (i.e., Windows executables, called PE files, should start - if memory serves - with "MZ").
For uploaded PDF files, opening the uploaded file and reading the first 5 bytes should always yield %PDF-.
This might be a good enough verification. for most use-cases.
ChicageBoss controller API has this
{stream, Generator::function(), Acc0}
Stream a response to the client using HTTP chunked encoding. For each
chunk, the Generator function is passed an accumulator (initally Acc0)
and should return either {output, Data, Acc1} or done.
I am wondering what is the use case for this? There are others like Json, output. When will this stream be useful?
Can someone present an use case in real world?
Serving large files for download might be the most straight-forward use case.
You could argue that there are also other ways to serve files so that users can download them, but these might have other disadvantages:
By streaming the file, you don't have to read the entire file into memory before starting to send the response to the client. For small files, you could just read the content of the file, and return it as {output, BinaryContent, CustomHeader}. But that might become tricky if you want to serve large files like disk images.
People often suggest to serve downloadable files as static files (e.g. here). However, these downloads bypass all controllers, which might be an issue if you want things like download counters or access restrictions. Caching might be an issue, too.
When I download, say an ISO image, using a torrent; should I still verify the file's integrity (by calculating sha256 hash, for example), or is this done automatically while downloading?
The BitTorrent protocol has a mechanism for automatically verifying each chunk's integrity after download. Of course, this should only reassure you if you trust the source of the file.
If you have a checksum for the whole file (eg. for some software package), you can definitely verify the file yourself afterwards.
Torrent files have an "announce" section, which specifies the URL of the tracker, and an "info" section, containing (suggested) names for the files, their lengths, the piece length used, and a SHA-1 hash code for each piece, all of which are used by clients to verify the integrity of the data they receive.
https://en.wikipedia.org/wiki/Bittorrent
We take text/csv like data over long periods (~days) from costly experiments and so file corruption is to be avoided at all costs.
Recently, a file was copied from the Explorer in XP whilst the experiment was in progress and the data was partially lost, presumably due to multiple access conflict.
What are some good techniques to avoid such loss? - We are using Delphi on Windows XP systems.
Some ideas we came up with are listed below - we'd welcome comments as well as your own input.
Use a database as a secondary data storage mechanism and take advantage of the atomic transaction mechanisms
How about splitting the large file into separate files, one for each day.
If these machines are on a network: send a HTTP post with the logging data to a webserver.
(sending UDP packets would be even simpler).
Make sure you only copy old data. If you have a timestamp on the filename with a 1 hour resolution, you can safely copy the data older than 1 hour.
If a write fails, cache the result for a later write - so if a file is opened externally the data is still stored internally, or could even be stored to a disk
I think what you're looking for is the Win32 CreateFile API, with these flags:
FILE_FLAG_WRITE_THROUGH : Write operations will not go through any intermediate cache, they will go directly to disk.
FILE_FLAG_NO_BUFFERING : The file or device is being opened with no system caching for data reads and writes. This flag does not affect hard disk caching or memory mapped files.
There are strict requirements for successfully working with files opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for details see File Buffering.
Each experiment much use a 'work' file and a 'done' file. Work file is opened exclusively and done file copied to a place on the network. A application on the receiving machine would feed that files into a database. If explorer try to move or copy the work file, it will receive a 'Access denied' error.
'Work' file would become 'done' after a certain period (say, 6/12/24 hours or what ever period). So it create another work file (the name must contain the timestamp) and send the 'done' through the network ( or a human can do that, what is you are doing actually if I understand your text correctly).
Copying a file while in use is asking for it being corrupted.
Write data to a buffer file in an obscure directory and copy the data to the 'public' data file periodically (every 10 points for instance), thereby reducing writes and also providing a backup
Write data points discretely, i.e. open and close the filehandle for every data point write - this reduces the amount of time the file is being accessed provided the time between data points is low