Rails: How to stream data to and from a binary column (blob) - ruby-on-rails

I have a question about how to efficiently store and retrieve large amounts of data to and from a blob column (data_type :binary). Most examples and code out there show simple assignments but that cannot be efficient for large amounts of data. For instance storing data from a file may be something like this:
# assume a model MyFileStore has a column blob_content :binary
my_db_rec = MyFileStore.new
File.open("#{Rails.root}/test/fixtures/alargefile.txt", "rb") do |f|
my_db_rec.blob_content = f.read
end
my_db_rec.save
Clearly this would read the entire file content into memory before saving it to the database. This cannot be the only way you can save blobs. For instance, in Java and in .Net there are ways to stream to and from a blob column so you are not pulling every thing into memory (see Similar Questions to the right). Is there something similar in rails? Or are we limited to only small chunks of data being stored in blobs when it comes to Rails applications.

If this is Rails 4 you can use render stream. Here's an example Rails 4 Streaming
I would ask though what database you're using, and if it might be better to store the files in a filesystem (Amazon s3, Google Cloud Storage, etc..) as this can greatly affect your ability to manage blobs. Microsoft, for example, has this recommendation: To Blob or Not to Blob
Uploading is generally done through forms, all at once or multi-part. Multi-part chunks the data so you can upload larger files with more confidence. The chunks are reassembled and stored in whatever database field (and type) you have defined in your model.
Downloads can be streamed. There is a large tendency to hand off upload and streaming to third party cloud storage systems like amazon s3. This drastically reduces the burden on rails. You can also hand off upload duties to your web server. All modern web servers have a way to stream files from a user. Doing this avoids memory issues as only the currently uploading chunk is in memory at any give time. The web server should also be able to notify your app once the upload is completed.
For general streaming of output:
To add a stream to a template you need to pass the :stream option from within your controller like this: render stream: true. You also need to explicitly close the stream with response.stream.close. Since the method of rendering templates and layouts changes with streaming, it is important to pay attention to loading attributes like title, etc. This needs to be done with content_for not yield. You can explicitly open and close streams using the Live API. For this you need the puma gem. Also be aware that you need a web server that supports streaming. You can configure Unicorn to support streaming.

Related

Image loaded using send_data are half broken when loading multiple images in iOS

For reasons I don't want to get into here I need to resize images that are coming from users cell phones (and are stupid large) from blobs on the fly when loading them using this tag
image_tag thumbnail_document_path(document.id), alt: "#{document.document_title}", class: "document-thumbnail"
this calls the thumbnail controller action
def thumbnail
document = Document.find params[:id]
image_meta = document.image_details.first
image = Document.get_image_for_display image_meta
send_data image, type: "image/jpg", disposition: "inline"
end
Document.get_image_for_display pulls the blob from a database, checks the metadata we save separately to see if it's a PDF that needs to be converted to an image or a huge image that needs resizing, and then does all that and asynchronously sends the blob. This works fine on everything except for iOS. on iOS safari and Chrome if you're loading multiple images this way at the same time this happens. If anyone has an explanation or solution for this I'd love to hear it!
Edit: This app shares a server and database with a C++ desktop app and does not(and can not) have access to the locations on the server where these files are stored and there are hundreds of thousands of files across hundreds of client servers so storing a resized version next to the full size one is off the table due to physical storage limitations. The issue only appears to happen when trying to convert and load more than 3 large images and then only the first image stays broken, the others will fully load, although it takes a while for them to do it.
Your thumbnail action code implies that get_image_for_display must be synchronous, because it's return value is used right away. Also it should be reenterable if you're using a threaded server (like puma) - so that two instances can run concurrently inside one process. This requires it to not write any state (class variables, other globals), or at least do it with synchronization. (mutexes and so on).
But even if your server is single-threaded, in production usually there're multiple processes. So for each step of the function think of how a concurrent run can interfere.
For example: I can guess that the function runs some external apps to process the data. First thing is to check that there's no temporary file name collision - when two instances write data to the same file (even if the data is going to be the same) - obviously it'll be broken.
This usually happens when the servers breaks the connection with the browser while streaming the data due to several reasons like networking or others.
Recommended approach here is to use send_file or use buffer by doing something like this. Streaming in chunks always ensures the data is completely streamed to the client
File.open('filename.jpeg', 'rb') do |file|
send_data(file.read, type: "image/jpg", disposition: "inline")
end

What is the best way to save images? [duplicate]

The common method to store images in a database is to convert the image to base64 data before storing the data. This process will increase the size by 33%. Alternatively it is possible to directly store the image as a BLOB; for example:
$image = new Imagick("image.jpg");
$data = $image->getImageBlob();
$data = $mysqli->real_escape_string($data);
$mysqli->query("INSERT INTO images (data) VALUES ('$data')");
and then display the image with
<img src="data:image/jpeg;base64,' . base64_encode($data) . '" />
With the latter method, we save 1/3 storage space. Why is it more common to store images as base64 in MySQL databases?
UPDATE: There are many debates about advantages and disadvantages of storing images in databases, and most people believe it is not a practical approach. Anyway, here I assume we store image in database, and discussing the best method to do so.
I contend that images (files) are NOT usually stored in a database base64 encoded. Instead, they are stored in their raw binary form in a binary column, blob column, or file.
Base64 is only used as a transport mechanism, not for storage. For example, you can embed a base64 encoded image into an XML document or an email message.
Base64 is also stream friendly. You can encode and decode on the fly (without knowing the total size of the data).
While base64 is fine for transport, do not store your images base64 encoded.
Base64 provides no checksum or anything of any value for storage.
Base64 encoding increases the storage requirement by 33% over a raw binary format. It also increases the amount of data that must be read from persistent storage, which is still generally the largest bottleneck in computing. It's generally faster to read less bytes and encode them on the fly. Only if your system is CPU bound instead of IO bound, and you're regularly outputting the image in base64, then consider storing in base64.
Inline images (base64 encoded images embedded in HTML) are a bottleneck themselves--you're sending 33% more data over the wire, and doing it serially (the web browser has to wait on the inline images before it can finish downloading the page HTML).
On MySQL, and perhaps similar databases, for performance reasons, you might wish to store very small images in binary format in BINARY or VARBINARY columns so that they are on the same page as the primary key, as opposed to BLOB columns, which are always stored on a separate page and sometimes force the use of temporary tables.
If you still wish to store images base64 encoded, please, whatever you do, make sure you don't store base64 encoded data in a UTF8 column then index it.
Pro base64: the encoded representation you handle is a pretty safe string. It contains neither control chars nor quotes. The latter point helps against SQL injection attempts. I wouldn't expect any problem to just add the value to a "hand coded" SQL query string.
Pro BLOB: the database manager software knows what type of data it has to expect. It can optimize for that. If you'd store base64 in a TEXT field it might try to build some index or other data structure for it, which would be really nice and useful for "real" text data but pointless and a waste of time and space for image data. And it is the smaller, as in number of bytes, representation.
Just want to give one example why we decided to store image in DB not files or CDN, it is storing images of signatures.
We have tried to do so via CDN, cloud storage, files, and finally decided to store in DB and happy about the decision as it was proven us right in our subsequent events when we moved, upgraded our scripts and migrated the sites serveral times.
For my case, we wanted the signatures to be with the records that belong to the author of documents.
Storing in files format risks missing them or deleted by accident.
We store it as a blob binary format in MySQL, and later as based64 encoded image in a text field. The decision to change to based64 was due to smaller size as result for some reason, and faster loading. Blob was slowing down the page load for some reason.
In our case, this solution to store signature images in DB, (whether as blob or based64), was driven by:
Most signature images are very small.
We don't need to index the signature images stored in DB.
Index is done on the primary key.
We may have to move or switch servers, moving physical images files to different servers, may cause the images not found due to links change.
it is embarrassed to ask the author to re-sign their signatures.
it is more secured saving in the DB as compared to exposing it as files which can be downloaded if security is compromised. Storing in DB allows us better control over its access.
any future migrations, change of web design, hosting, servers, we have zero worries about reconcilating the signature file names against the physical files, it is all in the DB!
AC
I recommend looking at modern databases like NoSQL and also I agree with user1252434's post. For instance I am storing a few < 500kb PNGs as base64 on my Mongo db with binary set to true with no performance hit at all. Mongo can be used to store large files like 10MB videos and that can offer huge time saving advantages in metadata searches for those videos, see storing large objects and files in mongodb.

iOS UIWebView - caching assets in native apap

I am evaluating a project that was originally targeted to be just a PWA using React and Redux.
The application needs offline support though, and needs a sizable amount of media assets (images and videos) to be available offline.
Since the service worker storage limit is just 50MB, this is not feasible for iOS.
I have toyed with the idea of using a native app wrapper that handles the storage of the media files, with most of the app remaining a Redux/React implementation.
Is there a good way to expose such assets from to the UIWebView from the native app? Or are there other common approaches for this situation?
First off all you should try to cache only that assets which are necessary for your PWA.However still if you want to store large files I would suggest you can go with IndexDB API.
IndexedDB is a low-level API for client-side storage of significant amounts of structured data, including files/blobs. This API uses indexes to enable high-performance searches of this data. While Web Storage is useful for storing smaller amounts of data, it is less useful for storing larger amounts of structured data. IndexedDB provides a solution.
Why IndexDB?
When quota exceeds on IndexedDB API, the error calls the transaction's onabort() function with Event as an argument.
When a browser requests user a permission for extending storage size, all browsers call this function only when the user doesn't allow it, otherwise, continue the transaction.
If you want know about other possible DB I would suggest you to go through this link
https://www.html5rocks.com/en/tutorials/offline/quota-research/

Storing infrequently used data sets in S3

I want a solution to store data that is not often accessed. For example, analytical data that I may not need to frequently access and instead only refer to counter columns stored in records in a local postgres database.
I’ve looked at Amazon SimpleDB and DynamoDB but I think S3 might be the best solution (for now). I like DynamoDB, but I think its overkill for what I need. I’ve also looked into the Amazon Big Data solutions, which I feel is almost definitely overkill. I’ve heard of companies using S3 to store infrequently accessed data sets (rather than the traditional use of media).
What should I use? Is S3 a horribly bad idea? If S3 is a good idea, how would I use it to store data sets? I’ve used S3, carrierwave and fog a lot for storing media, but not for data.
Example data
I want to store all the responses from emails that are sent by Mandrill. This will eventually accumulate a lot of data and I may not use the data for a long time, but I want to store it regardless.
I’m thinking about combining the emails into a set and grouping them by day. And I can store the counters (which I will access frequently) in a single record in my local postgres database (one record per day). I won’t be able to search through the raw email data without loading the day, but I don’t think I need to.

Machine-readability: Guidelines to follow such that data can be previewed nicely on CKAN

What are the guidelines to follow such that data can be previewed nicely on CKAN Data Preview tool? I am working on CKAN and have been uploading data or linking it to external websites. Some could be previewed nicely, some not. I have been researching online about machine-readability and could not find any resources pertaining to CKAN that states the correct way to structure data such that it can be previewed nicely on CKAN. I hope to gather responses from all of you on the do's and don'ts so that it will come in useful to CKAN publishers and developers in future.
For example, data has to be in a tabular format with labelled rows and columns. Data has to be stored on the first tab of the spreadsheet as the other tabs cannot be previewed. Spreadsheet cannot contain formulas or macros. Data has to be stored in the correct file format (refer to another topic of mine: Which file formats can be previewed on CKAN Data Preview tool?)
Thanks!
Since CKAN is an open source data management system, it does not have a specific guidelines on the machine readability of data. Instead, you might want to take a look at the current standard for data openness and machine readability right here: http://5stardata.info
UK's implementation of CKAN also includes a set of plugins which help to rate the openness of the data based on the 5 star open data scheme right here: https://github.com/ckan/ckanext-qa
Check Data Pusher Logs - When you host files in the CKAN Data Store - the tool that loads the data in provides logs - these will reveal problems with the format of data.
Store Data Locally - Where possible store the data locally - because data stored elsewhere has to go through the proxy process (https://github.com/okfn/dataproxy) which is slower and is of course subject to the external site maintaining availability.
Consider File Size and Connectivity - Keep the file size small enough for your installation and connectivity that it doesn't time out when loading into the CKAN Data Explorer. If the file is externally hosted and is large and the access to the file is slow ( poor connectivity or too much load) you will end up with timeouts since the proxy must read the entire file before it is presented for preview. Again hosting data locally should mean better control over the load on compute resource and ensure that the data explorer works consistently.
Use Open File Formats - If you are using CKAN to publish open data - then the community generally holds that is is best to publish data in open formats (e.g. CSV, TXT) rather than proprietary ones (eg. XLS). Beyond increasing access to data to all users - and reducing the chance that the data is not properly structured for preview - this has other advantages. For example, it is harder to accidentally publish information that you didn't mean to.
Validate Your Data -Use tools like CSVKIT to check that your data is in good shape.
The best way to get good previewing experiences is to start using the DataStore. When viewing remote data CKAN has to use the DataProxy to do its best to guess data types and convert the data to a form it can preview. If you put the data into the DataStore that isn't necessary as the data will already be in a good structure and types will have been set (e.g. you'll know this column is a date rather than a number).

Resources