How google doc are stored on disk? - storage

I know google docs are stored in google drive but what is the underlying storage used for google docs?
My guesses are
GFS: But GFS is not optimized for small files and a large number of small writes.
BigTable: Each update is a new record. Every time someone requests a document it will create the doc from the sequence of updates and return, But how it will find the size of the document and other metadata required for google drive.
NFS: Not sure if will work here or not. I have 0 knowledge of NFS.

Related

Why does a ~0.5MB binary file take up 4.3MB of iCloud space when stored in a Core Data record?

The situation
I am using Core Data's "Allows External Storage" to store compressed images and small audio files in Core Data. Performance benchmarks have shown that this is actually quite performant. Also, I am using Core Data's PersistentCloudKitContainer to sync my database with iCloud.
"Allows External storage" will automatically save files that are bigger than ~500KB (?) to the file system and only store a reference in the database. This works nicely. For instance, a 1MB image file is stored as an external record and takes up the expected 1MB of iCloud storage after syncing. Also, files that are smaller than those ~500KB are not stored as external record (ckAsset), but as binaryData in the database record.
The problem:
For some reason a 0.47MB binary data file that is stored directly in the database will take up about 4.3MB of iCloud storage. That is 9x of the expected amount. Inspecting the binary data stored in the record shows that the binary data itself has the expected size of only 0.47MB (CloudKit Dashboard). Also, I have verified that the local app bundle only grows by the expected 0.47MB. Thus, how can those additional 3.8MB of consumed iCloud storage be explained? In contrast, audio and image files that are larger than ~500KB are stored as external records and take up the correct amount of iCloud storage.
Please look at this annotated image for a better understanding:
Image that illustrates the problem (CloudKit Dashboard)
Ideas / Workarounds / What I tried:
I could try to find a way to always store files as ckAssets/external records. (e.g. by lowering the limit for storing ckAssets to 0.01MB). If that is possible.
Could the Write-Ahead-Log (WAL) of SQLite be involved in creating huge temporary sqlite-wal files? I tried limiting the WAL journal size and the local sqlite-wal is small, so I don't think that this is where the problem lies. Unless there is an independent iCloud WAL file that I don't know about.
I would be glad if anyone could help me with this issue. Thanks in advance!

Google Sheet with Google Finance lookup and custom function causes Google Drive to constantly resync

I have a sheet with a Google Finance lookup:
=googlefinance("USDZAR")
and a custom function that returns a constant string (abc). It doesn't take any parameters:
=test()
See here
Google Drive keeps syncing this sheet to my computer every 5-10 mins:
No actual content is being synced since Sheet files are only 176 bytes in size - they must be references to cloud data at Google:
I've compared subsequent files and they are identical.
Also, the Drive API keeps generating change events for this file every few minutes (https://developers.google.com/drive/api/v3/reference/changes/watch)
It's definitely the combination of the Google Finance and custom function - either separately doesn't cause this.
Does anyone know how I can fix this? It seems like a bug?
it's a "future" of GOOGLEFINANCE() to update in given intervals. it has nothing to do with your custom function.
and if GOOGLEFINANCE() constantly updating then it constantly syncing to your PC
you can try =GOOGLEFINANCE("USDZAR"; "daily") if that will do the trick, othervise you will need somehow to freeze GOOGLEFINANCE() formula

Difference between three firebase storage download methods

I couldn't find resources discussing the difference between the three download methods in the firebase storage documentation and pros/cons of each. I would like some clarification about the firebase storage documentation.
My App
Displays 100 images ranging from 10 KB-500 KB in size on a table view
Will be used in a location where internet connection and/or phone service could be very weak
Could be used by many users
3 methods for downloading from Firebase storage
Download to NSData in memory
This is the easiest way to quickly download a file, but it must load entire contents of your file into memory. If you request a file larger than your app's available memory, your app will crash. To protect against memory issues, make sure to set the max size to something you know your app can handle, or use another download method.
Question: I tried this method to display 100 images that were 10KB-500KB in size on my table view cells. Although my app didn't crash, as I scrolled through my table, my memory usage increased to 268 mb. Would this method not be recommended for displaying a lot of images?
Download to an NSURL representing a file on device
The writeToFile:completion: method downloads a file directly to a local device. Use this if your users want to have access to the file while offline or to share in a different app.
Question: Does that mean all images from firebase storage will be downloaded on user's phone? Does that mean that the app will be taking up a large percentage of the available storage on the phone?
Generate an NSURL representing the file online
If you already have download infrastructure based around URLs, or just want a URL to share, you can get the download URL for a file by calling the downloadURLWithCompletion: method on a storage reference.
Question: Does this method require a strong internet connection and/or phone service connection to work?
Generally, your memory usage should not be affected by the method of retrieval. As long as you're displaying the 100 images, their data will be stored in the memory and should have the same size if they're identically formatted/compressed.
Either way you go with, I suggest you implement pagination (for your convenience, this question's answer might serve as a good implementation reference/guide) to possibly decrease the memory and network usage.
Now, down to comparing the methods:
Method 1
...but it must load entire contents of your file into memory.
This line might throw some people off thinking it's a
memory-inefficient solution, when all it really means is that you
cannot retrieve parts of the data, you can only download the entire
file. In the case of storing images, you probably would want that for
the data to make sense.
If your application needs to download the images every time the users
access it (i.e if your images are regularly updated), then this
method will probably suit you best. The images will be downloaded
every time the application starts, then they'll get discarded when
you kill it.
You stated that a part of your user base might have a weak internet
connection and so the next method might be more efficient and
user-friendly
Method 2
First off, the answers to your questions:
Yes. The images downloaded using this method will be stored on the users' devices.
The images should take up about the same size they're taking on Firebase storage.
Secondly, if you plan to use this method, then I suggest you store a
timestamp (or any sort of marker) in your database for when the last
change to the images occurred. Then, every time the app opens up, do
the following flow:
If no images are downloaded -> download images and store the database timestamp locally
If the local timestamp does not equal the timestamp on the database -> download images and store the new timestamp locally
Else -> use the images you already have, they should be identical to the ones in Firebase storage
That would be the best way to go if your network usage priority is
higher than that of the local storage.
And finally...
Method 3 (not really)
This is not a data download method, this simply generates a
download URL given a reference to the child. You can then use that
URL to download the data in your app or elsewhere as long as the used
app or API is authorized to access your Firebase storage.
Update:
The URL is generated from a Firebase reference (FIRDatabase.database().reference().child("exampleReference")) and would look like this: (Note: this is a fake link that will not actually work, just used for illustration purpose)
https://firebasestorage.googleapis.com/v0/b/projectName.appspot.com/o/somePathHere%2FchildName%2FsomeOtherChildName%2FimageName.jpg?alt=media&token=1a8f83a7-95xf-4d3s-nf9b-99a274927bcb
If you simply try to access that link you generate through any regular web-browser (assuming you don't have any Firebase rule that conflicts with that in your project), you can directly download that image from anywhere, not just through your app.
So in conclusion, this "Method" does not download data from Firebase storage, it just returns a download URL for your data in case you want a direct link.

Rails: How to stream data to and from a binary column (blob)

I have a question about how to efficiently store and retrieve large amounts of data to and from a blob column (data_type :binary). Most examples and code out there show simple assignments but that cannot be efficient for large amounts of data. For instance storing data from a file may be something like this:
# assume a model MyFileStore has a column blob_content :binary
my_db_rec = MyFileStore.new
File.open("#{Rails.root}/test/fixtures/alargefile.txt", "rb") do |f|
my_db_rec.blob_content = f.read
end
my_db_rec.save
Clearly this would read the entire file content into memory before saving it to the database. This cannot be the only way you can save blobs. For instance, in Java and in .Net there are ways to stream to and from a blob column so you are not pulling every thing into memory (see Similar Questions to the right). Is there something similar in rails? Or are we limited to only small chunks of data being stored in blobs when it comes to Rails applications.
If this is Rails 4 you can use render stream. Here's an example Rails 4 Streaming
I would ask though what database you're using, and if it might be better to store the files in a filesystem (Amazon s3, Google Cloud Storage, etc..) as this can greatly affect your ability to manage blobs. Microsoft, for example, has this recommendation: To Blob or Not to Blob
Uploading is generally done through forms, all at once or multi-part. Multi-part chunks the data so you can upload larger files with more confidence. The chunks are reassembled and stored in whatever database field (and type) you have defined in your model.
Downloads can be streamed. There is a large tendency to hand off upload and streaming to third party cloud storage systems like amazon s3. This drastically reduces the burden on rails. You can also hand off upload duties to your web server. All modern web servers have a way to stream files from a user. Doing this avoids memory issues as only the currently uploading chunk is in memory at any give time. The web server should also be able to notify your app once the upload is completed.
For general streaming of output:
To add a stream to a template you need to pass the :stream option from within your controller like this: render stream: true. You also need to explicitly close the stream with response.stream.close. Since the method of rendering templates and layouts changes with streaming, it is important to pay attention to loading attributes like title, etc. This needs to be done with content_for not yield. You can explicitly open and close streams using the Live API. For this you need the puma gem. Also be aware that you need a web server that supports streaming. You can configure Unicorn to support streaming.

Machine-readability: Guidelines to follow such that data can be previewed nicely on CKAN

What are the guidelines to follow such that data can be previewed nicely on CKAN Data Preview tool? I am working on CKAN and have been uploading data or linking it to external websites. Some could be previewed nicely, some not. I have been researching online about machine-readability and could not find any resources pertaining to CKAN that states the correct way to structure data such that it can be previewed nicely on CKAN. I hope to gather responses from all of you on the do's and don'ts so that it will come in useful to CKAN publishers and developers in future.
For example, data has to be in a tabular format with labelled rows and columns. Data has to be stored on the first tab of the spreadsheet as the other tabs cannot be previewed. Spreadsheet cannot contain formulas or macros. Data has to be stored in the correct file format (refer to another topic of mine: Which file formats can be previewed on CKAN Data Preview tool?)
Thanks!
Since CKAN is an open source data management system, it does not have a specific guidelines on the machine readability of data. Instead, you might want to take a look at the current standard for data openness and machine readability right here: http://5stardata.info
UK's implementation of CKAN also includes a set of plugins which help to rate the openness of the data based on the 5 star open data scheme right here: https://github.com/ckan/ckanext-qa
Check Data Pusher Logs - When you host files in the CKAN Data Store - the tool that loads the data in provides logs - these will reveal problems with the format of data.
Store Data Locally - Where possible store the data locally - because data stored elsewhere has to go through the proxy process (https://github.com/okfn/dataproxy) which is slower and is of course subject to the external site maintaining availability.
Consider File Size and Connectivity - Keep the file size small enough for your installation and connectivity that it doesn't time out when loading into the CKAN Data Explorer. If the file is externally hosted and is large and the access to the file is slow ( poor connectivity or too much load) you will end up with timeouts since the proxy must read the entire file before it is presented for preview. Again hosting data locally should mean better control over the load on compute resource and ensure that the data explorer works consistently.
Use Open File Formats - If you are using CKAN to publish open data - then the community generally holds that is is best to publish data in open formats (e.g. CSV, TXT) rather than proprietary ones (eg. XLS). Beyond increasing access to data to all users - and reducing the chance that the data is not properly structured for preview - this has other advantages. For example, it is harder to accidentally publish information that you didn't mean to.
Validate Your Data -Use tools like CSVKIT to check that your data is in good shape.
The best way to get good previewing experiences is to start using the DataStore. When viewing remote data CKAN has to use the DataProxy to do its best to guess data types and convert the data to a form it can preview. If you put the data into the DataStore that isn't necessary as the data will already be in a good structure and types will have been set (e.g. you'll know this column is a date rather than a number).

Resources