serve my text from the filesystem instead of a database? - ruby-on-rails

I am working on a content management application in which the data being stored on the database is extremely generic. In this particular instance a container has many resources and those resources map to some kind of digital asset, whether that be a picture, a movie, an uploaded file or even plain text.
I have been arguing with a colleague for a week now because in addition to storing the pictures, etc - they would like to store the text assets on the file system and have the application look up the file location(from the database) and read in the text file(from the file system) before serving to the client application.
Common sense seemed to scream at me that this was ridiculous and if we are bothering to look up something from the database, we might as well store the text in a database column and have it served along up with the row lookup. Database lookup + File IO seemed sounds uncontrollably slower then just Database Lookup. After going back and forth for some time, I decided to run some benchmarks and found the results a little surprising. There seems to be very little consistency when it comes to benchmark times. The only clear winner in the benchmarks was pulling a large dataset from the database and iterating over the results to display the text asset, however pulling objects one at a time from the database and displaying their text content seems to be neck and neck.
Now I know the limitations of running benchmarks, and I am not sure I am even running the correct idea of "tests" (for example, File system writes are ridiculously faster then database writes, didn't know that!). I guess my question is for confirmation. Is File I/O comparable to database text storage/lookup? Am I missing a part of the argument here? Thanks ahead of time for your opinions/advice!
A quick work about what I am using:
This is a Ruby on Rails application,
using Ruby 1.8.6 and Sqlite3. I plan
on moving the same codebase to MySQL
tomorrow and see if the benchmarks are
the same.

The major advantage you'll get from not using the filesystem is that the database will manage concurrent access properly.
Let's say 2 processes need to modify the same text as the same time, synchronisation with the filesystem may lead to race conditions, whereas you will have no problem at all with everyhing in database.

I think your benchmark results will depend on how you store the text data in your database.
If you store it as LOB then behind the scenes it is stored in an ordinary file.
With any kind of LOB you pay the Database lookup + File IO anyway.
VARCHAR is stored in the tablespace
Ordinary text data types (VARCHAR et al) are very limited in size in typical relational database systems. Something like 2000 or 4000 (Oracle) sometimes 8000 or even 65536 characters. Some databases support long text
but these have serious drawbacks and are not recommended.
LOBs are references to file system objects
If your text is larger you have to use a LOB data type (e.g. CLOB in Oracle).
LOBs usually work like this:
The database stores only a reference to a file system object.
The file system object contains the data (e.g. the text data).
This is very similar to what your colleague proposes except the DBMS lifts the heavy work of
managing references and files.
The bottom line is:
If you can store your text in a VARCHAR then go for it.
If you can't you have two options: Use a LOB or store the data in a file referenced from the database. Both are technically similar and slower than using VARCHAR.

I did this before. Its a mess, you need to keep the filesystem and the database synchronized all the time, so that makes the programming more complicated, as you would guess.
My advice is either go for an all filesystem solution, or all database solution, depending on the data. Notably, if you require lots of searches, conditional data retrieval, then go for database, otherwise fs.
Note that database may not be optimized for storage of large binary files. Still, remember, if you use both, youre gonna have to keep them synchronized, and it doesnt make for an elegant nor enjoyble (to program) solution.
Good luck!

At least, if your problems come from the "performance side", you could use a "no SQL" storage solution like Redis (via Ohm, for example), or CouchDB...

Related

How can I store images in my heroku postgres DB

I've found some similar questions, and good link on the topic. Such as this one (Easiest way to store images in database on Heroku?), where the top answer is to re-think using their DB and instead go with S3.
What I need is store 120 100KB images. That's it. There is no dynamic aspect. It's not like I have 10,000 users, and each one needs to store their profile picture. None of that. I just need to store 120 100KB images. The amount of images will never never change, neither growing or shrinking. 3 years from now, the same 120 images will be all that there is.
For these reasons, I want to store these in my DB. S3 is overkill, could cost some, extra time and cost needed implementing that solution, etc. In my DB, it's a measly 12MB which is a fraction of a fraction of the DB's total size.
How can I store the images? What datatype should I use and how can I upload them into my DB?
Just store them as bytea fields in a table. Insert them using a script in whatever your preferred language is and its database adapter. e.g. using Python with psycopg2 you'd open('filename','rb') the file, then pass it as a query parameter to the execute method when doing your INSERT.
For images that small there's no point using pg_largeobject and the lo wrapper, which can be useful for really big files. The only real advantage of that is that you can use lo_import to read the files directly into the database.

Read huge file from disk

I am developing an iOS app and I have this text file with a city name per line. I have like 3 Million cities in that file. In order to be able to perform searches and operations on it I am using a B-Tree but this tree takes a long time to be created. It is not good for the user experience having him to wait for this every time he uses the time. All this without using Core data!
Any tips on how can I speed up this process?
Thank you
My recommendation is that you use SQLite with an index on the fields you want to query (or some other type of permanent, indexed storage) so that the user only has to wait the first time the app is opened, and then you can query the database, which will be much faster. I am also fairly certain that you can install a SQLite database from a pre-generated file, so you might be able to generate this index offline, bundle it with your application, and then the user has no wait time at all. I'm not 100% sure on this options though, so you should investigate.
Either way, there is no magic solution here. If the data you want is on line 2 million of the file, you will have to read 2 million lines of text in order to get to that line. I would recommend finding a way to make the UX of your app acceptable so that the user feels better about waiting for the data to load. If you display some sort of pretty screen with a progress bar while the data indexes, the user will be more forgiving of this wait.
The B-Tree will obviously take some time to be created. If you don't want to use a database but stick with your own B-Tree implementation you could dump the tree data to a separate file and load that when the program starts instead of recreating it every time. However, you will have to update the cached tree every time the source data is modified.
In Python the pickle module can help you, but most programming languages will have a serialisation module.
Does this file come with the Application? If it does then you could already process the file file into an SQLite database. Before you ship the app containing the database. You can then use "Select" statements to search the data using indexed fields (like cityname).
If the file changes. Then Still ship with a database and just send amendments as a file. Which would edit the database to bring it back up to date. You may need to add a command to the file for each line like, REPLACE, NEW, DELETE.

How much data a column of mnesia table can store

How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.

How to manage millions of tiny HTML files

I have taken over a project which is a Ruby on Rails app that basically crawls the internet (in cronjobs). It crawls selected sites to build historical statistics about their subject over time.
Currently, all the raw HTML it finds is kept for future use. The HTML is saved into a Postgres database table and is currently at about 1,2 million records taking up 16 gigabytes and growing.
The application currently reads the HTML once to extract the (currently) relevant data and then flags the row as processed. However, in the future, it may be relevant to reprocess ALL files in one job to produce new information.
Since the tasks in my pipeline includes setting up backups, I don't really feel like doing daily backups of this growing amount of data, knowing that 99,9% of the backup is just static records with a practically immutable column with raw HTML.
I want this HTML out of the PostgreSQL database and into some storage. The storage must be easily manageable (inserts, reads, backups) and relatively fast. I (currently) have no requirement to search/query the content, only to resolve it by a key.
The ideal solution should allow me to insert and get data very fast, easy to backup, perhaps incrementally, perhaps support batch reads and writes for performance.
Currently, I am testing a solution where I push all these HTML files (about 15-20 KB each) to Rackspace Cloudfiles, but it takes forever and the access time is somewhat slow and relatively inconsistent (between 100 ms and several seconds). I estimate that accessing all files sequentially from rackspace could take weeks. That's not an ideal foundation for devising new statistics.
I don't believe Rackspace's CDN is the optimal solution and would like to get some ideas on how to tackle this problem in a graceful way.
I know this question has no code and is more a question about architecture and database/storage solutions, so it may be treading on the edges of SO's scope. If there's a more fitting community for the question, I'll gladly move it.

Sharing an large array with all users on a rails app

I have inherited an app that generates a large array for every user that visit the app. I recently discovered that it is identical for nearly all the users!!
Now I want to somehow make one copy of it so it is not built over and over again. I have thought of a few options and wanted input to see which one is the best:
1) Create a model and shove the data into the database
2) Create a YAML file and have the app load it when it initializes.
I personally like the model idea but a few engineers at work feel as though it does not deserve to be a full model. 97% of the times users will see the same exact thing but 3% of the time users will get a slightly different array (a few elements will have changed).
Any other approaches that I should consider.??..thanks in advance.
Remember that if you store the data in the DB, each request which requires the data will have to execute a DB query to pull it out. If you are running multiple server threads, each thread could have its own copy in memory (if they are all handling requests which require the use of the array). In that case, you wouldn't be saving any memory (though you might save time from not having to regenerate the array).
If you are running multiple server processes (not threads), and if the array contents change as the application is running, and the changes have to be visible to all the processes, caching in memory won't work. You will have to use the DB in that case.
From the information in your comment, I suggest you try something like this:
Store the array in your DB, and make sure that the record(s) used have created/updated timestamps. Cache the contents in memory using a constant/global variable/class variable. Also store the last time the cache was updated.
Every time you need to use the array, retrieve the relevant "updated" timestamp from the DB. (You may need to use hand-coded SQL and ModelName.connection.execute to avoid pulling back all the data in the record, which ActiveRecord will probably do.) If the timestamp is later than the last time your cache was updated, pull the array from the DB and update your cache.
Use a Mutex ('require thread') when retrieving/updating the cached data, in case your server setup may use multiple threads. (I don't think that Passenger does, but I have had problems similar to threading problems when using Passenger+RMagick, so I would still use a Mutex to be safe.)
Wrap all the code which deals with the cached array in a library class (or a class method on the model used to store the data), so the details of cache management don't spill over into the rest of the application.
Do a little bit of performance testing on the cache setup using Benchmark.measure {}. If a bug in the setup actually made performance worse rather than better, that would be sad...
I'd go with option 2. You can add two constants (for the 97% and 3%) that load from a YAML file when the app initializes. That ought to shrink your memory footprint considerably.
Having said that, yikes, this is just a band-aid on a hack, but you knew that already. I'd consider putting some time into a redesign, if you have that luxury.

Resources