Leave files as data source or put all in database - ruby-on-rails

I have a little bit of logs [ 200Mbytes/per day ]. What I want is to use certain data from this logs to build some statistics and show it through web interface. After pre-processing these files I get 4-5 files like this one:
hadooper#ubuntu:/usr/local/hadoop$ du -h part-r-00000
4.0K part-r-00000
hadooper#ubuntu:/usr/local/hadoop$ cat part-r-00000
201508042015 444335775
201508042020 563
201508042025 320787123
.....
I'm planning to store all this at least for year, maybe even more. Not sure yet.
My question is where would be better to store and retrieve data: files or database ?
I'm planning to use rails as backend. And as for now it seems like storing everything in files are ok option. But there might be some drawbacks in long term which I'm not aware of right now.
I'm sure there are a lot of experienced people who solved similar tasks. Would much appreciate your thoughts and help

If you are only trying to store the files, store as flat/zipped file or add to the database and then export them as backup file from the database. Preparing backup from database will ensure easier import later when you need the data.
If you will need to perform queries on them too all this time, store them in database as querying to database is faster (because of indices) and easier (because of availability of DDL, DML etc.)
If you are worried about security, encrypt your files or encrypt the database and then export.
Let me know if there is some case I forgot to address.

Related

Delete attachments to reduce disk space

I would like to remove massive all attachments since two years ago because need more space on disk.
Is it possible to update any field direclty on database? Is it a goot idea to reduce the space?
Are any tool or can execute any query on mysql database to do it?
Thanks!
Any time you need to remove data from the RT database, you should use the Shredder tool. Shredder knows about all of the connections between tables and records in the database, so it will keep your data in a good state. You can run Shredder from the web UI (Admin > Tools > Shredder) or from the command-line when removing larger amounts of data.
After shredding, it's a good idea to run the rt-validator utility to confirm the database doesn't have broken references. You may even want to run it before using Shredder to confirm you're starting from a good place.

Keeping a 'revisionable' copy of Neo4j data in the file system; how?

The idea is to have git or a git-like system (users, revision tracking, branches, forks, etc) store the 'master copy' of objects and relationships.
Since the master copy is on the filesystem, any changes can be checked in, tracked, and backed up. Neo4j could then import the files and serve queries. This also gives freedom since node and connection files can be imported to any other database.
Changes in Neo4j can be written to these files as part of the query
Nodes and connections can be added by other means (like copying from a seed dataset)
Nodes and connections are rarely created/updated/deleted by users
Most of the usage is where Neo4j shines: querying
Due to these two, the performance penalty on importing can be safely ignored
What's the best way to set this up?
If this isn't wise; how come?
It's possible to do that, but it will be lot of work which would not have a real value. IMHO.
With unmanaged extension for Transaction Event API you are able to store information about each transaction onto disk in your common file format.
Here is the some information about Transaction Event API - http://graphaware.com/neo4j/transactions/2014/07/11/neo4j-transaction-event-api.html
Could you please tell us more about the use case and how would design that system?
In general nothing keeps you from just keeping neo4j database files around (zipped).
Otherwise I would probably use a format which can be quickly exported / imported and diffed too.
So very probably csv files with node-file per label ordered by a sensible key
And then relationship-files between pairs of nodes, with neo4j-import you can recover that data quickly into a graph again.
If you want to write changes to the files you have to make sure they are replayable (appends + updates + deletes) , i.e. you have to chose a format which is more or less a transaction-log (which Neo4j already has).
If you want to do it yourself the TransactionHandler is what you want to look at. Alternatively you could dump the full database to a snapshot at times you request.
There are plans to add point-in-time recovery on the existing tx-logs, which I think would also address your question.

Download data option for customers

I have a multi tenant Rails app where the data of a customer is separated with a global scope. Now I want to give the customers the option to download all their own data in a single download. What is the best way to achieve this? Is it best to output everything into a CSV file?
Putting it into 1 csv file is going to likely cause you headaches.
I agree with Alex to do it as a background job if you go with CSV.
I will walk you through 2 approaches (CSV and Feed) and then you can choose what works for you.
CSV
Normally there are many tables that you want to export. If you put it all into one CSV file, it will be a bit messy of a file.
Instead I would set a nightly process for each customer for each table.
These generate CSVs for each customer for each table and stage them.
Finally for each customer I would bring those files together into a compressed file, and prep for delivery (Web download, FTP, Email etc)
The downside really is the lack of real time.
If you need real time (or if the data set is large), then you have to think about the impact this will have to your production database. It could cause serious performance degredation over time.
One option to get by this is to have read only replicated databases and you can deploy/utilize as needed.
Change Management
Instead of creating these ever-growing files every night, or on each request you can process data as it changes.
For example, if your customers really need to get this data, it could be for dropping in their database. I would move away from downloading CSV's or excel and offer an API.
When data changes come into your system, you notify interested components of the change. This way they do not have to go to the DB to get the changes. The API can have a pickup location that serves up the changed data whenever it exists.
We have used this mechanism in large scale, high volume environments with great success.
Push Notifications
Finally, there are web hooks. Basically when changes you post the data to their web server.
I would suggest if you are going to go with the CSV route, you look at the long term read impacts. You may not need to make a change now, but you should have in your plan an item and solution ready.
Finally I would break the task into many small tasks over 1 long running.
CSV is a commonly used format for this. There is a good rails cast on how to achieve this: http://railscasts.com/episodes/362-exporting-csv-and-excel
From my experience I can advice you to implement it as a background scheduled process because export could be expensive in resources and take long time to finish. After the task is finished you can email a user with the download link for example.

Syncing a local sqlite file to iCloud

I store some data in my iOS app directly in a local .sqlite file.  I chose to do this instead of CoreData because the data will need to be compatible with non-Apple platforms.
Now, I'm trying to come up with the best way to sync this file over iCloud.  I know you can't sync it directly, for many reasons.  I know CoreData is able to sync its DBs, but even ignoring that using CD would essentially lock this file into Apple platforms (I think? I've only looked into CD a bit), I need the iCloud syncing of this file to work across ALL of iCloud's supported platforms - which is supposed to include Windows.  I have to assume that there won't be any compatibility for the CoreData files in the Windows API.  Planning out the best way to accomplish this would be a lot easier if Apple would tell us any more than "There will be a Windows API [eventually?]"
In addition, I'll eventually need to implement at least one more sync service to support platforms that iCloud does not.  It would be helpful, though not required, if the method I use for iCloud can be mostly reused for future services.
For these reasons, I don't think CoreData can help me with this.  Am I correct in thinking this?
Moving on from there, I need to devise an algorithm for this, or find an existing one or an existing 3rd party solution.  I haven't stumbled across anything yet. However, I have been mulling over a couple possible methods I could implement:
Method 1:
Do something similar to how CoreData syncs sqlite DBs: send "transaction logs" to iCloud instead and build each local sqlite file off of those.
I'm thinking each device would send a (uniquely named) text file listing all the sql commands that that device executed, with timestamps.  The device would store how far along in each list of commands it has executed, and continue from that point each time the file is updated. If it received updates to multiple log files at once, it would execute each command in timestamp order.
Things could get 'interesting' efficiency-wise once these files get large, but it seems like a solvable problem.  
Method 2:
Periodically sync a copy of the working database to iCloud.  Have a modification timestamp field in every record.  When an updated copy of the DB comes through, query all the records with newer timestamps than some reference time and update the record in the local DB from the new data.
I see many potential problems with this method:
-Have to implement something further to recognize record deletion.
-The DB file could get conflicts. It might be possible to deal with them by handling each conflict version in timestamp order.
-Determining the date to check each update from could be tricky, as it depends on which device the update is coming from.
There are a lot of potential problems with method 2, but method 1 seems doable to me...
Does anyone have any suggestions as to what might be the best course of action? Any better ideas than my "Method 1" (or reasons why it wouldn't work)?
Try those two solutions from Ray Wenderlich:
Exporting/Importing data through mail:
http://www.raywenderlich.com/1980/how-to-import-and-export-app-data-via-email-in-your-ios-app
File Sharing with iTunes:
http://www.raywenderlich.com/1948/how-integrate-itunes-file-sharing-with-your-ios-app
I found it quite complex but helped me a lot.
Both method 1 and method 2 seem doable. Perhaps a combination of the two in fact - use iCloud to send a separate database file that is a subset of data - i.e. just changed items. Or maybe another file format instead of sqlite db - XML/JSON/CSV etc.
Another alternative is to do it outside of iCloud - i.e. a simple custom web service for syncing. So each change gets submitted to a central server via JSON/XML over HTTP, and then other devices pull updates from that.
Obviously it depends how much data and how many devices you want to sync across, and whether you have access to an appropriate server and/or budget to cover running such a server. iCloud will do that for "free" but all it really does is transfer files. A custom solution allows you to define your syncing model as you wish, but you have to develop and manage it and pay for it.
I've considered the possibility of transferring a database file through iCloud but I think that I would run into classic problems of timing - slow start for the user - and corrupted databases if the app is run on multiple devices simultaneously. (iPad/iPhone for example).
Sooo. I've had to use the transaction logs method. It really is difficult to implement, but once in place, seems ok.
I am using Apple's SharedCoreData sample as the base for this work. This link requires an Apple Developer Account.
I did find a much much better solution from Tim Roadley however this only works for IOS and I needed both IOS and MacOS.
rant> iCloud development really has to get easier and more stable! /rant

serve my text from the filesystem instead of a database?

I am working on a content management application in which the data being stored on the database is extremely generic. In this particular instance a container has many resources and those resources map to some kind of digital asset, whether that be a picture, a movie, an uploaded file or even plain text.
I have been arguing with a colleague for a week now because in addition to storing the pictures, etc - they would like to store the text assets on the file system and have the application look up the file location(from the database) and read in the text file(from the file system) before serving to the client application.
Common sense seemed to scream at me that this was ridiculous and if we are bothering to look up something from the database, we might as well store the text in a database column and have it served along up with the row lookup. Database lookup + File IO seemed sounds uncontrollably slower then just Database Lookup. After going back and forth for some time, I decided to run some benchmarks and found the results a little surprising. There seems to be very little consistency when it comes to benchmark times. The only clear winner in the benchmarks was pulling a large dataset from the database and iterating over the results to display the text asset, however pulling objects one at a time from the database and displaying their text content seems to be neck and neck.
Now I know the limitations of running benchmarks, and I am not sure I am even running the correct idea of "tests" (for example, File system writes are ridiculously faster then database writes, didn't know that!). I guess my question is for confirmation. Is File I/O comparable to database text storage/lookup? Am I missing a part of the argument here? Thanks ahead of time for your opinions/advice!
A quick work about what I am using:
This is a Ruby on Rails application,
using Ruby 1.8.6 and Sqlite3. I plan
on moving the same codebase to MySQL
tomorrow and see if the benchmarks are
the same.
The major advantage you'll get from not using the filesystem is that the database will manage concurrent access properly.
Let's say 2 processes need to modify the same text as the same time, synchronisation with the filesystem may lead to race conditions, whereas you will have no problem at all with everyhing in database.
I think your benchmark results will depend on how you store the text data in your database.
If you store it as LOB then behind the scenes it is stored in an ordinary file.
With any kind of LOB you pay the Database lookup + File IO anyway.
VARCHAR is stored in the tablespace
Ordinary text data types (VARCHAR et al) are very limited in size in typical relational database systems. Something like 2000 or 4000 (Oracle) sometimes 8000 or even 65536 characters. Some databases support long text
but these have serious drawbacks and are not recommended.
LOBs are references to file system objects
If your text is larger you have to use a LOB data type (e.g. CLOB in Oracle).
LOBs usually work like this:
The database stores only a reference to a file system object.
The file system object contains the data (e.g. the text data).
This is very similar to what your colleague proposes except the DBMS lifts the heavy work of
managing references and files.
The bottom line is:
If you can store your text in a VARCHAR then go for it.
If you can't you have two options: Use a LOB or store the data in a file referenced from the database. Both are technically similar and slower than using VARCHAR.
I did this before. Its a mess, you need to keep the filesystem and the database synchronized all the time, so that makes the programming more complicated, as you would guess.
My advice is either go for an all filesystem solution, or all database solution, depending on the data. Notably, if you require lots of searches, conditional data retrieval, then go for database, otherwise fs.
Note that database may not be optimized for storage of large binary files. Still, remember, if you use both, youre gonna have to keep them synchronized, and it doesnt make for an elegant nor enjoyble (to program) solution.
Good luck!
At least, if your problems come from the "performance side", you could use a "no SQL" storage solution like Redis (via Ohm, for example), or CouchDB...

Resources