How many connections can sqlite 3 handle? - connection

Is it ever a good idea to let a large amount of people connect to your website while it is using sqlite?
edit: I am using it in a critical ruby on rails application that may have hundreds of concurrent users.

There are two important properties unique to SQLite that I know of that are relevant:
When doing multiple inserts, you will get better performance if you wrap them all in a single transaction. If the inserts are done individually, SQLite waits for the disk platters to rotate around completely on each insert, so that the inserted data can be read back from the disk and validated.
When writing to a SQLite file, the entire file is locked, which can cause writer starvation. This situation improved in SQLite 3.
The SQLite website says that SQLite is suitable for small to medium traffic websites, with low OLTP capability. This accounts for about 95% of all websites.

Related

Firebase Realtime Database load issues / brownout

We experience intermittent, seemingly random brownouts of a firebase realtime database. We are beginning to shard our data into multiple databases, however, we are not sure this will solve our problem. It appears to us that firebase cannot scale to meet our needs in terms of doing frequent writes to a specific data set.
We sync data from a third-party data source in cycles (every 4-10 minutes, 1000 active jobs). Each update has the potential to change a few thousand nodes in firebase, most of which lie pretty low. However, most of the time the number of low-level nodes changed is much lower. We do differential updates on the sync'd data in order to allow very small writes to the lower-level nodes. This helps prevent our users from downloading a ton of additional data. We also batch all of our updates per cycle into only a handful of writes, between 10-20 (not sure of the performance impact of a batched write to multiple nodes vs. a write to a single node).
Here is an image of the database load graph, which includes some sharding:
Database Load
The "blue" line is our "main" database. The "orange line" is a database containing only the data that requires many writes, as described above. Currently, the main (blue) database is supporting normal operations, including reads/writes, etc.. The shard (orange) database is only handling writes. The mirror of these is pretty indicative of a "write" load issue, given that a large percentage of writes occurs in the morning.
At times, the database load reaches 100% and remains in this state for 30+ minutes.
Please let me know if I can expand on anything or explain anything in more detail. Would appreciate any suggestions on debugging strategies or explanations as to why this may be occurring.
We are actively refactoring a lot of code to mitigate this issue, however, it is not obvious what the main driver is.

How much data a column of mnesia table can store

How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.

How to manage millions of tiny HTML files

I have taken over a project which is a Ruby on Rails app that basically crawls the internet (in cronjobs). It crawls selected sites to build historical statistics about their subject over time.
Currently, all the raw HTML it finds is kept for future use. The HTML is saved into a Postgres database table and is currently at about 1,2 million records taking up 16 gigabytes and growing.
The application currently reads the HTML once to extract the (currently) relevant data and then flags the row as processed. However, in the future, it may be relevant to reprocess ALL files in one job to produce new information.
Since the tasks in my pipeline includes setting up backups, I don't really feel like doing daily backups of this growing amount of data, knowing that 99,9% of the backup is just static records with a practically immutable column with raw HTML.
I want this HTML out of the PostgreSQL database and into some storage. The storage must be easily manageable (inserts, reads, backups) and relatively fast. I (currently) have no requirement to search/query the content, only to resolve it by a key.
The ideal solution should allow me to insert and get data very fast, easy to backup, perhaps incrementally, perhaps support batch reads and writes for performance.
Currently, I am testing a solution where I push all these HTML files (about 15-20 KB each) to Rackspace Cloudfiles, but it takes forever and the access time is somewhat slow and relatively inconsistent (between 100 ms and several seconds). I estimate that accessing all files sequentially from rackspace could take weeks. That's not an ideal foundation for devising new statistics.
I don't believe Rackspace's CDN is the optimal solution and would like to get some ideas on how to tackle this problem in a graceful way.
I know this question has no code and is more a question about architecture and database/storage solutions, so it may be treading on the edges of SO's scope. If there's a more fitting community for the question, I'll gladly move it.

SQLCipher + FMDB performence is poor on iOS device

I used SQLCipher to encrypt sqlite database and used FMDB for performing sqlite operation on encrypted database by using [FMDB setKey:] call.
My application work slowly when I used SQLCipher with FMDB on encrypted database.
If I used only FMDB on non encrypted database then it works properly without taking more cpu uses of the device.
Please help me out here how to resolve this memory cpu uses with SQLCipher & FMDB?
Thanks in advance.
There are a few very important guidelines for optimal SQLCipher performance:
Do not repeatedly open and close connections, as key derivation is very expensive, by design
Use transactions to wrap insert / update / delete operations. Unless executed in a transaction scope, every operation will occur within it's own transaction which slows things down by several orders of magnitude
Ensure your data is normalized (i.e., using good practices for separation of data into multiple tables to eliminate redundancy). Unnecessary duplication of data leads to database bloat, which means more pages for SQLCipher to operate on
Ensure that any columns that are used for searches or join conditions are indexed. If you don't, SQLCipher will need to execute full database scans across large numbers of pages
Vacuum periodically to ensure databases are compact if you do large deletes, updates etc.
Finally, to diagnose further the performance of your specific query statements, could you run an explain query plan command against some of your queries? The output of the explain query command is described here.

serve my text from the filesystem instead of a database?

I am working on a content management application in which the data being stored on the database is extremely generic. In this particular instance a container has many resources and those resources map to some kind of digital asset, whether that be a picture, a movie, an uploaded file or even plain text.
I have been arguing with a colleague for a week now because in addition to storing the pictures, etc - they would like to store the text assets on the file system and have the application look up the file location(from the database) and read in the text file(from the file system) before serving to the client application.
Common sense seemed to scream at me that this was ridiculous and if we are bothering to look up something from the database, we might as well store the text in a database column and have it served along up with the row lookup. Database lookup + File IO seemed sounds uncontrollably slower then just Database Lookup. After going back and forth for some time, I decided to run some benchmarks and found the results a little surprising. There seems to be very little consistency when it comes to benchmark times. The only clear winner in the benchmarks was pulling a large dataset from the database and iterating over the results to display the text asset, however pulling objects one at a time from the database and displaying their text content seems to be neck and neck.
Now I know the limitations of running benchmarks, and I am not sure I am even running the correct idea of "tests" (for example, File system writes are ridiculously faster then database writes, didn't know that!). I guess my question is for confirmation. Is File I/O comparable to database text storage/lookup? Am I missing a part of the argument here? Thanks ahead of time for your opinions/advice!
A quick work about what I am using:
This is a Ruby on Rails application,
using Ruby 1.8.6 and Sqlite3. I plan
on moving the same codebase to MySQL
tomorrow and see if the benchmarks are
the same.
The major advantage you'll get from not using the filesystem is that the database will manage concurrent access properly.
Let's say 2 processes need to modify the same text as the same time, synchronisation with the filesystem may lead to race conditions, whereas you will have no problem at all with everyhing in database.
I think your benchmark results will depend on how you store the text data in your database.
If you store it as LOB then behind the scenes it is stored in an ordinary file.
With any kind of LOB you pay the Database lookup + File IO anyway.
VARCHAR is stored in the tablespace
Ordinary text data types (VARCHAR et al) are very limited in size in typical relational database systems. Something like 2000 or 4000 (Oracle) sometimes 8000 or even 65536 characters. Some databases support long text
but these have serious drawbacks and are not recommended.
LOBs are references to file system objects
If your text is larger you have to use a LOB data type (e.g. CLOB in Oracle).
LOBs usually work like this:
The database stores only a reference to a file system object.
The file system object contains the data (e.g. the text data).
This is very similar to what your colleague proposes except the DBMS lifts the heavy work of
managing references and files.
The bottom line is:
If you can store your text in a VARCHAR then go for it.
If you can't you have two options: Use a LOB or store the data in a file referenced from the database. Both are technically similar and slower than using VARCHAR.
I did this before. Its a mess, you need to keep the filesystem and the database synchronized all the time, so that makes the programming more complicated, as you would guess.
My advice is either go for an all filesystem solution, or all database solution, depending on the data. Notably, if you require lots of searches, conditional data retrieval, then go for database, otherwise fs.
Note that database may not be optimized for storage of large binary files. Still, remember, if you use both, youre gonna have to keep them synchronized, and it doesnt make for an elegant nor enjoyble (to program) solution.
Good luck!
At least, if your problems come from the "performance side", you could use a "no SQL" storage solution like Redis (via Ohm, for example), or CouchDB...

Resources