How to manage millions of tiny HTML files

How to manage millions of tiny HTML files - ruby-on-rails

I have taken over a project which is a Ruby on Rails app that basically crawls the internet (in cronjobs). It crawls selected sites to build historical statistics about their subject over time.
Currently, all the raw HTML it finds is kept for future use. The HTML is saved into a Postgres database table and is currently at about 1,2 million records taking up 16 gigabytes and growing.
The application currently reads the HTML once to extract the (currently) relevant data and then flags the row as processed. However, in the future, it may be relevant to reprocess ALL files in one job to produce new information.
Since the tasks in my pipeline includes setting up backups, I don't really feel like doing daily backups of this growing amount of data, knowing that 99,9% of the backup is just static records with a practically immutable column with raw HTML.
I want this HTML out of the PostgreSQL database and into some storage. The storage must be easily manageable (inserts, reads, backups) and relatively fast. I (currently) have no requirement to search/query the content, only to resolve it by a key.
The ideal solution should allow me to insert and get data very fast, easy to backup, perhaps incrementally, perhaps support batch reads and writes for performance.
Currently, I am testing a solution where I push all these HTML files (about 15-20 KB each) to Rackspace Cloudfiles, but it takes forever and the access time is somewhat slow and relatively inconsistent (between 100 ms and several seconds). I estimate that accessing all files sequentially from rackspace could take weeks. That's not an ideal foundation for devising new statistics.
I don't believe Rackspace's CDN is the optimal solution and would like to get some ideas on how to tackle this problem in a graceful way.
I know this question has no code and is more a question about architecture and database/storage solutions, so it may be treading on the edges of SO's scope. If there's a more fitting community for the question, I'll gladly move it.

Related

Firebase Realtime Database load issues / brownout

We experience intermittent, seemingly random brownouts of a firebase realtime database. We are beginning to shard our data into multiple databases, however, we are not sure this will solve our problem. It appears to us that firebase cannot scale to meet our needs in terms of doing frequent writes to a specific data set.
We sync data from a third-party data source in cycles (every 4-10 minutes, 1000 active jobs). Each update has the potential to change a few thousand nodes in firebase, most of which lie pretty low. However, most of the time the number of low-level nodes changed is much lower. We do differential updates on the sync'd data in order to allow very small writes to the lower-level nodes. This helps prevent our users from downloading a ton of additional data. We also batch all of our updates per cycle into only a handful of writes, between 10-20 (not sure of the performance impact of a batched write to multiple nodes vs. a write to a single node).
Here is an image of the database load graph, which includes some sharding:
Database Load
The "blue" line is our "main" database. The "orange line" is a database containing only the data that requires many writes, as described above. Currently, the main (blue) database is supporting normal operations, including reads/writes, etc.. The shard (orange) database is only handling writes. The mirror of these is pretty indicative of a "write" load issue, given that a large percentage of writes occurs in the morning.
At times, the database load reaches 100% and remains in this state for 30+ minutes.
Please let me know if I can expand on anything or explain anything in more detail. Would appreciate any suggestions on debugging strategies or explanations as to why this may be occurring.
We are actively refactoring a lot of code to mitigate this issue, however, it is not obvious what the main driver is.

How to quickly warmup Neo4J page cache for a set of nodes and relationships

I'm setting up a REST API backed by a Neo4J instance of complex data about 400 GB of store size and need to warmup page cache for a set of nodes and relationships.
There are specific calls for a set of input data which takes a lot of time to query and get a response. So I want a way to warmup the cache only for the nodes and relationships which are accessed for these calls. I tried using APOC.warmup.run(true, true) (took around 15 mins which is acceptable for me), but it does load the whole store to memory which I can't do since I have a constraint on memory.I tried writing a simple Cypher which traverses through these path, but it is taking lot of time to execute and when I check the memory growth of Neo4J instance which is very slow as compared to APOC warmup.
I am also thinking if there is a way to extend/customize APOC warmup to load only specific parts of the store, but want to see if there are people out there who already tried something similar before.
I expect a quick way of warming up specific part of the store rather than the whole store.

You should also have enough page-cache configured for that (400G)
If you use apoc.warmup.run(true,true,true) also the schema indexes will be warmed up.
I guess it only took 15 minutes for you because your disk read performance might be lower? And how many CPUs do you have?
You cannot really load a specific part of the store.
But recent Neo4j versions track page-cache usage and restore it after restart.
So you can just run your queries and after a restart the same pages will be in the page cache.
You also see with PROFILE if you have page-faults in your query.

Heroku database performance experience needed?

We are experiencing some serious scaling challenges for our intelligent search engine/aggregator. Our database holds around 200k objects. From profiling and newrelic it seems most of our troubles may come from the database. We are using the smallest dedicated database Heroku provide (Ronin).
We have been looking into indexing and caching. So far we managed to solve our problems by reducing database calls and caching content intelligently, but now even this seems to reach an end. We are constantly asking ourselves if our code/configuration is good enough or if we are simply not using enough "hardware".
We suspect that the database solution we buy from Heroku may be performing insufficiently. For example, just doing a simple count (no joins, no nothing) on the 200k items takes around 250ms. This seems like a long time, even though postgres is known for its bad performance on counts?
We have also started to use geolocation lookups based on latitude/longitude. Both columns are indexed floats. Doing a distance calculation involves pretty complicated math, but we are using the very well recommended geocoder gem that is suspected to run very optimized queries. Even geocoder still takes 4-10 seconds to perform a lookup on, say, 40.000 objects, returning only a limit of the first nearest 10. This again sounds like a long time, and all the experienced people we consult says that it sound very odd, again hinting at the database performance.
So basically we wonder: What can we expect from the database? Might there be a problem? And what can we expect if we decide to upgrade?
An additional question I have is: I read here that we can improve performance by loading the entire database into memory. Are we supposed to configure this ourselves and if so how?
UPDATE ON THE LAST QUESTION:
I got this from the helpful people at Heroku support:
"What this means is having enough memory (a large enough dedicated
database) to store your hot data set in memory. This isn't something
you have to do manually, Postgres is configured automatically use all
available memory on our dedicated databases.
I took a look at your database and it looks like you're currently
using about 1.25 GB of RAM, so you haven't maxed your memory usage
yet."
UPDATE ON THE NUMBERS AND FIGURES
Okay so now I've had time to look into the numbers and figures, and I'll try to answer the questions below as follows:
First of all, the db consists of around 29 tables with a lot of relations. But in reality most queries are done on a single table (some additional resources are joined in, to provide all needed information for the views).
The table has 130 columns.
Currently it holds around 200k records but only 70k are active - hence all indexes are made as partial-indexes on this "state".
All columns we search are indexed correctly and none is of text-type, and many are just booleans.
Answers to questions:
Hmm the baseline performance it's kind of hard to tell, we have sooo many different selects. The time it takes varies typically from 90ms to 250ms selecting a limit of 20 rows. We have a LOT of counts on the same table all varying from 250ms to 800ms.
Hmm well, that's hard to say cause they wont give it a shot.
We have around 8-10 users/clients running requests at the same time.
Our query load: In new relic's database reports it says this about the last 24 hours: throughput: 9.0 cpm, total time: 0.234 s, avg time: 25.9 ms
Yes we have examined the query plans of our long-running queries. The count queries are especially slow, often over 500ms for a pretty simple count on the 70k records done on indexed columns with a result around 300

I've tuned a few Rails apps hosted on Heroku, and also hosted on other platforms, and usually the problems fall into a few basic categories:
Doing too much in ruby that could be done at the db level (sorting, filtering, join data, etc)
Slow queries
Inefficient use of indexes (not enough, or too many)
Trying too hard to do it all in the db (this is not as common in rails, but does happen)
Not optimizing cacheable data
Not effectively using background processing
Right now its hard to help you because your question doesn't contain any specifics. I think you'll get a better response if you pinpoint the biggest issue you need help with and then ask.
Some info that will help us help you:
What is the average response time of your actions? (from new relic, request-log-analyzer, logs)
What is the slowest request that you want help with?
What are the queries and code in that request?
Is the site's performance different when you run it locally vs. heroku?
In the end I think you'll find that it is not an issue specific to Heroku, and if you had your app deployed on amazon, engineyard, etc you'd have the same performance. The good news is I think that your problems are common, and shouldn't be too hard to fix once you've done some benchmarking and profiling.
-John McCaffrey

We are constantly asking...
...this seems a lot...
...that is suspected...
...What can we expect...
Good news! You can put and end to seeming, suspecting wondering and expecting through the magic of measurement!!!
Seriously though, you've not mentioned any of the basic points you'd need to get a useful answer:
What's the baseline performance of the DB running a sequential scan and single-row index fetches? You say Heroku say your DB fits in RAM, so you shouldn't see disk I/O issues when you measure.
Does this performance match whatver Heroku say it should be?
How many concurrent clients?
What's your query load - what queries and how often?
Have you checked the query plans for any of your suspiciously long-running queries?
Once you've got this sort of information, maybe someone can say something useful. As it stands anything you read here is just guesswork.

First: you should check your postgres configuration. (show all from within psql or another client, or just look at postgres.conf in the data directory) The parameter with the largest impact on performance is effective_cache_size, which should be set to about (total_physical_ram - memory_in_use_by_kernel_and_all_processes). For a 4GB machine, this often is around 3GB (4-1). (this is very course tuning, but will give the best results for a first step)
Second: why do you want all the counts? Better use a typical query: just ask for what is needed, not what is available. (reason: there is no possible optimisation for a COUNT(*): eiither the whole table, or a whole index needs to be scanned)
Third: start gathering and analysing some queryplans (for typical queries that perform badly). You can get a query plan by putting EXPLAIN ANALYZE before the actual query. (another way is to increase the logging level, and obtain them from the logfile) A bad queryplan can point you at missing statistics or indexes, or even at bad data-modelling.

Newrelic monitoring can be included as an add-on for heroku (http://devcenter.heroku.com/articles/newrelic). At the very least this should give you a lot of insight into what is happening behind the scenes, and may help you pinpoint some issues.

How many connections can sqlite 3 handle?

Is it ever a good idea to let a large amount of people connect to your website while it is using sqlite?
edit: I am using it in a critical ruby on rails application that may have hundreds of concurrent users.

There are two important properties unique to SQLite that I know of that are relevant:
When doing multiple inserts, you will get better performance if you wrap them all in a single transaction. If the inserts are done individually, SQLite waits for the disk platters to rotate around completely on each insert, so that the inserted data can be read back from the disk and validated.
When writing to a SQLite file, the entire file is locked, which can cause writer starvation. This situation improved in SQLite 3.
The SQLite website says that SQLite is suitable for small to medium traffic websites, with low OLTP capability. This accounts for about 95% of all websites.

serve my text from the filesystem instead of a database?

I am working on a content management application in which the data being stored on the database is extremely generic. In this particular instance a container has many resources and those resources map to some kind of digital asset, whether that be a picture, a movie, an uploaded file or even plain text.
I have been arguing with a colleague for a week now because in addition to storing the pictures, etc - they would like to store the text assets on the file system and have the application look up the file location(from the database) and read in the text file(from the file system) before serving to the client application.
Common sense seemed to scream at me that this was ridiculous and if we are bothering to look up something from the database, we might as well store the text in a database column and have it served along up with the row lookup. Database lookup + File IO seemed sounds uncontrollably slower then just Database Lookup. After going back and forth for some time, I decided to run some benchmarks and found the results a little surprising. There seems to be very little consistency when it comes to benchmark times. The only clear winner in the benchmarks was pulling a large dataset from the database and iterating over the results to display the text asset, however pulling objects one at a time from the database and displaying their text content seems to be neck and neck.
Now I know the limitations of running benchmarks, and I am not sure I am even running the correct idea of "tests" (for example, File system writes are ridiculously faster then database writes, didn't know that!). I guess my question is for confirmation. Is File I/O comparable to database text storage/lookup? Am I missing a part of the argument here? Thanks ahead of time for your opinions/advice!
A quick work about what I am using:
This is a Ruby on Rails application,
using Ruby 1.8.6 and Sqlite3. I plan
on moving the same codebase to MySQL
tomorrow and see if the benchmarks are
the same.

The major advantage you'll get from not using the filesystem is that the database will manage concurrent access properly.
Let's say 2 processes need to modify the same text as the same time, synchronisation with the filesystem may lead to race conditions, whereas you will have no problem at all with everyhing in database.

I think your benchmark results will depend on how you store the text data in your database.
If you store it as LOB then behind the scenes it is stored in an ordinary file.
With any kind of LOB you pay the Database lookup + File IO anyway.
VARCHAR is stored in the tablespace
Ordinary text data types (VARCHAR et al) are very limited in size in typical relational database systems. Something like 2000 or 4000 (Oracle) sometimes 8000 or even 65536 characters. Some databases support long text
but these have serious drawbacks and are not recommended.
LOBs are references to file system objects
If your text is larger you have to use a LOB data type (e.g. CLOB in Oracle).
LOBs usually work like this:
The database stores only a reference to a file system object.
The file system object contains the data (e.g. the text data).
This is very similar to what your colleague proposes except the DBMS lifts the heavy work of
managing references and files.
The bottom line is:
If you can store your text in a VARCHAR then go for it.
If you can't you have two options: Use a LOB or store the data in a file referenced from the database. Both are technically similar and slower than using VARCHAR.

I did this before. Its a mess, you need to keep the filesystem and the database synchronized all the time, so that makes the programming more complicated, as you would guess.
My advice is either go for an all filesystem solution, or all database solution, depending on the data. Notably, if you require lots of searches, conditional data retrieval, then go for database, otherwise fs.
Note that database may not be optimized for storage of large binary files. Still, remember, if you use both, youre gonna have to keep them synchronized, and it doesnt make for an elegant nor enjoyble (to program) solution.
Good luck!

At least, if your problems come from the "performance side", you could use a "no SQL" storage solution like Redis (via Ohm, for example), or CouchDB...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart