Count number of Postgres statements - ruby-on-rails

I am trying to obtain count the number of Postgres Statements my Ruby on Rails application is performing against our database. I found this entry on stackoverflow, but it counts transactions. We have several transactions that make very large numbers of statements, so that doesn't give a good picture. I am hoping the data is available from PG itself - rather than trying to parse a log.
https://dba.stackexchange.com/questions/35940/how-many-queries-per-second-is-my-postgres-executing

I think you are looking for ActiveSupport instrumentation. Part of Rails, this framework is used throughout Rails applications to publish certain events. For example, there's an sql.activerecord event type that you can subscribe to to count your queries.
ActiveSupport::Notifications.subscribe "sql.activerecord" do |*args|
counter++
done
You could put this in config/initializers/ (to count across the app) or in one of the various before_ hooks of a controller (to count statements for a single request).
(The fine print: I have not actually tested this snippet, but that's how it should work AFAIK.)

PostgreSQL provides a few facilities that will help.
The main one is pg_stat_statements, an extension you can install to collect statement statistics. I strongly recommend this extension, it's very useful. It can tell you which statements run most often, which take the longest, etc. You can query it to add up the number of queries for a given database.
To get a rate over time you should have a script sample pg_stat_statements regularly, creating a table with the values that changed since last sample.
The pg_stat_database view tracks values including the transaction rate. It does not track number of queries.
There's pg_stat_user_tables, pg_stat_user_indexes, etc, which provide usage statistics for tables and indexes. These track individual index scans, sequential scans, etc done by a query, but again not the number of queries.

Related

Atomically updating the count of a DB field in a multi-threaded environment

Note - This question expands on an answer to another question here.
I'm importing a file into my DB by chunking it up into smaller groups and spawning background jobs to import each chunk (of 100 rows).
I want a way to track progress of how many chunks have been imported so far, so I had planned on each job incrementing a DB field by 1 when it's done so I know how many have processed so far.
This has a potential situation of two parallel jobs incrementing the DB field by 1 simultaneously and overwriting each other.
What's the best way to avoid this condition and ensure an atomic parallel operation? The linked post above suggests using Redis, which is one good approach. For the purposes of this question I'm curious if there is an alternate way to do it using persistent storage.
I'm using ActiveRecord in Rails with Postgres as my DB.
Thanks!
I suggest to NOT incrementing a DB field by 1, instead, create a DB record with for each job with a job id. There are two benefits:
You can count the number of records to let you know how many have processed without worrying about parallel operations.
You can also add some necessary logs into each job record and easily debug when any of the jobs fails when importing.
I suggest you use a postgresql sequence.
See CREATE SEQUENCE and Sequence Manipulation.
Especially nextval():
Advance the sequence object to its next value and return that value. This is done atomically: even if multiple sessions execute nextval concurrently, each will safely receive a distinct sequence value.

Faceting in Solr when index contains millions of documents

I'm working on a project that uses a solr index with a few million documents and we've recently hit a memory problem. Faceting has become unusable on a couple of our fields - solr runs out of heap memroy - because of the number of documents containing those fields.
What options do we have besides increasing the memory? We see memory increases as a temporary solution because the number of documents goes up by a few 100k documents per day.
I'm looking at the minute into solrcloud but I'm not sure this is the right solution.
Any suggestions?
Thanks!
FacetFields: Allow for facet counts based on distinct values in a field. There are two methods for FacetFields, one that performs well with few distinct values in a field, and the other for when a field contains many distinct values (generally, thousands and up – you should test what works best for you).
The first method, facet.method=enum, works by issuing a FacetQuery for every unique value in the field. As mentioned, this is an excellent method when the number of distinct values in a field is small. It requires excessive memory though, and breaks down when the number of distinct values gets large. When using this method, be careful to ensure that your FilterCache is large enough to contain at least one filter for every distinct value you plan on faceting on.
The second method uses the Lucene FieldCache (future version of Solr will actually use a different non-inverted structure – the UnInvertedField). This method is actually slower and more memory intensive for fields with a low number of unique values, but if you have a lot of uniques, this is the way to go. This method uses the FieldCache to look up the values for the given field for each document, and every time a document with a given value is found, the value has its count incremented.
Please check the allotted memory for each cache and if you can tweak FieldCache to handle the situation. (As you have mentioned, type3 and type4 have large number of documents.
Source for the above information is Scaling Lucene and Solr. I found one more article which talks about solr faceting You are faceting it wrong.
Before solrcould you can think of solr multiple core.
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
With SolrCloud, a single index can span multiple Solr instances.
This means that a single index can be made up of multiple SolrCore's on different machines.
These SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy.
If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
SolrCloud adds the distributed capabilities in Solr.
With this enable you can have highly available, fault tolerant cluster of Solr servers.
Use SolrCloud when you want high scale, fault tolerant, distributed indexing and search capabilities.
You can get more info about SolrCloud here
https://cwiki.apache.org/confluence/display/solr/SolrCloud

Is is possible in ruby to set a specific active record call to read dirty

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

Thinking Sphinx & Rails questions

I'm building my first Rails app and have it working great with Thinking Sphinx. I'm understanding most of it but would love it if someone could help me clarify a few conceptual questions
When displaying search results after a sphinx query, should I be using the sphinx_attributes that are returned from the sphinx query? Or should my view use normal rails objects, such as #property.title, #property.amenities.title etc? If I use normal rails objects, doesn't that mean its doing extra queries?
In a forum, I'd like to display 'unread posts'. Obviously this is true/false for each user/topic combination, so I'm thinking I should be caching the 'reader' ids within the topic's sphinx index. This way I can quickly do a query for all unread posts for a given user_id. I've got this working, but then realised its pointless, as there is a time delay between sphinx indexes. So if a user clicks on an unread post, it will still appear unread until the sphinx DB is re-indexed
I'm still on development so I'm manually indexing/rebuilding, but on production, what is a standard time between re-indexing?
I have a model with several text fields - should I concat these all into one column in the sphinx index for a keyword search? Surely this is quicker than indexing all the separate fields.
Slightly off-topic, but just wondering - when you access nested models, for example #property.agents.name, does this affect performance? Or does rails automatically fetch all associated entries when a property is pulled from the database?
To answer each of your points:
For both of your examples, sphinx_attributes would not be helpful. Firstly, you've already loaded the property, so the title is available directly without an extra database hit. And for property.amenities.title you're dealing with an array of strings, which Sphinx has no concept of. Generally, I would only use sphinx_attributes for complicated calculated attributes, not standard column references.
Yes, you're right, there will be a delay with this value.
It depends on how often your data changes. I have some apps where I can index every day because changes are so rare, but others where we'll run it every 10 minutes. If the data is particularly volatile, I'll look at using deltas (usually via Sidekiq) to have changes reflected in Sphinx in a few seconds.
I don't think it's much difference either way - unless you want to search on any of those columns separately? If so, it'll need to be a separate field.
By default, as you use each property's agents, the agents for that property will be loaded from the database (one SQL call per property). You could look at the eager loading docs for how to manage this better when you're dealing with multiple records. Thinking Sphinx has the ability to pass through :include options to the underlying ActiveRecord call.

Exporting and/or displaying multiple records in Rails

I have been working in Rails (I mean serious working) for last 1.5 years now. Coming from .Net background and database/OLAP development, there are many things I like about Rails but there are few things about it that just don't make sense to me. I just need some clarification for one such issue.
I have been working on an educational institute's admission process, which is just a small part of much bigger application. Now, for administrator, we needed to display list of all applied/enrolled students (which may range from 1000 to 10,000), and also give a way to export them as excel file. For now, we are just focusing on exporting in CSV format.
My questions are:
Is Rails meant to display so many records at the same time?
Is will_paginate only way to paginate records in Rails? From what I understand, it still fetches all the records from DB, and then selectively displays relevant records. Back in .Net/PHP/JSP, we used to create stored procedure and from there we selectively returns relevant records. Since, using stored procedure being a known issue in Rails, what other options do we have?
Same issue with exporting this data. I benchmarked the process i.e. receiving request at the server, execution of the query and response return. The ActiveRecord creation was taking a helluva time. Why was that? There were only like 1000 records, and the page showed connection timeout at the user. I mean, if connection times-out while working on for 1000 records, then why use Rails or it means Rails are not meant for such applications. I have previously worked with TB's of data, and never had this issue.
I never understood ORM techniques at the core. Say, we have a table users, and are associated with multiple other tables, but for displaying records, we need data from only tables users and its associated table admissions, then does it actually create objects for all its associated tables. I know, the data will be fetched only if we use the association, but does it create all the objects before-hand?
I hope, these questions are not independent and do qualify as per the guidelines of SF.
Thank you.
EDIT: Any help? I re-checked and benchmarked again, for 1000 records, where in we are joining 4-5 different tables (1000 users, 2-3 one-to-one association, and 2-3 one-to-many associations), it is creating more than 15000 objects. This is for eager loading. As for lazy loading, it will be 1000 user query plus some 20+ queries). What are other possible options for such problems and applications? I know, I am kinda bumping the question to come to top again!
Rails can handle databases with TBs of data.
Is will_paginate only way to paginate records in Rails?
There are many other gems like "kaminari".
it fetches all records from the db..
NO. It doesnt work that way. For example take the following query,Users.all.page(1).per(10)
User.all wont fire a db query, it will return a proxy object. And you call page(1) and per(10) on the proxy(ActiveRecord::Relation). When you try to access the data from the proxy object, it will execute a db query. Active record will accumulate all conditions and paramaters you pass and will execute a sql query when required.
Go to rails console and type u= User.all; "f"; ( the second statement: "f", is to prevent rails console from calling to_s on the proxy to display the result.)
It wont fire any query. Now try u[0], it will fire a query.
ActiveRecord creation was taking a helluva time
1000 records shouldn't take much time.
Check the number of sql queries fired from the db. Look for signs of
n+1 problem and fix them by eager loading.
Check the serialization of the records to csv format for any cpu or memory intensive operation.
Use a profiler and track down the function that is consuming most of the time.

Resources