Limiting the search query to run faster - ruby-on-rails

Yes yes yes! I know! This is totally wrong to pass SQL in rails like the command below but I promise!:) it is just for some benchmark purposes
#medications = TestPharmOrderMain.select("brand_name,form,dose,generic_name as alternative,sum (order_count)
as total_count
,sum(order_cost) as total_cost").group("brand_name,form,dose,generic_name").limit(5)
The PostgresQL that I am running this REST service on it has two million rows and it takes like four minutes to return me the JSON from this query which is impossible to develop against it.
Is there a way I can change this query to for example only look at the first twenty rows in the DB and not two million rows so it runs faster for my dev purposes?

If this is for dev purposes do the smart thing and create a tiny database that is representative of the whole system. You can do this with a create table as select statement:
Create table my_test_table
as
select brand_name,form,dose,generic_name as alternative,sum (order_count)
as total_count
,sum(order_cost) as total_cost
from table
group by brand_name,form,dose,generic_name
limit 5
Now you can point your test query to my_test_table and it will only have 5 records and will therefore be quite fast.
Now you can also offset this to something like DBUnit, which is essentially a framework that is laid on top of XUnit. So this can be easily integrated into your testing that is done in what I presume is RubyUnit.

Related

Write native SQL in Core Data

I need to write a native SQL Query while I'm using Core Data in my project. I really need to do that, since I'm using NSPredicate right now and it's not efficient enough (in just one single case). I just need to write a couple of subqueries and joins to fetch a big number of rows and sort them by a special field. In particular, I need to sort it by the sum of values of their child-entites. Right now I'm fetching everything using NSPredicate and then I'm sorting my result (array) manually, but this just takes too long since there are many thousands of results.
Please correct me if I'm wrong, but I'm pretty sure this can't be a huge challenge, since there's a way of using sqlite in iOS applications.
It would be awesome if someone could guide me into the right direction.
Thanks in advance.
EDIT:
Let me explain what I'm doing.
Here's my Coredata model:
And here's how my result looks on the iPad:
I'm showing a table with one row per customer, where every customer has an amount of sales he made from January to June 2012 (Last) AND 2013 (Curr). Next to the Curr there's the variance between those two values. The same thing for gross margin and coverage ratio.
Every customer is saved in the Kunde table and every Kunde has a couple of PbsRows. PbsRow actually holds the sum of sales amounts per month.
So what I'm doing in order to show these results, is to fetch all the PbsRows between January and June 2013 and then do this:
self.kunden = [NSMutableOrderedSet orderedSetWithArray:[pbsRows valueForKeyPath:#"kunde"]];
Now I have all customers (Kunde) which have records between January and June 2013.
Then I'm using a for loop to calculate the sum for each single customer.
The idea is to get the amounts of sales of the current year and compare them to the last year.
The bad thing is that there are a lot of customers and the for-loop just takes very long :-(
This is a bit of a hack, but... The SQLite library is capable of opening more than one database file at a given time. It would be quite feasible to open the Core Data DB file (read/only usage) directly with SQLite and open a second file in conjunction with this (reporting/temporary tables). One could then execute direct SQL queries on the data in the Core Data DB and persist them into a second file (if persistence is needed).
I have done this sort of thing a few times. There are features available in the SQLite library (example: full-text search engine) that are not exposed through Core Data.
If you want to use Core Data there is no supported way to do a SQL query. You can fetch specific values and use [NSExpression expressionForFunction:arguments:] with a sum: function.
To see what SQL commands Core Data executes add -com.apple.CoreData.SQLDebug 1 to "Arguments Passed on Launch". Note that this should not tempt you to use the SQL commands youself, it's just for debugging purposes.
Short answer: you can't do this.
Long answer: Core Data is not a database per se - it's not guaranteed to have anything relational backing it, let alone a specific version of SQLite that you can query against. Furthermore, going mucking around in Core Data's persistent store files is a recipe for disaster, especially if Apple decides to change the format of that file in some way. You should instead try to find better ways to optimize your usage of NSPredicate or start caching the values you care about yourself.
Have you considered using the KVC collection operators? For example, if you have an entity Foo each with a bunch of children Bar, and those Bars have a Baz integer value, I think you can get the sum of those for each Foo by doing something like:
foo.bars.#sum.baz
Not sure if these are applicable to predicates, but it's worth looking into.

How to efficiently fetch n most recent rows with GROUP BY in sqlite?

I have a table of event results, and I need to fetch the most recent n events per player for a given list of players.
This is on iOS so it needs to be fast. I've looked at a lot of top-n-per-group solutions that use subqueries or joins, but these run slow for my 100k row dataset even on a macbook pro. So far my dumb solution, since I will only run this with a maximum of 6 players, is to do 6 separate queries. It isn't terribly slow, but there has to be a better way, right? Here's the gist of what I'm doing now:
results_by_pid = {}
player_ids = [1,2,3,4,5,6]
n_results = 6
for pid in player_ids:
results_by_pid[pid] = exec_sql("SELECT *
FROM results
WHERE player_id = #{pid}
ORDER BY event_date DESC
LIMIT n_events")
And then I go on my merry way. But how can I turn this into a single fast query?
There is no better way.
SQL window functions, which might help, are not implemented in SQLite.
SQLite is designed as an embedded database where most of the logic stays in the application.
In contrast to client/server databases where network communication should be avoided, there is no performance disadvantage to mixing SQL commands and program logic.
A less dumb solution requires you to do some SELECT player_id FROM somewhere beforehand, which should be no trouble.
To make the individual queries efficient, ensure you have one index on the two columns player_id and event_date.
This won't be much of an answer, but here goes...
I have found that making things really quick can involve ideas from the nature of the data and schema themselves. For example, searching an ordered list is faster than searching an unordered list, but you have to pay a cost up front - both in design and execution.
So ask yourself if there are any natural partitions on your data that may reduce the number of records SQLite must search. You might ask whether the latest n events fall within a particular time period. Will they all be from the last seven days? The last month? If so then you can construct the query to rule out whole chunks of data before performing more complex searches.
Also, if you just can't get the thing to work quickly, you can consider UX trickery! Soooooo many engineers don't get clever with their UX. Will your query be run as the result of a view controller push? Then set the thing going in a background thread from the PREVIOUS view controller, and let it work while iOS animates. How long does a push animation take? .2 seconds? At what point does your user indicate to the app (via some UX control) which playerids are going to be queried? As soon as he touches that button or TVCell, you can prefetch some data. So if the total work you have to do is O(n log n), that means you can probably break it up into O(n) and O(log n) pieces.
Just some thoughts while I avoid doing my own hard work.
More thoughts
How about a separate table that contains the ids of the previous n inserts? You could add a trigger to delete old ids if the size of the table grows above n. Say..
CREATE TABLE IF NOT EXISTS recent_results
(result_id INTEGER PRIMARY KEY, event_date DATE);
// is DATE a type? I don't know. you get the point
CREATE TRIGGER IF NOT EXISTS optimizer
AFTER INSERT ON recent_results
WHEN (SELECT COUNT(*) FROM recent_results) > N
BEGIN
DELETE FROM recent_results
WHERE result_id = (SELECT result_id
FROM recent_results
WHERE event_date = MIN(event_date));
// or something like that. I have no idea if this will work,
// I just threw it together.
Or you could just create a temporary memory-based table that you populate at app load and keep up to date as you perform transactions during app execution. That way you only pay the steep price once!
Just a few more thoughts for you. Be creative, and remember that you can usually define what you want as a data structure as well as an algorithm. Good luck!

Reducers stopped working at 66.68% while running HIVE Join query

Trying to join 6 tables which are having 5 million rows approximately in each table. Trying to join on account number which is sorted in ascending order on all tables. Map tasks are successfully finished and reducers stopped working at 66.68%. Tried options like increasing number of reducers and also tried other options set hive.auto.convert.join = true; and set hive.hashtable.max.memory.usage = 0.9; and set hive.smalltable.filesize = 25000000L; but the result is same. Tried with small number of records (like 5000 rows) and the query works really well.
Please suggest what can be done here to make it work.
Reducers at 66% start doing the actual reduce (0-33% is shuffle, 33-66% is sort). In a join with hive, the reducer is performing a Cartesian product between the two data sets.
I'm going to guess that there is at least one foreign key that is appearing frequently in all of the data sets. Watch for NULL and default values.
For example, in a join, imagine the key "abc" appears ten times in each of the six tables (10^6). That's a million output records for that one key. If "abc" appears 1000 times in one table, 1000 in another, 1000 in another, then twice in the other three tables, you get 8 billion records (1000^3 * 2^3). You can see how this gets out of hand. I'm guessing there is at least one key that is resulting in a massive number of output records.
This is general good practice to avoid in RDBMS outside of Hive as well. Doing multiple inner joins between many-to-many relationships can get you in a lot of trouble.
For debugging this now, and in the future, you could use the JobTracker to find and examine the logs for the Reducer(s) in question. You can then instrument the reduce operation to get a better handle as to what's going on. be careful you don't blow it up with logging of course!
Try looking at the number of records input to the reduce operation for example.

Performance of generated T-SQL from Entity Framework

I recently used Entity Framework for a project, despite my DBA's strong disapproval. So one day he came to my office complaining about generated T-SQL that reaches his database.
For instance, when I want to select a product based on the id, I write something like this:
context.Products.FirstOrDefault(p=>p.Id==id);
Which translates to
SELECT ... FROM (SELECT TOP 1 ... FROM PRODUCTS WHERE ID=#id)
So he is shouting, "Why on earth would you write a SELECT * FROM (SELECT TOP 1)"
So I changed my code to
context.Products.Where(p=>p.Id==id).ToList().FirstOrDefault()
and this produces a much cleaner T-SQL:
SELECT ... FROM PRODUCTS WHERE ID=#id
The inner query and the TOP 1 dissappeared. Enough mambling, my question is this: Does the first query really put an overhead for SQL Server? Is it harder to parse than the second method? The Id column has a Clustered index on. I want a good answer so I can rub it on his face (or mine)
Thanks,
Themos
Have you tried running the queries manually and comparing the executions plans?
The biggest problem here isn't that the SQL isn't perfectly formed to your DBA's standards (although I'm fairly certain that the query engine will optimize out the extra select). The second query actually returns the entire contents of the Products table which you then analyse in memory and this is definitely a task that should be performed by the DB and not the application layer.
In short, he's being a pedant; leave it the way it was.

Search a relation without a second query

My question is about how to perform varying levels of search into a database while limiting the number of queries.
Let's start simple:
#companies = Company.where("active = ?", true)
Let's say we display records from this set. Then, we need:
#clientcompanies = #companies.where("client_id = ?", #client.id)
We display something from #clientcompanies. Then, we want to drill down further.
#searchcompanies = #clientcompanies.where("name LIKE ? OR notes LIKE ?", "#{params[:search]}%", "#{params[:search]}%")
Are these three statements the most efficient way to go about this?
If indeed the database is starting with the entire Company table each time around, is there a way to limit the scope so each of the above statements would take a shorter amount of time as the size of the set diminishes?
In case it matters, I'm running Rails 3 on both MySQL and PostgreSQL.
It doesn't get much more optimized then what you're already doing. Exactly zero of those statements will execute a SQL query until you try to iterate over the results. Calling methods like all, first, inspect, any?, each etc will be when the query is executed.
Each time you chain on a new where or other arel method, it appends to the sql query that it'll execute at the end. If, somewhere in the middle, you want to see the query that'll be executed you can do puts #searchcompanies.to_sql
Note that if you run these commands in the console each statement appears to run a SQL query only because the console automatically runs .inspect on the line you entered.
Hopefully I answered your question :)
There's a great railscast here: http://railscasts.com/episodes/239-activerecord-relation-walkthrough that explains how ActiveRelation works, and what you can do with it.
EDIT:
I may have mis-understood your question. You indicated that after each where call you were displaying information from the query. What's the use-case for this? Are you displaying all companies on the same page that you have filtered-out companies from a search? If you display something from that very first query then you will be pulling every single company row from your database (which is not going to be very scalable or performant at larger quantities of company entries).
Would it not make sense to only display information from the #searchcompanies variable?

Resources