Rails Testing: Fixtures, Factories, and Magic numbers - ruby-on-rails

I've got an application that needs quite a bit of data (1000s of records) to do appropriate testing. The only way I've found to get a decent set of testable, sensible data is to use a subset of my production DB. I've converted this to YAML fixtures in the normal `test/fixtures' location.
This works, but now I have a bunch of seemingly brittle tests and assertions that depend on their being a particular number of records that meet condition X...
example
def test_children_association
p = Parent.find(1)
assert_equal 18, p.children.count, "Parent.children isn't providing the right records"
end
This doesn't seem like a good idea to me, but I'm not sure if there is a better / accepted way to test an application that needs a large hierarchy of data.

Magic numbers in tests aren't an anti-pattern. Your tests need to be so dead-simple that you don't need to test them. This means you'll have some magic numbers. This means that your tests will break when you change small bits of functionality. This is good.
Fixtures have some problems, but there are a few simple things you can do to make them easier to work with:
Only have baseline data in your fixtures, the sort of data that most of your tests need but don't care about. This will involve a time investment up front, but it's better to take the pain early than write poor unit tests for the life of the project.
Add the data to be tested in the context of the test. This improves readability of your tests and saves you from writing "make sure nobody messed up the fixtures" sanity checks at the beginning of your unit tests.

The first thing I'd say is: what are you testing in that example? If it's an ordinary AR has_many association, then I wouldn't bother writing a test for it. All you're doing is testing that AR works.
A better example might be if you had a very complicated query or if there was other processing involved in getting the list of children records. When you get them back, rather than testing for a count you could iterate through the returned list and verify that the children match the criteria you're using.

what I've found most useful in this situation is not using fixtures at all but rather construct the database objects on the fly like
def test_foo
project = Project.create valid_project.merge(....)
*do assertions here*
end
and in my test_helpers I'd have a bunch of methods:
def valid_project
{ :user_id => 23, :title => "My Project" }
end
def invalid_project
valid_project.merge(:title => nil)
end
I found that the pain of having to build massive collections of test objects has led me naturally to design simpler and more versatile class structures.

Cameron's right: What are you testing?
What sort of system needs 1000s of records present to test? Remember, your tests should be as tiny as possible and should be testing application behavior. There's no way it needs thousands of records for the vast majority of those tests.
For little bits of behavior tests where you need object relationships, consider mock objects. You'll only be speccing out the exact minimum amount of behavior necessary to get your test to pass, and they won't hit the DB at all, which will amount to a huge performance gain in your test suite. The faster it is to run, the more often people will run it.

I may have a unique situation here, but I really did need quite a few records for testing this app (I got it down to 150 or so). I'm analyzing historical data and have numerous levels of has_many. Some of my methods do custom SQL queries across several tables which I might end up modifying to use ActiveRecord.find but I needed to get the test running first.
Anyway, I ended up using some ruby code to create the fixtures. The code is included in my test_helper; it checks the test DB to see if the data is stale (based on a time condition) and wipes and recreates the records procedurally. In this case, creating it procedurally allows me to know what the data I'm testing for SHOULD be, which is safer than using a subset of production data and hoping the numbers I calculate the first time are what I should test for in the future.
I also moved to using Shoulda which along with many other useful things makes ActiveRecord Association testing as easy as:
should_have_many :children
should_belong_to :parent

Related

When should inferred relationships and nodes be used over explicit ones?

I was looking up how to utilise temporary relationships in Neo4j when I came across this question: Cypher temp relationship
and the comment underneath it made me wonder when they should be used and since no one argued against him, I thought I would bring it up here.
I come from a mainly SQL background and my main reason for using virtual relationships was to eliminate duplicated data and do traversals to get properties of something instead.
For a more specific example, let's say we have a robust cake recipe, which has sugar as an ingredient. The sugar is what makes the cake sweet.
Now imagine a use case where I don't like sweet cakes so I want to get all the ingredients of the recipe that make the cake sweet and possibly remove them or find alternatives.
Then there's another use case where I just want foods that are sweet. I could work backwards from the sweet ingredients to get to the food or just store that a cake is sweet in general, which saves time from traversal and makes a query easier. However, as I mentioned before, this duplicates known data that can be inferred.
Sorry if the example is too strange, I suck at making them. I hope the main question comes across, though.
My feeling is that the only valid scenario for creating redundant "shortcut" relationships is this:
Your use case has a stringent time constraint (e.g., average query time must be less than 200ms), but your neo4j query -- despite optimization -- exceeds that constraint, and you have verified that adding "shortcut" relationships will indeed make the response time acceptable.
You should be aware that adding redundant "shortcut" relationships comes with its own costs:
Queries that modify the DB would need to be more complex (to modify the redundant relationships) and also slower.
You'd always have to add the redundant relationships -- even if actually you never need some (most?) of them.
If you want to make concurrent updates to the DB, the chances that you may lose some updates and introduce inconsistencies into the DB would increase -- meaning that you'd have to work even harder to avoid inconsistencies.
NOTE: For visualization purposes, you can use virtual nodes and relationships, which are temporary and not actually stored in the DB.

What is one way that I can reduce .includes association query?

I have an extremely slow query that looks like this:
people = includes({ project: [{ best_analysis: :main_language }, :logo] }, :name, name_fact: :primary_language)
.where(name_id: limit(limit).unclaimed_people(opts))
Look at the includes method call and notice that is loading huge number of associations. In the RailsSpeed book, there is the following quote:
“For example, consider this:
Car.all.includes(:drivers, { parts: :vendors }, :log_messages)
How many ActiveRecord objects might get instantiated here?
The answer is:
# Cars * ( avg # drivers/car + avg log messages/car + average parts/car * ( average parts/vendor) )
Each eager load increases the number of instantiated objects, and in turn slows down the query. If these objects aren't used, you're potentially slowing down the query unnecessarily. Note how nested eager loads (parts and vendors in the example above) can really increase the number of objects instantiated.
Be careful with nesting in your eager loads, and always test with production-like data to see if includes is really speeding up your overall performance.”
The book fails to mention what could be a good substitute for this though. So my question is what sort of technique could I substitute for includes?
Before i jump to answer. I don't see you using any pagination or limit on a query, that may help quite a lot.
Unfortunately, there aren't any, really. And if you use all of the objects in a view that's okay. There is a one possible substitute to includes, though. It quite complex, but it still helpful sometimes: you join all needed tables, select only fields from them that you use, alias them and access them as a flat structure.
Something like
(NOTE: it uses arel helpers. You need to include ArelHelpers::ArelTable in models where you use syntax like NameFact[:id])
relation.join(name_fact: :primary_language).select(
NameFact[:id].as('name_fact_id')
PrimaryLanguage[:language].as('primary_language')
)
I'm not sure it will work for your case, but that's the only alternative I know.
I have an extremely slow query that looks like this
There are couple of potential causes:
Too many unnecessary objects fetched and created. From you comment, looks like that is not the case and you need all the data that is being fetched.
DB indexes not optimised. Check the time taken by query. Explain the generated query (check logs to get query or .to_sql) and make sure it is not doing table scan and other costly operations.

how do you test csv file content in Rails?

I'm writing a simple CanadaPost price shipping API, where the prices are given in a PDF document in table format (see page 3 of https://www.canadapost.ca/tools/pg/prices/SBParcels-e.pdf).
The data is extracted into csv files and imported into the database (table prices has columns weight, rate_code and date). The model and controller are very simple: they take a weight, rate_code and date and query the prices table.
I'm trying to create a suite of tests which tests the existence and correctness of every single price in that chart. It has to be generic and reusable because the data changes every year.
I thought of some ideas:
1) Read the same csv files that were used to import the data into the database and create a single RSpec example that loops around all rows and columns and tests the returned price.
2) Convert the csv files into fixtures for the prices controller and test each fixture.
I'm also having a hard time categorizing these tests. They are not unit tests, but are they really functional/feature tests? They sound more like integration tests, but I'm not sure if Rails differentiates functional tests from integration tests.
Maybe the answer is obvious and I'm not seeing it.
Program testing can be a very effective way to show the presence of bugs, but is hopelessly inadequate for showing their absence (Edsger Wybe Dijkstra)
First of all, I think you should assume that the CSV parser did its job correctly.
Once you did this, get the output CSV file and put it in your specs (fixtures, require, load), then test that the output is the one you see.
If it is correct for every single line of this csv file (I'm assuming it's a big one given what I say in your link), then I would conclude that your code works. I wouldn't bother testing it every year, unless it would make you feel very secure about your code.
You might get an anomaly afterwards, but unfortunalely your specs cannot detect that. it can simply ensure that it works great for this year's data.

In Rails 3, what is the most efficient to way to track the number of visits to a given RESTful resource?

I searched for this and was surprised not to find an answer, so I might be overcomplicating this. But, basically I have a couple of RESTful models in my Rails 3 application that I would like to keep track of, in a pretty simple way, just for popularity tracking over time. In my case, I'm only interested in tracking hits on the GET/show method–Users log in and view these two resources, their number of visits go up with each viewing/page load.
So, I have placed a "visits" column on the Books model:
== AddVisitsToBooks: migrating ===============================================
-- add_column(:books, :visits, :integer)
-> 0.0008s
== AddVisitsToBooks: migrated (0.0009s) ======================================
The column initializes to zero, then, basically, inside the books_controller,
def show
unless #book.owner == current_user #hypothetically, we won't let an owner
"cheat" their way to being popular
#book.visits = #book.visits + 1
#book.save
end
And this works fine, except now every time a show method is being called, you've got not only a read action for the object record, but a write, as well. And perhaps that gets to the heart of my question; is the total overhead required just to insert the single integer change a big deal in a small-to-midsize production app? Or is it a small deal, or basically nothing at all?
Is there a much smarter way to do it? Everything else I came up with still involved writing to a record every time the given page is viewed. Would indexing the field help, even if I'm rarely searching by it?
The database is PostgreSQL 9, by the way (running on Heroku).
Thanks!
What you described above has one significant cons: once the process updates database (increase visit counter) the row is blocked and if there is any other process it has to wait.. I would suggest using DB Sequence for this reason: http://www.postgresql.org/docs/8.1/static/sql-createsequence.html However you need to maintain the sequence custom in your code: Ruby on Rails+PostgreSQL: usage of custom sequences
After some more searching, I decided to take the visits counter off of the models themselves, because as MiGro said, it would be blocking the row every time the page is shown, even if just for a moment. I think the DB sequence approach is probably the fastest, and I am going to research it more, but for the moment it is a bit beyond me, and seems a bit cumbersome to implement in ActiveRecord. Thus,
https://github.com/charlotte-ruby/impressionist
seems like a decent alternative; keeping the view counts in an alternate table and utilizing a gem with a blacklist of over 1200 robots, etc, etc.

Modeling associations between ActiveRecord objects with Redis: avoiding multiple queries

I've been reading / playing around with the idea of using Redis to complement my ActiveRecord models, in particular as a way of modeling relationships. Also watched a few screencasts like this one: http://www.youtube.com/watch?v=dH6VYRMRQFw
It seems like a good idea in cases where you want to fetch one object at a time, but it seems like the approach breaks down when you need to show a list of objects along with each of their associations (e.g. in a View or in a JSON response in the case of an API).
Whereas in the case of using purely ActiveRecord, you can use includes and eager loading to avoid running N more queries, I can't seem to think of how to do so when depending purely on Redis to model relationships.
For instance, suppose you have the following (taken from the very helpful redis_on_rails project):
class Conference < ActiveRecord::Base
def attendees
# Attendee.find(rdb[:attendee_ids])
Attendee.find_all_by_id(rdb[:attendee_ids].smembers)
end
def register(attendee)
Redis.current.multi do
rdb[:attendee_ids].sadd(attendee.id)
attendee.rdb[:conference_ids].sadd id
end
end
def unregister(attendee)
Redis.current.multi do
rdb[:attendee_ids].srem(attendee.id)
attendee.rdb[:conference_ids].srem id
end
end
end
If I did something like
conferences = Conference.first(20)
conferences.each {|c|
c.attendees.each {|a| puts a.name}
}
I'm simply getting the first 20 conferences and getting the attendees in each and printing them out, but you can imagine a case where I am rendering the conferences along with a list of the attendees in a list in a view. In the above case, I would be running into the classic N+1 query problem.
If I had modeled the relationship in SQL along with has_many, I would have been able to use the includes function to avoid the same problem.
Ideas, links, questions welcome.
Redis can provide major benefits to your application's infrastructure, but I've found that, due to the specific operations you can perform on the various data types, you really need to put some thought ahead of time into how you're going to access your data. In this example, if you are very often iterating over a bunch of conferences and outputting the attendees, and are not otherwise benefiting from Redis' ability to do rich set operations (such as intersections, unions, etc.), maybe it's not a good fit for that data model.
On the other hand, if you are benefiting from Redis in performance-intensive parts of your application, it may be worth eating the occasional N+1 GET on Redis in order to reap those benefits. You have to do profiling on the parts of the app that you care about to see if the tradeoffs are worth it.
You may also be able to structure your data in Redis/your application in such a way that you can avoid the N+1 GETs; for example, if you can get all the keys up front, you can use MGET to get all the keys at once, which is a fast O(N) operation, or use pipelining to avoid network latency for multiple lookups.
In an application I work on, we've built a caching layer that caches the foreign key IDs for has_many relationships so that we can do fast lookups on cached versions of a large set of models that have complex relationships with each other; while fetching these by SQL, we generate very large, relatively slow SQL queries, but by using Redis and the cached foreign keys, we can do a few MGETs without hitting the database at all. However, we only arrived at that solution by investigating where our bottlenecks were and discussing how we might avoid them.

Resources