I'm writing a simple CanadaPost price shipping API, where the prices are given in a PDF document in table format (see page 3 of https://www.canadapost.ca/tools/pg/prices/SBParcels-e.pdf).
The data is extracted into csv files and imported into the database (table prices has columns weight, rate_code and date). The model and controller are very simple: they take a weight, rate_code and date and query the prices table.
I'm trying to create a suite of tests which tests the existence and correctness of every single price in that chart. It has to be generic and reusable because the data changes every year.
I thought of some ideas:
1) Read the same csv files that were used to import the data into the database and create a single RSpec example that loops around all rows and columns and tests the returned price.
2) Convert the csv files into fixtures for the prices controller and test each fixture.
I'm also having a hard time categorizing these tests. They are not unit tests, but are they really functional/feature tests? They sound more like integration tests, but I'm not sure if Rails differentiates functional tests from integration tests.
Maybe the answer is obvious and I'm not seeing it.
Program testing can be a very effective way to show the presence of bugs, but is hopelessly inadequate for showing their absence (Edsger Wybe Dijkstra)
First of all, I think you should assume that the CSV parser did its job correctly.
Once you did this, get the output CSV file and put it in your specs (fixtures, require, load), then test that the output is the one you see.
If it is correct for every single line of this csv file (I'm assuming it's a big one given what I say in your link), then I would conclude that your code works. I wouldn't bother testing it every year, unless it would make you feel very secure about your code.
You might get an anomaly afterwards, but unfortunalely your specs cannot detect that. it can simply ensure that it works great for this year's data.
Related
How can I hit multiple apis like example.com/1000/getUser, example.com/1001/getUser in Gatling? They are Get calls.
Note: Numbers start from a non zero integer.
Hard to give good advice based on the small amount of information in your question, but I'm guessing that passing the userID's in with a feeder could be a simple, straightforward solution. Largely depends on how your API-works, what kind of tests you're planning, and how many users (I'm assuming the numbers are userId's) you need to test with.
If you need millions of users, a custom feeder that generates increments would probably be better, but beyond that the strategy would otherwise be the same. I advice you to read up on the feeder-documentation for more information both on usage in general, and how to make custom feeders: https://gatling.io/docs/3.0/session/feeder/
As an example, if you just need a relatively small amount of users, something along these lines could be a simple, straightforward solution:
Make a simple csv file (for example named userid.csv) with all your userID's and add it to the resources folder:
userid
1000
1001
1002
...
...
The .feed() step adds one value from the csv-file to your gatling user session, which you can fetch as you would work with session values ordinarily. Each of the ten users injected in this example will get an increment from the csv-file.
setUp(
scenario("ScenarioName")
.feed(csv("userid.csv"))
.exec{http("Name of your request").get("/${userid}/getUser")}
)
.inject(
atOnceUsers(10)
)
).protocols(http.baseUrl("example.com"))
I've done some research and found Ice Cube, but it stores the schedule as YAML in a TEXT SQL type column.
The goal is to find something that is standard as far as storing schedules and allows, for example, a query interface for "all records where the schedule includes TODAY()".
It seems this is a hard problem, and the leading solution (that I could find) is to create an occurrences table. If there's no end time to the schedule, then you have to make a decision as to how many occurrences to store. Are there any best practices here?
I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.
I am trying to obtain count the number of Postgres Statements my Ruby on Rails application is performing against our database. I found this entry on stackoverflow, but it counts transactions. We have several transactions that make very large numbers of statements, so that doesn't give a good picture. I am hoping the data is available from PG itself - rather than trying to parse a log.
https://dba.stackexchange.com/questions/35940/how-many-queries-per-second-is-my-postgres-executing
I think you are looking for ActiveSupport instrumentation. Part of Rails, this framework is used throughout Rails applications to publish certain events. For example, there's an sql.activerecord event type that you can subscribe to to count your queries.
ActiveSupport::Notifications.subscribe "sql.activerecord" do |*args|
counter++
done
You could put this in config/initializers/ (to count across the app) or in one of the various before_ hooks of a controller (to count statements for a single request).
(The fine print: I have not actually tested this snippet, but that's how it should work AFAIK.)
PostgreSQL provides a few facilities that will help.
The main one is pg_stat_statements, an extension you can install to collect statement statistics. I strongly recommend this extension, it's very useful. It can tell you which statements run most often, which take the longest, etc. You can query it to add up the number of queries for a given database.
To get a rate over time you should have a script sample pg_stat_statements regularly, creating a table with the values that changed since last sample.
The pg_stat_database view tracks values including the transaction rate. It does not track number of queries.
There's pg_stat_user_tables, pg_stat_user_indexes, etc, which provide usage statistics for tables and indexes. These track individual index scans, sequential scans, etc done by a query, but again not the number of queries.
I've got an application that needs quite a bit of data (1000s of records) to do appropriate testing. The only way I've found to get a decent set of testable, sensible data is to use a subset of my production DB. I've converted this to YAML fixtures in the normal `test/fixtures' location.
This works, but now I have a bunch of seemingly brittle tests and assertions that depend on their being a particular number of records that meet condition X...
example
def test_children_association
p = Parent.find(1)
assert_equal 18, p.children.count, "Parent.children isn't providing the right records"
end
This doesn't seem like a good idea to me, but I'm not sure if there is a better / accepted way to test an application that needs a large hierarchy of data.
Magic numbers in tests aren't an anti-pattern. Your tests need to be so dead-simple that you don't need to test them. This means you'll have some magic numbers. This means that your tests will break when you change small bits of functionality. This is good.
Fixtures have some problems, but there are a few simple things you can do to make them easier to work with:
Only have baseline data in your fixtures, the sort of data that most of your tests need but don't care about. This will involve a time investment up front, but it's better to take the pain early than write poor unit tests for the life of the project.
Add the data to be tested in the context of the test. This improves readability of your tests and saves you from writing "make sure nobody messed up the fixtures" sanity checks at the beginning of your unit tests.
The first thing I'd say is: what are you testing in that example? If it's an ordinary AR has_many association, then I wouldn't bother writing a test for it. All you're doing is testing that AR works.
A better example might be if you had a very complicated query or if there was other processing involved in getting the list of children records. When you get them back, rather than testing for a count you could iterate through the returned list and verify that the children match the criteria you're using.
what I've found most useful in this situation is not using fixtures at all but rather construct the database objects on the fly like
def test_foo
project = Project.create valid_project.merge(....)
*do assertions here*
end
and in my test_helpers I'd have a bunch of methods:
def valid_project
{ :user_id => 23, :title => "My Project" }
end
def invalid_project
valid_project.merge(:title => nil)
end
I found that the pain of having to build massive collections of test objects has led me naturally to design simpler and more versatile class structures.
Cameron's right: What are you testing?
What sort of system needs 1000s of records present to test? Remember, your tests should be as tiny as possible and should be testing application behavior. There's no way it needs thousands of records for the vast majority of those tests.
For little bits of behavior tests where you need object relationships, consider mock objects. You'll only be speccing out the exact minimum amount of behavior necessary to get your test to pass, and they won't hit the DB at all, which will amount to a huge performance gain in your test suite. The faster it is to run, the more often people will run it.
I may have a unique situation here, but I really did need quite a few records for testing this app (I got it down to 150 or so). I'm analyzing historical data and have numerous levels of has_many. Some of my methods do custom SQL queries across several tables which I might end up modifying to use ActiveRecord.find but I needed to get the test running first.
Anyway, I ended up using some ruby code to create the fixtures. The code is included in my test_helper; it checks the test DB to see if the data is stale (based on a time condition) and wipes and recreates the records procedurally. In this case, creating it procedurally allows me to know what the data I'm testing for SHOULD be, which is safer than using a subset of production data and hoping the numbers I calculate the first time are what I should test for in the future.
I also moved to using Shoulda which along with many other useful things makes ActiveRecord Association testing as easy as:
should_have_many :children
should_belong_to :parent