The Situation
So I have created some code in the form of modules that each represent a medical questionnaire (I'm calling them Catalogs). Each different questionnaire has it's own module as they may differ slightly in their content and associated calculations, but are essentially made up of simple questions that have boolean/numeric possible responses. Here is an example:
http://www.janssenmedicalinformation.ca/assets/pdf/HarveyBradshaw_English.pdf
These Catalog modules are included in an Entry class that collects responses matching the question names. Each questionnaire is transformed into a DEFINITION which is used in the Entry to do things like:
Validate inputs
Check completeness
Calculate scoring
Here are 2 examples for reference that illustrate the problem of duplication... much of the code is similar but not exactly the same.
https://gist.github.com/theworkerant/3a074d5d2a642ded1b96
The Problem
There is a lot of duplication here, but I'm not sure about the best strategy to remove it. There are a few things that make it difficult for this particular problem and make me lean towards accepting some duplication as opposed to a system that is too strict to work. The system needs to remain flexible enough to accommodate currently unknown medical questionnaires of a similar nature so I need to be careful (the reason I've gone with a Module system so far)
Here are some examples:
Each Catalog can have slightly different scoring requirements and custom grouping of questions that represent one "score"
Potentially many Catalogs are included in an Entry class and can't step on each other
Some Catalogs incorporate things like "Current Weight" for calculations, breaking the 1-5 or 1-10 paradigm and not fitting very nicely into simple sum reductions.
One Catalog requires a week of previous entries in order to be valid, a sort of weird custom validation.
The Question:
What strategies might be employed here to reduce duplication overall? I'm not looking for tweaks cut out a few lines from these specific examples. Implementation cost is a consideration.
Possibilities:
Put some of this into a database (sounds pretty good, but I think the cost of implementation could be high)
I fear there could be room for improvement in my metaprogramming here, perhaps there are better ways to accomplish this through some dynamic method creation or other voodoo.
Thanks!
Related
I am interested in what the best practice is for a model that has a lot of data attached to it. Most of my app revolve around one model (SKU), and it seems to have more and more things associated with it.
For example, my SKU model has multiple prices, dimensions, weight, recommended prices for multiple price levels, title, description, shelf life, etc. Would it make sense to break all the pricing info to another table? Or break up the SKU into different uses of the SKU and associate them? For example, WebSKU, StockSKU, etc.
As mentioned in the answer linked by Tom, if all your attributes really belong to that model there is no reason to break it up. However, if you have columns like price1, price2, price3 or dimension_x_1, dimension_y_1, dimension_x_2, dimension_y_2, etc, then it usually means you should be creating another table to contain those.
For example, you could set it up so that you have the following models
Sku
has_many :prices
has_many :dimensions
Price
belongs_to :sku
Dimension
belongs_to :sku
As everyone else said, the design of a database should respond to the logic behind it. Why? Mainly, because it will be easier to maintain and understand.
I was also going to drive attention to normalization rules, as #sawa did.
Generally, is a good approach to normalize your database, as it provides several advantages. You should read this wikipedia link (at least as a starting point).
Following normal rules will help you to design your database taking into account the logic behind your data.
But denormalization also has it's advantages. The first (always considered) being optimizing read performance. This basically means having data on one table that you would have had in different tables when following normal rules, and generally makes sense when that data has some logic relation.
You have to aim to achieve a balance depending on the problem you are facing.
On the other side, for the tags on your post I can see you are using ruby on rails, that uses the active record pattern. One consequence of the database model you are presenting, is that you will probably have a domain model just as complex. I mean, very large. I don't know every detail about your project, but I guess that it will quickly grow to be a god object, making your code hard to maintain, extend and understand.
Database should be designed not according to how many columns it has, but according to logic, particularly following Codd's normal forms. If there is systematic redundancy in your database, then that is a sign for splitting it into multiple tables. If not, keep it as is.
I think it is good to design data model, taking into account how DB engine works with files and memory. The first bottleneck of PostgreSQL is file IO. Memory consumption is also an important part. When PostgreSQL reads some table data (FYI: table data is not read at Index-Only-Scans) it reads 8 KB (compile time parameter) pages. More tuples in such a page, - less file IO, less memory consumption, better cache using (more often hits, fast prewarming, etc.), better performance.
So, if one have a really high-loaded project, it can be useful to think about separation of often used data to isolated tables (as a next step - place this tables into a separate tablespace on SDD or powerful RAID).
I.e. there should be some balance between a logic simplicity and performance tweaks.
I've asked a couple of questions around this subject recently, and I think I'm managing to narrow down what I need to do.
I am attempting to create some "metrics" (quotes because these should not be confused with metrics relating to the performance of the application; these are metrics that are generated based on application data) in a Rails app; essentially I would like to be able to use something similar to the following in my view:
#metric(#customer,'total_profit','01-01-2011','31-12-2011').result
This would give the total profit for the given customer for 2011.
I can, of course, create a metric model with a custom result method, but I am confused about the best way to go about creating the custom metrics (e.g. total_profit, total_revenue, etc.) in such a way that they are easily extensible so that custom metrics can be added on a per-user basis.
My initial thoughts were to attempt to store the formula for each custom metric in a structure with operand, operation and operation_type models, but this quickly got very messy and verbose, and was proving very hard to do in terms of adding each metric.
My thoughts now are that perhaps I could create a custom metrics helper method that would hold each of my metrics (thus I could just hard code each one, and pass variables to each method), but how extensible would this be? This option doesn't seem very rails-esque.
Can anyone suggest a better alternative for approaching this problem?
EDIT: The answer below is a good one in that it keeps things very simple - though i'm concerned it may be fraught with danger, as it uses eval (thus there is no prospect of ever using user code). Is there another option for doing this (my previous option where operands etc. were broken down into chunks used a combination of constantize and get_instance_variable - is there a way these could be used to make the execution of a string safer)?
This question was largely answered with some discussion here: Rails - Scalable calculation model.
For anyone who comes across this, the solution is essentially to ensure an operation always has two operands, but an operand can either be an attribute, or the result of a previous calculation (i.e. it can be a metric itself), and it is thus highly scalable. This avoids the need to eval anything, and thus avoids the potential security holes that this entails.
I'm building an application where I will be gathering statistics from a game. Essentially, I will be parsing logs where each line is a game event. There are around 50 different kinds of events, but a lot of them are related. Each event has a specific set of values associated with it, and related events share a lot of these attributes. Overall there are around 50 attributes, but any given event only has around 5-10 attributes.
I would like to use Rails for the backend. Most of the queries will be event type related, meaning that I don't especially care about how two event types relate with each other in any given round, as much as I care about data from a single event type across many rounds. What kind of schema should I be building and what kind of database should I be using?
Given a relational database, I have thought of the following:
Have a flat structure, where there are only a couple of tables, but the events table has as many columns as there are overall event attributes. This would result in a lot of nulls in every row, but it would let me easily access what I need.
Have a table for each event type, among other things. This would let me save space and improve performance, but it seems excessive to have that many tables given that events aren't really seperate 'ideas'.
Group related events together, minimizing both the numbers of tables and number of attributes per table. The problem then becomes the grouping. It is far from clear cut, and it could take a long time to properly establish event supertypes. Also, it doesn't completely solve the problem of there being a fair amount of nils.
It was also suggested that I look into using a NoSQL database, such as MongoDB. It seems very applicable in this case, but I've never used a non-relational database before. It seems like I would still need a lot of different models, even though I wouldn't have tables for each one.
Any ideas?
This feels like a great use case for MongoDB and a very awkward fit for a relational database.
The types of queries you would be making against this data is very key to best schema design but imagine that your documents (in a single collection similar to 1. above) look something like this:
{ "round" : 1,
"eventType": "et1",
"attributeName": "attributeValue",
...
}
You can easily query by round, by eventType, getting back all attributes or just a specified subset, etc.
You don't have to know up front how many attributes you might have, which ones belong with which event types, or even how many event types you have. As you build your prototype/application you will be able to evolve your model as needed.
There is a very large active community of Rails/MongoDB folks and there's a good chance that you can find a lot of developers you can ask questions and a lot of code you can look at as examples.
I would encourage you to try it out, and see if it feels like a good fit. I was going to add some links to help you get started but there are too many of them to choose from!
Since you might have a question about whether to use an object mapper or not so here's a good answer to that.
A good write-up of dealing with dynamic attributes with Ruby and MongoDB is here.
I wonder what sort of things you look for when you start working on an existing, but new to you, system? Let's say that the system is quite big (whatever it means to you).
Some of the things that were identified are:
Where is a particular subroutine or procedure invoked?
What are the arguments, results and predicates of a particular function?
How does the flow of control reach a particular location?
Where is a particular variable set, used or queried?
Where is a particular variable declared?
Where is a particular data object accessed, i.e. created, read, updated or deleted?
What are the inputs and outputs of a particular module?
But if you look for something more specific or any of the above questions is particularly important to you please share it with us :)
I'm particularly interested in something that could be extracted in dynamic analysis/execution.
I like to use a "use case" approach:
First, I ask myself "what's this software's purpose?": I try to identify how users are going to interact with the application;
Once I have some "use case", I try to understand what are the objects that are more involved and how they interact with other objects.
Once I did this, I draw a UML-type diagram that describe what I've just learned for further reference. What happens after depends on the task I've been assigned, i.e. modify the code, document the code etc.
There is the question of what motivation do I have for learning the new system:
Bug fix/minor enhancement - In this case, I may focus solely on that portion of the system that performs a specific function that needs to be altered. This is a way to break down a huge system but also is a way to identify if the issue is something I can fix or if it is something that I have to hand to the off-the-shelf company whose software we are using,e.g. a CRM, CMS, or ERP system can be a customized off-the-shelf system so there are many pieces to it.
Project work - This would be the other case and is where I'd probably try to build myself a view from 30,000 feet or so to know what are the high-level components and which areas of the system does the project impact. An example of this is where I'd join a company and work off of an existing code base but I don't have the luxury of having the small focus like in the previous case. Part of that view is to look for any patterns in the code in terms of naming conventions, project structure, etc. as this may be useful once I start changing some code in the system. I'd probably do some tracing through the system and try to see where are the uglier parts of the code. By uglier I mean those parts that are kludge-like and may have some spaghetti code as this was rushed when first written and is now being reworked heavily.
To my mind another way to view this is the question of whether I'm going to be spending days or weeks wrapping my head around a system like in the second case or should this be a case where it hopefully takes only a few hours, optimistically that is, to get my footing to make the necessary changes.
<tl;dr>
In source version control diff patch generation, would it be worth it to use the optimizations listed at the very bottom of this writing (see <optimizations>) in my Ruby implementation of diff for making diff patches?
</tl;dr>
<introduction>
I am programming something I have never done before and there might already be tools out there to do the exact thing I am programming but at this point I am having too much fun to care so I am still going to do it from scratch, even if there is a tool for this.
So anyways, I am working on a Ruby on Rails app and need a certain feature. Basically I want each entry in a table of mine, let's say for example a table of video games, to have a stored chunk of text that represents a review or something of the sort for that table entry. However, I want this text to be both editable by any registered user and also keep track of different submissions in a version control system. The simplest solution I could think of is just implement a solution that keeps track of the text body and the diff patch history of different versions of the text body as objects in Ruby and then serialize it, preferably in human readable form (so I'll most likely use YAML for this) for editing if needed due to corruption by a software bug or a mistake is made by an admin doing some version editing.
So at first I just tried to dive in head first into this feature to find that the problem of generating a diff patch is more difficult that I thought to do efficiently. So I did some research and came across some ideas. Some I have implemented already and some I have not. However, it all pretty much revolves around the longest common subsequence problem, as you would already know if you have already done anything with diff or diff-like features, and optimization the function that solves it.
Currently I have it so it truncates the compared versions of the text body from the beginning and end until non-matching lines are found. Then it solves the problem using a comparison matrix, but instead of incrementing the value stored in a cell when it finds a matching line like in most longest common subsequence algorithms I have seen examples of, I increment when I have a non-matching line so as to calculate edit distance instead of longest common subsequence. Although as far as I can tell between the two approaches, they are essentially two sides of the same coin so either could be used to derive an answer. It then back-traces through the comparison matrix and notes when there was an incrementation and in which adjacent cell (West, Northwest, or North) to determine that line's diff entry and assumes all other lines to be unchanged.
Normally I would leave it at that, but since this is going into a Rails environment and not just some stand-alone Ruby script, I started getting worried about needing to optimize at least enough so if a spammer that somehow knew how I implemented the version control system and knew my worst case scenario entry still wouldn't be able to hit the server that bad. After some searching and reading of research papers and articles through the internet, I've come across several that seem decent but all seem to have pros and cons and I am having a hard time deciding how well in this situation that the pros and cons balance out. So are the ones listed here worth it? I have listed them with known pros and cons.
</introduction>
<optimizations>
Chop the compared sequences into multiple subsequences by splitting where lines are unchanged, and then truncating each section of unchanged lines at the beginning and end of each section. Then solve the edit distance of each subsequence.
Pro: Changes the time increase as the changed area gets bigger from a quadratic
increase to something more similar to a linear increase.
Con: Figuring out where to split already seems like you have to solve edit distance
except now you don't care how it is changed. Would be fine if this was solvable by
a process closer to solving hamming distance but a single insertion would throw this
off.
Use a cryptographic hash function to both convert all sequence elements into integers and ensure uniqueness. Then solve the edit distance comparing the hash integers instead of the sequence elements themselves.
Pro: The operation of comparing two integers is faster than the operation of comparing
two strings, so a slight performance gain is received after every comparison, which
can be a lot overall.
Con: Using a cryptographic hash function takes time to convert all the sequence
elements and may end up costing more time to do the conversion that you gain back from
the integer comparisons. You could use the built in hash function for a string but
that will not guarantee uniqueness.
Use lazy evaluation to only calculate the three center-most diagonals of the comparison matrix and then only calculate additional diagonals as needed. And then also use this approach to possibly remove the need on some comparisons to compare all three adjacent cells as desribed here.
Pro: Can turn an algorithm that always takes O(n * m) time and make it so only worst
case scenario is that time, best case becomes practically linear, and average case is
somewhere between the two.
Con: It is an algorithm I've only seen implemented in functional programming languages
and I am having a difficult time comprehending how to convert this into Ruby based on
how it is described at the site linked to above.
Make a C module and do the hard work at the native level in C and just make a Ruby wrapper for it so Ruby can make all the calls to it that it needs.
Pro: I have to imagine that evaluating something like this in could be a LOT faster.
Con: I have no idea how Rails handles apps with ruby code that has C extensions and it
hurts the portability of the app.
This is an optimization for after the solving of edit distance, but idea is to store additional combined diffs with the ones produced by each version to make a delta-tree data structure with the most recently made diff as the root node of the tree so getting to any version takes worst case time of O(log n) instead of O(n).
Pro: Would make going back to an old version a lot faster.
Con: It would mean every new commit, the delta-tree would get a new root node that
will cost time to reorganize the delta-tree for an operation that will be carried out
a lot more often than going back a version, not to mention the unlikelihood it will be
an old version.
</optimizations>
So are these things worth the effort?
With regard to item 4 in your list, this seems to be ( from what I can tell ) how most gems work if there is any heavy lifting to be done by the code. Rails plays nice with the gem system, so you should find that if you need to incorporate this - probably alongside other optimisations you have suggested here - it should be fine, although you may need to recompile for different platforms.