Design Pattern for Modeling Actuals that replace Estimates - data-warehouse

What if any is a good best practice / approach for a use case where a given business activity uses estimates that are then replaced by actual as they become available? In the same way that effective dates can be used to "automatically" (without user's having to know about it) retrieve historically accurate dimension rows, is there a similar way to have actual "automatically" replace the estimates without overwriting the data? I'd rather not have separate fact tables or columns and require that the users have to "know" about this and manually change it to get the latest actuals.

Why not have 2 measures in your fact table, one for estimate and one for actual?
You could then have a View over the fact table with a single measure calculated as "if actual = 0 then estimate else actual".
Users who just need the current position can use the View; users who need the full picture can access the underlying fact table

Related

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

Is it bad practise to save calculated data into a db as opposed to inputs for the calculation? (Rails)

Is it bad practise to save calculated data into a database record, as opposed to just inputs for the calculation?
Example:
If we're saving results of language tests as a db record, and the test has 3 parts which need to be saved in separate columns: listening_score, speaking_score,writing_score
Is it ok to have a forth column called overall_score, equal to
( listening_score + speaking_score + writing_score ) / 3?
Or should overall_score be recalculated each time current_user wants to look at historical results.
My thinking is that this would cause unnecessary duplication for data in the db. But it would make make extracting data simpler.
Is there a general rule for this?
It's not bad, but it's not good. There's no best practice here, because the answer is different in each situation. There are trade offs for persisting the calculated attributes instead of calculating them as needed. The big factors in deciding on whether to calculate when needed or persist are:
Complexity of calculation
Frequency of changes to dependent fields
Calculated field to be used a search criteria
Volume of calculated data
Usage of calculated fields (eg: operational/viewing one record at a time vs. big data style reporting)
Impact to other processes during calculation
Frequency that calculated fields will be viewed.
There are a lot of opinions on this matter. Each situation is different. you have to determine whether the overhead of persisting your attributes and maintaining their values is worth the extra effort than just calculating it as needed.
Using the factors above, my preference for persisting a calculated attribute increases as
Complexity of calculation goes up
Frequency of changes to dependent fields goes down
Calculated field to be used a search criteria goes up
calculated field are used for complicated reporting
Frequency that calculated fields will be viewed goes up.
The factors I omitted from the second list are dependent on external factors, and are subject to even more variability.
Storing the calculated total could be thought of as caching. Caching calculations like this means you have to start dealing with keeping the calculation up to date and worrying about when it isn't. In the long run, that pattern can result in a lot of work. On the flip side, always calculating the total means you will always have a fresh calculation to work with.
I've seen folks store calculations to address performance issues, when calculating is taking a long time due to its complexity or the complexity of the query its based off of. That's a good reason to start thinking about caching results like this.
I've also seen folks store this value to make queries easier. That's a lower return on investment, but can still be worth it if the columns used in your calculations aren't changing frequently.
My default is to calculate, and I want to see good justification for storing the value of the calculation in another column.
(It may also be worth noting that if you are using the same calculation multiple times in a particular function call, you can memoize the result to increase performance without storing the result in the database.)

Detecting HTML table orientation based only on table data

Given an HTML table with none of it's cells identified as "< th >" or "header" cells, I want to automatically detect whether the table is a "Vertical" table or "Horizontal" table.
For example:
This is a Horizontal table:
and this is a vertical table:
of course keep in mind that the "Bold" property along with the shading and any styling properties will not be available at the classification time.
I was thinking of approaching this by a statistical means, I can hand write couple of features like "if the first row has numbers, but the first column doesn't. That's probably a Vertical table" and give score for each feature and combine to decide the Class of the table orientation.
Is that how you approach such a problem? I haven't used any statistical-based algorithm before and I am not sure what would be optimal for such a problem
This is a bit confusing question. You are asking about ML method, but it seems you have not created training/crossvalidation/test sets yet. Without data preprocessing step any discussion about ML method is useless.
If I'm right and you didn't created datasets yet - give us more info on data (if you take a look on one example how do you know the table is vertical or horizontal?, how many data do you have, are you always sure whether s table is vertical/horizontal,...)
If you already created training/crossval/test sets - give us more details how the training set looks like (what are the features, number of examples, do you need white-box solution (you can see why a ML model give you this result),...)
How general is the domain for the tables? I know some Web table schema identification algorithms use types, properties, and instance data from a general knowledge schema such as Freebase to attempt to identify the property associated with a column. You might try leveraging that knowledge in an classifier.
If you want to do this without any external information, you'll need a bunch of hand labelled horizontal and vertical examples.
You say "of course" the font information isn't available, but I wouldn't be so quick to dismiss this since it's potentially a source of very useful information. Are you sure you can't get your data from a little bit further back in the pipeline so that you can get access to this info?

If I have two models and need a calculation on each attribute, should I calculate on the fly or create a 3rd model?

I have two models - Score & Weight.
Each of these models have about 5 attributes.
I need to be able to create a weighted_score for my User, which is basically the product of Score.attribute_A * Weight.attribute_A, Score.attribute_B * Weight.attribute_B, etc.
Am I better off creating a 3rd model - say Weighted_Score, where I store the product value for each attribute in a row with the user_id and then query that table whenever I need a particular weighted_score (e.g. my_user.weighted_score.attribute_A) or am I better off just doing the calculations on the fly every time?
I am asking from an efficiency stand-point.
Thanks.
I think the answer is very situation-dependent. Creating a 3rd table may be a good idea if the calculation is very expensive, you don't want to bog down the rest of the system and it's ok for you to respond to the user right away with a message saying that calculation will occur in the future. In that case, you can offload the processing into a background worker and create an instance of the 3rd model asynchronously. Additionally, you should de-normalize the table so that you can access it directly without having to lookup the Weight/Score records.
Some other ideas:
Focus optimizations on the model that has many records. If Weight, for instance, will only have 100 records, but Score could have infinite, then load Weight into memory and focus all your effort on optimizing the Score queries.
Use memoization on the calc methods
Use caching on the most expensive actions/methods. if you don't care too much about how frequently the values update, you can explicitly sweep the cache nightly or something.
Unless there is a need to store the calculated score (lets say that it changes and you want to preserve the changes to it) i dont see any benefit of adding complexity to store it in a separate table.

DB-agnostic Calculations : Is it good to store calculation results ? If yes, what's the better way to do this?

I want to perform some simple calculations while staying database-agnostic in my rails app.
I have three models:
.---------------. .--------------. .---------------.
| ImpactSummary |<------| ImpactReport |<----------| ImpactAuction |
`---------------'1 *`--------------'1 *`---------------'
Basicly:
ImpactAuction holds data about... auctions (prices, quantities and such).
ImpactReport holds monthly reports that have many auctions as well as other attributes ; it also shows some calculation results based on the auctions.
ImpactSummary holds a collection of reports as well as some information about a specific year, and also shows calculation results based on the two other models.
What i intend to do is to store the results of these really simple calculations (just means, sums, and the like) in the relevant tables, so that reading these would be fast, and in a way that i can easilly perform queries on the calculation results.
is it good practice to store calculation results ? I'm pretty sure that's not a very good thing, but is it acceptable ?
is it useful, or should i not bother and perform the calculations on-the-fly?
if it is good practice and useful, what's the better way to achieve what i want ?
Thats the tricky part.At first, i implemented a simple chain of callbacks that would update the calculation fields of the parent model upon save (that is, when an auction is created or updated, it marks some_attribute_will_change! on its report and saves it, which triggers its own callbacks, and so on).
This approach fits well when creating / updating a single record, but if i want to work on several records, it will trigger the calculations on the whole chain for each record... So i suddenly find myself forced to put a condition on the callbacks... depending on if i have one or many records, which i can't figure out how (using a class method that could be called on a relation? using an instance attribute #skip_calculations on each record? just using an outdated field to mark the parent records for later calculation ?).
Any advice is welcome.
Bonus question: Would it be considered DB agnostic if i implement this with DB views ?
As usual, it depends. If you can perform the calculations in the database, either using a view or using #find_by_sql, I would do so. You'll save yourself a lot of trouble: you have to keep your summaries up to date when you change values. You've already met the problem when updating multiple rows. Having a view, or a query that implements the view stored as text in ImpactReport, will allow you to always have fresh data.
The answer? Benchmark, benchmark, benchmark ;)

Resources