I know there are three types of facts, and I've read that Transactional-Fact tables have fully additive facts which are the most useful type, but can non-additive facts be there as well? Or even semi-additive facts for that matter.
I'm asking this because my teacher had this in one of her presentations:
"While non-additive facts are not stored in fact tables, it is important not to
lose track of them. For many processes, ratios are critical
measurements without which a solution would leave much to
be desired. Non-additive facts should be documented as
part of the schema design."
If they can't be stored in there, how can they be documented as part of the schema design?
It isn't correct to say "... are not stored in fact tables", there are circumstances in which it is desirable to store them.
For example, I recently worked on a data warehouse which had three dates - order, activation and completion. Those dates were related through dimensions, but the fact measures included days-order-to-activation, days-activation-to-completion, and days-order-to-completion.
Best practice would be to derive these measures in the BI tool. In this case, you would document the calculation of the day measures, to demonstrate how the requirement was met from existing data values.
In our recent example, however, these were KPI-level measures, critical to the business. Rather than have people calculating them (possibly differently) in Excel, Tableau, PowerBI, etc., we chose to implement those measures in the fact table.
They were documented as non-additive, because the sum(days-order-to-completion) is meaningless, although it is worth noting that the minimum, maximum and average values ARE meaningful in this case.
Related
I'm new to SPSS. I have data of skin cancer diagnosis for the years 2004 - 2018. I want to compare the changes in distribution of new cases with regards to which body part and compare between the different years. I've managed to create a crosstab and grouped bar graph that shows the percentages but I would like to run a statistical analysis to see if the changes in distribution are significant over time. The groups I have are face, trunk, arm, leg or not specified, the number of cases for each year vary greatly which is why I'm looking to compare the ratios (percentages) between the different body sites. The only explanations I've found all refer to repeated observations of the same subject which is not the case here (a person is only included with their first diagnosis so can only appear in one of the years).
The analysis would be similar to comparing the percentages of an election between 3+ parties and how that distribution changes over the years but I haven't found any such tutorials. Please help!
The CTABLES or Custom Tables procedure, if you have access to it, will let you create a crosstabulation like you mention, and then will let you test both for any changes overall in the distribution of types, as well as comparing each pair of columns for each row.
More generally, problems like this would usually be handled as loglinear or logit models.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Let me just start by saying I only took the undergrad AI class at school so I know just enough to be dangerous.
Here's the problem I'm looking to solve...accurate credit scoring is a key part to the success of my business. Currently we rely on a team of actuaries and statistical analysis to suss out patterns in the few dozen variables we track about each individual that indicate that they may be a low or high credit risk. As I understand it this is exactly the type of job that neural nets are great at solving, that is, finding high order relationships across many inputs that a human would likely never spot and then rendering a decision or output that is on average more accurate than what a trained human could do. In short, I want to be able to input your name, address, marital status, what car you drive, where you work, hair color, favorite food, etc in and get a credit score back.
My question is what type or architecture for a neural network would be best for this particular problem. I've done a bit of research and it seems I'm generating questions faster than I'm finding answers at this point. The best I've been able to come up with is some kind of generative deep neural network with multiple hidden layers where each layer is able to abstract one level beyond the previous one. Im assuming it's going to be feed-forward just because it seems to be the default. We have historical data on all previous customers including the information we used to make the initial score as well as data on what type of credit risk they actually turned out to be. This would seem to lend itself to unsupervised learning. Where I'm lost is in number of layers, how the layers are different from each other, size of each layer, connectedness of each of the perceptrons and so on. The more I dig the more I'm getting into research papers that are over my head so I just need some smart person to point me in the right direction
Does anyone have any ideas? Again, I don't need a thorough explanation just a general area I should focus on.
This is supervised learning since you have actual data that can be labelled. It's also feedforward since you're not predicting time series but assigning scores. Further, you should probably just prepare your data (assigning credit scores manually or with some rough heuristic) and start experimenting with some tools before you invest time into implementing state-of-the-art architectures. A multi-layer-perceptron (MLP) with 1 hidden layer is a sufficient starting point for such a problem. From there on, you can train the network to generalize your credit assignment heuristic you began with.
You should know that most "new" architectures you probably read about while researching are dealing with much more difficult problems than credit scoring (speech/image/character recognition/detection). There is a collection of papers on the scenario of credit scoring / risk classification, so I'd recommend reshifting your focus from architectures to actual case studies (see e.g. this paper). Just pick a recent paper with MLPs and apply their parameters. Start simple and improve the system incrementally (as #roganjosh stated).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the features and values I need and transform the data from one format to the other. This usually results in a very large number of very large files and a very large number of small programs which process them in order to get the input for some machine learning framework (like the arff files for Weka).
One needs to be extremely well organised to deal with that and program with great care not to miss any important peculiarities, exceptions or errors in the tons of data. Many principles of good software design like design patterns or refactoring paradigms are no big use for these tasks because things like security, maintainability or sustainability are of no real importance - once the program successfully processed the data one doesn't need it any longer. This has gone so far that I even stopped bothering about using classes or functions at all in my Python code and program in a simple procedural way. The next experiment will require different data sets with unique characteristics and in a different format so that their preparation will likely have to be programmed from scratch anyway. My experience so far is that it's not unusual to spend 80-90% of a project's time on the task of preparing training data. Hours and days go by only on thinking about how to get from one data format to another. At times, this can become quite frustrating.
Well, you probably guessed that I'm exaggerating a bit, on purpose even, but I'm positive you understand what I'm trying to say. My question, actually, is this:
Are there any general frameworks, architectures, best practices for approaching these tasks? How much of the code I write can I expect to be reusable given optimal design?
I find myself mostly using the textutils from GNU coreutils and flex for corpus preparation, chaining things together in simple scripts, at least when the preparations i need to make are simple enough for regular expressions and trivial filtering etc.
It is still possible to make things reusable, the general rules also apply here. If you are programming with no regard to best practices and the like and just program procedurally there is IMHO really no wonder that you have to do everything from scratch when starting a new project.
Even though the format requirements will vary a lot there is still many common tasks, ie. tag-stripping, tag-translation, selection, tabulation, some trivial data harvesting such as number of tokens, sentences and the like. Programming these tasks aming for high reusability will pay off, even though it takes longer at first.
I am not aware of any such frameworks--doesn't mean they aren't out there. I prefer to use my own which is just a collection of code snippets i've refined/tweaked/borrowed over time and that i can chain together in various configurations depending on the problem. If you already know python, then i strongly recommend handling all of your data prep in NumPy--as you know, ML data sets tends to be large--thousands of row vectors packed with floats. NumPy is brilliant for that sort of thing. Additionally, I might suggest that for preparing training data for ML, there are a couple of tasks that arise in nearly every such effort and that don't vary a whole lot from one problem to the next. I've give you snippets for these below.
normalization (scaling & mean-centering your data to avoid overweighting. As i'm sure you know, you can scale -1 to 1 or 0 to 1. I usually chose the latter so that i can take advantage of sparsity patterns. In python, using the NumPy library:
import numpy as NP
data = NP.linspace( 1, 12, 12).reshape(4, 3)
data_norm = NP.apply_along_axis( lambda x : (x - float(x.min())) / x.max(),
0, data )
cross-validation (here's i've set the default argument at '5', so test set is 5%, training set, 95%--putting this in a function makes k-fold much simpler)
def divide_data(data, testset_size=5) :
max_ndx_val = data.shape[0] -1
ndx2 = NP.random.random_integers(0, max_ndx_val, testset_size)
TE = data_rows[ndx2]
TR = NP.delete(data, ndx2, axis=0)
return TR, TE
Lastly, here's an excellent case study (IMHO), both clear and complete, showing literally the entire process from collection of the raw data through input to the ML algorithm (a MLP in this case). They also provide their code.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am interested in doing some Collective Intelligence programming, but wonder how it can work?
It is said to be able to give accurate predictions: the O'Reilly Programming Collective Intelligence book, for example, says a collection of traders' action actually can predict future prices (such as corn) better than an expert can.
Now we also know in statistics class that, if it is a room of 40 students taking exam, there will be 3 to 5 students who will get an "A" grade. There might be 8 that get "B", and 17 that got "C", and so on. That is, basically, a bell curve.
So from these two standpoints, how can a collection of "B" and "C" answers give a better prediction than the answer that got an "A"?
Note that the corn price, for example, is the accurate price factoring in weather, demand of food companies using corn, etc, rather than "self fulfilling prophecy" (more people buy the corn futures and price goes up and more people buy the futures again). It is actually predicting the supply and demand accurately to give out an accurate price in the future.
How is it possible?
Update: can we say Collective Intelligence won't work in stock market euphoria and panic?
The Wisdom of Crowds wiki page offers a good explanation.
In short, you don't always get good answers. There needs to be a few conditions for it to occur.
Well, you might want to think of the following "model" for a guess:
guess = right answer + error
If we ask a lot of people a question, we'll get lots of different guesses. But if, for some reason, the distribution of errors is symmetric around zero (actually it just has to have zero mean) then the average of the guesses will be a pretty good predictor of the right answer.
Note that the guesses don't necessarily have to be good -- i.e., the errors could indeed be large (grade B or C, rather than A) as long as there are grade B and C answers distributed on both sides of the right answer.
Of course, there are cases where this is a terrible model for our guesses, so collective intelligence won't always work...
Crowd Wisdom techniques, like prediction markets, work well in some situations, and poorly in others, just as other approaches (experts, for instance) have their strengths and weaknesses. The optimal arenas therefore, are ones where no other approaches do very well, and prediction markets can do well. Some examples include predicting public elections, estimating project completion dates, and predicting the prevalence of epidemics. These are areas where information is spread around sparsely, and experts haven't found effective models that reliably predict.
The general idea is that market participants make up for one another's weaknesses. The expectation isn't that the markets will always predict every outcome correctly, but that, due to people noticing other people's mistakes, they won't miss crucial information as often, and that over the long haul, they'll do better. In cases where the exerts actually know the answer, they'll be able to influence the outcome. Different experts can weigh in on different questions, so each has more influence where they have the most knowledge. And as markets continue over time, each participant gets feedback from their gains and losses that makes them better informed about which kinds of questions they actually understand and which ones they should stay away from.
In a classroom, people are often graded on a curve, so the distribution of grades doesn't tell you much about how good the answers were. Prediction markets calibrate all the answers against actual outcomes. This public record of successes and failures does a lot to reinforce the mechanism, and is missing in most other approaches to forecasting.
Collective intelligence is really good at coming up to to problems that have complex behavior behind them, because they are able to take multiple sources of opinions/attributes to determine the end result. With a setup like this, training helps to optimize the end result of the processes.
The fault is in your analogy, both opinions are not equal. Traders predict direct profit for their transaction (the little part of the market they have overview over), while the expert tries to predict the overall field.
IOW the overall traders position is pieced together like a jigsaw puzzle based on a large amount of small opinions for their respective piece of the pie (where they are assumed to be experts).
A single mind can't process that kind of detail, which is why the overall position MIGHT overshadow the real expert. Note that this is particularly phenomon is usually restricted to a fairly static market, not in periods of turmoil. Expert usually do better then, since they are often better trained and motivated to avoid going with general sentiment. (which is often comparable to that of a lemming in times of turmoil)
The problem with the class analogy is that the grading system doesn't assume that the students are masters in their (difficult to predict) terrain, so it is not comparable.
P.s. note that the base axiom depends on all players being experts in a small piece of the field. One can debate if this requirement actually transports well to a web 2 environment.
Currently, we plan to record a "batch id" for each batch of facts we load. That way, we can back out the load in case we find problems.
Should we consider tracking the batch id on the dimension rows, also?
It seems like dimension rows have different rules. If we treat them as slowly-changing, and use one of the SCD algorithms that preserves history, then a reload doesn't really mean much.
Typical Scenario. Conform dimension, handling SCD. Load facts. Done.
Extension. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts. Fix the problem. Reload facts. Done.
Possible Scenario. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts and the dimension rows. Fix the problem. Conform dimension, handling SCD. Load facts. Done.
It doesn't seem like tracking dimension changes helps very much at all. Any guidance on how best to handle an "undo" or "rollback" of a data warehouse load?
Our ETL tools are entirely home-grown Python applications.
From my perspective as long as you are not abusing your dimensions (like tracking time to the millisecond) there is not a lot of gain to be had by tracking dimensions for a rollback. Also you can build a tool to cleanup unreferenced dimensions once a month.