How are dimensional models used different in the two approaches to data warehousing?
I understand that a data warehouse created using the bottom-up approach has data marts as the building blocks of the data warehouse, and each data mart has it's own dimensional model. Is it the same for the top-down approach? Does Inmons method use dimensional models?
Kimball's method uses (collection of) data-marts with a common "dimension bus" as a data-warehouse.
Inmon's method has a subject-oriented normalized structure as a warehouse, and then from that structure the data is exported to data-marts, which may (or may not) be star-shaped as Kimball's.
For very large warehouses, those two architectures converge -- at least become similar -- due to introduction of master-data management structure/storage in the Kimball-type architecture.
There is a white paper on Inmon's site called A Tale of Two Architectures which nicely summarizes the two approaches.
Dimensional modelling is a design pattern sometimes used for Data Marts. It's not a very effective technique for complex Data Warehouse design due to the redundancy and in-built bias in dimensional models. Kimball's "bottom-up" approach attempts to sidestep the issue by referring to a collection of Data Marts as a "Data Warehouse" - an excuse that looks far less credible today than it did in the 1990s when Kimball first proposed it.
Inmon recommends Normal Form as the most flexible, powerful and efficient basis for building a Data Warehouse.
Related
I'm new to NLP. I'm currently building a NLP system in a specific domain. After training a word2vec and fasttext model on my documents, I found that the embedding is not really good because I didn't feed enough number of documents (e.g. the embedding can't see that "bar" and "pub" is strongly correlated to each other because "pub" only appears a few in the documents). Later, I found a word2vec model online built on that domain-specific corpus which definitely has a way better embedding (so "pub" is more related to "bar"). Is there any way to improve my word embedding using the model I found? Thanks!
Word2Vec (and similar) models really require a large volume of varied data to create strong vectors.
But also, a model's vectors are typically only meaningful alongside other vectors that were trained together in the same session. This is both because the process includes some randomness, and the vectors only acquire their useful positions via a tug-of-war with all other vectors and aspects of the model-in-training.
So, there's no standard location for a word like 'bar' - just a good position, within a certain model, given the training data and model parameters and other words co-populating the model.
This means mixing vectors from different models is non-trivial. There are ways to learn a 'translation' that moves vectors from the space of one model to another – but that is itself a lot like a re-training. You can pre-initialize a model with vectors from elsewhere... but as soon as training starts, all the words in your training corpus will start drifting into the best alignment for that data, and gradually away from their original positions, and away from pure comparability with other words that aren't being updated.
In my opinion, the best approach is usually to expand your corpus with more appropriate data, so that it has "enough" examples of every word important to you, in sufficiently varied contexts.
Many people use large free text dumps like Wikipedia articles for word-vector training, but be aware that its style of writing – dry, authoritative reference texts – may not be optimal for all domains. If your problem-area is "business reviews", you'd probably do best finding other review texts. If it's fiction stories, more fictional writing. And so forth. You can shuffle these other text-soruces in with your data to expand the vocabulary coverage.
You can also potentially shuffle in extra repeated examples of your own local data, if you want it to effectively have relatively more influence. (Generally, merely repeating a small number of non-varied examples can't help improve word-vectors: it's the subtle contrasts of different examples that helps. But as a way to incrementally boost the influence of some examples, when there are plenty of examples overall, it can make more sense.)
I am asking this question in context of Data Warehousing only.
Are Dimensional models & De-normalized models the same or different ?
As far as I have heard from DW enthusiast, there is nothing called Normalized or De-normalized data model.
But my understanding is, breaking down the Dimensions i.e. Snow-flaking is the Dimensional model. Whereas the model with flattened hierarchy dimensions is called a De-normalized data model. Both are data modelling concepts in Data Warehousing.
I need your expert advice on this.
And what we can we call the data model that does not have surrogate keys but instead has the primary keys - codes from the operational (OLTP) system to join Fact-Dimension together?
A Dimensional model is normally thought of as 'denormalised', because of the way dimension tables are handled.
A data warehouse with 'snowflaked' dimensions can still be called a dimensional model, but they're not the advice of Kimball, whose approach is what most people think of when they think of dimensional modelling.
Breaking down the dimensions (i.e. snowflaking) is normalising those tables, and dimensional modelling (as described by Kimball) suggests avoiding snowflaking where possible, although people of course sometimes do, for all sorts of reasons. The model with flattened hierarchy dimensions is a denormalised data model, and this is the main thing that people mean when they talk of a dimensional model.
As for a system that doesn't have surrogate keys: that could also be called a data warehouse also, you could also call it a dimensional model, but is against the recommended approach by Kimball (whether for better or worse!).
I have a dataset of approx. 4800 rows with 22 attributes, all numerical, describing mostly the geometry of rock / minerals, and 3 different classes.
I tried out a cross validation with k-nn Model inside it, with k= 7 and Numerical Measure -> Camberra Distance as parameters set..and I got a performance of 82.53% and 0.673 kappa. Is that result representative for the dataset? I mean 82% is quite ok..
Before doing this, I evaluated the best subset of attributes with a decision table, I got out 6 different attributes for that.
the problem is, you still don't learn much from that kind of models, like instance-based k-nn. Can I get any more insight from knn? I don't know how to visualize the clusters in that high dimensional space in Rapidminer, is that somehow possible?
I tried decision tree on the data, but I got too much branches (300 or so) and it looked all too messy, the problem is, all numerical attributes have about the same mean and distribution, therefore its hard to get a distinct subset of meaningful attributes...
ideally, the staff wants to "Learn" something about the data, but my impression is, that you cannot learn much meaningful of that data, all that works best is "Blackbox" Learning models like Neural Nets, SVM, and those other instance-based models...
how should I proceed?
Welcome to the world of machine learning! This sounds like a classic real-world case: we want to make firm conclusions, but the data rows don't cooperate. :-)
Your goal is vague: "learn something"? I'm taking this to mean that you're investigating, hoping to find quantitative discriminations among the three classes.
First of all, I highly recommend Principal Component Analysis (PCA): find out whether you can eliminate some of these attributes by automated matrix operations, rather than a hand-built decision table. I expect that the messy branches are due to unfortunate choice of factors; decision trees work very hard at over-fitting. :-)
How clean are the separations of the data sets? Since you already used Knn, I'm hopeful that you have dense clusters with gaps. If so, perhaps a spectral clustering would help; these methods are good at classifying data based on gaps between the clusters, even if the cluster shapes aren't spherical. Interpretation depends on having someone on staff who can read eigenvectors, to interpret what the values mean.
Try a multi-class SVM. Start with 3 classes, but increase if necessary until your 3 expected classes appear. (Sometimes you get one tiny outlier class, and then two major ones get combined.) The resulting kernel functions and the placement of the gaps can teach you something about your data.
Try the Naive Bayes family, especially if you observe that the features come from a Gaussian or Bernoulli distribution.
As a holistic approach, try a neural net, but use something to visualize the neurons and weights. Letting the human visual cortex play with relationships can help extract subtle relationships.
There are several data sets for automobile manufacturers and models. Each contains several hundreds data entries like the following:
Mercedes GLK 350 W2
Prius Plug-in Hybrid Advanced Toyota
General Motors Buick Regal 2012 GS 2.4L
How to automatically divide the above entries into the manufacturers (e.g. Toyota ) and models (e.g. Prius Plug-in Hybrid Advanced) by using only those files?
Thanks in advance.
Machine Learning (ML) typically relies on training data which allows the ML logic to produce and validate a model of the underlying data. With this model, it is then in a position to infer the class of new data presented to it (in the classifier application, as the one at hand) or to infer the value of some variable (in the regression case, as would be, say, an ML application predicting the amount of rain a particular region will receive next month).
The situation presented in the question is a bit puzzling, at several levels.
Firstly, the number of automobile manufacturers is finite and relatively small. It would therefore be easy to manually make the list of these manufacturers and then simply use this lexicon to parse out the manufacturers from the model numbers, using plain string parsing techniques, i.e. no ML needed or even desired here. (alas the requirement that one would be using "...only those files" seems to preclude this option.
Secondly, one can think of a few patterns or heuristics that could be used to produce the desired classifier (tentatively a relatively weak one, as the patterns/heuristics that come to mind ATM seem relatively unreliable). Furthermore, such an approach is also not quite an ML approach in the common understanding of the word.
I'm looking for an overview of the state-of-the-art methods that
find temporal patterns (of arbitrary length) in temporal data
and are unsupervised (no labels).
In other words, given a steam/sequence of (potentially high-dimensional) data, how do you find those common subsequences that best capture the structure in the
data.
Any pointers to recent developments or papers (that go beyond HMMs, hopefully) are welcome!
Is this problem maybe well-understood
in a more specific application domain, like
motion capture
speech processing
natural language processing
game action sequences
stock market prediction?
In addition, are some of these methods general enough to deal with
highly noisy data
hierarchical structure
irregularly spacing on time axis
(I'm not interested in detecting known patterns, nor in classifying or segmenting the sequences.)
There has been a lot of recent emphasis on non-parametric HMMs, extensions to infinite state spaces, as well as factorial models, explaining an observation using a set of factors rather than a single mixture component.
Here are some interesting papers to start with (just google the paper names):
"Beam Sampling for the Infinite Hidden Markov Model"
"The Infinite Factorial Hidden Markov Model"
"Bayesian Nonparametric Inference of Switching Dynamic Linear Models"
"Sharing features among dynamical systems with beta processes"
The experiments sections these papers discuss applications in text modeling, speaker diarization, and motion capture, among other things.
I don't know the kind of data you are analysing, but I would suggest(from a dynamical systems analysis point of view), to take a look at:
Recurrence plots (easily found googling it)
Time-delay embedding (may unfold potential relationships between the different dimensions of the data) + distance matrix(study neighborhood patterns maybe?)
Note that this is just another way to represent your data, and analyse it based on this new representation. Just a suggestion!