I am designing a data mart for University students and confused about visa and passport information that should it go in the student dimension or should I define a separate dimension for it. Which would be the better approach?
Related
The basic process for most supervised machine learning problems is to divide the dataset into a training set and test set and then train a model on the training set and evaluate its performance on the test set. But in many (most) settings, disease diagnosis for example, more data will be available in the future. How can I use this to improve upon the model? Do I need to retrain from scratch? When might be the appropriate time to retrain if this is the case (e.g., a specific percent of additional data points)?
Let’s take the example of predicting house prices. House prices change all the time. The data you used to train a machine learning model that predicts house prices six months ago could provide terrible predictions today. For house prices, it’s imperative that you have up-to-date information to train your models.
When designing a machine learning system it is important to understand how your data is going to change over time. A well-architected system should take this into account, and a plan should be put in place for keeping your models updated.
Manual retraining
One way to maintain models with fresh data is to train and deploy your models using the same process you used to build your models in the first place. As you can imagine this process can be time-consuming. How often do you retrain your models? Weekly? Daily? There is a balance between cost and benefit. Costs in model retraining include:
Computational Costs
Labor Costs
Implementation Costs
On the other hand, as you are manually retraining your models you may discover a new algorithm or a different set of features that provide improved accuracy.
Continuous learning
Another way to keep your models up-to-date is to have an automated system to continuously evaluate and retrain your models. This type of system is often referred to as continuous learning, and may look something like this:
Save new training data as you receive it. For example, if you are receiving updated prices of houses on the market, save that information to a database.
When you have enough new data, test its accuracy against your machine learning model.
If you see the accuracy of your model degrading over time, use the new data, or a combination of the new data and old training data to build and deploy a new model.
The benefit to a continuous learning system is that it can be completely automated.
Let's say there is a data warehouse created from a shop data. A fact is a single purchase of a product. There is a dimension that describes a customer.
There is a need to create a measure that stores a number of distinct customers. Can this measure be created based on customer identifier in the dimension table, or it needs to be fact table? In which cases one or the other solution is better?
Below I post a visualization based on an AdventureWorks2016 database:
If I have 1000 features (or more) with pairwise corrleation below 0.7, and I plan to build neural networks for predictions. Should I build one model to incorporate all features or two models with 500 features in each and then ensemble? That is:
Option 1: Model with all features. The model structure may be changed in the future if I have more features generated. For example, 100 features require 3 hidden layers and 1000 features require 6 hidden layers
Option 2: Model with fix number of features (e.g. 500). For every 500 new features that I get in the future, I just feed the data into the model without modifyting the model structure
From my perspective, if I choose the option 2, I can build a model with proper capacity to handle 500 features, and thus whenever I generate new features, I can just feed features to the existing model structure with the same network structure and even hyperparamters for ensembling. However, I have not heard of such measures in practice. I am not sure if my idea is valid or not, and I am confused which option could be better
From my past experience and also a lot of high ranking solution on kaggle, you usually get the best by training multiple models with all features.
But if we have to choose between the two options, option 1 is better.
Models learn better if more features is provided.
What if feature a and b is the most useful to the final answer, but feature a is used to train model 1 and feature b is used to train model2?
In my opinion, go for option 2. Option 1 may sometimes overfit the model. It is true that the models learn better with more features but it may also make the model more complex. Models developed using second model is more accurate too (Even though it is based on the features selected).
I could really use some help!
The company I work for is made up of 52 very different businesses so I can't predict at the company level but instead need to predict business by business then roll up the result to give company wide prediction.
I have written an ML model in studio.azureml.net
It works great with a 0.947 Coefficient of Determination, but this is for 1 of the businesses.
I now need to train the model for the other 51.
Is there a way to do this in a single ML model rather than having to create 52 very similar models?
Any help would be much appreciated !!!
Kind Regards
Martin
You can use Ensembles, combining several models to improve predictions. The most direct is stacking when the outputs of all the models are trained on the entire dataset.
The method that, I think, corresponds the best to your problem is bagging (bootstrap aggregation). You need to divide the training set into different subsets (each corresponding to a certain business), then train a different model on each subset and combine the result of each classifier.
Another way is boosting but it is difficult to implement in Azure ML.
You can see an example in Azure ML Gallery.
Quote from book:
Stacking and bagging can be easily implemented in Azure Machine
Learning, but other ensemble methods are more difficult. Also, it
turns out to be very tedious to implement in Azure Machine Learning an
ensemble of, say, more than five models. The experiment is filled with
modules and is quite difficult to maintain. Sometimes it is worthwhile
to use any ensemble method available in R or Python. Adding more
models to an ensemble written in a script can be as trivial as
changing a number in the code, instead of copying and pasting modules
into the experiment.
You may also have a look at sklearn (Python) and caret (R) documentation for further details.
I have a college assignment requirement to built a Data warehouse for product Inventory management which can help inventory management understand in-hand value and using historical data they can predict when to bring new inventory. I have been reading to find out best way to do it using Cubes or Data mart. My question here is do I have to create a Data warehouse first and on top of that built Cube, Data mart or I can directly extract transactional data into Cube/Data Mart.
Next, Is it mandatory to built a Star Schema(or other DW schema) for doing this assignment as after reading multiple articles my understanding is OLAP cube can have multiple facts surrounded by Dimensions.
Your question is far bigger than you know!
As a general principle, you would have a staging database(s) which lands the data from one or more OLTP systems. then the staging database(s) would feed data to a datawarehouse (DWH). On top of a DWH would be built a number of Marts, these typically are subject area specific.
There are several DWH methodologies
Kimball Star Schema - you mention star schema above, this broadly is Kimball Star Schema. Proposed by Ralph Kimball. Also I would include here Snowflake Schemas, which are a variation on Star Schemas.
Inmon Model - Proposed by Bill Inmon
Data Vault - proposed by Dan Linstedt. Has a large user base in the Benelux countries. There are variations on the Data Vault.
It's important not to get confused between a DWH methodology and the technology to implement a DWH, though sometimes there are some technologies that lend themselves to particular methodologies. For example OLAP cubes work easily with Kimball star schemas. There is no particular need to use a relational technology for particular databases. Some NoSQL databases (like Cassandra) lend themselves to staging databases well.
To answer your specific questions
Do I have to create a Data warehouse first and on
top of that built Cube, Data mart or I can directly extract
transactional data into Cube/Data Mart.
OLAP Cubes are optional if you have a specific Mart that is tailored to your reporting but it depends on your reporting and analysis requirements and the speed of access.
A Data Mart could actually be built only using an OLAP cube, coming straight from the DWH.
Specifically on inventory management, all of these DWH methodologies would be suitable.
I can't answer your last question, as that seems to be the point of the assignment and you havn't given enough information to answer the question, but you need to do some research into dimensional modelling, so I hope this has pointed you in the right direction!
The answer is yes, a star model will always help a better analysis, but it is relational, a cube is multidimensional (where it performs all data crossings) and often uses as a data source to star models (recommended).
OLAP cubes are generally used for fast analysis and summaries of data.
So, by standard, I recommend you make all the star models you need and then generate the OLAP cubes for your analysis.
AS this is a 'homework' question, I would guess that the lecturer is looking for pros/cons between Kimball and Inmon which are the two 'default' designs for end-user reporting. In the real world DataVault can also be applied as part of the DWH strategy but it plays a different purpose and is not recommended for end-user consumption.
DataVault is a design pattern to bring data in from source systems unmolested. Data will inevitably need to be cleaned before being presented to the end-user solution and DV allows the DWH ETL process to be re-run if any issues are found or the business requirements change, especially if the granularity level goes down (e.g. the original fact table was for sales and the dimension requirements were for salesman and product category, now they want fact-sales by sales round and salesman for product subcategory and category. Without DV you do not have the granular data to replay the historical information and rebuild the DWH)