Delta Extraction + Business Intelligence - data-warehouse

What does Delta Extraction mean in regards to Data Warehousing.

Only picking up data that have changed since the last run. This saves you the effort of processing data that you've already extracted before. For example, if your last extract of customer data was at April 1 00:00:00, your delta run would extract all customers who were added or have had details updated since April 1 00:00:00. To do this, you will either need an attribute that stores when a record was last updated or use a log scraper.

Related

ActiveRecord query for all of last day's data and every 100th record prior

I have a process that generates a new record every 10 minutes. It was great for some time, however, now Datum.all returns 30k+ records, which are unnecessary as the purpose is simply to display them on a chart.
So as a simple solution, I'd like to provide all available data generated in the past 24 hours, but low res data (every 100th record) prior to the last 24 hours (right back to the beginning of the dataset).
I suspect the solution is some combination of this answer which selects every nth record (but was provided in 2010), and this answer which combines two ActiveRecord objects
But I cannot work out how to get a working implementation that obtains all the required data into one instance variable
You can use OR query:
Datum.where("created_at>?", 1.day.ago).or(Datum.where("id%100=0"))

Fact Table Design - How to capture a fact which precedes the data start date

We have a fact table which collects information detailing when an employee selected a benefit. The problem we are trying to solve is how to count the total benefits selected by all employee's.
We do have a BenefitSelectedOnDay flag and ordinarily, we can do a SUM on this to get a result, but this only works for benefit selections since we started loading the data.
For Example:
Suppose Client#1 has been using our analytics tool since October 2016. We have 4 months of data in the platform.
When the data is loaded in October, the Benefits source data will show:
Employee#1 selected a benefit on 4th April 2016.
Employee#2 selected a benefit on 3rd October 2016
Setting the BenefitSelectedOnDay flag for Employee#2 is very straight forward.
The issue is what to do with Employee#1 because we can’t set a flag on a day which doesn’t exist for that client in the fact table. Client#1's data will start on 1st October 2016.
Counting the benefit selection is problematic in some scenarios. If we’re filtering the report by date and only looking at benefit selections in Q4 2016, we have no problem. But, if we want a total benefit selection count, we have a problem because we haven’t set a flag for Employee#1 because the selection date precedes Client#1’s dataset range (Oct 1st 2016 - Jan 31st 2017 currently).
Two approaches seem logical in your scenario:
Load some historical data going back as far as the first benefit selection date that is still relevant to current reporting. While it may take some work and extra space, this may be your only solution if employees qualify for different benefits based on how long the benefit has been active.
Add records for a single day prior to the join date (Sept 30 in this case) and flag all benefits that were selected before and are active on the Client join date (Oct 1) as being selected on that date. They will fall outside of the October reporting window but count for unbounded queries. If benefits are a binary on/off thing this should work just fine.
Personally, I would go with option 1 unless the storage requirements are ridiculous. Even then, you could load only the flagged records into the fact table. Your client might get confused if he is able to select a period prior to the joining date and get broken data, but you can explain/justify that.

How to get the daily data of "Microsoft.VSTS.Scheduling.CompletedWork"?

We need to get the daily data from the "Microsoft.VSTS.Scheduling.CompletedWork"field (which is detailed in Workload, scheduling and time tracking field references). However I get data from the Analysis database and found that it only records one last new data,and can't get the historical data.
For example the task of ID 3356, who's "CompletedWork" is 3 hours in 2016/8/4, and I get the exact 3 hours-data from the Analysis database in the second day, 2016/8/5, as the pictures in this post show.
Then on the 2016/8/5, I update the "CompletedWork" from 3 hours to 4 hours and I get the exact 4 hours-data from the Analysis database in the second day, 2016/8/6. However the 3 hours-data of 2016/8/4 is lost. Well, How can I get the historical data of "Microsoft.VSTS.Scheduling.CompletedWork"?
First of all, it's important to understand that the CompletedWork is a cumulatieve data field. So when one user enters 3 and another enters 4, the total number of hours worked on the field is 4 not 7.
The warehouse has a granularity of a day and keeps that data int he cube, though the relational warehouse tables will store all the changes to the reportable fields on a per-revision bases. You can't easily query this data using the qube or Excel Power Pivot and they're lost in the Dim* and fact* tables, but you can write a SQL query against tfs_warehouse and iterate through the tables containing the workitem data (tbl_workitems[are|were|latest]). This is much slower and much harder to build unfortunately.
Your other alternative is to use the TFS Client Object Model and query the WorkItemStore object directly. You'll be able to query all work items of interest and iterate through them and their revisions. The API for workitems is relatively easy to use and is well documented.
If you're on TFS 2015 you can also use the new REST api to query workitem data and revisions.

Time series caching in Rails website

I'm developing a rails website and trying to implement a cache layer.In current website we are displaying the stock information for each stock using D3 chart rendering and every other second sending a request to server for new data and appending it with the current rendered D3 chart.
My Design approach*
I will implement a caching layer that will internally send request to database ,let say every 10 second and update the cache for last one hour so at given point my cache will always have last 1 hour of data so any request that match with in this time stamp will be served from cache.
Issues :
How to store data in cache. Currently,I'm thinking of memcached for distributed caching and key as timestamp but how I invalidate the earliest timestamp key when new key with latest 10 second data comes in ?
Some of the data don't come sequentially, let say data for 14:02:33 will come later than 14:02:38. How to avoid such scenarios
Let me know if you guys have better approach to design this problem.
Thanks
Memcached supports expiration natively. That means you can assign 1 hour expiration for every new key and they will be removed automatically by the Memcached.
Such problem does not exist as memcached doesn't care about the key sequences. It's a single key - single value system.
If you care about different timeframe, you can quantize time by let's say 5s, so 14:02:33 -> 14:02:30; 14:02:37 -> 14:02:35, and so on.

Handling change of grain for a snapshot fact table in a star-schema

The question
How do you handle a change in grain (from weekly measurement to daily measurement) for a snapshot fact table.
Background info
For a star-schema design I want to incorporate the results of a survey as a fact (e.g. in week 2 of 2015 80% of the respondents have responded 'yes', in week 3 76% etc.)
This survey is conducted each week, and I only have access to the result of the survey (% of people saying yes this week) and not to the individual responses.
Based on (my interpretation of) Christopher Adamson's "Star Schema: The complete reference" I believe I should use a snapshot fact table for these kind of measurements.
The date dimension for this fact should be on the week-level, and be a conformed rollup of a more fine-grained date dimension for other facts in other stars that take place on a daily basis.
Here comes trouble
Now someone decides they want to conduct these surveys daily instead of weekly. What is the best way to handle this? Some of the options I'm currently considering:
change the week dimension to a daily one, and fake the old facts as if they happened on the last day of the week.
change the week dimension to a daily one, and add 7 facts for each weekly one.
create a new star, with the daily fact and dimension and treat the old one as an aggregate.
I'd appreciate any input. Please tell me if my logic is off, or my question is not clear :)
I'm not convinced that this is a snapshot. Each survey response represents a "transaction".
With an appropriate date dimension you can calculate the Yes/No percentages, rolled up by week.
Further, this would enable you to show results like "Surveys issued on a Sunday night get more responses", or "People who respond on Friday are more likely to answer 'Yes'". (contrived examples)
Following clarification, this does look like a periodic snapshot. The example of a bank account balance is often used to describe a similar scenario.
A key feature of a periodic snapshot is that every combination of every dimension should be present. If your grain is monthly, then every month you record the fact, even if it has not changed from the previous month.
I think that is the key to your problem. Knowing that your grain may change from weekly to daily, make your grain daily. It does mean you'll be repeating the weekly value on every day of the week, but that is a true representation of your knowledge of the fact; on Wednesday you only knew that its value was the same as Monday.
If you design your ETL right, you won't need to make any changes when the daily updates begin.
Your second option is the one I'd choose in your place.

Resources