Reversing (or undoing) a large load to a warehouse fact table - data-warehouse

Currently, we plan to record a "batch id" for each batch of facts we load. That way, we can back out the load in case we find problems.
Should we consider tracking the batch id on the dimension rows, also?
It seems like dimension rows have different rules. If we treat them as slowly-changing, and use one of the SCD algorithms that preserves history, then a reload doesn't really mean much.
Typical Scenario. Conform dimension, handling SCD. Load facts. Done.
Extension. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts. Fix the problem. Reload facts. Done.
Possible Scenario. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts and the dimension rows. Fix the problem. Conform dimension, handling SCD. Load facts. Done.
It doesn't seem like tracking dimension changes helps very much at all. Any guidance on how best to handle an "undo" or "rollback" of a data warehouse load?
Our ETL tools are entirely home-grown Python applications.

From my perspective as long as you are not abusing your dimensions (like tracking time to the millisecond) there is not a lot of gain to be had by tracking dimensions for a rollback. Also you can build a tool to cleanup unreferenced dimensions once a month.

Related

In ML, using RNN for an NLP project, is it necessary for DATA redundancy?

Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA
This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!

How to treat outliers if you have data set with ~2,000 features and can't look at each feature individually

I'm wondering how one goes about treating outliers at scale. Based on my experiences, I usually need to understand why there are outliers from the first place. What causes it, are there any patterns, or it just happens randomly. I know that, theoretically, we usually define outliers as data points outside of 3 standard deviation. But in the case where data is so big that you can't treat each feature one by one, and don't know if the 3 standard deviation rule is applicable anymore because of sparsity, how do we most effectively treat the outliers.
My intuition about high dimensional data is that data is sparse so the definition of "outliers" is harder to determine. Do you guys think we would be able to just get away with using ML algorithms that are more robust to outliers (tree based models, robust SVM, etc) instead of trying to treat outliers during preprocessing step? And if we really want to treat it, what is the best way to do it?
I would first propose a frame work for understanding the data. Imagine you are handed a dataset with no explanation of what it is. Analytics could actually be used to enable us to get understanding. Usually rows are observations and columns parameters of some sort regarding the observations. You first want to have a frame work for what you are trying to achieve. Now matter is going on, all data centers around the interest of people...that is why we decided to record it in some format. Given that, we are at most interested in:
1.) Object
2.) Attributes of object
3.) Behaviors of object
4.) Preferences of object
4.) Behaviors and preferences of object over time
5.) Relationships of object to other objects
6.) Affects of attributes, behaviors, preferences and other objects on object
So you are wanting to identify these items. So you open a data set and maybe you instantly recognize a time stamp. You then see some categorical variables and start doing relationship analysis for what is one to one, one to many, many to many. You then identify continuous variables. These all come together to give a foundation for identifying what is an outlier.
If we are evaluating objects of over time....is the rare event indicative of something that happens rarely, but we want to know about. Forest fire are outlier events...but they are events of great concern. If I am analyzing machine data and having rare events, but these rare events are tied to machine failure, then it matters. Basically.....does the rare event-parameter show evidence that it correlates to something that you care about?
Now if you have so many dimensions that the above approach is not feasible to your judgement, then you are seeking dimension reduction alternatives. I am currently employing Single Value Decomposition as at technique. I am already seeing situations where I am accomplishing the same level of predictive ability with 25% of the data. Which segways into my final thought; find a mark to decide whether the outliers matter or not.
Begin with leaving them in then begin your analysis, and run the work again with them removed. What were the affects. I believe that when you are in doubt, simply do both and see how different the results are. If there is little difference than maybe you are good to go. If there is significant difference of concern, then you are wanting to take an evidenced based approach of the outlier occurring. Simply because it is rare in your data does not mean it is rare. Think of certain type crimes that are under-reported (via arrest records). Lack of data showing politicians being arrested for insider trading does not mean that politicians are not doing insider trader en masse.

What is the best way to implement Lambda-architecture batch_layer and serving_layer?

If I am build a project applying Lambda-architecture now, should I split the batch layer and the serving layer, i.e. program A do the batch layer's work, program B do the serving layer's? they are physically independent but logically relevant, since program A can tell B to work after A finishes the pre-compute works.
If so, would you please tell me how to implement it? I am thinking about IPC. If IPC could help, what is the specific way?
BTW, what does "batch view" mean exactly? why and How does the serving layer index it?
What is the best way to implement Lambda Architecture batch_layer and serving_layer? That totally depends upon the specific requirements, system environment, and so on. I can address how to design Lambda Architecture batch_layer and serving_layer, though.
Incidentally, I was just discussing this with a colleague yesterday and this is based on that discussion. I will explain in 3 parts and for the sake of this discussion lets say we are interested in designing a system that computes the most read stories (a) of the day, (b) of the week, (c) of the year:
Firstly in a lambda architecture, it is important to divide the problem that you are trying to solve with respect to time first and features second. So if you model your data as an incoming stream then the speed layer deals with the 'head' of the stream, e.g. current day's data; the batch layer deals with the 'head' + 'tail', the masterset.
Secondly, divide the features between these time-based lines. For instance, some features can be done using the 'head' of the stream alone, while other features require a wider breadth of data than the 'head', e.g. masterset. In our example, lets say that we define the speed layer to compute one day's worth of data. Then the Speed layer would compute most read stories (a) of the day in the so-called Speed View; while the Batch Layer would compute most read stories (a) of the days, (b) of the week, and (c) of the year in the so-called Batch View. Note that yes there may appear to be a bit of redundancy but hold on to that thought.
Thirdly, serving layer response to queries from clients regarding Speed View and Batch View and merges results accordingly. There will necessarily be overlap in the results from the Speed View and the Batch View. No matter, this is divide of Speed vs Batch, among other benefits, allows us to minimize exposure to risks such as (1) rolling out bugs, (2) corrupt data delivery, (3) long-running batch processes, etc. Ideally, issues will be caught in the speed view and then fixes will be applied prior to the batch view re-compute. If it is then all is well and good.
In summary, no IPC needs to be used since they are completely independent from each other. So program A does not need to communicate to program B. Instead, the system relies upon some overlap of processing. For instance, if program B computes its Batch view based on a daily basis then program A needs to compute the Speed view for day plus any additional time that processing may take. This extra time needs to include any downtime in the batch layer.
Hope this helps!
Notes:
Redundancy of the batch layer - it is necessary to have at least some redundancy in the batch layer since the serving layer must be able to provide a single cohesive view of results to queries. At the least, the redundancy may help avoid time-gaps in query responses.
Assessing which features are in the speed layer - this step will not always be as convenient as in the 'most read stories' example here. This is more of an art form.

Analyzing Sensor Data stored in cassandra and draw graphs

I'm collecting data from different sensors and write them to a Cassandra database.
The Sensor-ID accts as a partition key, the timestamp of the sensors data as clustering column. Additionally a value of the sensor is stored.
Each sensor collects something about 30000 to 60000 values a day.
The simplest thing I wane do is draw a graph showing this data. This is not a problem for a few hours but when showing a week or even a longer range, all the data has to be loaded into the backend (a rails application) for further processing. This isn't really fast with my test dataset and won't be faster in production I think.
So my question is, how to speed this up. I thought about pre-processing the data directly in the database but it seems, that Cassandra isn't able to do such things.
For a graph with a width of 1000px it isn't interesting to draw ten thousands of points - so it would be interesting to gather only relevant, pre-aggregated data from the database.
For example, when showing the data for a whole day in a graph with a width of 1000px, it would be enough to take 1000 average values (this would be an average clustered by 86seconds - 60*60*24 / 1000).
Is this a good approach? Or are there other techniques fasten this up? How would I handle this with database? Create a second Table and store some average values? But the resolution of the graph may change...
Other approaches would be drawing mean values by day, week, month and so on. Maybe vor this a second table could do a good job!
Cassandra is all about letting you write and read your data quickly. Think of it as just a data store. It can't (really) do any processing on that data.
If you want to do operations on it, then you are going to need to put the data into something else. Storm is quite popular for building computation clusters for processing data from Cassandra, but without knowing exactly the scale you need to operate at, then that may be overkill.
Another option which might suit you is to aggregate data on the way in, or perhaps in nightly jobs. This is how OLAP is often done with other technologies. This can work if you know in advance what you need to aggregate. You could build your sets into hourly, daily, whatever, then pull a smaller amount of data into Rails for graphing (and possibly aggregate it even further to exactly meet the desired graph requirements).
For the purposes of storing, aggregating, and graphing your sensor data, you might consider RRDtool which does basically everything you describe. Its main limitation is it does not store raw data, but instead stores aggregated, interpolated values. (If you need the raw data, you can still use Cassandra for that.)
AndySavage is onto something here when it comes to precomputing aggregate values. This does require you to understand in advance the sorts of metrics you'd like to see from the sensor values generally.
You correctly identify the limitation of a graph in informing the viewer. Questions you need to ask really fall into areas such as:
When you aggregate are you interested in the mean, median, spread of the values?
What's the biggest aggregation that you're interested in?
What's the goal of the data visualisation - is it really necessary to be looking at a whole year of data?
Are outliers the important part of the dataset?
Each of these questions will lead you down a different path with visualisation and the application itself too.
Once you know what you're wanting to do, an ETL process harnessing some form of analytical processing will be needed. This is where the Hadoop world would be useful investigating.
Regarding your decision to use Cassandra as your timeseries historian, how is that working for you? I'm looking at technical solutions for a similar requirement at the moment and it's one of the options on the table.

People counting using OpenCV

I'm starting a search to implement a system that must count people flow of some place.
The final idea is to have something like http://www.youtube.com/watch?v=u7N1MCBRdl0 . I'm working with OpenCv to start creating it, I'm reading and studying about. But I'd like to know if some one can give me some hints of source code exemples, articles and anything elese that can make me get faster on my deal.
I started with blobtrack.exe sample to study, but I got not good results.
Tks in advice.
Blob detection is the correct way to do this, as long as you choose good threshold values and your lighting is even and consistent; but the real problem here is writing a tracking algorithm that can keep track of multiple blobs, being resistant to dropped frames. Basically you want to be able to assign persistent IDs to each blob over multiple frames, keeping in mind that due to changing lighting conditions and due to people walking very close together and/or crossing paths, the blobs may drop out for several frames, split, and/or merge.
To do this 'properly' you'd want a fuzzy ID assignment algorithm that is resistant to dropped frames (ie blob ID remains, and ideally predicts motion, if the blob drops out for a frame or two). You'd probably also want to keep a history of ID merges and splits, so that if two IDs merge to one, and then the one splits to two, you can re-assign the individual merged IDs to the resulting two blobs.
In my experience the openFrameworks openCv basic example is a good starting point.
I'll not put this as the right answer.
It is just an option for those who are able to read in Portugues or can use a translator. It's my graduation project and there is the explanation of a option to count people in it.
Limitations:
It's do not behave well on envirionaments that change so much the background light.
It must be configured for each location that you will use it.
Advantages:
It's fast!
I used OpenCV to do the basic features as, capture screen, go trough the pixels, etc. But the algorithm to count people was done by my self.
You can check it on this paper
Final opinion about this project: It's not prepared to go alive, to became a product. But it works very well as base for study.

Resources