Modelling chart data in star schema - data-warehouse

I am interested in storing scientific data from a chart (plot) of say x against y in a data warehouse, where both x and y are real numbers.
Each chart will be generated for a fixed set of descriptive dimensions (e.g. time, date, location, equipment) which can be modelled in a traditional star/snowflake schema.
An example would be say angle vs response of a detector, where angle is the independent variable and response is the dependant. Angle here could be any real number between 0 and 360 degrees.
My current thought is to use the real value as a dimension, potentially prepopulating an angle_dimension table with values from 0 to 360 at a suitable scale (e.g. 3dp) and round the measured results where necessary, although this results in a loss of precision.
I am wondering if there are any more effective ways to store this data for later use in an OLAP cube?
The type of query I'd be looking to do is to compare chart data at different time points to look for changes or to look at the average response in a given range (say 0-15 degrees) at different locations or for different equipment.

Your last paragraph gives I think a good hint about how you want to store it for analysis: by time, by angle range, by location, by equipment- all of which would be dimensions.
One way of modelling this could be to have the grain of the fact as 'one row per plot point' with the 2 real figures in the fact, losing no precision.
You could then add supplementary dimensions as you say to categorise the figures. In your angle example, you could also have 'angle range' as a column showing 0-15,16-30, etc.
You may have to have a more complex/generic design if you have more than angle and response to contend with, with a generic dimension of 'X Axis type' including the range but with an additional 'X Axis type' column which is 'angle', 'response' etc.
I think your broad idea here is sound and off the shelf tools should be fine with both details and summaries. The key is to model something that reflects both the essential nature of the thing you're measuring (i.e. a reading in a machine) and how people want to analyse it. You would want to use the cube's capabilities to provide the calculations for averages rather than having the underlying dimensional model deal with that.

Related

Approximate missing value for input given a data set

I have a data set with x attributes and y records. Given an input record which has up to x-1 missing values, how would I reasonably approximate one of the remaining missing values?
So in the example below, the input record has two values (for attribute 2 and 6, with the rest missing) and I would like to approximate a value for attribute 8.
I know missing values are dealt with through 'imputation' but I'm generally finding examples regarding pre-processing datasets. I'm looking for a solution which uses regression to determine the missing value and ideally makes use of a model which is built once (if possible, to not have to generate one each time).
The number of possibilities for which attributes are present or absent, makes it seem impractical to be able to maintain a collection of models like linear regressions that would cover all of the cases. The one model that seems practical to me is the one that you don't exactly make any model - Nearest Neighbors Regression. My suggestion would be to use whatever attributes you have available and compute distance to your training points. You could use the value from the nearest neighbor or the (possibly weighted) average of several nearest neighbors. In your example, we would use only attributes 2 and 6 to compute distance. The nearest point is the last one (3.966469, 8.911591). That point has value 6.014256 for attribute 8, so that is your estimate of attribute 8 for the new point.
Alternatively, you could use three nearest neighbors. Those are points 17, 8 and 12, so you could use the average of the values of attribute 8 for those points, or a weighted average. People sometimes use the weights 1/dist. Of course, three neighbors is just an example. You could pick another k.
This is probably better than using the global average (8.4) for all missing values of attribute 8.

Confusion about (Mean) Average Precision

In this question I asked clarifications about the precision-recall curve.
In particular, I asked if we have to consider a fixed number of rankings to draw the curve or we can reasonably choose ourselves. According to the answer, the second one is correct.
However now I have a big doubt about the Average Precision (AP) value: AP is used to estimate numerically how good is our algorithm given a certain query. Mean Average Precision (MAP) is average precision on multiple queries.
My doubt is: if AP changes according to how many objects we retrieve then we can tune this parameter to our advantage so we show the best AP value possible. For example, supposing that the p-r curve performs wonderfully until 10 elements and then horribly, we could "cheat" computing the (M)AP value considering only the first 10 elements.
I know that this could sound confusing, but I didn't find anything about this anywhere.
AP is the area under the precision-recall curve, and the precision-recall curve is supposed to be computed over the entire returned ranked list.
It is not possible to cheat the AP by tweaking the size of the returned ranked list. AP is the area below the precision-recall curve which plots precision as a function of recall, where recall is the number of returned positives relative to the total number of positives that exist in the ground truth, not relative to the number of positives in the returned list. So if you crop the list, all you are doing is that you are cropping the precision-recall curve and ignoring to plot its tail. As AP is the area under the curve, cropping the list reduces the AP, so there is no wisdom in tweaking the ranked list size - the maximal AP is achieved if you return the entire list. You can see this for example from the code you cited in your other question - cropping the list simply corresponds to
for ( ; i<ranked_list.size(); ++i) {
changing to
for ( ; i<some_number; ++i) {
which results in fewer increments of ap (all increments are non-negative as old_precision and precision are non-negative and recall is non-decreasing) and thus smaller AP value.
In practice, for purely computational reasons, you might want to crop the list at some reasonable number, e.g. 10k, as it is unlikely that AP will change much since precision#large_number is likely to be 0 unless you have an unusually large number of positives.
Your confusion might be related to the way some popular function, such as VLFeat's vl_pr compute the precision-recall curves as they assume that you've provided them the entire ranked list and therefore compute the total number of positives in the ground truth by just looking at the ranked list instead of the ground truth itself. So if you used vl_pr naively on cropped lists you could indeed cheat it, but that would be an invalid computation. I agree it's not 100% clear from the description of the function, but if you examine the documentation in more detail, you'll see it mentions NUMNEGATIVES and NUMPOSITIVES, so that if you are giving an incomplete ranked list you should set these two quantities to let the function know how to compute the precision-recall curve / AP properly. Now if you plot different crops of a ranked list using vl_pr but with the same NUMNEGATIVES and NUMPOSITIVES for all function calls, you'll see that the precision-recall curves are just crops of each other, as I was explaining above (I haven't checked this yet as I don't have matlab here, but I'm certain it's the case and if it's not we should file a bug).
What you said is partially correct. If you get reasonable MAP or AP in top N retrieved documents, its fine. Its not cheating because your IR system is retrieving good number of relevant documents in top N returned documents but yes its still missing some relevant docs. Note that for an IR system its better if it can't retrieve all relevant documents but rank all the retrieved relevant documents in higher rank and thats what AP measures. (higher rank means rank 1 or 2 instead of 100 or 101)
Consider an example, you have two relevant documents, one is returned at rank 1 and the other one is returned at rank 50. Now, if you compute MAP or AP for top 10 returned documents, then you must report the answer as MAP#10 or AP#10. Generally AP means average precision over all returned documents but if you consider the top N documents, your metric will be AP#N instead of only AP and note that, its not cheating! But yes if you compute AP#N and report as AP, then you are giving partial information to the readers.
Important fact about MAP is - If a relevant document never gets retrieved, we assume the precision corresponding to that relevant document to be zero. While computing AP, we divide accumulated precision by total relevant documents. So, when you are computing MAP#N or AP#N It means you only care about the top N returned documents by the IR system. For example, i have used MAP#100 in one of my research works.
If you have confusion about AP or MAP, you can see my brief answer explaining them here. Hopefully it will help you to clarify your confusion.

DBSCAN using spatial and temporal data

I am looking at data points that have lat, lng, and date/time of event. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works ok at clustering lat and lng, my concern is it will fall apart when incorporating temporal information, since it's not of the same scale or same type of distance.
What are my options for incorporating temporal data into the DBSCAN algorithm?
Look up Generalized DBSCAN by the same authors.
Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2(2): 169–194. doi:10.1023/A:1009745219419.
For (Generalized) DBSCAN, you need two functions:
findNeighbors - get all "related" objects from your database
corePoint - decide whether this set is enough to start a cluster
then you can repeatedly find neighbors to grow the clusters.
Function 1 is where you want to hook into, for example by using two thresholds: one that is geographic and one that is temporal (i.e. within 100 miles, and within 1 hour).
tl;dr you are going to have to modify your feature set, i.e. scaling your date/time to match the magnitude of your geo data.
DBSCAN's input is simply a vector, and the algorithm itself doesn't know that one dimension (time) is orders of magnitudes bigger or smaller than another (distance). Thus, when calculating the density of data points, the difference in scaling will screw it up.
Now I suppose you can modify the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. supplying your own distance function, instead of using the default Euclidean distance.
IMHO, though, the easier thing to do is to scale one of your dimension to match another. just multiply your time values by a fixed, linear factor so they are on the same order of magnitude as the geo values, and you should be good to go.
more generally, this is part of the features selection process, which is arguably the most important part of solving any machine learning algorithm. choose the right features, and transform them correctly, and you'd be more than halfway to a solution.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Most meaningful way to compare multiple time series

I need to write a program that performs arithmetic (+-*/) on multiples time series of different date range (mostly from 2007-2009) and frequency (weekly, monthly, yearly...).
I came up with:
find the series with the highest freq. then fill in the other series with zeros so they have the same number of elements. then perform the operation.
How can I present the data in the most meaningful way?
Trying to think of all the possibilities
If zero can be a meaningful value for this time series (e.g. temperature in Celsius degrees), it might not be a good idea to fill all gaps with zeros (i.e. you will not be able to distinguish between the real and stub values afterwards). You might want to interpolate your time series. Basic data structure for this can be array/double linked list.
You can take several approaches:
use the finest-grained time series data (for instance, seconds) and interpolate/fill data when needed
use the coarsest-grained (for instance, years) and summarize data when needed
any middle step between the two extremes
You should always know your data, because:
in case of interpolating you have to choose the best algorithm (linear or quadratic interpolation, splines, exponential...)
in case of summarizing you have to choose an appropriate aggregation function (sum, maximum, mean...)
Once you have the same time scales for all the time series you can perform your arithmetical magick, but be aware that interpolation generates extra information, and summarization removes available information.
I've studied this problem fairly extensively. The danger of interpolation methods is that you bias various measures - especially volatility - and introduce spurious correlation. I found that Fourier interpolation mitigated this to some extent but the better approach is to go the other way: aggregate your more frequent observations to match the periodicity of your less frequent series, then compare these.

Resources