Rules engine for spatial and temporal reasoning? - spatial

I have an application that receives a number of datums that characterize 3 dimensional spatial and temporal processes. It then filters these datums and creates actions which are then sent to processes that perform the actions. Rinse and repeat.
At present, I have a collection of custom filters that perform a lot of complicated spatial/temporal calculations.
Many times as I discuss my system to individuals in my company, they ask if I'm using a rules engine.
I have yet to find a rules engine that is able to reason well temporally and spatially. (Things like: When are two 3D entities ever close? Is 3D entity A ever contained in 3D region B? If entity C is near entity D but oriented backwards relative to C then perform action D.)
I have looked at Drools, Cyc, Jess in the past (say 3-4 years ago). It's time to re-examine the state of the art. Any suggestions? Any standards that you know of that support this kind of reasoning? Any defacto standards? Any applications?
Thanks!

Premise - remember that a SQL-based1 DBMS is a (quite capable) inference engine, as can be seen from these comparisons between SQL and Prolog:
prolog to SQL converter
difference between SQL and Prolog
To address specifically your spatio-temporal applications, this book will help:
TEMPORAL DATA AND THE RELATIONAL MODEL - A Detailed Investigation into
the Application of Interval and Relation Theory to the Problem of Temporal Database Management.
That is, combining Interval and Relation Theory is possible to reasoning about spatio-temporal problems effectively (see 5.2 Applications of Intervals).
Of course, if your SQL-based DBMS is not (yet) equipped with interval (and other) operators you will need to extend it appropriately (via store-procedures and/or User-Defined Functions - UDFs).
Update: skimming the paper pointed out in comments by timemirror (Towards a 3D Spatial Query Language for Building Information Models) they do essentially what I touched on above:
(last page)
IMPLEMENTATION CONCEPTS
The implementation of the abstract
type system into a query language will
be performed on the basis of the query
language SQL, which is a widely
established standard in the field of
object-relational databases. The
international standard SQL:1999
extends the relational model to
include object-oriented aspects, such
as the possibility to define complex
abstract data types with integrated
methods.
I do not concur with the "object-relational database" terminology (for reason off-topic here) but I think the rest is pertinent.
Update: a quote regardind 3D and interval theory from the book cited above:
NOTE: All of the intervals discussed
so far can be thought of as
one-dimensional. However, we might
want to combine two one-dimensional
intervals to form a twodimensional
interval. For example, a rectangular
plot of ground might be thought of as
a two-dimensional interval, because it
is, by definition, an object with
length and width, each of which is
basically a one-dimensional interval
measured along some axis. And, of
course, we can extend this idea to any
number of dimensions. For example, a
(rather simple!) building might be
regarded as a three-dimensional
interval: It is an object with length,
width, and height, or in other words a
cuboid. (More realistically, a
building might be regarded as a set of
several such cuboids that overlap in
various ways.) And so on. In what
follows, however, we will restrict our
attention to one-dimensional intervals
specifically, barring explicit
statements to the contrary, and we
will omit the "one-dimensional"
qualifier for simplicity.
Note
I wrote SQL-based and not relational because there are ways to use such DBMSes that completely deviate from relational theory.

This is Spatial Reasoning... a few models but 9DE-IM is now accepted by OGC and implemented in PostGIS and other programming tools.
PostGIS implements a spatial reasoning engine based on dimensionally extended 9 intersection model... 9DE-IM..
http://postgis.refractions.net/documentation/manual-svn/ch04.html#DE-9IM
check sect 4.3.6.1. Theory...
So does the Java Topology Suite (and Net Topology suite for C# etc)...
http://docs.codehaus.org/display/GEOTDOC/Point+Set+Theory+and+the+DE-9IM+Matrix
In particualr check out the geometry.relate stuff.. such as
boolean isRelated = geometry.relate( geometry2, "T*T***T**" )
You can test the relationships, or filter data based on them.
Works with pts, lines, polygons etc...
This might help on temporal stuff..
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.4643&rep=rep1&type=pdf

Check out SpatialRules at http://www.objectfx.com/. It's a geospatial complex event processor for 2D and 3D.

Related

replicating trees between ACID RDB using CRDT

I'm interested in replicating "hierachies" of data say similar to addresses.
Area
District
Sector
Unit
but you may have different pieces of data associated to each layer, so you may know the area of Sectors, but not of units, and you may know the population of a unit, basically its not a homogenious tree.
I know little about replication of data except brushing Brewers theorem/CAP, and some naive intuition about what eventual consistency is.
I'm looking for SIMPLE mechanisms to replicate this data from an ACID RDB, into other ACID RDBs, systemically the system needs to eventually converge, and obviously each RDB will enforce its own local consistent view, but any 2 nodes may not match at any given time (except 'eventually').
The simplest way to approach this is to simple store all the data in a single message from some designated leader and distribute it...like an overnight dump and load process, but thats too big.
So the next simplest thing (I thought) was if something inside an area changes, I can export the complete set of data inside an area, and load it into the nodes, thats still quite a coarse algorithm.
The next step was if, say an 'object' at any level changed, was to send all the data in the path to that 'object', i.e. if something in a sector is amended, you would send the data associated to the sector, its parent the district, and its parent the sector (with some sort of version stamp and lets say last update wins)....what i wanted to do was to ensure that any replication 'update' was guaranteed to succeed (so it needs the whole path, which potentially would be created if it didn't exist).
then i stumbled on CRDTs and thought....ah...I'm reinventing the wheel here, and the algorithms are allegedly easy in principle, but tricky to get correct in practice
are there standards accepted patterns to do this sort of thing?
In my use case the hierarchies are quite shallow, and there is only a single designated leader (at this time), I'm quite attracted to state based CRDTs because then I can ignore ordering.
Simplicity is the key requirement.
Actually it appears I've reinvented (in a very crude naive way) the SHELF algorithm.
I'll write some code and see if I can get it to work, and try to understand whats going on.

Dynamic Process parameter adjustment in Semiconductor manufacturing data

I have process parameter data from semiconductor manufacturing.and requirement is to suggest what could be the best parameter adjustment to be made to process parameter to get better yield ie best path for high yield. what machine learning /Statistical models best suits this requirement
Note:I have thought of using decision tree which can give us best path for high yield.
Would like to know it any other methods that can be more efficient
data is like
lotno x1 x2 x3 x4 x5 yield(%)
<95% yield is considered as 0 and >95% as 1
I'm not really sure of the question here, but as a former semiconductor process engineer, here is how I look at the yield improvement approach - perspective.
Process Development.
DOE: Typically, I would run structured DOEs to understand my process (#4). I would first identify "potential" 'factors', and run various "screening" experiments to identify statistical significance. With the goal basically here to identify the most statistically significant (and for that matter, least significant) factors. So these are inherently simple experiments, low # of "levels" which don't target understanding of the curvature of the response surface, they just look for magnitude change of response vs factor. Generally, I am most concerned with 'Process' factors, but it is important to recognize that the influence of variable inputs can come from more than just "machine knobs' as example. Variable can arise from 1) People, 2) Environment (moisture, temp, etc), 3) consumables (used in the process), 4) Equipment (is 40 psi on this tool really 40 psi and the same as 40 psi on a different tool) 4) Process variable settings.
With the most statistically significant factors, I would run more elaborate DOE using the major factors and analyze this data to develop a model. There are generally more 'levels' used here to allow for curvature insight of the response surface via the analysis. There are many types of well known standard experimental design structures here. And there is software such as JMP that is specifically set up to do this analysis.
From here, the idea would be to generate a model in the form of Response = F (Factors). That allows you to essentially optimize the response based upon these factors where the response is a reflection of your yield criteria.
From here, the engineer would typically execute confirmation runs with optimized factors to confirm optimized response.
Note that the software analysis typically allows for the engineer to illuminate any run order dependence. The execution of the DOE is typically performed in a randomized cell fashion. (Each 'cell' is a set of conditions for the experiment). Similarly the experiments include some level of repetition to gauge 'repeatability' of the 'system'. This inclusion can be explicit (run the same cell twice), but there is also some level of repeatability inherent in the design as well since you are running multiple cells, albeit at difference settings. But generally, the experiment includes explicitly repeated cells.
And finally there is the concept of manufacturability, which includes constraints of time, cost, physical limits, equipment capability, etc. (The ideal process works great, but it takes 10 years, costs 1 million dollars and requires projected settings outsides the capability of the tool.)
Since you have manufacturing data, hopefully, you have the data that captures the other types of factors as well (1,2,3), so you should specifically analyze the data to try to identify such effects. This is typically done as A vs B comparisons. Person A vs B, Tool A vs B, Consumable A vs B, Consumable lot A vs B, Summer vs Winter, etc.
Basically, there are all sorts of comparisons you could envision here and check for statistically differences across two sets of populations.
A comment on response: What is the yield criteria? You should know this in order to formulate the model. For semiconductors, we have both line yield (process yield) but there is also device yield. I assume for your work, you are primarily concerned with line yield. So minimizing variability in the factors (from 1,2,3,4) to achieve the desired response (target response(s) with minimal variability) is the primary goal.
APC (Advanced Process Control).
In many cases, there is significant trending that results from whatever reason; crappy tool control (the tool heats up), crappy consumable (the target material wears, the polishing pad wears, the chemical bath gets loaded, whatever), and so the idea here is how to adjust the next batch/lot/wafer based upon the history of what came prior. Either improve the manufacturing to avoid/minimize this trending (run order dependence) or adjust process to accommodate it to achieve the desired response.
Time for lunch, hope this helps...if you post on the specific process module type, and even equipment and consumables, I might be able to provide more insight.

Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?

My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.

How to bulk-load an r-tree in C#?

I am looking for C# code to construct an r-tree. I have code that builds an r-tree incrementally i.e. items are added one by one to the tree, but I guess a better r-tree could be built if all items are given all at once to the tree creation algorithm. Please let me know if anyone knows how to bulk-load an r-tree in this manner. I tried doing some search but couldn't find anything very useful.
The most common method for low-dimensional point data is sort-tile-recursive (STR). It does exactly that: sort the data, tile it into the optimal number of slices, then recurse if necessary.
The leaf level of a STR-loaded tree with point data will have no overlap, so it is really good. Higher levels may have overlap, as STR does not take the extend of objects into account.
A proven good bulk-loading is a key component to the Priority-R-Tree, too.
And even when not bulk-loading, the insertion strategy makes a big difference. R-Trees built with linear splits such as Guttmans or Ang-Tan will usually be worse than those built with the R*-Tree split heuristics. In particular Ang-Tan tends to produce "sliced" pages, that are very unbalanced in their spatial extend. It is a fast split strategy and probably the simplest, but the results aren't good.
A paper by Achakeev et al.,Sort-based Parallel Loading of R-trees might be of some help. And you could also find other methods in their references.

Apriori Algorithm Implementation

I am using an apiori algorithm implementation to generate association rules from a transaction set and I am getting the following association rules. but I get an association rules 1->8 can i assume 8->1 because see the association rules it starts from 0 and ends till 9 because there are 10 product classes, but using this algorithm I am not getting something like 8->2 or 9->1, so can i reverse an association rules 2->8 to 8->2. if not can someone point to a better apiori algorithm implementation
0-->5
0-->9
1-->2
1-->4
1-->5
1-->7
1-->8
1-->9
2-->3
2-->4
2-->5
2-->6
2-->7
2-->8
2-->9
3-->4
3-->5
3-->6
3-->7
3-->8
4-->5
4-->6
4-->7
4-->8
4-->9
5-->6
5-->7
5-->8
5-->9
6-->7
6-->8
6-->9
7-->8
7-->9
8-->9
You can get my favourite apriori implementation here:
http://www.borgelt.net/apriori.html
(Christian Borgelt also has implementations for many other mining algorithms.)
I use it regularly to mine datasets with millions of entries and it's blazingly fast.
And you can configure it to do what you want (frequent item sets vs. association rules).
of course you can assume so (1=>9 is equal to 9=>1). the items are basically combination among the others, not permutation.
FPGrowth is way more efficient than Apriori
If you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:
http://www.philippe-fournier-viger.com/spmf/
It also offers implementations of Eclat, FPGrowth, Charm and many other algorithms that can be used for association rule mining, frequent itemset mining, sequential pattern mining and sequential rule mining.

Resources