How to bulk-load an r-tree in C#? - r-tree

I am looking for C# code to construct an r-tree. I have code that builds an r-tree incrementally i.e. items are added one by one to the tree, but I guess a better r-tree could be built if all items are given all at once to the tree creation algorithm. Please let me know if anyone knows how to bulk-load an r-tree in this manner. I tried doing some search but couldn't find anything very useful.

The most common method for low-dimensional point data is sort-tile-recursive (STR). It does exactly that: sort the data, tile it into the optimal number of slices, then recurse if necessary.
The leaf level of a STR-loaded tree with point data will have no overlap, so it is really good. Higher levels may have overlap, as STR does not take the extend of objects into account.
A proven good bulk-loading is a key component to the Priority-R-Tree, too.
And even when not bulk-loading, the insertion strategy makes a big difference. R-Trees built with linear splits such as Guttmans or Ang-Tan will usually be worse than those built with the R*-Tree split heuristics. In particular Ang-Tan tends to produce "sliced" pages, that are very unbalanced in their spatial extend. It is a fast split strategy and probably the simplest, but the results aren't good.

A paper by Achakeev et al.,Sort-based Parallel Loading of R-trees might be of some help. And you could also find other methods in their references.

Related

replicating trees between ACID RDB using CRDT

I'm interested in replicating "hierachies" of data say similar to addresses.
Area
District
Sector
Unit
but you may have different pieces of data associated to each layer, so you may know the area of Sectors, but not of units, and you may know the population of a unit, basically its not a homogenious tree.
I know little about replication of data except brushing Brewers theorem/CAP, and some naive intuition about what eventual consistency is.
I'm looking for SIMPLE mechanisms to replicate this data from an ACID RDB, into other ACID RDBs, systemically the system needs to eventually converge, and obviously each RDB will enforce its own local consistent view, but any 2 nodes may not match at any given time (except 'eventually').
The simplest way to approach this is to simple store all the data in a single message from some designated leader and distribute it...like an overnight dump and load process, but thats too big.
So the next simplest thing (I thought) was if something inside an area changes, I can export the complete set of data inside an area, and load it into the nodes, thats still quite a coarse algorithm.
The next step was if, say an 'object' at any level changed, was to send all the data in the path to that 'object', i.e. if something in a sector is amended, you would send the data associated to the sector, its parent the district, and its parent the sector (with some sort of version stamp and lets say last update wins)....what i wanted to do was to ensure that any replication 'update' was guaranteed to succeed (so it needs the whole path, which potentially would be created if it didn't exist).
then i stumbled on CRDTs and thought....ah...I'm reinventing the wheel here, and the algorithms are allegedly easy in principle, but tricky to get correct in practice
are there standards accepted patterns to do this sort of thing?
In my use case the hierarchies are quite shallow, and there is only a single designated leader (at this time), I'm quite attracted to state based CRDTs because then I can ignore ordering.
Simplicity is the key requirement.
Actually it appears I've reinvented (in a very crude naive way) the SHELF algorithm.
I'll write some code and see if I can get it to work, and try to understand whats going on.

Explanation of Mapping structure in Veins

I try to understand and implement modifications to the Veins framework. Right now, I have some difficulties figuring out how the "Mapping" structure works. It is used to set the transmission power in "Mac1609_4.cc"
ConstMapping* txPowerMapping = createSingleFrequencyMapping(start, end, frequency, 5.0e6, power)
and to calculate received power, SNR and SINR in "Decider80211p.cc". Could you give some insight and some examples related to the structure manipulation?
The mapping structure is from MiXiM, as Veins initially forked that project. MiXiM, however, is deprecated now and should not be used anymore [2]. Unfortunately, there is no real documentation available (anymore).
As replacement, there is either INET, which also is supported by Veins, or, as it will be introduced in the next release, a much simpler representation of Signals, removing the Mapping structure [4].
If you still need to understand the structure, you can have a look at this paper where the authors explained the physical layer including the Mapping structure.

ELKI: Normalization undo for result

I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.

Does it help to duplicate original data in order to make more data for building model?

I just got an interview question.
"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?
Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.
But is there anyone can explain it more statistically? Thanks
Consider e.g. variance. The data set with the duplicated data will have the exact same variance - you don't have a more precise estimate of the distrbution afterwards.
There are, however, some exceptions. For example bootstrap validation helps when evaluating your model, but you have very little data.
Well, it depends on exactly what one means by "duplicating the data".
If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)
However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.
Also if one means bootstrapping, then that, too, makes a difference.
PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Rules engine for spatial and temporal reasoning?

I have an application that receives a number of datums that characterize 3 dimensional spatial and temporal processes. It then filters these datums and creates actions which are then sent to processes that perform the actions. Rinse and repeat.
At present, I have a collection of custom filters that perform a lot of complicated spatial/temporal calculations.
Many times as I discuss my system to individuals in my company, they ask if I'm using a rules engine.
I have yet to find a rules engine that is able to reason well temporally and spatially. (Things like: When are two 3D entities ever close? Is 3D entity A ever contained in 3D region B? If entity C is near entity D but oriented backwards relative to C then perform action D.)
I have looked at Drools, Cyc, Jess in the past (say 3-4 years ago). It's time to re-examine the state of the art. Any suggestions? Any standards that you know of that support this kind of reasoning? Any defacto standards? Any applications?
Thanks!
Premise - remember that a SQL-based1 DBMS is a (quite capable) inference engine, as can be seen from these comparisons between SQL and Prolog:
prolog to SQL converter
difference between SQL and Prolog
To address specifically your spatio-temporal applications, this book will help:
TEMPORAL DATA AND THE RELATIONAL MODEL - A Detailed Investigation into
the Application of Interval and Relation Theory to the Problem of Temporal Database Management.
That is, combining Interval and Relation Theory is possible to reasoning about spatio-temporal problems effectively (see 5.2 Applications of Intervals).
Of course, if your SQL-based DBMS is not (yet) equipped with interval (and other) operators you will need to extend it appropriately (via store-procedures and/or User-Defined Functions - UDFs).
Update: skimming the paper pointed out in comments by timemirror (Towards a 3D Spatial Query Language for Building Information Models) they do essentially what I touched on above:
(last page)
IMPLEMENTATION CONCEPTS
The implementation of the abstract
type system into a query language will
be performed on the basis of the query
language SQL, which is a widely
established standard in the field of
object-relational databases. The
international standard SQL:1999
extends the relational model to
include object-oriented aspects, such
as the possibility to define complex
abstract data types with integrated
methods.
I do not concur with the "object-relational database" terminology (for reason off-topic here) but I think the rest is pertinent.
Update: a quote regardind 3D and interval theory from the book cited above:
NOTE: All of the intervals discussed
so far can be thought of as
one-dimensional. However, we might
want to combine two one-dimensional
intervals to form a twodimensional
interval. For example, a rectangular
plot of ground might be thought of as
a two-dimensional interval, because it
is, by definition, an object with
length and width, each of which is
basically a one-dimensional interval
measured along some axis. And, of
course, we can extend this idea to any
number of dimensions. For example, a
(rather simple!) building might be
regarded as a three-dimensional
interval: It is an object with length,
width, and height, or in other words a
cuboid. (More realistically, a
building might be regarded as a set of
several such cuboids that overlap in
various ways.) And so on. In what
follows, however, we will restrict our
attention to one-dimensional intervals
specifically, barring explicit
statements to the contrary, and we
will omit the "one-dimensional"
qualifier for simplicity.
Note
I wrote SQL-based and not relational because there are ways to use such DBMSes that completely deviate from relational theory.
This is Spatial Reasoning... a few models but 9DE-IM is now accepted by OGC and implemented in PostGIS and other programming tools.
PostGIS implements a spatial reasoning engine based on dimensionally extended 9 intersection model... 9DE-IM..
http://postgis.refractions.net/documentation/manual-svn/ch04.html#DE-9IM
check sect 4.3.6.1. Theory...
So does the Java Topology Suite (and Net Topology suite for C# etc)...
http://docs.codehaus.org/display/GEOTDOC/Point+Set+Theory+and+the+DE-9IM+Matrix
In particualr check out the geometry.relate stuff.. such as
boolean isRelated = geometry.relate( geometry2, "T*T***T**" )
You can test the relationships, or filter data based on them.
Works with pts, lines, polygons etc...
This might help on temporal stuff..
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.4643&rep=rep1&type=pdf
Check out SpatialRules at http://www.objectfx.com/. It's a geospatial complex event processor for 2D and 3D.

Resources