ELKI DBSCAN R* tree index - r-tree

In MiniGUi, I can see db.index. How do I set it to tree.spatial.rstarvariants.rstar.RStartTreeFactory via Java code?
I have implemented:
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,tree.spatial.rstarvariants.rstar.RStarTreeFactory);
For the second parameter of addParameter() function tree.spatial...RStarTreeFactory class not found
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(
FileBasedDatabaseConnection.Parameterizer.INPUT_ID,
fileLocation);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
I am getting NullPointerException. Did I use RStarTreeFactory.class correctly?

The ELKI command line (and MiniGui; which is a command line builder) allow to specify shorthand class names, leaving out the package prefix of the implemented interface.
The full command line documentation yields:
-db.index <object_1|class_1,...,object_n|class_n>
Database indexes to add.
Implementing de.lmu.ifi.dbs.elki.index.IndexFactory
Known classes (default package de.lmu.ifi.dbs.elki.index.):
-> tree.spatial.rstarvariants.rstar.RStarTreeFactory
-> ...
I.e. for this parameter, the class prefix de.lmu.ifi.dbs.elki.index. may be omitted.
The full class name thus is:
de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeFactory
or you just type RStarTreeFactory, and let eclipse auto-repair the import:
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
// Bulk loading static data yields much better trees and is much faster, too.
params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID,
SortTileRecursiveBulkSplit.class);
// Page size should fit your dimensionality.
// For 2-dimensional data, use page sizes less than 1000.
// Rule of thumb: 15...20 * (dim * 8 + 4) is usually reasonable
// (for in-memory bulk-loaded trees)
params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 300);
See also: Geo Indexing example in the tutorial folder.

Related

Saxon - s9api - setParameter as node and access in transformation

we are trying to add parameters to a transformation at the runtime. The only possible way to do so, is to set every single parameter and not a node. We don't know yet how to create a node for the setParameter.
Current setParameter:
QName TEST XdmAtomicValue 24
Expected setParameter:
<TempNode> <local>Value1</local> </TempNode>
We searched and tried to create a XdmNode and XdmItem.
If you want to create an XdmNode by parsing XML, the best way to do it is:
DocumentBuilder db = processor.newDocumentBuilder();
XdmNode node = db.build(new StreamSource(
new StringReader("<doc><elem/></doc>")));
You could also pass a string containing lexical XML as the parameter value, and then convert it to a tree by calling the XPath parse-xml() function.
If you want to construct the XdmNode programmatically, there are a number of options:
DocumentBuilder.newBuildingStreamWriter() gives you an instance of BuildingStreamWriter which extends XmlStreamWriter, and you can create the document by writing events to it using methods such as writeStartElement, writeCharacters, writeEndElement; at the end call getDocumentNode() on the BuildingStreamWriter, which gives you an XdmNode. This has the advantage that XmlStreamWriter is a standard API, though it's not actually a very nice one, because the documentation isn't very good and as a result implementations vary in their behaviour.
Another event-based API is Saxon's Push class; this differs from most push-based event APIs in that rather than having a flat sequence of methods like:
builder.startElement('x');
builder.characters('abc');
builder.endElement();
you have a nested sequence:
Element x = Document.elem('x');
x.text('abc');
x.close();
As mentioned by Martin, there is the "sapling" API: Saplings.doc().withChild(elem(...).withChild(elem(...)) etc. This API is rather radically different from anything you might be familiar with (though it's influenced by the LINQ API for tree construction on .NET) but once you've got used to it, it reads very well. The Sapling API constructs a very light-weight tree in memory (hance the name), and converts it to a fully-fledged XDM tree with a final call of SaplingDocument.toXdmNode().
If you're familiar with DOM, JDOM2, or XOM, you can construct a tree using any of those libraries and then convert it for use by Saxon. That's a bit convoluted and only really intended for applications that are already using a third-party tree model heavily (or for users who love these APIs and prefer them to anything else).
In the Saxon Java s9api, you can construct temporary trees as SaplingNode/SaplingElement/SaplingDocument, see https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingDocument.html and https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingElement.html.
To give you a simple example constructing from a Map, as you seem to want to do:
Processor processor = new Processor();
Map<String, String> xsltParameters = new HashMap<>();
xsltParameters.put("foo", "value 1");
xsltParameters.put("bar", "value 2");
SaplingElement saplingElement = new SaplingElement("Test");
for (Map.Entry<String, String> param : xsltParameters.entrySet())
{
saplingElement = saplingElement.withChild(new SaplingElement(param.getKey()).withText(param.getValue()));
}
XdmNode paramNode = saplingElement.toXdmNode(processor);
System.out.println(paramNode);
outputs e.g. <Test><bar>value 2</bar><foo>value 1</foo></Test>.
So the key is to understand that withChild() returns a new SaplingElement.
The code can be compacted using streams e.g.
XdmNode paramNode2 = Saplings.elem("root").withChild(
xsltParameters
.entrySet()
.stream()
.map(p -> Saplings.elem(p.getKey()).withText(p.getValue()))
.collect(Collectors.toList())
.toArray(SaplingElement[]::new))
.toXdmNode(processor);
System.out.println(paramNode2);

Spatial querydsl - GeometryExpressions. Find entities that range contains given "Point"

Used tools and their versions:
I am using:
spring boot 2.2.6
hibernate/hibernate-spatial 5.3.10 with dialect set to: org.hibernate.spatial.dialect.mysql.MySQL56SpatialDialect
querydsl-spatial 4.2.1
com.vividsolutions.jts 1.13
jscience 4.3.1
Problem description:
I have an entity that represents medical-clinic:
import com.vividsolutions.jts.geom.Polygon;
#Entity
public class Clinic {
#Column(name = "range", columnDefinition = "Polygon")
private Polygon range;
}
The range is a circle calculated earlier based on the clinic's gps-location and radius. It represents the operating area for that clinic. That means that it treats only patients with their home address lying within that circle. Let's assume that above circle, is correct.
My goal (question):
I have a gps point with a patient location: 45.7602322 4.8444941. I would like to find all clinics that are able to treat that patient. That means, to find all the clinics that their range field contains 45.7602322 4.8444941.
My solution (partially correct (I think))
To get it done, I have created a simple "Predicate"/"BooleanExpression":
GeometryExpressions.asGeometry(QClinic.clinic.range)
.contains(Wkt.fromWkt("Point(45.7602322 4.8444941)"))
and it actualy works, because I can see proper sql query in console:
select (...) where
ST_Contains(clinic0_.range, ?)=1 limit ?
first binding param: POINT(45.7602322 4.8444941)
But I have two problems with that:
QClinic.clinic.range is marked as "warning" in intellij as: "Unchecked assignment: 'com.querydsl.spatial.jts.JTSPolygonPath' to 'com.querydsl.core.types.Expression<org.geolatte.geom.Geometry'". Yes, in QClinic range is com.querydsl.spatial.jts.JTSPolygonPath
Using debugger and intellij's "evaluate" on the above line (that creates the expression) i can see that there is an error message: "unknown operation with operator CONTAINS and args [clinic.range, POINT(45.7602322 4.8444941)]"
You can ignore the second warning. The spatial specific operations are simply not registered in the serializer used for the toString of the operation. They reside in their own module.
The first warning indicates that you're mixing up Geolatte and JTS expressions. From your mapping it seems you intend to use JTS. In that case you need to use com.querydsl.spatial.jts.JTSGeometryExpressions instead of com.querydsl.spatial.jts.GeometryExpressions in order to get rid of that warning.

Can I visualize a Multibody pose without explicitly calculating every body's full transform?

In the examples/quadrotor/ example, a custom QuadrotorPlant is specified and its output is passed into QuadrotorGeometry where the QuadrotorPlant state is packaged into FramePoseVector for the SceneGraph to visualize.
The relevant code segment in QuadrotorGeometry that does this:
...
builder->Connect(
quadrotor_geometry->get_output_port(0),
scene_graph->get_source_pose_port(quadrotor_geometry->source_id_));
...
void QuadrotorGeometry::OutputGeometryPose(
const systems::Context<double>& context,
geometry::FramePoseVector<double>* poses) const {
DRAKE_DEMAND(frame_id_.is_valid());
const auto& state = get_input_port(0).Eval(context);
math::RigidTransformd pose(
math::RollPitchYawd(state.segment<3>(3)),
state.head<3>());
*poses = {{frame_id_, pose.GetAsIsometry3()}};
}
In my case, I have a floating based multibody system (think a quadrotor with a pendulum attached) of which I've created a custom plant (LeafSystem). The minimal coordinates for such a system would be 4 (quaternion) + 3 (x,y,z) + 1 (joint angle) = 7. If I were to follow the QuadrotorGeometry example, I believe I would need to specify the full RigidTransformd for the quadrotor and the full RigidTransformd of the pendulum.
Question
Is it possible to set up the visualization / specify the pose such that I only need to specify the 7 (pose of quadrotor + joint angle) state minimal coordinates and have the internal MultibodyPlant handle the computation of each individual body's (quadrotor and pendulum) full RigidTransform which can then be passed to the SceneGraph for visualization?
I believe this was possible with the "attic-ed" (which I take to mean "to be deprecated") RigidBodyTree, which was accomplished in examples/compass_gait
lcm::DrakeLcm lcm;
auto publisher = builder.AddSystem<systems::DrakeVisualizer>(*tree, &lcm);
publisher->set_name("publisher");
builder.Connect(compass_gait->get_floating_base_state_output_port(),
publisher->get_input_port(0));
Where get_floating_base_state_output_port() was outputting the CompassGait state with only 7 states (3 rpy + 3 xyz + 1 hip angle).
What is the MultibodyPlant, SceneGraph equivalent of this?
Update (Using MultibodyPositionToGeometryPose from Russ's deleted answer
I created the following function which, attempts to create a MultibodyPlant from the given model_file and connects the given plant pose_output_port through MultibodyPositionToGeometryPose.
The pose_output_port I'm using is the 4(quaternion) + 3(xyz) + 1(joint angle) minimal state.
void add_plant_visuals(
systems::DiagramBuilder<double>* builder,
geometry::SceneGraph<double>* scene_graph,
const std::string model_file,
const systems::OutputPort<double>& pose_output_port)
{
multibody::MultibodyPlant<double> mbp;
multibody::Parser parser(&mbp, scene_graph);
auto model_id = parser.AddModelFromFile(model_file);
mbp.Finalize();
auto source_id = *mbp.get_source_id();
auto multibody_position_to_geometry_pose = builder->AddSystem<systems::rendering::MultibodyPositionToGeometryPose<double>>(mbp);
builder->Connect(pose_output_port,
multibody_position_to_geometry_pose->get_input_port());
builder->Connect(
multibody_position_to_geometry_pose->get_output_port(),
scene_graph->get_source_pose_port(source_id));
geometry::ConnectDrakeVisualizer(builder, *scene_graph);
}
The above fails with the following exception
abort: Failure at multibody/plant/multibody_plant.cc:2015 in get_geometry_poses_output_port(): condition 'geometry_source_is_registered()' failed.
So, there's a lot in here. I have a suspicion there's a simple answer, but we may have to converge on it.
First, my assumptions:
You've got an "internal" MultibodyPlant (MBP). Presumably, you also have a context for it, allowing you to perform meaningful state-dependent calculations.
Furthermore, I presume the MBP was responsible for registering the geometry (probably happened when you parsed it).
Your LeafSystem will directly connect to the SceneGraph to provide poses.
Given your state, you routinely set the state in the MBP's context to do that evaluation.
Option 1 (Edited):
In your custom LeafSystem, create the FramePoseVector output port, create the calc callback for it, and inside that callback, simply invoke the Eval() of the pose output port of the internal MBP that your LeafSystem own (having previously set the state in your locally owned Context for the MBP and passing in the pointer to the FramePoseVector that your LeafSystem's callback was provided with).
Essentially (in a very coarse way):
MySystem::MySystem() {
this->DeclareAbstractOutputPort("geometry_pose",
&MySystem::OutputGeometryPose);
}
void MySystem::OutputGeometryPose(
const Context& context, FramePoseVector* poses) const {
mbp_context_.get_mutable_continuous_state()
.SetFromVector(my_state_vector);
mbp_.get_geometry_poses_output_port().Eval(mpb_context_, poses);
}
Option 2:
Rather than implementing a LeafSystem that has an internal plant, you could have a Diagram that contains an MBP and exports the MBP's FramePoseVector output directly through the diagram to connect.
This answer addresses, specifically, your edit where you are attempting to use the MultibodyPositionToGeometryPose approach. It doesn't address the larger design issues.
Your problem is that the MultibodyPositiontToGeometryPose system takes a reference to an MBP and keeps a reference to that same MBP. That means the MBP must be alive and well for at least as long as the MPTGP is. However, in your code snippet, your MBP is local to the add_plant_visuals() function so it is destroyed as soon as the function is over.
You'll need to create something that is persisted and owned by someone else.
(This is tightly related to my option 2 - now edited for improved clarity.)

How to Get Filename when using file pattern match in google-cloud-dataflow

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I'm newbee to use dataflow. How to get filename when use file patten match, in this way.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))
I'd like to how I detect filename that kinglear.txt,Hamlet.txt, etc.
If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).
If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.
Some workarounds include:
Writing a pipeline like this (the tf-idf example uses this approach):
DoFn readFile = ...(takes a filename, reads the file and produces records)...
p.apply(Create.of(filenames))
.apply(ParDo.of(readFile))
.apply(the rest of your pipeline)
This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.
Implementing your own FileBasedSource (javadoc, general documentation) which would return records of type something like Pair<String, T> where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.
Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.
Update based on latest SDK
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
PCollection<KV<String, String>> filesAndContents = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(KVs(strings(), strings()))
.via((ReadableFile f) -> KV.of(
f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.
import apache_beam as beam
from apache_beam.io import fileio
with beam.Pipeline() as p:
readable_files = (p
| fileio.MatchFiles(‘hdfs://path/to/*.txt’)
| fileio.ReadMatches()
| beam.Reshuffle())
files_and_contents = (readable_files
| beam.Map(lambda x: (x.metadata.path,
x.read_utf8()))
One approach is to build a List<PCollection> where each entry corresponds to an input file, then use Flatten. For example, if you want to parse each line of a collection of files into a Foo object, you might do something like this:
public static class FooParserFn extends DoFn<String, Foo> {
private String fileName;
public FooParserFn(String fileName) {
this.fileName = fileName;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
String line = processContext.element();
// here you have access to both the line of text and the name of the file
// from which it came.
}
}
public static void main(String[] args) {
...
List<String> inputFiles = ...;
List<PCollection<Foo>> foosByFile =
Lists.transform(inputFiles,
new Function<String, PCollection<Foo>>() {
#Override
public PCollection<Foo> apply(String fileName) {
return p.apply(TextIO.Read.from(fileName))
.apply(new ParDo().of(new FooParserFn(fileName)));
}
});
PCollection<Foo> foos = PCollectionList.<Foo>empty(p).and(foosByFile).apply(Flatten.<Foo>pCollections());
...
}
One downside of this approach is that, if you have 100 input files, you'll also have 100 nodes in the Cloud Dataflow monitoring console. This makes it hard to tell what's going on. I'd be interested in hearing from the Google Cloud Dataflow people whether this approach is efficient.
I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to #danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.
GcsOptions gcsOptions = options.as(GcsOptions.class);
List<GcsPath> paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
List<String>filesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
PCollectionList<SomeClass> pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
pcl = pcl.and(
p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
.from(fileName)
.withSchema(SomeClass.class)
)
.apply(ParDo.of(new MyDoFn(fileName)))
);
}
// flatten the PCollectionList, combining all the PCollections together
PCollection<SomeClass> flattenedPCollection = pcl.apply(Flatten.pCollections());
This might be a very late post for the above question, but I wanted to add answer with Beam bundled classes.
This could also be seen as an extracted code from the solution provided by #Reza Rokni.
PCollection<String> listOfFilenames =
pipe.apply(FileIO.match().filepattern("gs://apache-beam-samples/shakespeare/*"))
.apply(FileIO.readMatches())
.apply(
MapElements.into(TypeDescriptors.strings())
.via(
(FileIO.ReadableFile file) -> {
String f = file.getMetadata().resourceId().getFilename();
System.out.println(f);
return f;
}));
pipe.run().waitUntilFinish();
Above PCollection<String> will have a list of files available at any provided directory.
I was struggling with the same use case while using wildcard to read files from GCS but also needed to modify the collection based on the file name.The key is to use ReadFromTextWithFilename instead of readfromtext In java you already have a way out and you can use:
String filename =context.element().getMetadata().resourceId().getCurrentDirectory().toString()
inside your processElement method.
But for Python below technique will work:
-> Use beam.io.ReadFromTextWithFilename for reading the wildcard path from GCS
-> As per the document, ReadFromTextWithFilename returns the file's name and the file's content.
Below is the code snippet:
class GetFileNameFromWildcard(beam.DoFn):
def process(self, element, *args, **kwargs):
file_path, content = element
schema = ["id","name","mob","email","dept","store"]
store_name = file_path.split("/")[-2]
content_list = content.split(",")
content_list.append(store_name)
out_dict = dict(zip(schema,content_list))
print(out_dict)
yield out_dict
def run():
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
# saving main session so that it can load global namespace on the Cloud Dataflow Worker
init = p | 'Begin Pipeline With Initiator' >> beam.Create(
["pcollection initializer"]) | 'Read From GCS' >> beam.io.ReadFromTextWithFilename(
"gs://<bkt-name>/20220826/*/dlp*", skip_header_lines=1) | beam.ParDo(
GetFileNameFromWildcard()) | beam.io.WriteToText(
'df_out.csv')

Is there anything like "CheckSum" in Dart (on Objects)?

For Testing purposes I'm trying to design a way to verify that the results of statistical tests are identical across versions, platforms and such. There are a lot things that go on that include ints, nums, dates, Strings and more inside our collections of Objects.
In the end I want to 'know' that the whole set of instantiated objects sum to the same value (by just doing something like adding the checkSum of all internal properties).
I can write low level code for each internal value to return a checkSum but I was thinking that perhaps something like this already exists.
Thanks!
_swarmii
This sounds like you should be using the serialization library (install via Pub).
Here's a simple example to get you started:
import 'dart:io';
import 'package:serialization/serialization.dart';
class Address {
String street;
int number;
}
main() {
var address = new Address()
..number = 5
..street = 'Luumut';
var serialization = new Serialization()
..addRuleFor(address);
Map output = serialization.write(address, new SimpleJsonFormat());
print(output);
}
Then depending on what you want to do exactly, I'm sure you can fine tune the code for your purpose.

Resources