If I have 2 million rows in a db4o database, would you recommend a large flat table, or a hierarchy? - db4o

I have 2 million rows in a flat db4o table. A lot of the information is repeated - for example, the first column has only three possible strings.
I could easily break the table into a 4-tier hierarchy (i.e. navigate from root >> symbol >> date >> final table) - but is this worth it from a speed and software maintenance point of view?
If it turns out that it would be cleaner to break the table into a hierarchy, any recommendations on a good method to achieve this within the current db4o framework?
Answers to questions
To actually answer your question, I
would need more information. What kind
of information do you store?
I'm storing objects containing strings and doubles. The hierarchy is exactly, in concept, like a file system with directories, sub-directories, and sub-sub-directories: a single root node contains an array of subclasses, and each sub-class in turn contains further arrays of sub-sub-classes, etc. Here is an example of the code:
// rootNode---|
// sub-node 1----|
// |-----sub-sub-node 1
// |-----sub-sub-node 2
// |-----sub-sub-node 3
// |-----sub-sub-node X (others, N elements)
// sub-node 2----|
// |-----sub-sub-node 1
// |-----sub-sub-node 2
// |-----sub-sub-node 3
// |-----sub-sub-node X (others, N elements)
// sub-node 3----|
// |-----sub-sub-node 1
// |-----sub-sub-node 2
// |-----sub-sub-node 3
// |-----sub-sub-node X (others, N elements)
// sub-node X (others, N elements)
class rootNode
{
IList<subNode> subNodeCollection = new List<subNode>();
string rootNodeParam;
}
class subNode
{
IList<subSubNode> subSubNodeCollection = new List<subSubNode>();
string subNodeParam;
}
class subSubNode
{
string subSubNodeParam;
}
// Now, we have to work out a way to create a query that filters
// by rootNodeParam, subNodeParam and subSubNodeParam.
Ans what
are the access-patterns of your data?
Are reading single objects by a query
/ search. Or are you reading a lot of
objects which are related to each
other?.
I'm trying to navigate down the tree, filtering by parameters as I go.
In general db4o (and other object
databases) are good at navigational
access. This means that you first
query for some objects, and from there
you navigate to related objects. For
example you first query for a
user-object. From there you navigate
to the users home, city, job, friends
etc objects. That kind of access works
in great in db4o.
This is exactly what I'm trying to do, and exactly what works well in db4o if you only have 1-1 mappings between classes and subclasses. If you have 1-to-many by implementing an ArrayList of classes within a class, it can't do a query without instantiating the whole tree - or am I misled on this one?
So in your example in your case the
4-tier hierarchy can work great with
db4o, but only when you can navigate
from the root to the symbol object and
so on. That mean that the root object
has a collection of its
'children'-object
Yes - but is there any way to do a query, if each subNode contains a collection?

As Sam Stainsby already pointed out in his commend, db4o doesn't have the notion of tables. It stored objects and thats db4o's unit of storage. Don't try to think in terms of tables, that doesn't really work with db4o.
As you said, you repeat information, so thats a good candidate to be separated in a other objects, which then can be referenced by other objects. In general I would first design a good domain-model, to be aware of how the data is organized and related to each other. And to think about what kind of data-access-patterns you have. And then try to find out how you can design your classes/object in a way which works with db4o.
To actually answer your question, I would need more information. What kind of information do you store? Ans what are the access-patterns of your data? Are reading single objects by a query / search. Or are you reading a lot of objects which are related to each other?.
In general db4o (and other object databases) are good at navigational access. This means that you first query for some objects, and from there you navigate to related objects. For example you first query for a user-object. From there you navigate to the users home, city, job, friends etc objects. That kind of access works in great in db4o.
So in your example in your case the 4-tier hierarchy can work great with db4o, but only when you can navigate from the root to the symbol object and so on. That mean that the root object has a collection of its 'children'-object
Btw: If you feel that is more natural to think in terms of tables for your data, then I recommend using a relational database. Relations databases are awesome at dealing with tables.

Related

DDD Entity Framework Repository Return Complex Type

I have a repository called LeadRepository that returns a model called Lead which is a person.
The UI I have is a dashboard that displays the following stats. They are all leads but in different states.
Total Leads: 52
Assigned: 49
Unassigned: 3
Contacted: 49
Uncontacted: 0
I am using a stored procedure to query the db. So i'm not using lazy loading to work out the count on the fly.
I have thought about two possible solutions below but neither of them feel quite right.
Use LeadRepository but have a method on it called GetStats() that returns a complex type. This does not have any association with the agg root Lead. Just a bunch of properties that have the different counts.
Create a LeadStatsRepository but this is not really an aggregate root as it has no id. It just is a grouped set of data.
If anyone has any suggestions that would be great.
Repositories are for aggregate roots. What you're after are read models and dedicated query objects.
Stats are best handled through a Service. From Evans’ DDD, a good Service has these characteristics:
The operation relates to a domain concept that is not a natural part of an Entity or Value Object
The interface is defined in terms of other elements in the domain model
The operation is stateless
Stats are a related to a domain object, but not really a part of the entity or value object. They may not be defined in terms of other elements, but it's a possibility. There isn't any state with stats, even keeping them over time isn't truly stateful.

CoreData LinkedList?

The question here is that is it a good choice to make a linkedlist to maintain the order/priority while using coredata or should one just use a simple var to maintain the priority as a number.
While maintaining a number if a new object is inserted in an array of N objects at N/2 position then all the priority values for objects from N/2+1 -> N/2 will have to be modified which will result in that many sql queries if I am not mistaken.
If there is a linkedlist then a self relationship can be maintained to that entity say "next". If an object is inserted at N/2 position there are just two queries which is :
1. N/2-1 -> next -> newObj
2. newObj -> next ->N/2+1
But here the problem lies in using the NSFetchedResultsController which cannot sort the fetchedresults using this relationship or can it in someway?
Please respond which one of the two techniques is better in relevance to the situations mentioned above.
The best solution would be to use a ordered to-many relationship. This uses an NSOrderedSet which keeps the ordering like an array but also supports fast membership tests like a set. This only is available on iOS 5.0 or Mac OS X 10.7 or later though.
If I needed to support earlier versions of iOS I’d chose the approach with an extra property for the ordering. This makes fetches much easier. A linked list structure might be easier to update, but usually fetching (for displaying the data) is done much more often, so this is the case that should be easier.
If you need to update your ordering very often you can leave big gaps between your ordering numbers.

DDD Repository EF Performance

I was wondering how people who follow DDD get around potential performance issues with using EF and the repository pattern with returning an aggregate root with children.
e.g. Parent
----- Child A
Or even e.g. Parent
----- Child A
------- Child A2
If I bring back the aggregate root's data from the repository and use a navigational property EF then fires off another query because it is utlising lazy loading. This is a problem because we are experiencing 100+ queries when we are in a loop.
If I bring back the aggregate root's data from the repository with the children's data as well by using the 'Include' statements, this will bring back the childrens data from the repository with its parent. Then when I use the navigational properties no queries fire off because that data is already in memory.
The problem with the second approach is that some of our data for the child object can be quite big e.g. 100,000+ records.
Obviously I don't want to store 100,000+ records in memory for the child. We decided to use paging to select 10 at a time to get around this, but another issue is when we are trying to use calculations on the children like sum, total count etc but we can only do that in memory on the 10 records we have pulled back.
I know the DDD way is to pull back the object graph with all of its data in memory and then you traverse through the objects for the data you need to display.
There is a split in our team with some believing we should pull back the aggregate root and it's children together and some feel we should have a method on the aggregate root's repository that queries the childrens data directly and pulls back the child object.
I Just wondered how other people have solved the performance issues with large amounts of data being stored in memory with the parent/child.
If you have to deal with performance you must use the second approach with special method exposed on repository - that is the point of repository to provide you such methods otherwise you can use EF context / set directly.
Theory is nice if you work with theoretical data - once you have real data you must tweak theory to work in real world scenarios.
You can also check this article (there are three following articles on the blog). It does the second way but it pretends to be the first way. It works for Count but maybe you can use the idea for some other scenarios as well.
The DDD way isn't always to pull back all the data that is required. One technique we use a pattern called double dispatching. This is where you make your call to your aggregate roots' method (or domain service) with all the parameters it requires but also with it you pass in a 'query only' repository type interface parameter too. This enables the root or its children decide what extra data is required and when it's it should be returned by simply calling methods on this injected interface.
This approach adhere's to the DDD principals that states that aggregate roots should not be aware of repository implementation whilst providing an testable and highly performant domain code.

Neo4j Key-Value List recommended implementation

I've been using Neo4j for a little while now and have an app up and running using Neo4j, its all working really well and Neo4j has been really cool at solving this problem, but I now need to extend the app and having been trying to impl. a Key-Value List of data into Neo4j and I'm not sure the best way to go about it.
I have a List, the list is around 7 million elements in length and so a bit long for just storing the whole list in memory and managing it myself. I tested this and it would consume 3Gb.
My choices are either:
(a) Neo4j is just the wrong tool for the job and I should use an actual key-value data store. A little adverse to do this as I'd have to introduce another data store just for this list of data.
(b) Use Neo4j, by creating a node per key-value setting the key and value as properties on the node, but there is no relationship other then having an index to group these nodes together, exposing the key of the key-value as the key on the index.
(c) Create a single node and hold all key-values as properties, this feels wrong, because when getting the node it will load the whole thing into memory, then I'd have to search the properties for the one I'm interested in, and I might as well manage the List myself.
(d) The key is a two part key that actually points to two nodes, so create a relationship and set the value as a property on the relationship. I started down this path, but when it came to doing a lookup for a specific key/value it's not simple and fast, so backed away from this.
Either options 'a' or 'b' feel the way to go.
Any advice would be appreciated.
Example scenario
We have Node A and Node B which has a relationship between the two Nodes.
The nodes all have a property of 'foo', with foo having some value.
In this example node A has foo=X and Node B has foo=Y
We then have this list of K/Vs. One of those K/V is Key:X+Y=Value:Z
So, the original idea was to create another relationship between Node A and Node B and store a property on the relationship holding Z. Then create an index on 'foo' and a relationship idx on the new relationship.
When given Key X+Y get the value.
Lookup logic would be get Node A (from X) and Node B (from y), then walk through Node A relationships to Node B lookup for this new relationship type.
While this will work, I do not like the fact I have to lookup through all relationships to/from the nodes looking for a specific type this is inefficient. Especially if there are many relationships of different types.
So the conclusion to go with options 'A' or 'B', or I'm trying to do something impractical with Neo.
Don't try to store 7 million items in a Neo4j property -- you're right, that's wrong.
Redis and Neo4j often make a good pairing, but I don't quite understand what you're trying to do or what you mean in "d" -- what are the key/value pairs, and how do they relate to the nodes and relationships in the graph? Examples would help.
UPDATE: The most natural way to do this with a graph database is to store it as a property on the edge between the two nodes. Then you can use Gremlin to get its value.
For example, to return a property on an edge that exists between two vertices (nodes) that have some properties:
start = g.idx('vertices')[[key:value]] // start vertex
edge = start.outE(label).as('e') // edge
end = edge.inV.filter{it.someprop == somevalue} // end vertex
prop = end.back('e').prop // edge property
return prop
You could store it in an index like you suggested, but this adds more complexity to your system, and if you need to reference the data as part of the traversal, then you will either have to store duplicate data or look it up in Redis during the traversal, which you can do, see:
Have Gremlin Talk to Redis in Real Time while It's Walking the Graph
https://groups.google.com/d/msg/gremlin-users/xhqP-0wIg5s/bxkNEh9jSw4J
UPDATE 2:
If the ID of vertex a and b are known ahead of time, then it's even easier:
g.v(a).outE(label).filter{it.inVertex.id == b}.prop
If vertex a and b are known ahead of time, then it's:
a.outE(label).filter{it.inVertex == b}.prop

Working with cyclical graphs in RoR

I haven't attempted to work with graphs in Rails before, and am curious as to the best approach. Some background:
I am making a Rails 3 site and thought it would be interesting to store certain objects and their relationships as a graph, where each object is a node and some are connected to show that the two objects are related. The graph does contain cycles, and there wouldn't be more than 100-150 nodes in the graph (probably only closer to 50). One node probably wouldn't have more than five edges, with an average of three to four edges per node.
I figured a simple join table with two columns (each the ID of the object) might be the easiest way to do it, but I doubt it's the best way. Another thought was to use a plugin such as acts_as_tree (which doesn't appear to be updated for Rails 3...) or acts_as_tree_with_dotted_ids, but I am unsure of their ability to work with cycles rather than hierarchical trees.
the most I would currently like is to easily traverse from one node to its siblings. I really can't think of a reason I would want to traverse to a node's sibling's sibling, which is why I was considering just making an SQL join table. I only want to have a section on the site to display objects related to a specified object, and this graph is one of the ways I am specifying relationships.
Advice? Things I should check out? Thanks!
I would use two SQL tables, node and link where a link is simply two foreign keys, source and target. This way you can get the set of inbound or outbound links to a node by performing an SQL select query by constraining the source or target node id. You could take it a step further by adding a "graph_id" column to both tables so you can retrieve all the data for a graph in two queries and build it as a post-processing step.
This strategy should be just as easy (if not easier) than finding, installing, learning to use, and implementing a plugin to do the same, IMHO.
Depending on whether your concern is primarily about operations on graphs, or on storage of graphs, what you need is potentially quite different. If you want convenient operations on graphs, investigate the gem "rgl" (ruby graph library). It has implementations of most of the basic classic traversal and search algorithms.
If you're dealing with something on the order of 150 nodes, you can probably get away with a minimalist adjacency list representation in the database itself, or incidence list. Then you can feed that into RGL for traversal and search operations.
If I remember correctly, RGL has enough abstraction that you may be able to work with an existing class structure and you simply provide methods to get adjacent nodes.
Assuming that it is a directed graph, use a mapping table such as
id | src | dest
where src and dest are FKs to your object table.
If your objects are not all of the same type, either have them all inherit a ruby class or have another table:
id | type | type_id
Where type is the type of object it is and type_id is its id in another table.
By doing this, you should be able to get an array of objects for each object that it points to using:
select dest
from maptable
where dest = self.id
If you need to know its inbound edges, you can preform the same type of query using src instead of dest.
From there, you should be able to easily write any graph algorithms that you want. If you need weights, you can modify the mapping table as such.
id | src | dest | weight

Resources