Disclaimer: I'm a novice.
I want to simulate a join for my mongodb embedded document. If I have an embedded list:
{
_id: ObjectId("5320f6c34b6576d373000000"),
user_id: "52f581096b657612fe020000",
list: "52f4fd9f52e39bc0c15674ea"
{
player_1: "52f4fd9f52e39bc0c15674ex",
player_2: "52f4fd9f52e39bc0c15674ey",
player_3: "52f4fd9f52e39bc0c15674ez"
}
}
And a player collection with each player being something like:
{
_id: ObjectId("52f4fd9f52e39bc0c15674ex"),
college: "Louisville",
headshot: "player.png",
height: "6'2",
name: "Wayne Brady",
position: "QB",
weight: 205
}
I want to end up with:
{
_id: ObjectId("5320f6c34b6576d373000000"),
user_id: "52f581096b657612fe020000",
list: "52f4fd9f52e39bc0c15674ea"
{
player_1:
{
_id: ObjectId("52f4fd9f52e39bc0c15674ex"),
college: "Louisville",
headshot: "player.png",
height: "6'2",
name: "Wayne Brady",
position: "QB",
weight: 205
},
etc...
}
}
So I can call User.lists.first.player_1.name.
This is what makes sense in my mind since I'm new to rails...and I don't want to embed players in each user's list because I'd have so many redundancies...
Advice? Is this possible, if so how? Is it a good idea, or is there a better way?
So have have a typical relational model, let's call it "one to many", which you have users or "user teams" and a whole pool of players. And in typical modelling form you want to "de-normalize" this to avoid duplication.
But here's the thing, MongoDB does not do joins. Joins are not "webscale" in the current parlance. So it leaves you thinking what to do. Or does MongoDB do joins?
db.eval(function() {
var user = db.user.findOne({ "user_id": "52f581096b657612fe020000" });
for ( k in user.list ) {
var player = db.player.findOne({ "_id": user.list[k] });
user.list[k] = player;
}
return user;
});
Which "arguably" is "kind of a join". And it was all done on the server, right?
But DO NOT DO THAT. While db.eval() has uses, something that you are going to query regularly is not one of the practical uses. Read the documentation, which shows the warnings. In particular, all JavaScript is running in a single thread so that will lock things up very quickly.
Now client side, you are more or less doing the same thing. And you ODM of choice is likely again doing "the same thing", though it is usually hiding it away in some manner so you don't see that part. Likewise the same could likely be said of your SQL ORM, which was also "sneaking off behind your back" and querying the database while you just accessed the objects in your code.
As for mapReduce. Well the problem with the data you present is that there is nothing to "reduce". There is a technique known as in "incremental mapReduce" but it would not be well suited to this type of data. A topic in itself, but you would basically need all the "users" associated to the "players" as well, stored in the "player data" to make that any kind of viability. And it's ultimately just another way of "cheating" joins.
This is the space in which MongoDB exists.
So rather than going and doing all this fetching or joining, it allows the concept of being able to "pre-join" your data as it were. And the point of this is to allow faster, and more atomic reads and writes. And this is known as embedding.
Looking at your data, there should not be a problem with embedding at all. Consider the points:
Presumably you are modelling "fantasy teams" for a given user. It would be fair to day that a "team" does not consist of an infinite number of players.
Aside from other things your "A1" usage is likely to be "displaying" the players associated with that "user team". And in so much as, you want to "display" as much information as possible, and keep that to a single read operation. You also want to easily add "players" to the "user team".
While a "player" may have "extended information", and possibly even some global statistics or scores, that information may well be not what you want to access, while associated to the "user team" that often. It can probably be written independently, and only read when looking at the "player detail".
Those are three good cases to support embedding. Sure you would be duplicating information stored against each user team, opposed to just a small "key" reference. And sure that information is likely to exist elsewhere in the full "player detail" and that would be duplication as well.
But the point of the "duplication" here is to optimize. So here it would seem valid to embed "some of the data", not all, but what you regularly use in your main operations. Considering the "player's" name, position, height and weight are not likely to change on a regular basis or not even at all in the context, then that seems a reasonable trade-off.
{
"_id": ObjectId("5320f6c34b6576d373000000"),
"user_id": ObjectId("52f581096b657612fe020000"),
"list": [
{
"_id": ObjectId("52f4fd9f52e39bc0c15674ex"),
"label": "Player1",
"college": "Louisville",
"headshot": "player.png",
"height": "6'2",
"name": "Wayne Brady",
"position": "QB",
"weight": 205
},
{
"label": "Player2",
(...)
}
]
},
That's not that bad. And it would take a lot to break the 16MB limit. And considering this seems to be a "user team" then it could probably do with some information from the "user" as well.
You also get a lot of power out of this when data is kept together like this, to find the top "player" picked by each user:
db.userteams.aggregate([
// Unwind the array
{ "$unwind": "$list" },
// Group and use the player name
{ "$group": {
"_id": {
"user_id": "$user_id",
"player": "$list.name",
},
"count": { "$sum": 1 }
}},
// Sort the results descending by popularity
{ "$sort": { "_id.user_id": 1, "count": -1 } },
// Group to limit the first one
{ "$group": {
"_id": "$_id.user_id",
"player": { "$first": "$_id.player" },
"picks": { "$first:" "$count" }
}}
])
Which admittedly is a reasonably trivial use of a name in this case, but it is an example of using information that has become available by the use of some embedding.
Of course you really believe that you need everything to be properly normalized, then do it that way, and live with the patterns you would need to access it. But this offers a perspective of doing this another way.
So don't over-concern yourself with embedding everything, and lose a little fear on embedding some things. There are no "get out of jail free cards" for using something not suited to relational modeling in a standard relational way. Choose something that suits your needs.
Related
The log4j2 PatternLayout offers a %notEmpty conversion pattern that allows you to skip sections of the pattern that refer to empty variables.
Is there any way to do something similar for JsonTemplateLayout, specifically for thread context data (MDC)? It correctly (IMO) suppresses null fields, but it doesn't do the same with empty ones.
E.g., given the following in my JSON template:
"application": {
"name": { "key": "x-app", "$resolver": "mdc" },
"context": { "key": "x-app-context", "$resolver": "mdc" },
"instance": {
"name": { "key": "x-appinst", "$resolver": "mdc" },
"context": { "key": "x-appinst-context", "$resolver": "mdc" }
}
}
is there a way to prevent blocks like this from being logged, where the only data in the subtree is the empty string values for context?
"application":{"context":"","instance":{"context":""}}
(Yes, ideally I'd prevent those empty strings being put into the context in the first place, but this isn't my app, I'm just configuring it.)
JsonTemplateLayout author speaking here. Currently, JsonTemplateLayout doesn't support blank property exclusion for the following reasons:
The definition of empty/blank is ambiguous. One might have, null, {}, "\s*", [], [[]], [{}], etc. as valid JSON values. Which one of these are empty/blank? Let's assume we have agreed on a certain behavior. Will it apply to the rest of its users?
Checking if a value is empty/blank incurs an extra runtime cost.
Most of the time you don't care. You persist logs in a storage system, e.g., ELK stack, and there blank value elimination is provided out of the box by the storage engine in the most efficient way.
Would you mind sharing your use case, please? Why do you want to prevent the emission of "context": "" properties? If you deliver your logs to Elasticsearch, there you can easily exclude such fields via appropriate index mappings.
Near as I can tell, no. I would suggest you create a Jira issue to get that addressed.
New to ElasticSearch and I was wondering if there is a way to construct conditional queries/filters. I am working with Rails, so I suppose it has to be on that particular level, as I couldn't find anything that points to conditional queries at ES-Level and I am pretty sure it was silly just to assume!
So here is the (working) query I have:
search_definition = {
query: {
bool: {
must: [
{
more_like_this: {
fields: tag_types,
docs: [
{
_index: self.class.index_name,
_type: self.class.document_type,
_id: id
}
],
min_term_freq: 1
}
}
],
should: [
range: {
age: {
gte: min_age,
lte: max_age,
boost: 4.0
}
}
],
filter: {
bool: {
must: [
term: {
active: true
}
],
must: [
geo_distance: {
distance: xdistance,
unit: "km",
location: {
lat: xlat,
lon: xlng
},
boost: 5.0
}
]
}
}
}
},
size: how_many
}
And it works perfectly fine. Now let's assume I'd like to apply additional filters, in this particular example I need to verify when the user who is searching, that the users on the other end are, in fact, looking for a person of gender for whoever is searching. This is held in 2 separate boolean attributes in the database (male/female). I thought it would be simple enough to prepare two similar filters - however, there are a few more conditional filters that run into the queries, and I would eventually end up with more than ten pre-prepared filters. There must be a more elegant way! Thank you!
Are you familiar with elasticsearch search templates?
Using search templates you can have conditional and dynamic queries. for example you can have a list of fields and values to do terms filter and pass it to search template as a parameter.
As suggested by Mohammad - in the end, I pursued a solution using ES search templates which made my life a lot easier. The problem with JBuilder, ElasticSearch-DSL and other solutions is that they appear not to be working with the latest ES, and subsequently, I am not sure where I end up should there me ever any changes to gems or version of ES. So cutting the middle man out and taking full control with templates that are in fact super easy to create made a lot of sense to me. The versions I set up with JBuilder and ES-DSL never worked correctly as their output was random at best.
Search Templates -> More Information
JBuilder -> More Information
ElasticSearch-DSL -> More Information
There are other solutions that I haven't tried, but with search templates, I didn't see any need for that.
It has taken me quite a long (calendar) time to get my head around CouchDB and map/reduce and how I can utilize it for various use cases. One challenge I've put myself to understanding is how to use it for normalized data effectively. Sources all over the internet simply stop with "don't use it for normalized data.". I do not like the lack of analysis on how to use it effectively with normalized data!
Some of the better resources I've found are below:
CouchDB: Single document vs "joining" documents together
http://www.cmlenz.net/archives/2007/10/couchdb-joins
In both cases, the authors do a great job at explaining how to do a "join" when it is necessary to join documents when there is denormalized commonality across them. If, however, I need to join more than two normalized "tables" the view collation tricks leveraged to query just one row of data together do not work. That is, it seems you need some sort of data about all elements in the join to exist in all documents that would participate in the join, and thus, your data is not normalized!
Consider the following simple Q&A example (question/answer/answer comment):
{ id: "Q1", type: "question", question: "How do I...?" }
{ id: "A1", type: "answer", answer: "Simple... You just..." }
{ id: "C1", type: "answer-comment", comment: "Great... But what about...?" }
{ id: "C2", type: "answer-comment", comment: "Great... But what about...?" }
{ id: "QA1", type: "question-answer-relationship", q_id:"Q1", a_id:"A1" }
{ id: "AC1", type: "answer-comment-relationship", a_id:"A1", c_id:"C1" }
{ id: "AC2", type: "answer-comment-relationship", a_id:"A1", c_id:"C2" }
{ id: "Q2", type: "question", question: "What is the fastest...?" }
{ id: "A2", type: "answer", answer: "Do it this way..." }
{ id: "C3", type: "answer-comment", comment: "Works great! Thanks!" }
{ id: "QA2", type: "question-answer-relationship", q_id:"Q2", a_id:"A2" }
{ id: "AC3", type: "answer-comment-relationship", a_id:"A2", c_id:"C3" }
I want to get one question, its answer, and all of its answer's comments, and no other records from the databse with only one query.
With the data set above, at a high level, you'd need to have views for each record type, ask for a particular question with an id in mind, then in another view, use the question id to look up relationships specified by the question-answer-relationship type, then in another view look up the answer by the id obtained by the question-answer-relationship type, and so on and so forth, aggregating the "row" over a series of requests.
Another option might be to create some sort of application that does process above to cache denormalized documents in the desired format that automatically react to the normalized data being updated. This feels awkward and like a reimplementation of something that already exists/should exist.
After all of this background, the ultimate question is: Is there a better way to do this so the database, rather than the application, does the work?
Thanks in advance for anyone sharing their experience!
The document model you have is what I would do if I'm using traditional relational database, since you can perform joins more naturally with those ids.
For a document database however, this will introduce complexity since 'joining' document with MapReduce isn't the same thing.
In the Q&A scenario you presented, I would model it as follow:
{
id: "Q1",
type: "question",
question: "How do I...?"
answers: [
{
answer: "Simple... You just...",
comments: [
{ comment: "Great... But what about...?" },
{ comment: "Great... But what about...?" }
]
},
{
answer: "Do it this way...",
comments: [
{ comment "Works great! Thanks!" },
{ comment "Nope, it doen't work" }
]
}
]
}
This can solve a-lot of issues with read from the db, but it does make your write more complex, for example when adding a new comment to an answer, you will need to
Get the document out from CouchDB.
Loop through the answer and find the correct position, and push comment into the array.
Save document back to CouchDB.
I'd only consider to spit the answer as a separate document if there's a-lot of them (e.g. 1 question yield 1000 answers'), otherwise it's easier to just package them in a single document. But even in that case, try putting the relationship info inside the document, e.g.
{
id: "Q1",
type: "question",
question: "How do I...?"
}
{
id: "A1",
type: "answer",
answer: "Simple... You just..."
question_id: "Q1"
}
{
id: "C1",
type: "comment",
comment: "Works great! Thanks!"
answer_id: "A1"
}
This can make you'r write operation easier but you will need to create view to join the documents so it returns all documents with one request.
And always keep in mind that the return result from a view is not necessary a flat structure like rows like in sql query.
Is there any mechanism for controlling the order of properties?
I cannot reproduce this in http://www.neo4j.org/console
Using Neo4j 1.9.2 Community if I do the following:
CREATE (m1 {`$type`: {moduleTypeName}, Name: 'M1', ModelNumber: 'MN1'})
Then later I get this node back from a cypher query using the REST cypher endpoint I get back...
{
"extensions": {},
"paged_traverse": "http://localhost:7575/db/data/node/3777/paged/traverse/{returnType}{?pageSize,leaseTime}",
"outgoing_relationships": "http://localhost:7575/db/data/node/3777/relationships/out",
"traverse": "http://localhost:7575/db/data/node/3777/traverse/{returnType}",
"all_typed_relationships": "http://localhost:7575/db/data/node/3777/relationships/all/{-list|&|types}",
"property": "http://localhost:7575/db/data/node/3777/properties/{key}",
"all_relationships": "http://localhost:7575/db/data/node/3777/relationships/all",
"self": "http://localhost:7575/db/data/node/3777",
"properties": "http://localhost:7575/db/data/node/3777/properties",
"outgoing_typed_relationships": "http://localhost:7575/db/data/node/3777/relationships/out/{-list|&|types}",
"incoming_relationships": "http://localhost:7575/db/data/node/3777/relationships/in",
"incoming_typed_relationships": "http://localhost:7575/db/data/node/3777/relationships/in/{-list|&|types}",
"create_relationship": "http://localhost:7575/db/data/node/3777/relationships",
"data": {
"ModelNumber": "MN1",
"$type": "ModuleType",
"Name": "M1"
}
}
I'm using http://james.newtonking.com/pages/json-net.aspx to parse JSON and for it to automatically infer an object type, the $type property must be first. It makes sense when parsing the JSON in a stream when you don't want to load the entire thing into memory first.
It does not appear to be alphabetical, and it does not seem to be random either. It seems that the order is consistent for different object types, but inconsistent between them.
I have pulled the node in the Shell as well and so it seems that the order does not depend on how I get the node, but is not related to the order in which I create the node either.
Properties have no guaranteed order. Do not take any assumptions on a 'maybe' ordering. An upcoming version might change this assumed behaviour and break your code.
I guess it is simpler in Cypher to not return the node itself in favour of a list of properties, e.g.
START node=node(<myid>)
RETURN node.`$type`, node.ModelNumber, node.Name
This has defined columns.
Definitively it seems not to have that functionality.
My workaround for it it is to alias the properties with a prefix in the format aXX_, as in a01_, a02, a03_ and then remove it in the code.
Not pretty, not great, but it works as neo4j respects numerical order.
It needs a letter character at the beggining though, hence the "a" before the numbers.
I am trying to store network layout in Couch DB, but my solution provides rather randomized graph.
I store a nodes with a document:
{_id ,
nodeName,
group}
and storing links in traditional:
{_id, source_id, target_id, value}
Following multiple tutorials on handling joins and multiple relationship in Couch DB I created view:
function(doc) {
if(doc.type == 'connection') {
if (doc.source_id)
emit("source", {'_id': doc.source_id});
if(doc.target_id)
emit("target", {'_id': doc.target_id});
}
}
which should have emitted sequence of source and target id, then I pass it to the list function with include_docs=true, assumes that source and target come in pairs stitches everything back in a structure like this:
{
"nodes":[
{"nodeName":"Name 1","group":"1"},
{"nodeName":"Name 2","group":"1"},
],
"links": [
{"source":7,"target":0,"value":1},
{"source":7,"target":5,"value":1}
]
}
Although my list produce a proper JSON, view map returns number of rows of source docs and then target docs.
So far I don't have any ideas how to make this thing working properly - I am happy to fetch additional values from document _id in the list, but so far I havn't find any good examples.
Alternative ways of achieving the same goal are welcome. _id values are standard for CouchDB so far.
Update: while writing a question I came up with different view which sorted my immediate problem, but I still would like to see other options.
updated map:
function(doc) {
if(doc.type == 'connection') {
if (doc.source_id)
emit([doc._id,0,"source"], {'_id': doc.source_id});
if(doc.target_id)
emit([doc._id,1,"target"], {'_id': doc.target_id});
}
}
Your updated map function makes more sense. However, you don't need 0 and 1 in your key since you have already "source"and "target".