How do I effectively use CouchDB with normalized data? - join

It has taken me quite a long (calendar) time to get my head around CouchDB and map/reduce and how I can utilize it for various use cases. One challenge I've put myself to understanding is how to use it for normalized data effectively. Sources all over the internet simply stop with "don't use it for normalized data.". I do not like the lack of analysis on how to use it effectively with normalized data!
Some of the better resources I've found are below:
CouchDB: Single document vs "joining" documents together
http://www.cmlenz.net/archives/2007/10/couchdb-joins
In both cases, the authors do a great job at explaining how to do a "join" when it is necessary to join documents when there is denormalized commonality across them. If, however, I need to join more than two normalized "tables" the view collation tricks leveraged to query just one row of data together do not work. That is, it seems you need some sort of data about all elements in the join to exist in all documents that would participate in the join, and thus, your data is not normalized!
Consider the following simple Q&A example (question/answer/answer comment):
{ id: "Q1", type: "question", question: "How do I...?" }
{ id: "A1", type: "answer", answer: "Simple... You just..." }
{ id: "C1", type: "answer-comment", comment: "Great... But what about...?" }
{ id: "C2", type: "answer-comment", comment: "Great... But what about...?" }
{ id: "QA1", type: "question-answer-relationship", q_id:"Q1", a_id:"A1" }
{ id: "AC1", type: "answer-comment-relationship", a_id:"A1", c_id:"C1" }
{ id: "AC2", type: "answer-comment-relationship", a_id:"A1", c_id:"C2" }
{ id: "Q2", type: "question", question: "What is the fastest...?" }
{ id: "A2", type: "answer", answer: "Do it this way..." }
{ id: "C3", type: "answer-comment", comment: "Works great! Thanks!" }
{ id: "QA2", type: "question-answer-relationship", q_id:"Q2", a_id:"A2" }
{ id: "AC3", type: "answer-comment-relationship", a_id:"A2", c_id:"C3" }
I want to get one question, its answer, and all of its answer's comments, and no other records from the databse with only one query.
With the data set above, at a high level, you'd need to have views for each record type, ask for a particular question with an id in mind, then in another view, use the question id to look up relationships specified by the question-answer-relationship type, then in another view look up the answer by the id obtained by the question-answer-relationship type, and so on and so forth, aggregating the "row" over a series of requests.
Another option might be to create some sort of application that does process above to cache denormalized documents in the desired format that automatically react to the normalized data being updated. This feels awkward and like a reimplementation of something that already exists/should exist.
After all of this background, the ultimate question is: Is there a better way to do this so the database, rather than the application, does the work?
Thanks in advance for anyone sharing their experience!

The document model you have is what I would do if I'm using traditional relational database, since you can perform joins more naturally with those ids.
For a document database however, this will introduce complexity since 'joining' document with MapReduce isn't the same thing.
In the Q&A scenario you presented, I would model it as follow:
{
id: "Q1",
type: "question",
question: "How do I...?"
answers: [
{
answer: "Simple... You just...",
comments: [
{ comment: "Great... But what about...?" },
{ comment: "Great... But what about...?" }
]
},
{
answer: "Do it this way...",
comments: [
{ comment "Works great! Thanks!" },
{ comment "Nope, it doen't work" }
]
}
]
}
This can solve a-lot of issues with read from the db, but it does make your write more complex, for example when adding a new comment to an answer, you will need to
Get the document out from CouchDB.
Loop through the answer and find the correct position, and push comment into the array.
Save document back to CouchDB.
I'd only consider to spit the answer as a separate document if there's a-lot of them (e.g. 1 question yield 1000 answers'), otherwise it's easier to just package them in a single document. But even in that case, try putting the relationship info inside the document, e.g.
{
id: "Q1",
type: "question",
question: "How do I...?"
}
{
id: "A1",
type: "answer",
answer: "Simple... You just..."
question_id: "Q1"
}
{
id: "C1",
type: "comment",
comment: "Works great! Thanks!"
answer_id: "A1"
}
This can make you'r write operation easier but you will need to create view to join the documents so it returns all documents with one request.
And always keep in mind that the return result from a view is not necessary a flat structure like rows like in sql query.

Related

Mongoid Return Specific Object from Array in Document

This seems as though it should be simple but I have been struggling with this for a while with no luck.
Let's assume I have a simple document that looks like the following:
{
data: [
{
name: "Minnesota",
},
{
name: "Mississippi",
},
...
]
}
If I run the following query in my Mongo Shell, everything works as I would expect:
db.collection.find({}, {data: {$elemMatch: {name: "Michigan"}}})
Returns:
{ "_id" : ObjectId("5e9ba60998d1ff88be83fffe"), "data" : [ { "name" : "Michigan" } ] }
However, using mongoid, attempting to run a similar query returns every object inside of the data array. Here is one of the may queries I've tried:
Model.where({data: {"$elemMatch": {name: "Michigan"}}}).first
As I mentioned above, that little query returns everything inside the data array, not the specific object I'm trying to pull out of the document.
Any help would be appreciated. I'm trying to avoid returning the results and post-processing them with Ruby. I'd love to handle this at the DB level.
Thank you.
There was a very similar question earlier for a different driver. Apparently the ruby driver behaves differently than the shell.
Try running your find as the equivalent database command:
session.command({'find' => 'my_collection', 'filter' => {}, projection => {data: {$elemMatch: {name: "Michigan"}}}})
Mongoid syntax for projections is only.

Conditional queries for ElasticSearch 5.x (elasticsearch-rails/elasticsearch-model)

New to ElasticSearch and I was wondering if there is a way to construct conditional queries/filters. I am working with Rails, so I suppose it has to be on that particular level, as I couldn't find anything that points to conditional queries at ES-Level and I am pretty sure it was silly just to assume!
So here is the (working) query I have:
search_definition = {
query: {
bool: {
must: [
{
more_like_this: {
fields: tag_types,
docs: [
{
_index: self.class.index_name,
_type: self.class.document_type,
_id: id
}
],
min_term_freq: 1
}
}
],
should: [
range: {
age: {
gte: min_age,
lte: max_age,
boost: 4.0
}
}
],
filter: {
bool: {
must: [
term: {
active: true
}
],
must: [
geo_distance: {
distance: xdistance,
unit: "km",
location: {
lat: xlat,
lon: xlng
},
boost: 5.0
}
]
}
}
}
},
size: how_many
}
And it works perfectly fine. Now let's assume I'd like to apply additional filters, in this particular example I need to verify when the user who is searching, that the users on the other end are, in fact, looking for a person of gender for whoever is searching. This is held in 2 separate boolean attributes in the database (male/female). I thought it would be simple enough to prepare two similar filters - however, there are a few more conditional filters that run into the queries, and I would eventually end up with more than ten pre-prepared filters. There must be a more elegant way! Thank you!
Are you familiar with elasticsearch search templates?
Using search templates you can have conditional and dynamic queries. for example you can have a list of fields and values to do terms filter and pass it to search template as a parameter.
As suggested by Mohammad - in the end, I pursued a solution using ES search templates which made my life a lot easier. The problem with JBuilder, ElasticSearch-DSL and other solutions is that they appear not to be working with the latest ES, and subsequently, I am not sure where I end up should there me ever any changes to gems or version of ES. So cutting the middle man out and taking full control with templates that are in fact super easy to create made a lot of sense to me. The versions I set up with JBuilder and ES-DSL never worked correctly as their output was random at best.
Search Templates -> More Information
JBuilder -> More Information
ElasticSearch-DSL -> More Information
There are other solutions that I haven't tried, but with search templates, I didn't see any need for that.

Left join on 2 foreign keys with Rails and active record

I have 3 objects :
Question
- id
- question
Answer
- question_id
- user_id
- answer
User
- id
One Answer belongs to one User and one question.
I would like to get ever every questions and their associate answer.
And I want answers for a specific user.
I'have tried both query but no one is working as expected.
In this case, if a user answer to the questions. The second user querying will have a return with nothing
##questions = Question.includes(:answer).where(answers: { user_id: [#user.id, nil] })
This one returns the expected data with postgres, but then Rails is making a request for every question to get the answer but without using the condition on the user_id
##questions = Question.joins('left join answers on answers.question_id = questions.id and answers.user_id = #user_id')
I would like to produce this kind of Json for every User.
[
{
id: 1,
question: 'question',
answer: { id:22, answer: 'answer' }
},
{
id: 2,
question: 'question2',
answer: { id:23, answer: 'answer' }
},
{
id: 3,
question: 'question3',
answer: null
},
]
Any help is very welcome.
I can't figure this out

Mongoid simulate join by inject or mapreduce?

Disclaimer: I'm a novice.
I want to simulate a join for my mongodb embedded document. If I have an embedded list:
{
_id: ObjectId("5320f6c34b6576d373000000"),
user_id: "52f581096b657612fe020000",
list: "52f4fd9f52e39bc0c15674ea"
{
player_1: "52f4fd9f52e39bc0c15674ex",
player_2: "52f4fd9f52e39bc0c15674ey",
player_3: "52f4fd9f52e39bc0c15674ez"
}
}
And a player collection with each player being something like:
{
_id: ObjectId("52f4fd9f52e39bc0c15674ex"),
college: "Louisville",
headshot: "player.png",
height: "6'2",
name: "Wayne Brady",
position: "QB",
weight: 205
}
I want to end up with:
{
_id: ObjectId("5320f6c34b6576d373000000"),
user_id: "52f581096b657612fe020000",
list: "52f4fd9f52e39bc0c15674ea"
{
player_1:
{
_id: ObjectId("52f4fd9f52e39bc0c15674ex"),
college: "Louisville",
headshot: "player.png",
height: "6'2",
name: "Wayne Brady",
position: "QB",
weight: 205
},
etc...
}
}
So I can call User.lists.first.player_1.name.
This is what makes sense in my mind since I'm new to rails...and I don't want to embed players in each user's list because I'd have so many redundancies...
Advice? Is this possible, if so how? Is it a good idea, or is there a better way?
So have have a typical relational model, let's call it "one to many", which you have users or "user teams" and a whole pool of players. And in typical modelling form you want to "de-normalize" this to avoid duplication.
But here's the thing, MongoDB does not do joins. Joins are not "webscale" in the current parlance. So it leaves you thinking what to do. Or does MongoDB do joins?
db.eval(function() {
var user = db.user.findOne({ "user_id": "52f581096b657612fe020000" });
for ( k in user.list ) {
var player = db.player.findOne({ "_id": user.list[k] });
user.list[k] = player;
}
return user;
});
Which "arguably" is "kind of a join". And it was all done on the server, right?
But DO NOT DO THAT. While db.eval() has uses, something that you are going to query regularly is not one of the practical uses. Read the documentation, which shows the warnings. In particular, all JavaScript is running in a single thread so that will lock things up very quickly.
Now client side, you are more or less doing the same thing. And you ODM of choice is likely again doing "the same thing", though it is usually hiding it away in some manner so you don't see that part. Likewise the same could likely be said of your SQL ORM, which was also "sneaking off behind your back" and querying the database while you just accessed the objects in your code.
As for mapReduce. Well the problem with the data you present is that there is nothing to "reduce". There is a technique known as in "incremental mapReduce" but it would not be well suited to this type of data. A topic in itself, but you would basically need all the "users" associated to the "players" as well, stored in the "player data" to make that any kind of viability. And it's ultimately just another way of "cheating" joins.
This is the space in which MongoDB exists.
So rather than going and doing all this fetching or joining, it allows the concept of being able to "pre-join" your data as it were. And the point of this is to allow faster, and more atomic reads and writes. And this is known as embedding.
Looking at your data, there should not be a problem with embedding at all. Consider the points:
Presumably you are modelling "fantasy teams" for a given user. It would be fair to day that a "team" does not consist of an infinite number of players.
Aside from other things your "A1" usage is likely to be "displaying" the players associated with that "user team". And in so much as, you want to "display" as much information as possible, and keep that to a single read operation. You also want to easily add "players" to the "user team".
While a "player" may have "extended information", and possibly even some global statistics or scores, that information may well be not what you want to access, while associated to the "user team" that often. It can probably be written independently, and only read when looking at the "player detail".
Those are three good cases to support embedding. Sure you would be duplicating information stored against each user team, opposed to just a small "key" reference. And sure that information is likely to exist elsewhere in the full "player detail" and that would be duplication as well.
But the point of the "duplication" here is to optimize. So here it would seem valid to embed "some of the data", not all, but what you regularly use in your main operations. Considering the "player's" name, position, height and weight are not likely to change on a regular basis or not even at all in the context, then that seems a reasonable trade-off.
{
"_id": ObjectId("5320f6c34b6576d373000000"),
"user_id": ObjectId("52f581096b657612fe020000"),
"list": [
{
"_id": ObjectId("52f4fd9f52e39bc0c15674ex"),
"label": "Player1",
"college": "Louisville",
"headshot": "player.png",
"height": "6'2",
"name": "Wayne Brady",
"position": "QB",
"weight": 205
},
{
"label": "Player2",
(...)
}
]
},
That's not that bad. And it would take a lot to break the 16MB limit. And considering this seems to be a "user team" then it could probably do with some information from the "user" as well.
You also get a lot of power out of this when data is kept together like this, to find the top "player" picked by each user:
db.userteams.aggregate([
// Unwind the array
{ "$unwind": "$list" },
// Group and use the player name
{ "$group": {
"_id": {
"user_id": "$user_id",
"player": "$list.name",
},
"count": { "$sum": 1 }
}},
// Sort the results descending by popularity
{ "$sort": { "_id.user_id": 1, "count": -1 } },
// Group to limit the first one
{ "$group": {
"_id": "$_id.user_id",
"player": { "$first": "$_id.player" },
"picks": { "$first:" "$count" }
}}
])
Which admittedly is a reasonably trivial use of a name in this case, but it is an example of using information that has become available by the use of some embedding.
Of course you really believe that you need everything to be properly normalized, then do it that way, and live with the patterns you would need to access it. But this offers a perspective of doing this another way.
So don't over-concern yourself with embedding everything, and lose a little fear on embedding some things. There are no "get out of jail free cards" for using something not suited to relational modeling in a standard relational way. Choose something that suits your needs.

Storing graph-like structure in Couch DB or do include_docs yourself

I am trying to store network layout in Couch DB, but my solution provides rather randomized graph.
I store a nodes with a document:
{_id ,
nodeName,
group}
and storing links in traditional:
{_id, source_id, target_id, value}
Following multiple tutorials on handling joins and multiple relationship in Couch DB I created view:
function(doc) {
if(doc.type == 'connection') {
if (doc.source_id)
emit("source", {'_id': doc.source_id});
if(doc.target_id)
emit("target", {'_id': doc.target_id});
}
}
which should have emitted sequence of source and target id, then I pass it to the list function with include_docs=true, assumes that source and target come in pairs stitches everything back in a structure like this:
{
"nodes":[
{"nodeName":"Name 1","group":"1"},
{"nodeName":"Name 2","group":"1"},
],
"links": [
{"source":7,"target":0,"value":1},
{"source":7,"target":5,"value":1}
]
}
Although my list produce a proper JSON, view map returns number of rows of source docs and then target docs.
So far I don't have any ideas how to make this thing working properly - I am happy to fetch additional values from document _id in the list, but so far I havn't find any good examples.
Alternative ways of achieving the same goal are welcome. _id values are standard for CouchDB so far.
Update: while writing a question I came up with different view which sorted my immediate problem, but I still would like to see other options.
updated map:
function(doc) {
if(doc.type == 'connection') {
if (doc.source_id)
emit([doc._id,0,"source"], {'_id': doc.source_id});
if(doc.target_id)
emit([doc._id,1,"target"], {'_id': doc.target_id});
}
}
Your updated map function makes more sense. However, you don't need 0 and 1 in your key since you have already "source"and "target".

Resources