Since we can structure a MongoDB any way we want, we can do it this way
{ products:
[
{ date: "2010-09-08", data: { pageviews: 23, timeOnPage: 178 }},
{ date: "2010-09-09", data: { pageviews: 36, timeOnPage: 202 }}
],
brands:
[
{ date: "2010-09-08", data: { pageviews: 123, timeOnPage: 210 }},
{ date: "2010-09-09", data: { pageviews: 61, timeOnPage: 876 }}
]
}
so as we add data to it day after day, the products document and brands document will become bigger and bigger. After 3 years, there will be a thousand elements in products and in brands. Is it not good for MongoDB? Should we break it down more into 4 documents:
{ type: 'products', date: "2010-09-08", data: { pageviews: 23, timeOnPage: 178 }}
{ type: 'products', date: "2010-09-09", data: { pageviews: 36, timeOnPage: 202 }}
{ type: 'brands', date: "2010-09-08", data: { pageviews: 123, timeOnPage: 210 }}
{ type: 'brands', date: "2010-09-08", data: { pageviews: 61, timeOnPage: 876 }}
So that after 3 years, there will be just 2000 "documents"?
Assuming you're using Mongoid (you tagged it), you wouldn't want to use your first schema idea. It would be very inefficient for Mongoid to pull out those huge documents each time you wanted to look up a single little value.
What would probably be a much better model for you is:
class Log
include Mongoid::Document
field :type
field :date
field :pageviews, :type => Integer
field :time_on_page, :type => Integer
end
This would give you documents that look like:
{_id: ..., date: '2010-09-08', type: 'products', pageviews: 23, time_on_page: 178}
Don't worry about the number of documents - Mongo can handle billions of these. And you can index on type and date to easily find whatever figures you want.
Furthermore, this way it's a lot easier to update the records through the driver, without even pulling the record from the database. For example, on each pageview you could do something like:
Log.collection.update({'type' => 'products', 'date' => '2010-09-08'}, {'$inc' => {'pageview' => 1}})
I'm not a MongoDB expert, but 1000 isn't "huge". Also I would seriously doubt any difference between 1 top-level document containing 4000 total subelements, and 4 top-level documents each containing 1000 subelements -- one of those six-of-one vs. half-dozen-of-another issues.
Now if you were talking 1 document with 1,000,000 elements vs. 1000 documents each with 1000 elements, that's a different order of magnitude + there might be advantages of one vs. the other, either/both in storage time or query time.
You have talked about how you are going to update the data, but how do you plan to query it? It probably makes a difference on how you should structure your docs.
The problem with using embedded elements in arrays is that each time you add to that it may not fit in the current space allocated for the document. This will cause the (new) document to be reallocated and moved (that move will require re-writing any of the indexes for the doc).
I would generally suggest the second form you suggested, but it depends on the questions above.
Note: 4MB is an arbitrary limit and will be raised soon; you can recompile the server for any limit you want in fact.
It seems your design closely resembles the relational table schema.
So every document added will be a separate entry in a collection having its own identifier. Though mongo document size is limited to 4 MB, its mostly enough to accommodate plain text documents. And you don't have to worry about number of growing documents in mongo, thats the essence of document based databases.
Only thing you need to worry about is size of the db collection. Its limited to 2GB for 32 bit systems. Because MongoDB uses memory-mapped files, as they're tied to the available memory addressing. This is not a problem with 64 bit systems.
Hope this helps
Again this depends on your use case of querying. If you really care about single item, such as products per day:
{ type: 'products', date: "2010-09-08", data: { pageviews: 23, timeOnPage: 178 }}
then you could include multiple days in one date.
{ type: 'products', { date: "2010-09-08", data: { pageviews: 23, timeOnPage: 178 } } }
We use something like this:
{ type: 'products', "2010" : { "09" : { "08" : data: { pageviews: 23, timeOnPage: 178 }} } } }
So we can increment by day: { "$inc" : { "2010.09.08.data.pageviews" : 1 } }
Maybe seems complicated, but the advantage is you can store all data about a 'type' in 1 record. So you can retrieve a single record and have all information.
Related
I have an incoming CSV that I am trying to compare with an existing collection of mongo documents (Note objects) to determine additions, deletions, and updates. The incoming CSV and mongo collection are quite large at around 500K records each.
ex. csv_data
[{
id: 1, text: "zzz"
},
{
id: 2, text: "bbb"
},
{
id: 4, text: "ddd"
},
{
id: 5, text: "eee"
}]
Mongo collection of Note objects:
[{
id: 1, text: "aaa"
},
{
id: 2, text: "bbb"
},
{
id: 3, text: "ccc"
},
{
id: 4, text: "ddd"
}]
As a result I would want to get
an array of additions
[{
id: 5, text: "eee"
}]
an array of removals
[{
id: 3, text: "ccc"
}]
an array of updates
[{
id: 1, text: "zzz"
}]
I tried using select statements to filter for each particular difference but it is failing / taking hours when using the real data set with all 500k records.
additions = csv_data.select{|record| !Note.where(id: record[:id]).exists?}
deletions = Note.all.select{|note| !csv_data.any?{|row| row[:id] == note.id}}
updates = csv_data.select do |record|
note = Note.where(id: record[:id])
note.exists? && note.first.text != record[:text]
end
How would I better optimize this?
Assumption: the CSV file is a snapshot of the data in the database taken at some other time, and you want a diff.
In order to get the answers you want, you need to read every record in the DB. Right now you are effectively doing this three times, once to obtain each statistic. Which is c.1.5m DB calls, and possibly more if there are significantly more notes on the DB than there are in the file. I'd follow these steps:
Read the CSV data into a hash keyed by ID
Read each record in the database, and for each record:
If the DB ID is found in the CSV hash, move it from the hash to the updates
If the DB ID isn't found in the CSV hash, add it to the deletes
When you reach the end of the DB, anything still left in the CSV hash must therefore be an addition
While it's still not super-slick, at least you only get to do the database I/O once instead of three times...
I've read a lot of posts about finding the highest-valued objects in arrays using max and max_by, but my situation is another level deeper, and I can't find any references on how to do it.
I have an experimental Rails app in which I am attempting to convert a legacy .NET/SQL application. The (simplified) model looks like Overlay -> Calibration <- Parameter. In a single data set, I will have, say, 20K Calibrations, but about 3,000-4,000 of these are versioned duplicates by Parameter name, and I need only the highest-versioned Parameter by each name. Further complicating matters is that the version lives on the Overlay. (I know this seems crazy, but this models our reality.)
In pure SQL, we add the following to a query to create a virtual table:
n = ROW_NUMBER() OVER (PARTITION BY Parameters.Designation ORDER BY Overlays.Version DESC)
And then select the entries where n = 1.
I can order the array like this:
ordered_calibrations = mainline_calibrations.sort do |e, f|
[f.parameter.Designation, f.overlay.Version] <=> [e.parameter.Designation, e.overlay.Version] || 1
end
I get this kind of result:
C_SCR_trc_NH3SensCln_SCRT1_Thd 160
C_SCR_trc_NH3SensCln_SCRT1_Thd 87
C_SCR_trc_NH3Sen_DewPtHiThd_Tbl 310
C_SCR_trc_NH3Sen_DewPtHiThd_Tbl 160
C_SCR_trc_NH3Sen_DewPtHiThd_Tbl 87
So I'm wondering if there is a way, using Ruby's Enumerable built-in methods, to loop over the sorted array, and only return the highest-versioned elements per name. HUGE bonus points if I could feed an integer to this method's block, and only return the highest-versioned elements UP TO that version number ("160" would return just the second and fourth entries, above).
The alternative to this is that I could somehow implement the ROW_NUMBER() OVER in ActiveRecord, but that seems much more difficult to try. And, of course, I could write code to deal with this, but I'm quite certain it would be orders of magnitude slower than figuring out the right Enumerable function, if it exists.
(Also, to be clear, it's trivial to do .find_by_sql() and create the same result set as in the legacy application -- it's even fast -- but I'm trying to drag all the related objects along for the ride, which you really can't do with that method.)
I'm not convinced that doing this in the database isn't a better option, but since I'm unfamiliar with SQL Server I'll give you a Ruby answer.
I'm assuming that when you say "Parameter name" you're talking about the Parameters.Designation column, since that's the one in your examples.
One straightforward way you can do this is with Enumerable#slice_when, which is available in Ruby 2.2+. slice_when is good when you want to slice an array "between" values that are different in some way. For example:
[ { id: 1, name: "foo" }, { id: 2, name: "foo" }, { id: 3, name: "bar" } ]
.slice_when {|a,b| a[:name] != b[:name] }
# => [ [ { id: 1, name: "foo" }, { id: 2, name: "foo" } ],
# [ { id: 3, name: "bar" } ]
# ]
You've already sorted your collection, so to slice it you just need to do this:
calibrations_by_designation = ordered_calibrations.slice_when do |a, b|
a.parameter.Designation != b.parameter.Designation
end
Now calibrations_by_designation is an array of arrays, each of which is sorted from greatest Overlay.Version to least. The final step, then, is to get the first element in each of those arrays:
highest_version_calibrations = calibrations_by_designation.map(&:first)
So far I have found that we can iterate on arrays using {from: x, to: y}. Is there a way to iterate on a map?
For example, I have the following map:
companyMap: {
61: {
name: 'Apple'
},
66: {
name: 'Microsoft'
},
70: {
name: 'Uber'
}
}
Is there a way to iterate on this map? Or at least get all the keys?
To iterate over a map, you need to first establish a practical (not theoretical) maximum for the number of keys you're going to have.
You can't make a call for an unbounded amount of data in Falcor, by design. If there is no practical maximum, it may be best to reconsider how you page through the data in the first place.
For example, if you set the practical maximum at 70 keys, you'll need to make the following request:
this.model.get(`companyMap[0..70]['name']`);
For those keys that do not exist in the dataset, there will be nothing returned.
You can ask for an arbitrary number of keys. For example the following path set:
["companyMap", [61, 66, 70], "name"]
returns the names from the 3 companies.
In my project I need to aggregate many data in one string and later parse it out.
The data is related to people, it need to record people_ids in different state and age group, and their counts.
For example, we have 5 people named John Smith in CA, 2 people between 20-29, 2 between 30-39, 1 between 40-49; 2 people named John Smith in NY, 1 between 20-29 and 1 between 30-39. Then the string will be somewhat like this,
John smith| [CA#5: 20-29#2{pid_1, pid_2};30-39#2{pid_3,pid_4};40-49#1{pid_5}] [NY#2: 20-29#1{pid_6};30-39#1{pid_7}]
It not necessarily be the same format, but whatever format easy to parse out. Is there any good way to do this? How about Json format?
And if it looks like the above format, if I want all John Smith in CA between age 30-39, how should I parse out the data?
Thanks a lot!!
From my understanding of your post, this might be a format you're looking for (as represented in JSON).
Keep in mind that there are gems that can generate and parse JSON for you.
{
"name": "John Smith",
"states": {
"CA": {
"total": 5,
"ages": {
"20-29": [pid_1, pid_2],
"30-39": [pid_3, pid_4],
"40-49": [pid_5]
}
},
"NY": {
"total": 2,
"ages": {
"20-29": [pid_6],
"30-39": [pid_7]
}
}
}
}
I'm making a new web app using Rails, and was wondering, what's the difference between string and text? And when should each be used?
The difference relies in how the symbol is converted into its respective column type in query language.
with MySQL :string is mapped to VARCHAR(255)
https://edgeguides.rubyonrails.org/active_record_migrations.html
:string | VARCHAR | :limit => 1 to 255 (default = 255)
:text | TINYTEXT, TEXT, MEDIUMTEXT, or LONGTEXT2 | :limit => 1 to 4294967296 (default = 65536)
Reference:
https://hub.packtpub.com/working-rails-activerecord-migrations-models-scaffolding-and-database-completion/
When should each be used?
As a general rule of thumb, use :string for short text input (username, email, password, titles, etc.) and use :text for longer expected input such as descriptions, comment content, etc.
If you are using postgres use text wherever you can, unless you have a size constraint since there is no performance penalty for text vs varchar
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead
PostsgreSQL manual
String translates to "Varchar" in your database, while text translates to "text". A varchar can contain far less items, a text can be of (almost) any length.
For an in-depth analysis with good references check http://www.pythian.com/news/7129/text-vs-varchar/
Edit: Some database engines can load varchar in one go, but store text (and blob) outside of the table. A SELECT name, amount FROM products could, be a lot slower when using text for name than when you use varchar. And since Rails, by default loads records with SELECT * FROM... your text-columns will be loaded. This will probably never be a real problem in your or my app, though (Premature optimization is ...). But knowing that text is not always "free" is good to know.
String if the size is fixed and small and text if it is variable and big.
This is kind of important because text is way bigger than strings. It contains a lot more kilobytes.
So for small fields always use string(varchar). Fields like. first_name, login, email, subject (of a article or post)
and example of texts: content/body of a post or article. fields for paragraphs etc
String size 1 to 255 (default = 255)
Text size 1 to 4294967296 (default = 65536)2
As explained above not just the db datatype it will also affect the view that will be generated if you are scaffolding.
string will generate a text_field text will generate a text_area
Use string for shorter field, like names, address, phone, company
Use Text for larger content, comments, content, paragraphs.
My general rule, if it's something that is more than one line, I typically go for text, if it's a short 2-6 words, I go for string.
The official rule is 255 for a string. So, if your string is more than 255 characters, go for text.
The accepted answer is awesome, it properly explains the difference between string vs text (mostly the limit size in the database, but there are a few other gotchas), but I wanted to point out a small issue that got me through it as that answer didn't completely do it for me.
The max size :limit => 1 to 4294967296 didn't work exactly as put, I needed to go -1 from that max size. I'm storing large JSON blobs and they might be crazy huge sometimes.
Here's my migration with the larger value in place with the value MySQL doesn't complain about.
Note the 5 at the end of the limit instead of 6
class ChangeUserSyncRecordDetailsToText < ActiveRecord::Migration[5.1]
def up
change_column :user_sync_records, :details, :text, :limit => 4294967295
end
def down
change_column :user_sync_records, :details, :string, :limit => 1000
end
end
If you are using oracle... STRING will be created as VARCHAR(255) column and TEXT, as a CLOB.
NATIVE_DATABASE_TYPES = {
primary_key: "NUMBER(38) NOT NULL PRIMARY KEY",
string: { name: "VARCHAR2", limit: 255 },
text: { name: "CLOB" },
ntext: { name: "NCLOB" },
integer: { name: "NUMBER", limit: 38 },
float: { name: "BINARY_FLOAT" },
decimal: { name: "DECIMAL" },
datetime: { name: "TIMESTAMP" },
timestamp: { name: "TIMESTAMP" },
timestamptz: { name: "TIMESTAMP WITH TIME ZONE" },
timestampltz: { name: "TIMESTAMP WITH LOCAL TIME ZONE" },
time: { name: "TIMESTAMP" },
date: { name: "DATE" },
binary: { name: "BLOB" },
boolean: { name: "NUMBER", limit: 1 },
raw: { name: "RAW", limit: 2000 },
bigint: { name: "NUMBER", limit: 19 }
}
https://github.com/rsim/oracle-enhanced/blob/master/lib/active_record/connection_adapters/oracle_enhanced_adapter.rb
If the attribute is matching f.text_field in form use string, if it is matching f.text_area use text.