I am searching for a solution in the Javascript API for CosmosDB, where you can perform an INNER/OUTER JOIN between two document collections.
I have been unsuccessful.
From my understanding, Javascript Stored Procedures run within a collection, and cannot access/reference data in another collection.
If the above is true, where does this leave our application's datasource that has been designed in a relational way? If Business requires a immediate query, to collect pthe following data:
All agreements/contracts that has been migrated to a new product offering, within a specific region, for a given time frame. How would I go about this query, if there are about 5 collections containing all infromation related to this query?
Any guidance?
UPDATE
Customer
{
"id": "d02e6668-ce24-455d-b241-32835bb2dcb5",
"Name": "Test User One",
"Surname": "Test"
}
Agreement
{
"id": "ee1094bd-16f4-45ec-9f5e-7ecd91d4e729",
"CustomerId": "d02e6668-ce24-455d-b241-32835bb2dcb5"
"RetailProductVersionInstance":
[
{
"id": "8ce31e7c-7b1a-4221-89a3-449ae4fd6622",
"RetailProductVersionId": "ce7a44a4-7e49-434b-8a51-840599fbbfbb",
"AgreementInstanceUser": {
"FirstName": "Luke",
"LastName": "Pothier",
"AgreementUserTypeId": ""
},
"AgreementInstanceMSISDN": {
"IsoCountryDialingCode": null,
"PhoneNumber": "0839263922",
"NetworkOperatorId": "30303728-9983-47f9-a494-1de853d66254"
},
"RetailProductVersionInstanceState": "IN USE",
"IsPrimaryRetailProduct": true,
"RetailProductVersionInstancePhysicalItems": [
{
"id": "f8090aba-f06b-4233-9f9e-eb2567a20afe",
"PhysicalItemId": "75f64ab3-81d2-f600-6acb-d37da216846f",
"RetailProductVersionInstancePhysicalItemNumbers": [
{
"id": "9905058b-8369-4a64-b9a5-e17e28750fba",
"PhysicalItemNumberTypeId": "39226b5a-429b-4634-bbce-2213974e5bab",
"PhysicalItemNumberValue": "KJDS959405"
},
{
"id": "1fe09dd2-fb8a-49b3-99e6-8c51df10adb1",
"PhysicalItemNumberTypeId": "960a1750-64be-4333-9a7f-c8da419d670a",
"PhysicalItemNumberValue": "DJDJ94943"
}
],
"RetailProductVersionInstancePhysicalItemState": "IN USE",
"DateCreatedUtc": "2018-11-21T13:55:00Z",
"DateUpdatedUtc": "2020-11-21T13:55:00Z"
}
]
}
]
}
RetailProduct
{
"id": "ce7a44a4-7e49-434b-8a51-840599fbbfbb",
"FriendlyName": "Data-Package 100GB",
"WholeSaleProductId": "d054dae5-173d-478b-bb0e-7516e6a24476"
}
WholeSaleProduct:
{
"id": "d054dae5-173d-478b-bb0e-7516e6a24476",
"ProductName": "Data 100",
"ProviderLiabilities": []
}
Above, I have added some sample documentation.
Relationships:
Agreement.CustomerId links to Customer.id
Agreement.RetailProductVersionInstance.RetailProductVersionId links
to RetailProduct.id
RetailProduct.WholeSaleProductId links to WholeSaleProduct.id
How, would I write a Javascript Stored Procedure, in CosmosDB, to perform joins between these 4 collections?
Short answer is that you cannot perform joins between different collections via SQL in Cosmos DB.
Generally, the solution to this type of question is multiple queries or different schema. In your scenario, if you can denormalize your schema into one collection without duplicating data, then it is easy.
If you provide your schemas, it'd be possible to provide a more comprehensive answer.
-- Edit 1 --
Stored Procedures are only good candidates for operations that require multiple operations on the same collection + partition key. This makes them good for bulk insert/delete/update, transactions (which need at least a read and a write), and a few other things. They aren't good for CPU intensive things, but rather things that would normally be IO bound by network latency. They aren't possible to use for cross partition or cross collection scenarios. In those cases, you must perform the operations exclusively from the remote client.
In your case, it's a fairly straightforward 2 + 2N separate reads, where N is the number of products. You need to read the agreement first. Then you can look up the customer and the product records in parallel, and then you can look up the wholesale record last, so you should have a latency of 3s + C, where s is the average duration of a given read request and C is some constant CPU time to perform the join/issue the request/etc.
It's worth considering whether you can consolidate RetailProduct and WholeSale product into a single record where Wholesale contains all the RetailProducts in an array, or as separate documents, partitioned by the wholesale id, with a well known id that contained the Wholesale product info in a separate document. That would reduce your latency by 1 third. If you go with the partitioning by wholesale id idea, you could write 1 query for any records that shared a wholesale id, so you'd get 2 + log(N) reads, but the same effective latency. For that strategy, you'd store a composite index of "wholesaleid+productid" in the agreement. One issue to worry about is that it duplicates the wholesale+product relationship, but as long as that relationship doesn't change, I don't think there is anything to worry about and it provides a good optimization for info lookup.
Related
I am trying to adapt my normalized datamodel, to a Firebase-friendly datamodel.
My application primarily runs 'summary queries' on the normalised tables e.g. things analogous to:
'SELECT ctryid, avg(age) FROM users GROUP BY ctryid'
Having an integer ctryid speeds things up tremendously,
and in my mind it's because he has to compare an integer ctryid, instead of strings (USA, FRA, ITA, ...).
Now I learned that Firebase generates keys like 'Xj34Fhe2sP0'. Would that indeed imply less efficiency as compared to my SQL queries?
What would such a query look like in Firebase? I do not wish to denormalize any calculated results.
Edit: Denormalizing for avoiding costly joins, would imply also including ctryname in the users object right?
Thanks a lot.
Firebase doesn't support group-by clauses in its queries, nor any other aggregation operations. I don't think the keys that it generates are very important though.
What I often recommend is that you model your database to reflect what you show in the screens of your app. So your SQL queries seems to delivery a list of country IDs with the average age of the users in each country ID.
If that's what you want to show, consider storing exactly that data in Firebase:
averageAgeByCountryId: {
"NL": 43.3,
"US": 38.1
}
Not to write this data you'll need to update the existing average each time you write a new user to a country. To allow that you'll probably want to instead store the total number of users in each country, and their total age:
averageAgeByCountryId: {
"NL": { userCount: 5, sumOfAge: 217 },
"US": { userCount: 10, sumOfAge: 381 }
}
Now you can still easily calculate the average age, but in this format it is also easier to update the average ages as you add users.
I'm having trouble doing an otherwise SQL valid self-join query on documentdb.
So the following query works:
SELECT * FROM c AS c1 WHERE c1.obj="car"
But this simple self join query does not: SELECT c1.url FROM c AS c1 JOIN c AS c2 WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url, with the error, Identifier 'c' could not be resolved.
It seems that documendb supports self-joins within the document, but I'm asking on the collection level.
I looked at the official syntax doc and understand that the collection name is basically inferred; I tried changing c to explicitly my collection name and root but neither worked.
Am I missing something obvious? Thanks!
A few things to clarify:
1.) Regarding Identifier 'c' could not be resolved
Queries are scoped to a single collection; and in the example above, c is an implicit alias for the collection which is being re-aliased to c1 with the AS keyword.
You can fix the example query changing fixing the JOIN to reference c1:
SELECT c1.url
FROM c AS c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`
This is also equivalent to:
SELECT c1.url
FROM c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`
2.) Understanding JOINs and examining your data model
With that said, I don't think fixing the query syntax issue above will produce the behavior you are expecting. The JOIN keyword in DocumentDB SQL is designed for forming a cross product with a denormalized array of elements within a document (as opposed to forming cross products across other documents in the same collection). If you run in to struggles here, it may be worth taking a step back and revisiting how to model your data for Azure Cosmos DB.
In a RDBMS, you are trained to think entity-first and normalize your data model based on entities. You rely heavily on a query engine to optimize queries to fit your workload (which typically do a good, but not always optimal, job for retrieving data). The challenges here are that many relational benefits are lost as scale increases, and scaling out to multiple shards/partitions becomes a requirement.
For a scale-out distributed database like Cosmos DB, you will want to start with understanding the workload first and optimize your data model to fit the workload (as opposed to thinking entity first). You'll want to keep in mind that collections are merely a logical abstraction composed of many replicas that live within partition sets. They do not enforce schema and are the boundary for queries.
When designing your model, you will want to incorporate the following questions in to your thought process:
What is the scale, in terms of size and throughput, for the broader solution (an estimate of order of magnitude is sufficient)?
What is the ratio of reads vs writes?
For writes - what is the pattern for writes? Is it mostly inserts, or are there a lot of updates?
For reads - what do top N queries look like?
The above should influence your choice of partition key as well as what your data / object model should look like. For example:
The ratio of requests will help guide how you make tradeoffs (use Pareto principle and optimize for the bulk of your workload).
For read-heavy workloads, commonly filtered properties become candidates for choice of partition key.
Properties that tend to be updated together frequently should be abstracted together in the data model, and away from properties that get updated with a slower cadence (to lower the RU charge for updates).
Don't be afraid to duplicate properties to enrich queryability, and annotate types, across different record types. For example, we have two types of documents: cat and person.
{
"id": "Andrew",
"type": "Person",
"familyId": "Liu",
"employer": "Microsoft"
}
{
"id": "Ralph",
"type": "Cat",
"familyId": "Liu",
"fur": {
"length": "short",
"color": "brown"
}
}
We can query both types of documents without needing a JOIN simply by running a query without a filter on type:
SELECT * FROM c WHERE c.familyId = "Liu"
And if we wanted to filter on type = “Person”, we can simply add a filter on type to our query:
SELECT * FROM c WHERE c.familyId = "Liu" AND c.type = "Person"
Above Answer has queries mentioned by #Andrew Liu. This will resolve your error but Azure Cosmos DB does not support Cross-item and cross-container joins. Use this link to read about joins https://learn.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-join
i have two entity
First one -
{
"ID": 5777,
"Name": "",
"EventID": 18341,
"ParentID": 19702,
"LastModifiedDate": "2016-11-30 09:36:04",
"EntityType": 3,
"InstanceID": 916787
}
2d one
{
"ID": 19702,
"Name": "Google",
"EventID": 18341,
"ParentID": 0,
"LastModifiedDate": "2016-12-01 06:20:49",
"EntityType": 0,
"FileAttribute": "",
"InstanceID": 0,
"IsFile": false,
"ResourceURL": "http://www.google.com",
"Taxonomies": "0",
"ViewCount": 2
}
Now need to fetch from 2nd one with "ID" is "ParentID" of first one using core data.
MySql query will "SELECT * FROM "tbl_two" WHERE `ID` IN ( SELECT `ParentID` FROM "tbl_ONE" WHERE `InstanceID` = '916787' AND `EventID` = '18341')
The first step is to stop thinking about Core Data like it's SQL. It has a different API, and thinking in SQL terms will lead you to poor designs.
In the most general terms, don't think of records, think of objects that you might use in an app and how those objects relate to each other. Instead of using foreign keys to relate one record to another, use object properties to relate one object to another.
From what I can make of your SQL query, you want something like
Two Core Data entities called Resource (your first sample) and ResourceMapping (your second sample).
Resource has a property called something like mapping, of type ResourceMapping.
You relate one to another using the relationship, not by ID. You can store the IDs if you need to sync the data to a remote server but they're usually not useful in Core Data.
The equivalent of your SQL query would be, I think:
Fetch the single instance of Resource (the first example) using an NSPredicate that matches one or more of its properties.
When you have that instance, ask it for the value of its relationship mapping (or whatever you call it), and you'll get the related object from the other entity.
Beyond that, Apple provides extensive, detailed documentation with code samples that will help.
Is it efficient to store multiple metrics in a single series? There is support for multiple columns, but it seems that at least based on the 0.9 documentation that there is a preference towards a single series per metric and a column for value.
What I'm looking at is a way to store some related data (such as hd free, used, total) and having 3 separate series seems like a pain and would most certainly complicate queries that need to be made across the series.
Are there some general best practices for storing metrics such as these?
InfluxDB 0.9 will happily support up to 255 fields per series. The examples in the docs mostly have single field examples with a field key of "value" but there's nothing preventing you from having multiple fields. Since fields aren't indexed it should have no performance impact at all.
For example, here's a point with three field values:
{
"database": "mydb",
"points": [
{
"measurement": "disk",
"tags": {
"host": "server01",
"type": "SSD"
},
"time": "2009-11-10T23:00:00Z",
"fields": {
"free": 318465464,
"used": 682324110,
"total": 1000789574
}
}
]
}
Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
http://docs.basho.com/riak/latest/tutorials/Secondary-Indexes---Examples/
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.