RavenDB - How to Use in Production with Many Inserts? - asp.net-mvc

After hearing about NoSQL for a couple years I finally started playing with RavenDB today in an .Net MVC app (simple blog). Getting the embedded database up and running was pretty quick and painless.
However, I've found that after inserting objects into the document store, they are not always there when the subsequent page refreshes. When I refresh the page they do show up. I read somewhere that this is due to stale indexes.
My question is, how are you supposed to use this in production on a site with inserts happening all the time (example: e-commerce). Isn't this always going to result in stale indexes and unreliable query results?

Think of what actually happens with a traditional database like SQL Server.
When an item is created, updated, or deleted from a table, any indexes associated with table also have to be updated.
The more indexes you have on a table, the slower your write operations will be.
If you create a new index on an existing table, it isn't used at all until it is fully built. If no other index can answer a query, then a slow table scan occurs.
If others attempt to query from an existing index while it is being modified, the reader will block until the modification is complete, because of the requirement for Consistency being higher priority than Availability.
This can often lead to slow reads, timeouts, and deadlocks.
The NoSQL concept of "Eventual Consistency" is designed to alleviate these concerns. It is optimized reads by prioritizing Availability higher than Consistency. RavenDB is not unique in this regard, but it is somewhat special in that it still has the ability to be consistent. If you are retrieving single document, such as reviewing an order or an end user viewing their profile, these operations are ACID compliant, and are not affected by the "eventual consistency" design.
To understand "eventual consistency", think about a typical user looking at a list of products on your web site. At the same time, the sales staff of your company is modifying the catalog, adding new products, changing prices, etc. One could argue that it's probably not super important that the list be fully consistent with these changes. After all, a user visiting the site a couple of seconds earlier would have received data without the changes anyway. The most important thing is to deliver product results quickly. Blocking the query because a write was in progress would mean a slower response time to the customer, and thus a poorer experience on your web site, and perhaps a lost sale.
So, in RavenDB:
Writes occur against the document store.
Single Load operations go directly to the document store.
Queries occur against the index store
As documents are being written, data is being copied from the document store to the index store, for those indexes that are already defined.
At any time you query an index, you will get whatever is already in that index, regardless of the state of the copying that's going on in the background. This is why sometimes indexes are "stale".
If you query without specifying an index, and Raven needs a new index to answer your query, it will start building an index on the fly and return you some of those results right away. It only blocks long enough to give you one page of results. It then continues building the index in the background so next time you query you will have more data available.
So now lets give an example that shows the down side to this approach.
A sales person goes to a "products list" page that is sorted alphabetically.
On the first page, they see that "Apples" aren't currently being sold.
So they click "add product", and go to a new page where they enter "Apples".
They are then returned to the "products list" page and they still don't see any Apples because the index is stale. WTF - right?
Addressing this problem requires the understanding that not all viewers of data should be considered equal. That particular sales person might demand to see the newly added product, but a customer isn't going to know or care about it with the same level of urgency.
So on the "products list" page that the sales person is viewing, you might do something like:
var results = session.Query<Product>()
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
.OrderBy(x=> x.Name)
.Skip((pageNumber-1) * pageSize).Take(pageSize);
While on the customer's view of the catalog, you would not want to add that customization line.
If you wanted to get super precise, you could use a slightly more optimized strategy:
When going back from the "add product" page to the "list products" page, pass along the ProductID that was just added.
Just before you query on that page, if the ProductID was passed in then change your query code to:
var product = session.Load(productId);
var etag = session.Advanced.GetEtagFor(product);
var results = session.Query<Product>()
.Customize(x => x.WaitForNonStaleResultsAsOf(etag))
.OrderBy(x=> x.Name)
.Skip((pageNumber-1) * pageSize).Take(pageSize);
This will ensure that you only wait as long as absolutely necessary to get just that one product's changes included in the results list along with the other results from the index.
You could optimize this slightly by passing the etag back instead of the ProductId, but that might be less reusable from other places in your application.
But do keep in mind that if the list is sorted alphabetically, and we added "Plums" instead of "Apples", then you might not have seen these results instantly anyway. By the time the user had skipped to the page that includes that product, it would likely have been there already.

You are running into stale queries.
That is a by design part of RavenDB. You need to make distinction between queries (BASE) and loading by id (ACID).

Related

Using EFFECTIVE_TS and EXPIRATION_TS on FACT tables

I have a requirement to create a Fact table which stores granted_share_qty awarded to employees. There are surrounding Dimensions like SPS Grant_dim which stores info about each grant, SPS Plan Dim which stores info about the Plan, SPS Client Dim which stores info about the Employer and SPS Customer Dim which stores info about the customer. The DimKeys (Surrogate Key) and DurableKeys(Supernatural Keys) from each Dimension is added to the Fact.
Reporting need is "as-of" ie on any given date, one should be able to see the granted_share_qty as of that date (similar to account balance as of that date) along with point-in-time values of few attributes from the Grant,Plan, Client, Customer dimensions.
First, we thought of creating a daily snapshot table where the data is repeated everyday in the fact (unless source sends any changes). However since there could be more than 100 million grant records , repeating this everyday was almost impossible, moreover the granted_share_qty doesnt change that often so why copy this everyday?.
So instead of a daily snapshot we thought of adding an EFFECTIVE_DT and EXPIRATION_DT on the Fact table (like a TIMESPAN PERIODIC SNAPSHOT table if such a thing exists)
This reduces the volume and perfectly satisfies a reporting need like "get me the granted_qty and grant details,client, plan, customer details as of 10/01/2022 " will translate to "select granted_qty from fact where 10/01/2022 between EFFECTIVE_DT and EXPIRATION_DT and Fact.DimKeys=Dim.DimKeys"
The challenge however is to keep the Dim Keys of the Fact in sync with Dim Keys of the Dimensions. Even if the Fact doesn't change, any DimKey changes due to versioning in any of the Dimension need to be tracked and versioned in the Fact. This has become an implementation nightmare
(To worsen the things, the Dims could undergo multiple intraday changes , so these are to be tracked near-real-time :-( )
Any thoughts how to handle such situations will be highly appreciated (Database: Snowflake)
P:S: We could remove the DimKeys from the Fact and use DurableKeys+Date to join between the Facts and Type 2 Dims, but that proposal is not favored/approved as of now
Thanks
Sunil
First, we thought of creating a daily snapshot table where the data is repeated everyday in the fact (unless source sends any changes). However
Stop right there. Whenever you know the right model but think it's un-workable for some reason, try harder. At a minimum test your assumption that it would be "too much data", and consider not materializing the snapshot but leaving it as a view and computing it at query time.
... moreover the granted_share_qty doesnt change that often so why copy this everyday?.
And there's your answer. Use a monthly snapshot instead of a daily snapshot, and you've divided the data by 30.

Querying Firebase Firestore Data

I'm looking to build an app that functions like a dating app:
User A fetches All Users.
User A removes Users B, C, and D.
User A fetches All Users again - excluding Users B, C, and D.
My goal is to perform a query that does not read the User B, C, and D documents in my fetch query.
I've read into array-contains-any, array-contains, not-in queries, but the 10 item limit prevents me from using these as options because the "removed users list" will continue to grow.
2 workaround options I've mulled over are...
Performing a paginated fetch on All User documents and then filtering out on the client side?
Store all User IDs (A, B, C, D) on 1 document in an array field, fetch the 1 document, and then filter client side?
Any guidance would be extremely appreciated either on suggestions around how I store my data or specific queries I can perform.
You can do it the other way around.
Instead of a removed or ignored array at your current user, you have an array of ignoredBy or removedBy in which you add your current user.
And when you fetch the users from the users collection, you just have to check if the requesting user is part of the array ignoredBy. So you don’t have tons of entries to check in the array, it is always just one.
Firestore may get a little pricey with the Tinder model but you can certainly implement a very extensible architecture, well enough to scale to millions of users, without breaking a sweat. So the user queries a pool of people, and each person is represented by their own document, this much is obvious. The user must then take an action on each person/document, and, presumably, when an action is taken that person should no longer reappear in the user's queries. We obviously can't edit the queried documents because there could be millions of users and that wouldn't scale. And we shouldn't use arrays because documents have byte limits and that also wouldn't scale. So we have to treat a collection like an array, using documents as items, because collections have no known limit to how many documents they can contain.
So when the user takes an action on someone, consider creating a new document in a subcollection in the user's own document (user A, the one performing the query) that contains the person's uid, and perhaps a boolean to determine if they liked or disliked that person (i.e. liked: true), and maybe a timestamp for UI purposes. This new document is the item in your limitless array.
When the user later performs another query, those same users are going to reappear in the results, which you need to filter out. You have no choice but to check if each person's uid is in this subcollection. If it is, omit the document and move to the next. But if your UI is configured like Tinder's, where there isn't a list of people to scroll through but instead cards stacked on top of each other, this is no big deal. The user will only be presented with one person at a time and they won't know how many you're filtering out behind the scenes. With a paginated list, the user may see odd behavior like uneven pages. The drawback is that you're now paying double for each query. Each query will cost you the original fetch and the subcollection-check fetch. But, hey, with this model you can scale to millions of users without ever breaking a sweat.

What's the most efficient way to pull only the records created since last pull?

I am building a Shopify App, and I want my customer (the store owner) to 'sync' the app with their store data occasionally, specifically for their Order data (ignore the fact that I will also have a webhook that pushes this data to my app whenever a new order is created).
Right now, just for crude illustration purposes, I am doing this (which works in dev mode):
store_orders = ShopifyAPI::Order.find(:all)
store_orders.each do |sorder|
new_order = Order.find_or_create_by(s_order_id: sorder.id)
new_order.update(
currency: sorder.currency,
...
)
So, I am pulling all of the orders directly from the ShopifyAPI. Ideally, what I want to happen is to only pull the new orders that have been made that have not been synced.
I have two constraints:
The Order IDs produced by the ShopifyAPI are not in a sequential order, and can be relatively haphazard
I can't do a where query on the ShopifyAPI according to the updated_at date, i.e. to only select the records after my last synced date (the ShopifyAPI is not allowing this at the moment, I am not sure if this will be fixed in the future).
So that leaves me with two questions:
What's another, efficient way, for me to quickly find the records that have not been pulled and only pull those?
How do I make sure to only update the records locally that any attribute has been changed or no record exists before? (i.e. I am trying to avoid updating every record that has been pulled).
I am not sure where you figure the Shopify order ID is "haphazard" and not in "sequential" order? If you study these things, you will find that they are integers and in fact, they are in order in the sense that an order created after another will have a bigger ID.
So the coolest little thing you can do, really quite easy, is to use the Shopify API filter known as "since_id". You get the luxury of pulling only orders that hit Shopify SINCE the last pull, assuming you stored that last pulled ID in the since_id field you keep around, on the shop model you keep.
Try it. It works perfectly. I have been doing that for years. Just update the since_id in your DB once you're done processing a batch of orders, and then next time you want more, filter using your since_id.

Event Store DB : temporal queries

regarding to asked question here :
suppose that we have ProductCreated and ProductRenamed events which both contain the title of the product.now we want to query EventStoreDB for all events of type ProductCreated and ProductRenamed with the given title.i want all these events to check whether there is any product in the system which has been created or renamed to the given title, so that i could throw the exception of repetitive title in the domain
i am using MongoDB for creating UI reports from all the published events and everything is fine there.but for checking some invariants, like checking for unique values, i have to either query the event store for some events along with their criteria and by iterating over them, decide whether there is a product created with the same title which has not renamed or a product renamed with the same title.
for such queries, the only way that event store provides is creating a one-time projection with the proper java script code which filters and emits required events to a new stream.and then all i have to do is to fetch events from the new generated stream which is filled by the projection
no the odd thing is, projections are great for subscriptions and generating new streams, but they seem to be odd for doing real time queries.immediately after i create a projection with the HTTP api, i check the new resulting stream for the query result, but it seems that the workers has not got the chance to elaborate on the result and i get 404 response.but after waiting for a bunch of seconds, the new streams pops out and gets filled with the result.
there are too many things wrong with this approach:
first, it seems that if the event store is filled with millions of events across many streams, it wont be able to process and filter all of them immediately to the resulting stream.it does not create the stream immediately, let alone the population.so i have to wait for some time and check for the result hoping the the projection is done
second, i have to fetch multiple times and issue multiple GET HTTP commands which seems to be slow.the new JVM client is not ready yet.
Third, i have to delete the resulting stream after i'm done with the result and failing to do so will leave event store with millions of orphan query result streams
i wish i could pass the java script to some api and get the result page by page like querying MongoDB without worrying about the projection, new streams and timing issues.
i have seen a query section in the Admin UI, but i dont know whats that for, and unfortunetly the documentation doesn't help much
am i expecting the event store to do something that is impossible?
do i have to create a bounded context inner read model for doing such checks?
i am using my events to dehyderate the aggregates and willing to use the same events for such simple queries without acquiring other techniques
I believe it would not be a separate bounded context since the check you want to perform belongs to the same bounded context where your Product aggregate lives. So, the projection that is solely used to prevent duplicate product names would be a part of the same context.
You can indeed use a custom projection to check it but I believe the complexity of such a solution would be higher than having a simple read model in MongoDB.
It is also fine to use an existing projection if you have one to do the check. It might be not what you would otherwise prefer if the aim of the existing projection is to show things in the UI.
For the collection that you could use for duplicates check, you can have the document schema limited to the id only (string), which would be the product title. Since collections are automatically indexed by the id, you won't need any additional indexes to support the duplicate check query. When the product gets renamed, you'd need to delete the document for the old title and add a new one.
Again, you will get a small time window when the duplicate can slip in. It's then up to the business to decide if the concern is real (it's not, most of the time) and what's the consequence of the situation if it happens one day. You'd be able to find a duplicate when projecting events quite easily and decide what to do when it happens.
Practically, when you have such a projection, all it takes is to build a simple domain service bool ProductTitleAlreadyExists.

Building a (simple) twitter-clone with CouchDB

I'm trying to build a (simple) twitter-clone which uses CouchDB as Database-Backend.
Because of its reduced feature set, I'm almost finished with coding, but there's one thing left I can't solve with CouchDB - the per user timeline.
As with twitter, the per user timeline should show the tweets of all people I'm following, in a chronological order. With SQL it's a quite simple Select-Statement, but I don't know how to reproduce this with CouchDBs Map/Reduce.
Here's the SQL-Statement I would use with an RDBMS:
SELECT * FROM tweets WHERE user_id IN [1,5,20,33,...] ORDER BY created_at DESC;
CouchDB schema details
user-schema:
{
_id:xxxxxxx,
_rev:yyyyyy,
"type":"user",
"user_id":1,
"username":"john",
...
}
tweet-schema:
{
"_id":"xxxx",
"_rev":"yyyy",
"type":"tweet",
"text":"Sample Text",
"user_id":1,
...
"created_at":"2011-10-17 10:21:36 +000"
}
With view collations it's quite simple to query CouchDB for a list of "all tweets with user_id = 1 ordered chronologically".
But how do I retrieve a list of "all tweets which belongs to the users with the ID 1,2,3,... ordered chronologically"? Do I need another schema for my application?
The best way of doing this would be to save the created_at as a timestamp and then create a view, and map all tweets to the user_id:
function(doc){
if(doc.type == 'tweet'){
emit(doc.user_id, doc);
}
}
Then query the view with the user id's as keys, and in your application sort them however you want(most have a sort method for arrays).
Edited one last time - Was trying to make it all in couchDB... see revisions :)
Is that a CouchDB-only app? Or do you use something in between for additional buisness logic. In the latter case, you could achieve this by running multiple queries.
This might include merging different views. Another approach would be to add a list of "private readers" for each tweet. It allows user-specific (partial) views, but also introduces the complexity of adding the list of readers for each new tweet, or even updating the list in case of new followers or unfollow operations.
It's important to think of possible operations and their frequencies. So when you're mostly generating lists of tweets, it's better to shift the complexity into the way how to integrate the reader information into your documents (i.e. integrating the readers into your tweet doc) and then easily build efficient view indices.
If you have many changes to your data, it's better to design your database not to update too many existing documents at the same time. Instead, try to add data by adding new documents and aggregate via complex views.
But you have shown an edge case where the simple (1-dimensional) list-based index is not enough. You'd actually need secondary indices to filter by time and user-ids (given that fact that you also need partial ranges for both). But this not possible in CouchDB, so you need to work around by shifting "query" data into your docs and use them when building the view.

Resources