For a social network site, I have an activity of events from people you follow, and I'd like to group similar types of events made within a short timeframe together, for a more compact activity feed. Imagine how Facebook displays a comma separated list when you 'like' several things in rapid succession: 'Joe likes beer, football and chips.'
I understand using the group_by method on ActiveRecord Enumerable results, but there needs to be some initial work done populating a property that I can group by later. My questions deal with both storing activity data in a way that these groupings can be marked, and then later retrieving them again.
Right now I have an Activity model, which is a join association between the user that committed the activity and the item that that it's linked to (in my example above, assume 'beer', 'football' and 'chips' are records of a Like model). There are other activity types aside from 'likes' too (events, saving favorites, etc). What I'm considering is, as this association is created, a check is made when the last association of that type was done, and if it was made more than a certain time period ago, incrementing an 'activity block' counter that is part of the Activity model. Later, when rendering this activity feed, I can group by user, then type, then this activity block counter.
Example: Let's say 2 blocks of updates are made within the same day. A user likes 2 things at 2:05 and later 3 more things at 5:45. After the third update (the start of the 2nd block) happens at 5:45, the model detects too much time has passed and increments its activity block counter by 1, thus forcing any following updates into a new block when they are rendered via a group_by call:
2:05 Joe likes beer nuts and Hooters.
5:45 Joe likes couches, chips and salsa.
7:00 Joe is attending the Football Viewing Party At Joe's
My first question: What's an efficient way to increment a counter like this? It's no longer auto_increment, so the easiest thing I can think of is looking at the counter for the last record as a reference point. However, this couldn't be from the same query that checked for when the last update of that type was made, since a later update of another type could have already received the next counter value. They don't have to be globally unique, but that would be nice.
The other overall strategy I thought of was another model Called ActivityBlock, that joins groups of similar activities together. In many cases, updates will be isolated by themselves though, so this seems a little inefficient to have one record for each individual activity.
Do either of these seem like a solid strategy?
My final question revolves around pagination. Now that we're dealing with blocks, it's harder to always display exactly a certain amount of entries, before pagination kicks in. Either an individual (isolated) Activity update, or a block of then should count as just 1, so at the lowest layer of my group_by, I can incorporate a counter to track how many rows I've displayed, but this means I can't just make one DB query anymore and simply specify a limit statement. Is there any way I could still do this without repeatedly performing additional SQL queries until I've reached my page limit?
This would be one advantage of the ActivityBlock model approach, since I could easily apply a limit call to that, and blocks could contain an auto increment counter as well.
Check out http://railscasts.com/episodes/406-public-activity
He also posted one on how to do it from scratch in episode 407 (it's a Pro episode though).
You could use the epoch time, or a variation of it as the counter since thats semi-unique and deterministic
Related
We have items in our app that form a tree-like structure. You might have a pattern like the following:
(c:card)-[:child]->(subcard:card)-[:child]->(subsubcard:card) ... etc
Every time an operation is performed on a card (at any level), we'd like to record it. Here are some possible events:
The title of a card was updated by Bob
A comment was added by Kate mentioning Joe
The status of a card changed from pending to approved
The linked list approach seems popular but given the sorts of queries we'd like to perform, I'm not sure if it works the best for us.
Here are the main queries we will be running:
All of the activity associated with a particular card AND child cards, sorted by time of the event (basically we'd like to merge all of these activity feeds together)
All of the activity associated with a particular person sorted by time
On top of that we'd like to add filters like the following:
Filter by person involved
Filter by time period
It is also important to note that cards may be re-arranged very frequently. In other words, the parents may change.
Any ideas on how to best model something like this? Thanks!
I have a couple of suggestions, but I would suggest benchmarking them.
The linked list approach might be good if you could use the Java APIs (perhaps via an unmanaged extension for Neo4j). If the newest event in the list were the one attached to the card (and essentially the list was ordered by the date the events happened down the line), then if you're filtering by time you could terminate early when you've found an event which is earlier than the specified time.
Attaching the events directly to the card has the potential to lead you down into problems with supernodes/dense nodes. It would be the simplest to query for in Cypher, though. The problem is that Cypher will look at all of them before filtering. You could perhaps improve the performance of queries by, in addition to placing the date/time of the event on the event node, placing it on the relationships to the node ((:Card)-[:HAS_EVENT]->(:Event) or (:Event)-[:PERFORMED_BY]->(:Person)). Then when you query you can filter by the relationships so that it doesn't need to traverse to the nodes.
Regardless, it would probably be helpful to break up the query like so:
MATCH (c:Card {uuid: 'id_here')-[:child*0..]->(child:Card)
WITH child
MATCH (child)-[:HAS_EVENT]->(event:Event)
I think that would mean that the MATCH is going to have fewer permutations of paths that it will need to evaluate.
Others are welcome to supplement my dubious advice as I've never really dealt with supernodes personally, just read about them ;)
After hearing about NoSQL for a couple years I finally started playing with RavenDB today in an .Net MVC app (simple blog). Getting the embedded database up and running was pretty quick and painless.
However, I've found that after inserting objects into the document store, they are not always there when the subsequent page refreshes. When I refresh the page they do show up. I read somewhere that this is due to stale indexes.
My question is, how are you supposed to use this in production on a site with inserts happening all the time (example: e-commerce). Isn't this always going to result in stale indexes and unreliable query results?
Think of what actually happens with a traditional database like SQL Server.
When an item is created, updated, or deleted from a table, any indexes associated with table also have to be updated.
The more indexes you have on a table, the slower your write operations will be.
If you create a new index on an existing table, it isn't used at all until it is fully built. If no other index can answer a query, then a slow table scan occurs.
If others attempt to query from an existing index while it is being modified, the reader will block until the modification is complete, because of the requirement for Consistency being higher priority than Availability.
This can often lead to slow reads, timeouts, and deadlocks.
The NoSQL concept of "Eventual Consistency" is designed to alleviate these concerns. It is optimized reads by prioritizing Availability higher than Consistency. RavenDB is not unique in this regard, but it is somewhat special in that it still has the ability to be consistent. If you are retrieving single document, such as reviewing an order or an end user viewing their profile, these operations are ACID compliant, and are not affected by the "eventual consistency" design.
To understand "eventual consistency", think about a typical user looking at a list of products on your web site. At the same time, the sales staff of your company is modifying the catalog, adding new products, changing prices, etc. One could argue that it's probably not super important that the list be fully consistent with these changes. After all, a user visiting the site a couple of seconds earlier would have received data without the changes anyway. The most important thing is to deliver product results quickly. Blocking the query because a write was in progress would mean a slower response time to the customer, and thus a poorer experience on your web site, and perhaps a lost sale.
So, in RavenDB:
Writes occur against the document store.
Single Load operations go directly to the document store.
Queries occur against the index store
As documents are being written, data is being copied from the document store to the index store, for those indexes that are already defined.
At any time you query an index, you will get whatever is already in that index, regardless of the state of the copying that's going on in the background. This is why sometimes indexes are "stale".
If you query without specifying an index, and Raven needs a new index to answer your query, it will start building an index on the fly and return you some of those results right away. It only blocks long enough to give you one page of results. It then continues building the index in the background so next time you query you will have more data available.
So now lets give an example that shows the down side to this approach.
A sales person goes to a "products list" page that is sorted alphabetically.
On the first page, they see that "Apples" aren't currently being sold.
So they click "add product", and go to a new page where they enter "Apples".
They are then returned to the "products list" page and they still don't see any Apples because the index is stale. WTF - right?
Addressing this problem requires the understanding that not all viewers of data should be considered equal. That particular sales person might demand to see the newly added product, but a customer isn't going to know or care about it with the same level of urgency.
So on the "products list" page that the sales person is viewing, you might do something like:
var results = session.Query<Product>()
.Customize(x => x.WaitForNonStaleResultsAsOfLastWrite())
.OrderBy(x=> x.Name)
.Skip((pageNumber-1) * pageSize).Take(pageSize);
While on the customer's view of the catalog, you would not want to add that customization line.
If you wanted to get super precise, you could use a slightly more optimized strategy:
When going back from the "add product" page to the "list products" page, pass along the ProductID that was just added.
Just before you query on that page, if the ProductID was passed in then change your query code to:
var product = session.Load(productId);
var etag = session.Advanced.GetEtagFor(product);
var results = session.Query<Product>()
.Customize(x => x.WaitForNonStaleResultsAsOf(etag))
.OrderBy(x=> x.Name)
.Skip((pageNumber-1) * pageSize).Take(pageSize);
This will ensure that you only wait as long as absolutely necessary to get just that one product's changes included in the results list along with the other results from the index.
You could optimize this slightly by passing the etag back instead of the ProductId, but that might be less reusable from other places in your application.
But do keep in mind that if the list is sorted alphabetically, and we added "Plums" instead of "Apples", then you might not have seen these results instantly anyway. By the time the user had skipped to the page that includes that product, it would likely have been there already.
You are running into stale queries.
That is a by design part of RavenDB. You need to make distinction between queries (BASE) and loading by id (ACID).
I want to perform some simple calculations while staying database-agnostic in my rails app.
I have three models:
.---------------. .--------------. .---------------.
| ImpactSummary |<------| ImpactReport |<----------| ImpactAuction |
`---------------'1 *`--------------'1 *`---------------'
Basicly:
ImpactAuction holds data about... auctions (prices, quantities and such).
ImpactReport holds monthly reports that have many auctions as well as other attributes ; it also shows some calculation results based on the auctions.
ImpactSummary holds a collection of reports as well as some information about a specific year, and also shows calculation results based on the two other models.
What i intend to do is to store the results of these really simple calculations (just means, sums, and the like) in the relevant tables, so that reading these would be fast, and in a way that i can easilly perform queries on the calculation results.
is it good practice to store calculation results ? I'm pretty sure that's not a very good thing, but is it acceptable ?
is it useful, or should i not bother and perform the calculations on-the-fly?
if it is good practice and useful, what's the better way to achieve what i want ?
Thats the tricky part.At first, i implemented a simple chain of callbacks that would update the calculation fields of the parent model upon save (that is, when an auction is created or updated, it marks some_attribute_will_change! on its report and saves it, which triggers its own callbacks, and so on).
This approach fits well when creating / updating a single record, but if i want to work on several records, it will trigger the calculations on the whole chain for each record... So i suddenly find myself forced to put a condition on the callbacks... depending on if i have one or many records, which i can't figure out how (using a class method that could be called on a relation? using an instance attribute #skip_calculations on each record? just using an outdated field to mark the parent records for later calculation ?).
Any advice is welcome.
Bonus question: Would it be considered DB agnostic if i implement this with DB views ?
As usual, it depends. If you can perform the calculations in the database, either using a view or using #find_by_sql, I would do so. You'll save yourself a lot of trouble: you have to keep your summaries up to date when you change values. You've already met the problem when updating multiple rows. Having a view, or a query that implements the view stored as text in ImpactReport, will allow you to always have fresh data.
The answer? Benchmark, benchmark, benchmark ;)
I have a table in my database that stores event totals, something like:
event1_count
event2_count
event3_count
I would like to transition from these simple aggregates to more of a time series, so the user can see on which days these events actually happened (like how Stack Overflow shows daily reputation gains).
Elsewhere in my system I already did this by creating a separate table with one record for each daily value - then, in order to collect a time series you end up with a huge database table and the need to query 10s or 100s of records. It works but I'm not convinced that it's the best way.
What is the best way of storing these individual events along with their dates so I can do a daily plot for any of my users?
When building tables like this, the real key is having effective indexes. Test your queries with the EXAMINE statement or the equivalent in your database of choice.
If you want to build summary tables you can query, build a view that represents the query, or roll the daily results into a new table on a regular schedule. Often summary tables are the best way to go as they are quick to query.
The best way to implement this is to use Redis. If you haven't worked before with Redis I suggest you to start. You will be surprised how fast this can get :). The way I would do such a thing is to use the Hash data structure Redis provides. Just assign every user to his Hash (making a unique key for every user like "user:23:counters"). Inside this Hash you can store a daily timestamp as "05/06/2011" as the field and increment its counter every time an event happens or whatever you want to do with that!
A good start would be this thread. It has a simple, beginner level solution. Time Series Starter. If you are ok with rails models: This is a way it could work. For a sol called "irregular" time series. So this is a event here and there, but not in a regular interval. Like a sensor that sends data when your door is opened.
The other thing, and that is what I was looking for in this thread is regular time series db: Values come at a interval. Say 60/minute aka 1 per second for example a temperature sensor. This all boils down to datasets with "buckets" as you are suspecting right: A time series table gets long, indexes suck at a point etc. Here is one "bucket" approach using postgres arrays that would a be feasible idea.
Its not done as "plug and play" as far as I researched the web.
Is there a way I can query the database to return only the elements that have happened successively (in time) according to a set variable?
Specifically, I need to find out how many times the user has won a game in a row. The games are stored in the database with an attribute win_loss (1 is win, 0 is loss).
I would like to see if the user has won 3 games in a row. Is there a way to do this in the database?
If not, what would it look like in the application?
I apologize ahead of time if this is confusing. Please ask questions and I will try to clear it up. I'm using Ruby on Rails.
I would take different approach. I would add a column consecutive_wins into your User model (negative numbers could represent consecutive losses if you need that as well).
EDIT: if you really need to query it...
Two queries:
select the newest victory and newest loose for given user (could be done in one query, in SQL SELECT max(created_at) AS newest, win_loss FROM games GROUP BY win_loss (I can't recall how to do grouping functions in Rails right now but you should get the idea)
and then, regarding the result of query (1):
if both newest win and newest lose are found, count the games with condition that created_at > newest_win AND created_at > newest_lose
if only newest win or lose found, user was always wining or always loosing, just count the number of his games
in other case there were no games
I recommend adding the column to user model, it will be better when you need to display the whole table of gamers and their consecutive wins/loses.