I want to update a property in every "edge" every n-cycles/seconds/minutes.
As you may suspect this is time consuming and probably wont work well.
One possible approach is to do it in chunks.
The question is what is the best way to do it.
Here is how a full sweep will look like :
match (n1)-[x:q]-(n2)
set x.decay = x.decay * exp(-rate)
So the idea is to decay the edges and remove them when they hit a specific value.
If I do it in chunks how do I keep track which ones I decayed already so that I skip them, faster and cheaper.
Sounds like you need a better approach.
For example, store the calculated expiration time (as a timestamp) in every relationship. And a query that wants to use such a relationship could test that it had not expired. This way, there is no need to update any relationship properties, and all queries will get the correct behavior (down to the millisecond).
Here is a sample snippet:
...
MATCH (foo)-[rel:REL]->(bar)
WHERE timestamp() < rel.expiration
You can also periodically remove expired relationships to clean up the DB and improve query performance.
Related
My data model has a ClickerRecord entity with 2 attributes: date (NSDate) and numberOfBiscuits (NSNumber). Every time a new record is added, a different value for numberOfBiscuits can be entered.
To calculate a daily average for the number of biscuits I'm currently doing a fetch request for each day within range and using the corresponding NSExpression to calculate the sum of all numberOfBiscuits values for that day.
The problem: I'm using asynchronous fetch requests to avoid blocking the main thread, so it ends up being quite slow when there are many days between the first and last record. The fetch requests are performed one after another.
I could also load all records into memory and perform the sorting and calculations, but I'm worried that it could become an issue when the number of records becomes very large.
Therefore, my question: Is it possible to use NSExpressions to add something like sub-predicates for each date interval, in order to do a single fetch request and retrieve a dictionary with an entry for each daily sum of numberOfBiscuits?
If not, what would be the recommended approach for this situation?
I've read about subqueries but as far as I've understood they're not intended for this kind of use.
This is the first question I'm asking on SO, so I hope to have written it in a clear way :)
I think what you are looking for is the propertiesToGroupBy (see the Apple Docs) for the NSFetchRequest, though in your case it is not straight forward to implement, for reasons I will discuss later.
Suppose you could specify the category of biscuit consumed on each occasion, and this is stored in a category attribute of your entity. Then to obtain the total number of biscuits of each category (ignoring the date), you could use an NSExpression using #sum and specify:
fetch.propertiesToGroupBy = ["category"]
CoreData will then group the results of the fetch by the category and will calculate the sum for each group separately.
The problem in your case is that (unless you already strip out the time information from your date attribute), there is no attribute that represents the date interval that you want to group by, and CoreData will not let you specify a computed value to group by. You would need to add a new day attribute to your entity, and calculate that whenever you add/update a record, and specify it in the group by. And you face the same problem again if you subsequently want to calculate your average over a different interval - weeks or months for example. One other downside to this is that the results will only include days for which there are ClickerRecords: if the user has a day where they consume no biscuits, then the fetch will not show a result for that day (ie it will not infer an average of 0). You would need to handle this appropriately when using the results.
It might be better either to tune your asynchronous fetch or, as you suggest, just to read the whole lot into memory to perform the calculations. If your entity only has those two attributes, and assuming your users don't live entirely on biscuits, the volumes should not be too problematic.
I have a table of event results, and I need to fetch the most recent n events per player for a given list of players.
This is on iOS so it needs to be fast. I've looked at a lot of top-n-per-group solutions that use subqueries or joins, but these run slow for my 100k row dataset even on a macbook pro. So far my dumb solution, since I will only run this with a maximum of 6 players, is to do 6 separate queries. It isn't terribly slow, but there has to be a better way, right? Here's the gist of what I'm doing now:
results_by_pid = {}
player_ids = [1,2,3,4,5,6]
n_results = 6
for pid in player_ids:
results_by_pid[pid] = exec_sql("SELECT *
FROM results
WHERE player_id = #{pid}
ORDER BY event_date DESC
LIMIT n_events")
And then I go on my merry way. But how can I turn this into a single fast query?
There is no better way.
SQL window functions, which might help, are not implemented in SQLite.
SQLite is designed as an embedded database where most of the logic stays in the application.
In contrast to client/server databases where network communication should be avoided, there is no performance disadvantage to mixing SQL commands and program logic.
A less dumb solution requires you to do some SELECT player_id FROM somewhere beforehand, which should be no trouble.
To make the individual queries efficient, ensure you have one index on the two columns player_id and event_date.
This won't be much of an answer, but here goes...
I have found that making things really quick can involve ideas from the nature of the data and schema themselves. For example, searching an ordered list is faster than searching an unordered list, but you have to pay a cost up front - both in design and execution.
So ask yourself if there are any natural partitions on your data that may reduce the number of records SQLite must search. You might ask whether the latest n events fall within a particular time period. Will they all be from the last seven days? The last month? If so then you can construct the query to rule out whole chunks of data before performing more complex searches.
Also, if you just can't get the thing to work quickly, you can consider UX trickery! Soooooo many engineers don't get clever with their UX. Will your query be run as the result of a view controller push? Then set the thing going in a background thread from the PREVIOUS view controller, and let it work while iOS animates. How long does a push animation take? .2 seconds? At what point does your user indicate to the app (via some UX control) which playerids are going to be queried? As soon as he touches that button or TVCell, you can prefetch some data. So if the total work you have to do is O(n log n), that means you can probably break it up into O(n) and O(log n) pieces.
Just some thoughts while I avoid doing my own hard work.
More thoughts
How about a separate table that contains the ids of the previous n inserts? You could add a trigger to delete old ids if the size of the table grows above n. Say..
CREATE TABLE IF NOT EXISTS recent_results
(result_id INTEGER PRIMARY KEY, event_date DATE);
// is DATE a type? I don't know. you get the point
CREATE TRIGGER IF NOT EXISTS optimizer
AFTER INSERT ON recent_results
WHEN (SELECT COUNT(*) FROM recent_results) > N
BEGIN
DELETE FROM recent_results
WHERE result_id = (SELECT result_id
FROM recent_results
WHERE event_date = MIN(event_date));
// or something like that. I have no idea if this will work,
// I just threw it together.
Or you could just create a temporary memory-based table that you populate at app load and keep up to date as you perform transactions during app execution. That way you only pay the steep price once!
Just a few more thoughts for you. Be creative, and remember that you can usually define what you want as a data structure as well as an algorithm. Good luck!
I have two models - Score & Weight.
Each of these models have about 5 attributes.
I need to be able to create a weighted_score for my User, which is basically the product of Score.attribute_A * Weight.attribute_A, Score.attribute_B * Weight.attribute_B, etc.
Am I better off creating a 3rd model - say Weighted_Score, where I store the product value for each attribute in a row with the user_id and then query that table whenever I need a particular weighted_score (e.g. my_user.weighted_score.attribute_A) or am I better off just doing the calculations on the fly every time?
I am asking from an efficiency stand-point.
Thanks.
I think the answer is very situation-dependent. Creating a 3rd table may be a good idea if the calculation is very expensive, you don't want to bog down the rest of the system and it's ok for you to respond to the user right away with a message saying that calculation will occur in the future. In that case, you can offload the processing into a background worker and create an instance of the 3rd model asynchronously. Additionally, you should de-normalize the table so that you can access it directly without having to lookup the Weight/Score records.
Some other ideas:
Focus optimizations on the model that has many records. If Weight, for instance, will only have 100 records, but Score could have infinite, then load Weight into memory and focus all your effort on optimizing the Score queries.
Use memoization on the calc methods
Use caching on the most expensive actions/methods. if you don't care too much about how frequently the values update, you can explicitly sweep the cache nightly or something.
Unless there is a need to store the calculated score (lets say that it changes and you want to preserve the changes to it) i dont see any benefit of adding complexity to store it in a separate table.
The Scenario
Update: It was brought to my attention that ordering by created_at will actually compare a millisecond float that's of sufficient resolution (by far). However, while I feel a bit dumb now, my question still stands. My scenario is just irrelevant, so I removed it.
The Question
I know that the database knows precisely the order of creation by tracking a row's ID.
Are there any pitfalls in relying on latest ID to determine order?
A better solution is to replace the latest_post_at with something more precise than a second. Time.now.to_f instead of .to_i will give you sub-second precision (millisecond I think, the docs aren't clear). Should two posts happen to have the same millisecond timestamp you could use the id as a tie-breaker.
If you're using whatever is the "natural" way of generating autoincrementing surrogate primary keys for your database, the only pitfall that comes to mind is that the order in which the database sequencer generated the IDs might not be the order in which the transactions that create the Post records start or finish. (Or however you define the time when a post is "created".)
Considering the transaction should normally take a fraction of a second to complete this uncertainty might be irrelevant for your needs.
I want to perform some simple calculations while staying database-agnostic in my rails app.
I have three models:
.---------------. .--------------. .---------------.
| ImpactSummary |<------| ImpactReport |<----------| ImpactAuction |
`---------------'1 *`--------------'1 *`---------------'
Basicly:
ImpactAuction holds data about... auctions (prices, quantities and such).
ImpactReport holds monthly reports that have many auctions as well as other attributes ; it also shows some calculation results based on the auctions.
ImpactSummary holds a collection of reports as well as some information about a specific year, and also shows calculation results based on the two other models.
What i intend to do is to store the results of these really simple calculations (just means, sums, and the like) in the relevant tables, so that reading these would be fast, and in a way that i can easilly perform queries on the calculation results.
is it good practice to store calculation results ? I'm pretty sure that's not a very good thing, but is it acceptable ?
is it useful, or should i not bother and perform the calculations on-the-fly?
if it is good practice and useful, what's the better way to achieve what i want ?
Thats the tricky part.At first, i implemented a simple chain of callbacks that would update the calculation fields of the parent model upon save (that is, when an auction is created or updated, it marks some_attribute_will_change! on its report and saves it, which triggers its own callbacks, and so on).
This approach fits well when creating / updating a single record, but if i want to work on several records, it will trigger the calculations on the whole chain for each record... So i suddenly find myself forced to put a condition on the callbacks... depending on if i have one or many records, which i can't figure out how (using a class method that could be called on a relation? using an instance attribute #skip_calculations on each record? just using an outdated field to mark the parent records for later calculation ?).
Any advice is welcome.
Bonus question: Would it be considered DB agnostic if i implement this with DB views ?
As usual, it depends. If you can perform the calculations in the database, either using a view or using #find_by_sql, I would do so. You'll save yourself a lot of trouble: you have to keep your summaries up to date when you change values. You've already met the problem when updating multiple rows. Having a view, or a query that implements the view stored as text in ImpactReport, will allow you to always have fresh data.
The answer? Benchmark, benchmark, benchmark ;)