Basically here is the set up:
You have a number of marketplace items and you want to sort them by price. If the cache expires when someone is browsing, they will suddenly be presented potential duplicate entries. This seems like a really terrible public API experience and we are looking to avoid this problem.
Some basic philosophies I have seen include:
Reddit's, in which they track the last id seen by the client, but they still handle duplicates.
Will Paginate, which is a simple implementation that basically returns results based on a multiple of items you want returned and an offset
Then there are many varied solutions that involve Redis sorted sets, etc. But these also don't really solve the problem of how to remove the duplicate entries
Does anyone have a fairly reliable way to deal with paginating sorted, dynamic lists without dupicates?
If the items you need to paginate are sorted properly ( on unique values ) then the only thing you need to do is to select the results by that value instead of by offset.
simple SQL example
SELECT * FROM items LIMIT 10; /*page 1*/
lets say row #10 has id = 42 (and id is the primary key)
SELECT * FROM items WHERE id < 42 LIMIT 10; /* page 2*/
If you are using postgresql (probably mysql has same problem) this solves also the problem that using OFFSET sucks in terms of performances (OFFSET N LIMIT M needs to scan N rows!)
If sorting is not unique (eg. sorting on creation timestamp can lead to multiple items created at the same time) you are going to have the duplication problem
Related
I am developing an application which is very similar to todo list in its nature, except order of todos matters and can be changed by user.
What's a good way to save this order in db without having to re-save whole todo list upon change of order?
I am developing in Rails, Postgres and React, newest versions.
I am thinking to save it as an array in User Todos (there can be multiple users of the application), but I am thinking it could complicate things a little as every time I create a todo I would have to save the List also.
You can look into acts_as_list gem and for this you'll have to add an additional column position in your table. But this will do mass update on the records. But this gem is frequently updated.
If you want an optimised solution and minimise the number of updates on changing the list then you should check ranked_model gem but this one is not frequently updated. There is a brief on how it works :-
This library is written using ARel from the ground-up. This leaves the code much cleaner than many implementations. ranked-model is also optimized to write to the database as little as possible: ranks are stored as a number between -2147483648 and 2147483647 (the INT range in MySQL). When an item is given a new position, it assigns itself a rank number between two neighbors. This allows several movements of items before no digits are available between two neighbors. When this occurs, ranked-model will try to shift other records out of the way. If items can't be easily shifted anymore, it will rebalance the distribution of rank numbers across all members of the ranked group.
You can refer this gem and make your own implementation as it only supports rails 3 & 4.
This was a bit of a head scratcher but here is what I figured:
create table orderedtable (
pk SERIAL PRIMARY KEY,
ord INTEGER NOT NULL,
UNIQUE(ord) DEFERRABLE INITIALLY DEFERRED
)
DEFERRABLE INITIALLY DEFERRED is important so that intermediate states don't cause constraint violations during reordering.
INSERT INTO orderedtable (ord) VALUES (1),(2),(3),(4),(5),(10),(11)
Note that when inserting in this table it would be more efficient to leave gaps between ord values so as to minimize the amount of order values that need to be shifted when inserting or moving rows later. The consecutive values are for demonstration purposes.
Here's the trick: You can find a consecutive sequence of values starting at a particular value using a recursive query.
So for example, let's say you wanted to insert or move a row just above position 3. One way would be to move rows currently at position 4 and 5 up by one to open up position 4.
WITH RECURSIVE consecutives(ord) AS (
SELECT ord FROM orderedtable WHERE ord = 3+1 --start position
UNION ALL
SELECT orderedtable.ord FROM orderedtable JOIN consecutives ON orderedtable.ord=consecutives.ord+1 --recursively select rows one above, until there is a hole in the sequence
)
UPDATE orderedtable
SET ord=orderedtable.ord+1
FROM consecutives
WHERE orderedtable.ord=consecutives.ord;
The above renumbers the ord from 1,2,3,4,5,10,11 to 1,2,3,5,6,10,11 leaving a hole at 4.
If there was already a hole at ord=4 , the above query wouldn't have done anything.
Then just insert or move another row by giving it the now free ord value of 4.
You could push rows down instead of up by changing the +1s to -1s.
I have a table with three fields: Id, location, sortorder.
Id location sortorder
-- -------- ---------
1 a 1
2 b 2
3 c 3
4 d 4
I want to the user to be able to amend the sort order on the items in the table. I'm using EF to write to the database, is there any way of amending the sort order on the table without having to loads of calls to the database.
If I move an item to the top of the list from the bottom I would need to update all the rows that were underneath that new row, to move them down the order. If possible I would like to avoid n updates to the database, and just do it in the least number possible.
Is this possible?
I believe Gert's suggestion of using floats for sort order is probably the best one to go with. Drupal uses weights of menu items for the same purpose but inserts at increments of 100 or 1000 so you can go between things. I think that it also can run a cron job to respace the ordering so you don't run out of numbers in a more efficiently stored data type but that sounds like a holdover from my BASIC days in middle school where you had to do that with line numbers.
Also, I would wager that it isn't actually as awful as running n updates because it's instead doing one update that affects n rows. Yes, at the end of the day it does have to change n rows but that's on the DB side so there are tons of efficiencies that can be implemented to speed it up.
I have a table of event results, and I need to fetch the most recent n events per player for a given list of players.
This is on iOS so it needs to be fast. I've looked at a lot of top-n-per-group solutions that use subqueries or joins, but these run slow for my 100k row dataset even on a macbook pro. So far my dumb solution, since I will only run this with a maximum of 6 players, is to do 6 separate queries. It isn't terribly slow, but there has to be a better way, right? Here's the gist of what I'm doing now:
results_by_pid = {}
player_ids = [1,2,3,4,5,6]
n_results = 6
for pid in player_ids:
results_by_pid[pid] = exec_sql("SELECT *
FROM results
WHERE player_id = #{pid}
ORDER BY event_date DESC
LIMIT n_events")
And then I go on my merry way. But how can I turn this into a single fast query?
There is no better way.
SQL window functions, which might help, are not implemented in SQLite.
SQLite is designed as an embedded database where most of the logic stays in the application.
In contrast to client/server databases where network communication should be avoided, there is no performance disadvantage to mixing SQL commands and program logic.
A less dumb solution requires you to do some SELECT player_id FROM somewhere beforehand, which should be no trouble.
To make the individual queries efficient, ensure you have one index on the two columns player_id and event_date.
This won't be much of an answer, but here goes...
I have found that making things really quick can involve ideas from the nature of the data and schema themselves. For example, searching an ordered list is faster than searching an unordered list, but you have to pay a cost up front - both in design and execution.
So ask yourself if there are any natural partitions on your data that may reduce the number of records SQLite must search. You might ask whether the latest n events fall within a particular time period. Will they all be from the last seven days? The last month? If so then you can construct the query to rule out whole chunks of data before performing more complex searches.
Also, if you just can't get the thing to work quickly, you can consider UX trickery! Soooooo many engineers don't get clever with their UX. Will your query be run as the result of a view controller push? Then set the thing going in a background thread from the PREVIOUS view controller, and let it work while iOS animates. How long does a push animation take? .2 seconds? At what point does your user indicate to the app (via some UX control) which playerids are going to be queried? As soon as he touches that button or TVCell, you can prefetch some data. So if the total work you have to do is O(n log n), that means you can probably break it up into O(n) and O(log n) pieces.
Just some thoughts while I avoid doing my own hard work.
More thoughts
How about a separate table that contains the ids of the previous n inserts? You could add a trigger to delete old ids if the size of the table grows above n. Say..
CREATE TABLE IF NOT EXISTS recent_results
(result_id INTEGER PRIMARY KEY, event_date DATE);
// is DATE a type? I don't know. you get the point
CREATE TRIGGER IF NOT EXISTS optimizer
AFTER INSERT ON recent_results
WHEN (SELECT COUNT(*) FROM recent_results) > N
BEGIN
DELETE FROM recent_results
WHERE result_id = (SELECT result_id
FROM recent_results
WHERE event_date = MIN(event_date));
// or something like that. I have no idea if this will work,
// I just threw it together.
Or you could just create a temporary memory-based table that you populate at app load and keep up to date as you perform transactions during app execution. That way you only pay the steep price once!
Just a few more thoughts for you. Be creative, and remember that you can usually define what you want as a data structure as well as an algorithm. Good luck!
Trying to join 6 tables which are having 5 million rows approximately in each table. Trying to join on account number which is sorted in ascending order on all tables. Map tasks are successfully finished and reducers stopped working at 66.68%. Tried options like increasing number of reducers and also tried other options set hive.auto.convert.join = true; and set hive.hashtable.max.memory.usage = 0.9; and set hive.smalltable.filesize = 25000000L; but the result is same. Tried with small number of records (like 5000 rows) and the query works really well.
Please suggest what can be done here to make it work.
Reducers at 66% start doing the actual reduce (0-33% is shuffle, 33-66% is sort). In a join with hive, the reducer is performing a Cartesian product between the two data sets.
I'm going to guess that there is at least one foreign key that is appearing frequently in all of the data sets. Watch for NULL and default values.
For example, in a join, imagine the key "abc" appears ten times in each of the six tables (10^6). That's a million output records for that one key. If "abc" appears 1000 times in one table, 1000 in another, 1000 in another, then twice in the other three tables, you get 8 billion records (1000^3 * 2^3). You can see how this gets out of hand. I'm guessing there is at least one key that is resulting in a massive number of output records.
This is general good practice to avoid in RDBMS outside of Hive as well. Doing multiple inner joins between many-to-many relationships can get you in a lot of trouble.
For debugging this now, and in the future, you could use the JobTracker to find and examine the logs for the Reducer(s) in question. You can then instrument the reduce operation to get a better handle as to what's going on. be careful you don't blow it up with logging of course!
Try looking at the number of records input to the reduce operation for example.
Let's say we have a time consuming query described below :
(SELECT ...
FROM ...) AS FOO
LEFT JOIN (
SELECT ...
FROM ...) AS BAR
ON FOO.BarID = BAR.ID
Let's suppose that
(SELECT ...
FROM ...) AS FOO
Returns many rows (let's say 10 M). Every single row has to be joined with data in BAR.
Now let's say we insert the result of
SELECT ...
FROM ...) AS BAR
In a table, and add the ad hoc index(es) to it.
My question :
How would the performance of the "JOIN" with a live query differ from the performance of the "JOIN" to a table containing the result of the previous live query, to which ad hoc indexes would have been added ?
Another way to put it :
If a JOIN is slow, would there be any gain in actually storing and indexing the table to which we JOIN to ?
The answer is 'Maybe'.
It depends on the statistics of the data in question. The only way you'll find out for sure is to actually load the first query into a temp table, stick a relevant index on it, then run the second part of the query.
I can tell you if speed is what you want, if it's possible for you load the results of your first query permanently into a table then of course your query is going to be quicker.
If you want it to be even faster, depending on which DBMS you are using you could consider creating an index which crosses both tables - if you're using SQL Server they're called 'Indexed Views' or you can also look up 'Reified indexes' for other systems.
Finally, if you want the ultimate in speed, consider denormalising your data and eliminating the join that is occurring on the fly - basically you move the pre-processing (the join) offline at the cost of storage space and data consistency (your live table will be a little behind depending on how frequently you run your updates).
I hope this helps.