I'm building a model against Amazon's SQS Standard Queue which can send updates out of order.
My goal is to properly order them.
I am longpolling to copy all data from the queue into my DB.
table example - lets say I fetch some messages and process them
id | published_at | run_at | payload
1 | 1:11pm | nil | ...
2 | 1:12pm | nil | ...
3 | 1:13pm | nil | ...
4 | 1:14pm | nil | ...
5 | 1:15pm | nil | ...
Then I fetch some more, and we can see that a few odd messages are now outdated.
id | published_at | run_at | payload
1 | 1:11pm | 1:15 | ...
2 | 1:12pm | 1:15 | ...
3 | 1:13pm | 1:15 | ...
4 | 1:14pm | 1:15 | ...
5 | 1:15pm | 1:15 | ...
6 | 1:13pm | nil | ...
7 | 1:14pm | nil | ...
8 | 1:16pm | nil | ...
if i were to order by published_at, you can see that the queue needs to be re-processed starting at ID=6 down to make sure messages are processed in order.
id | published_at | run_at | payload
1 | 1:11pm | 1:15 | ...
2 | 1:12pm | 1:15 | ...
3 | 1:13pm | 1:15 | ...
6 | 1:13pm | nil | ...
4 | 1:14pm | 1:15 | ...
7 | 1:14pm | nil | ...
5 | 1:15pm | 1:15 | ...
8 | 1:16pm | nil | ...
There is value in processing data accurately, and very little problem with processing twice so re-running is not a problem.
I am mostly curious how best to best find the oldest item that has not been ran, and start running from that moment forward.
Currently doing:
# fetch oldest publish_time that has not been ran
first_publish_time = AnyOfferChange.where(run_at: nil).minimum(:publish_time)
if first_publish_time
# start there, and process in ascending order
AnyOfferChange.order("publish_time DESC").where("publish_time >= ?",first_publish_time).reverse.each(&:process!)
end
It feels quite fragile, I'd like to fetch the position and use it as a limit.
limit = AnyOfferChange.where(run_at: nil).order("publish_time ASC").pluck("POSITION SOMETHIN(SOMETHING)").first
if limit > 0
# start there, and process in ascending order
AnyOfferChange.order("publish_time DESC").limit(limit).reverse.each(&:process!)
end
The follow SQL query will give you the oldest publish_time:
AnyOfferChange.where(run_at: nil).minimum(:publish_time)
Or, if you want one record:
AnyOfferChange.where(run_at: nil).order(publish_time: :asc).first
This will limit the SQL query the oldest row that has not ran.
Fetch all records that not ran from old till new:
result = AnyOfferChanges.where(run_at: nil).order(publish_time: :asc)
# or
result = AnyOfferChanges.where(run_at: nil).order(:publish_time) # Defaults to :asc
result.each(&:process!) # Process result. See note below for batch info.
Fetch all records that did not ran with exactly the oldest publish_time (that did not ran):
# See note below to prevent unwanted SQL execution for the statements
# below when executing in the terminal.
# Create shorthand.
any_offer_changes = AnyOfferChange.arel_table
# Build query parts.
not_ran = AnyOfferChange.where(run_at: nil)
oldest_publish_time = not_ran.select(any_offer_changes[:publish_time].minimum)
# All records that not ran with with the oldest publish time.
result = not_ran.where(publish_time: oldest_publish_time)
result.each(&:process!) # Process result. See note below for batch info.
This will result in fetching all records with the lowest publish time in one SQL query, using a sub-query.
The reason I use another way than AnyOfferChange.where(run_at: nil).minimum(:publish_time) to fetch the minimum for the last part. Is that this query will break the chain and create multiple SQL queries instead of one. Whereas AnyOfferChange.where(run_at: nil).select(any_offer_changes[:run_at].minimum) will keep the chain intact when used in a where statement.
notes:
unwanted SQL execution
When run one by one this will result in multiple queries since the #inspect (used to show you the result) will trigger the SQL to execute. In the terminal follow each statement with ;nil, to prevent execution when building a #where chain. This is not needed when executed in a script.
using "batches"
For large amounts of records you may have to limit the resulting values. Rails support batches, but they don't respect the given order. To keep the order you can create your own batch, although probably less efficient. This can be done like so:
result = AnyOfferChange.where(run_at: nil).order(:publish_time).limit(100)
result.each(&:process!) while result.reload.any?
Assuming you set the run_at attribute in #process!, otherwise the above will result in an endless loop.
Related
I have InfluxDB measurement currently set up with following "schema":
+----+-------------+-----------+
| ts | cost(field) | type(tag) |
+----+-------------+-----------+
| 1 | 10 | 'a' |
| 1 | 20 | 'b' |
| 2 | 12 | 'a' |
| 2 | 18 | 'b' |
| 2 | 22 | 'c' |
+------------------+-----------+
I am trying to write a query that will group my table by timestamp and get a delta between field values of two different tags. If I want to get delta between tag 'a' and tag 'b', it will give me following result (please not that I ignore tag 'c'):
+----+-----------+------------+
| ts | type(tag) | delta_cost |
+----+-----------+------------+
| 1 | 'a' | 10 |
| 2 | 'b' | 6 |
+----+-----------+------------+
Is it something Influx can do or am I using the wrong tool?
Just managed to answer my own question. While one of the obvious ways would be performing self-join, Influx does not support joins anymore. We can, however, use nested selects in a following format:
SELECT MEAN(cost_a) - MEAN(cost_b) as delta_cost
FROM
(SELECT cost as cost_a, tag, tablename where tag='a'),
(SELECT cost as cost_b, tag, tablename where tag='b')
GROUP BY time(60s)
Since I am getting my data every 60 seconds anyway, and I have a guarantee of just one point per tag per 60 seconds, I can use GROUP BY and take MEAN without any problems
An Objective-C iOS app integrates a sqlite with a set of rows, each identified by an ID. For example:
| id | user_name | age |
------------------------------
| 1 | johnny | 33 |
| 2 | mark | 30 |
| 3 | maroccia | 50 |
Asynchronously, the app receives the same set of records, but some of them are modified: it has to update (or replace) only the modified records, ignoring the other ones (those not modified).
For example, the app receives such updated rows:
| id | user_name | age |
------------------------------
| 1 | johnny | 33 |
| 2 | mark | 30 |
| 3 | ballarin | 50 | <------ CHANGED RECORD
In this case, only the third record is changed and the app should update or replace just it, ignoring the first two.
Obviously, the INSERT OR REPLACE does not suit me because it will write all the records. So, there exists some procedure in sqlite (or Objective-C) which can help me, updating only the modified records?
Thanks
You could simply replace all rows; the result is the same.
If you do not want to rewrite rows that have not actually changed, you have to compare all column values. If you have both the old rows and the received rows in separate tables, you can compare entire rows with a compound query:
INSERT OR REPLACE INTO MyData
SELECT * FROM ReceivedData
EXCEPT
SELECT * FROM MyData;
This is my example history table.
id | time | price
1 | 1-02-17 | 15.99
1 | 1-03-17 | 15.99
1 | 1-04-17 | 15.99
1 | 1-05-17 | 20.99
1 | 1-06-17 | 20.99
1 | 1-07-17 | 15.99
1 | 1-08-17 | 15.99
I want to get an output similar to this:
1-02-17 | 15.99
1-05-17 | 20.99
1-07-17 | 15.99
Essentially I want to group the prices AFTER sorting by date.
Can this be done with rails?
If number of price points are not high, you can get fetch the records from db and do this:
price_ranges, prev_price = {}, nil
ProductHistory.where(product_id: 1).order(:date).find_each do |history|
price_ranges[history.date] = history.price if history.price != prev_price
prev_price = history.price
end
At the end of this loop, you would get:
{ "1-02-17": 15.99, "1-05-17": 20.99, "1-07-17": 15.99 }
Note: the key would be Date object instead of string if db column in datetime.
If number of records are high they I would suggest doing it in your db. This question looks very similar to your use case and if you have mysql then you can use it off-the-self
How to eliminate only continuous duplicates but not all duplicates in a select query (MySQL)?
I'm migrating some properties in a labeled node and the query performance is very poor.
The old property is callerRef and the new property is code. There are 17m nodes that need to be updated that I want to process in batches. Absence of the code property on the entity indicates that it has not yet been upgraded.
profile match (e:Entity) where not has(e.code) with e limit 1000000 set e.key = e.callerKeyRef, e.code = e.callerRef;
There is one index in the Entity label and that is for code.
schema -l :Entity
Indexes
ON :Entity(code) ONLINE
No constraints
The heap has 8gbs allocated running Neo4j 2.2.4. The problem, if I'm reading the plan right, is that ALL nodes in the label are being hit even though a limit clause is specified. I would have thought that in an unordered query where a limit is requested that processing would stop after the limit criteria is met.
+-------------------+
| No data returned. |
+-------------------+
Properties set: 2000000
870891 ms
Compiler CYPHER 2.2
+-------------+----------+----------+-------------+--------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+-------------+----------+----------+-------------+--------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 1000000 | 6000000 | e | PropertySet; PropertySet |
| Eager | 1000000 | 0 | e | |
| Slice | 1000000 | 0 | e | { AUTOINT0} |
| Filter | 1000000 | 16990200 | e | NOT(hasProp(e.code)) |
| NodeByLabel | 16990200 | 16990201 | e | :Entity |
+-------------+----------+----------+-------------+--------------------------+
Total database accesses: 39980401
Am I missing something obvious? TIA
Indexes are supported only for = and IN (which basically are the same, because Cypher compiler transofrms all = operations in IN).
Neo4j is schema-less database. So, if there are no property - there are no index data. That why it needs to scan all nodes.
My suggestions:
First step: add code property to all necessary nodes with some default "falsy" value
Make update using node.code = "none" where clause
It might be faster to first assign a new label, say ToDo, to all the nodes that have yet to be migrated:
MATCH (e:Entity)
WHERE NOT HAS (e.code)
SET e:ToDo;
Then, you can iteratively match 1000000 (or whatever) ToDo nodes at a time, removing the ToDo label after migrating each node:
MATCH (e:ToDo)
WITH e
LIMIT 1000000
SET e.key = e.callerKeyRef, e.code = e.callerRef
REMOVE e:ToDo;
is there any way to tell a cucumber table's diff! method, that I don't care about the row order?
Example:
The feature says:
| start | eat | left |
| 12 | 5 | 7 |
| 20 | 5 | 15 |
The code outputs
| start | eat | left |
| 20 | 5 | 15 |
| 12 | 5 | 7 |
which is ok for me. Cucumber would fail nonetheless, because it also checks for the order (which is nice in most cases).
Couldn't find a solution for it :(
Maybe you can sort the rows in both tables (the test value and the tested value) in a way ensuring a unique order.
I would have expected that there is a corresponding ruby version of the following Java method:
cucumber.api.DataTable#unorderedDiff(cucumber.api.DataTable)
This is under cucumber-core artifact.