Why does a select with consistent read from Amazon SimpleDB yield different results? - amazon-simpledb

I have a domain on SimpleDB and I never delete from it.
I am doing the following query on it.
select count(*) from table_name where last_updated > '2012-09-25';
Though I am setting consistent read parameter as true, it is still returning me different results in different executions. As I am not deleting anything from this domain, ideally the results should be in increasing order, but that is not happening.
Am I missing something here?

If I understand your use case correctly, you might be misreading the semantics of the ConsistentRead parameter in the context of Amazon SimpleDB, see Select:
When set to true, ensures that the most recent data is returned. For
more information, see Consistency
The phrase most recent can admittedly be misleading eventually, but it doesn't address/affect result ordering in any way, rather it means most recently updated and ConsistentRead guarantees that every update operation preceding your select statement is visible to this select operation already, see the description:
Amazon SimpleDB keeps multiple copies of each domain. When data is
written or updated, all copies of the data are updated. However, it
takes time for the update to propagate to all storage locations. The
data will eventually be consistent, but an immediate read might not
show the change. If eventually consistent reads are not acceptable for
your application, use ConsistentRead. Although this operation might
take longer than a standard read, it always returns the last updated
value. [emphasis mine]
The linked section on Consistency provides more details and an illustration regarding this concept.
Sort order
To achieve the results you presumably desire, a simple order by statement should do the job, e.g.:
select * from table_name where last_updated > '2012-09-25' order by last_updated;
There are a couple of constraints/subtleties regarding this operation on SimpleDB, so make sure to skim the short documentation of Sort for details.

Related

CloudKit query indices validity

Doc says:
WARNING Query indexes are updated asynchronously so they are not
guaranteed to be current. If you query for records that you recently
changed and not allow enough time for those changes to be processed,
the query results may be incorrect. The results may not contain the
correct records and the records may be out of order.
Valid query result is crucial for me. Looks like currently there is no other option than provide some time for changes to be processed. Is there amount of time considered safe? Is there any case where query is as reliable as fetch? I would like to avoid forward referencing (highly recommended by Apple).

Is it possible to execute read only cypher queries from java?

I'd like to know just what the title says.
The reason I'd want this is to permit constrained read-only cypher queries to be executed; the data results would later be interpreted and serialized by a separate API layer.
I've seen code that makes basic assumptions in an attempt to mimic this behavior, e.g. the code might filter out any Cypher query that contains certain special words associated with write query structures (merge, create, delete, set, and so on).
This approach tends to be limited and naive though; if it very simply looks for those tokens, it would prevent a query like MATCH n WHERE n.label =~ '.*create.*' RETURN n even though it's a read-only query.
I'd really prefer not to do a full parse on a candidate query and then descend through the AST trying to figure out whether something is read-only or not (although I would gladly accept an answer that shows how to do this easily in java)
EDIT - I'm aware it's possible to start the entire database in read-only mode via the configuration property read_only=true, but this would be undesirable; no other aspect of the java API would be able to change the database.
EDIT 2 - I found another possible strategy, but I'm not sure of its advisability. Comments welcome on this, and potential downsides:
try (Transaction ignore = graphDb.beginTx()) {
ExecutionResult result = executionEngine.execute(query);
// Do nifty stuff with result, then...
// Force transaction to fail.
ignore.failure();
}
The idea here is that if queries happen within transactions and the transaction is always force-failed, then nothing can ever be written to the DB no matter what the result.
Read-only Cypher is (not yet) directly supported. However I can think of two workarounds for that:
1) assuming you're running a Neo4j enterprise cluster: you can set read_only=true on one instance. That instance is then used for the read only queries where the other cluster instances are used for r/w. A load balancer in front of the cluster can be set up to send the requests to the right instance.
2) Use a TransactionEventHandler that vetos a transaction if its TransactionData contains write operations. Just for fun I've invested some minutes to implement that, see https://github.com/sarmbruster/read-only-cypher - feedback is appreciated.

Is is possible in ruby to set a specific active record call to read dirty

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

Uniqueness in BatchInserter of Neo4J

I am using a "BatchInserter" to build a graph (in a single thread). I want to make sure nodes (and possibly relationships) are unique. My current solution is to check whether the node exists in the following manner:
String name = (String) nodeProperties.get(IndexKeys.CATEGORY_KEY);
if(index.get(IndexKeys.CATEGORY_KEY, name).size() > 0)
return index.get(IndexKeys.CATEGORY_KEY, name).getSingle();
Long nodeID = inserter.createNode( nodeProperties,categoryLabel );
index.add(nodeID, nodeProperties);
index.flush();
It seems to be working fine but as you can see it is IO expensive (flushing on every new addition - which i believe is a lucene "commit" command). This is slowing down my code considerably.
I am aware of put if absent and uniqueFactory. As documented:
By using put-if-absent functionality, entity uniqueness can be guaranteed using an index.
Here the index acts as the lock and will only lock the smallest part
needed to guaranteed uniqueness across threads and transactions. To
get the more high-level get-or-create functionality make use of
UniqueFactory
However, these are for transaction based interactions with the graph. What I would like to do is to ensure uniqueness of nodes and possibly relationships in a batch insertion semantics, that is faster than my current setup.
Any pointers would be much appreciated.
Thank you
You should investigate the MERGE keyword in cypher. I believe this will permit you to exploit your autoindexes without requiring you to use them yourself. More broadly, you might want to see if you can formulate your bulk load in a way that is conducive to piping large volumes of cypher queries through the neo4j-shell.
Finally, as general pointers and background, you should check out this information on bulk loading
When I encountered this problem, I just decided to go tyrant and force index values in my own. Can't you do the same? I mean, ensure uniqueness before you do the insertions?

Breeze.js reverses the query order when executed locally

So a slightly weird one that I can't find any cause for really.
My app is set up to basically run almost all queries through one standard method that handles things like querying against the local cache etc. So essentially the queries are all pretty standardised.
Then I have just one, with a strange orderby issue. The query includes a specific orderby clause, and if I run the query first time, the cache is checked, no results found, queries the remote data source, get data, all correct and ordered.
When I return to the page, the query is executed again, and the query is executed against the local cache, where it does find the data and returns it... the weird parts is the order is reversed. Bear in mind the parameters going in are exactly the same, the only difference is the query is executed with executeQueryLocally, and results are found, and returned (in the first query, it is still executed with executeQueryLocally, it's just that no results are found and it goes on to execute it remotely).
I really can't see any specific issue as to why the results are reversed (I say they are reversed, I can't actually guarantee that - they might just be unordered and happen to come out in a reversed order)
This isn't really causing a headache, it's just weird, especially as it appears to be only one query where this happens).
Thoughts?
Server side queries and client side queries are not guaranteed to return results in any specific order UNLESS you have an "orderBy" clause specified. The reason that order may be different without the "orderBy" clause is that the data is being stored very differently on the server vs the client and unless a specific order is specified both will attempt to satisfy the query as efficiently as possible given the storage implementation.
One interesting side note is that per the ANSI 92 SQL standard, even your SQL database is not required to return data in the same order for the same query ( again unless you have an ORDER BY clause). It's just that it's very rare to see it happen.

Resources