We have a server that processes portfolio and securities (inside it) in different actors. For portfolio with smaller number of securities (<20) this works fine. When i increase the number of security count to 1000, encountered following issues:
akka.dispatch.FutureTimeoutException: Futures timed out after [5000] milliseconds
I could bypass this error by increasing timeout inside akka config, is that the right thing to do? In akka versions earlier than 1.2 i could set self.timeout inside the actor but that is deprecated.
The other issue I faced (intermittently) is that the entire server hangs while joining in futures.map code inside my portfolio actor:
//fork out for each security
val listOfFutures = new ListBuffer[Future[Security]]()
for (security <- portfolio.getSecurities.toList) {
val securityProcessor = actorOf[SecurityProcessor].start()
listOfFutures += (securityProcessor ? security) map {
_.asInstanceOf[Security]
}
}
EventHandler.info(this,"joining results from security processors")
//join for each security
val futures = Future.sequence(listOfFutures.toList)
futures.map {
listOfSecurities =>
portfolioResponse = MergeHelper.merge(portfolio, listOfSecurities)
}.get
You do not state which version of Akka you're on, and given my limited time with the crystal ball I'll assume that you're on 1.2.
You can specify a Timeout when you call ask/?
(Also, your code is a bit convoluted, but that I have already solved in your other question.)
Cheers,
√
Related
Stored procedures in Cosmos DB are transactional and run under isolation snapshop with optimistic concurrency control. That means that write conflicts can occur, but they are detected so the transaction is rolled back.
If such a conflict occurs, does Cosmos DB automatically retry the stored procedure, or does the client receive an exception (maybe a HTTP 412 precondition failure?) and need to implement the retry logic itself?
I tried running 100 instances of a stored procedures in parallel that would produce a write conflict by reading the a document (without setting _etag), waiting for a while and then incrementing an integer property within that document (again without setting _etag).
In all trials so far, no errors occurred, and the result was as if the 100 runs were run sequentially. So the preliminary answer is: yes, Cosmos DB automatically retries running an SP on write conflicts (or perhaps enforces transactional isolation by some other means like locking), so clients hopefully don't need to worry about aborted SPs due to conflicts.
It would be great to hear from a Cosmos DB engineer how this is achieved: retry, locking or something different?
You're correct in that this isn't properly documented anywhere. Here's how OCC check can be done in a stored procedure:
function storedProcedureWithEtag(newItem)
{
var context = getContext();
var collection = context.getCollection();
var response = context.getResponse();
if (!newItem) {
throw 'Missing item';
}
// update the item to set changed time
newItem.ChangedTime = (new Date()).toISOString();
var etagForOcc = newItem._etag;
var upsertAccecpted = collection.upsertDocument(
collection.getSelfLink(),
newItem,
{ etag: etagForOcc }, // <-- Pass in the etag
function (err2, feed2, options2) {
if (err2) throw err2;
response.setBody(newItem);
}
);
if (!upsertAccecpted) {
throw "Unable to upsert item. Id: " + newItem.id;
}
}
Credit: https://peter.intheazuresky.com/2016/12/22/documentdb-optimistic-concurrency-in-a-stored-procedure/
SDK does not retry on a 412, 412 failures are related to Optimistic Concurrency and in those cases, you are controlling the ETag that you are passing. It is expected that the user handles the 412 by reading the newest version of the document, obtains the newer ETag, and retries the operation with the updated value.
Example for V3 SDK
Example for V2 SDK
I have an Apache Beam pipeline running on Google Dataflow whose job is rather simple:
It reads individual JSON objects from Pub/Sub
Parses them
And sends them via HTTP to some API
This API requires me to send the items in batches of 75. So I built a DoFn that accumulates events in a list and publish them via this API once they I get 75. This results to be too slow, so I thought instead of executing those HTTP requests in different threads using a thread pool.
The implementation of what I have right now looks like this:
private class WriteFn : DoFn<TheEvent, Void>() {
#Transient var api: TheApi
#Transient var currentBatch: MutableList<TheEvent>
#Transient var executor: ExecutorService
#Setup
fun setup() {
api = buildApi()
executor = Executors.newCachedThreadPool()
}
#StartBundle
fun startBundle() {
currentBatch = mutableListOf()
}
#ProcessElement
fun processElement(processContext: ProcessContext) {
val record = processContext.element()
currentBatch.add(record)
if (currentBatch.size >= 75) {
flush()
}
}
private fun flush() {
val payloadTrack = currentBatch.toList()
executor.submit {
api.sendToApi(payloadTrack)
}
currentBatch.clear()
}
#FinishBundle
fun finishBundle() {
if (currentBatch.isNotEmpty()) {
flush()
}
}
#Teardown
fun teardown() {
executor.shutdown()
executor.awaitTermination(30, TimeUnit.SECONDS)
}
}
This seems to work "fine" in the sense that data is making it to the API. But I don't know if this is the right approach and I have the sense that this is very slow.
The reason I think it's slow is that when load testing (by sending a few million events to Pub/Sub), it takes it up to 8 times more time for the pipeline to forward those messages to the API (which has response times of under 8ms) than for my laptop to feed them into Pub/Sub.
Is there any problem with my implementation? Is this the way I should be doing this?
Also... am I required to wait for all the requests to finish in my #FinishBundle method (i.e. by getting the futures returned by the executor and waiting on them)?
You have two interrelated questions here:
Are you doing this right / do you need to change anything?
Do you need to wait in #FinishBundle?
The second answer: yes. But actually you need to flush more thoroughly, as will become clear.
Once your #FinishBundle method succeeds, a Beam runner will assume the bundle has completed successfully. But your #FinishBundle only sends the requests - it does not ensure they have succeeded. So you could lose data that way if the requests subsequently fail. Your #FinishBundle method should actually be blocking and waiting for confirmation of success from the TheApi. Incidentally, all of the above should be idempotent, since after finishing the bundle, an earthquake could strike and cause a retry ;-)
So to answer the first question: should you change anything? Just the above. The practice of batching requests this way can work as long as you are sure the results are committed before the bundle is committed.
You may find that doing so will cause your pipeline to slow down, because #FinishBundle happens more frequently than #Setup. To batch up requests across bundles you need to use the lower-level features of state and timers. I wrote up a contrived version of your use case at https://beam.apache.org/blog/2017/08/28/timely-processing.html. I would be quite interested in how this works for you.
It may simply be that the extremely low latency you are expecting, in the low millisecond range, is not available when there is a durable shuffle in your pipeline.
How does one run SQL queries with different column dimensions in parallel using dask? Below was my attempt:
from dask.delayed import delayed
from dask.diagnostics import ProgressBar
import dask
ProgressBar().register()
con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")
#delayed
def loadsql(sql):
return pd.read_sql_query(sql,con)
results = [loadsql(x) for x in sql_to_run]
dask.compute(results)
df1=results[0]
df2=results[1]
df3=results[2]
df4=results[3]
df5=results[4]
df6=results[5]
However this results in the following error being thrown:
DatabaseError: Execution failed on sql: "SQL QUERY"
ORA-01013: user requested cancel of current operation
unable to rollback
and then shortly thereafter another error comes up:
MultipleInstanceError: Multiple incompatible subclass instances of TerminalInteractiveShell are being created.
sql_to_run is a list of different sql queries
Any suggestions or pointers?? Thanks!
Update 9.7.18
Think this is more a case of me not reading documentation close enough. Indeed the con being outside the loadsql function was causing the problem. The below is the code change that seems to be working as intended now.
def loadsql(sql):
con = cx_Oracle.connect(user="BLAH",password="BLAH",dsn = "BLAH")
result = pd.read_sql_query(sql,con)
con.close()
return result
values = [delayed(loadsql)(x) for x in sql_to_run]
#MultiProcessing version
import dask.multiprocessing
results = dask.compute(*values, scheduler='processes')
#My sample queries took 56.2 seconds
#MultiThreaded version
import dask.threaded
results = dask.compute(*values, scheduler='threads')
#My sample queries took 51.5 seconds
My guess is, that the oracle client is not thread-safe. You could try running with processes instead (by using the multiprocessing scheduler, or the distributed one), if the conn object serialises - this may be unlikely. More likely to work, would be to create the connection within loadsql, so it gets remade for each call, and the different connections hopefully don't interfere with one-another.
It looks like previously working approach is deprecated now:
unsupported.dbms.executiontime_limit.enabled=true
unsupported.dbms.executiontime_limit.time=1s
According to the documentation new variables are responsible for timeouts handling:
dbms.transaction.timeout
dbms.transaction_timeout
At the same time the new variables look related to the transactions.
The new timeout variables look not working. They were set in the neo4j.conf as follows:
dbms.transaction_timeout=5s
dbms.transaction.timeout=5s
Slow cypher query isn't terminated.
Then the Neo4j plugin was added to model a slow query with transaction:
#Procedure("test.slowQuery")
public Stream<Res> slowQuery(#Name("delay") Number Delay )
{
ArrayList<Res> res = new ArrayList<>();
try ( Transaction tx = db.beginTx() ){
Thread.sleep(Delay.intValue(), 0);
tx.success();
} catch (Exception e) {
System.out.println(e);
}
return res.stream();
}
The function served by the plugin is executed with neoism Golang package. And the timeout isn't triggered as well.
The timeout is only honored if your procedure code invokes either operations on the graph like reading nodes and rels or explicitly checks if the current transaction is marked as terminate.
For the later, see https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/master/src/main/java/apoc/util/Utils.java#L41-L51 as example.
According to the documentation the transaction guard is interested in orphaned transactions only.
The server guards against orphaned transactions by using a timeout. If there are no requests for a given transaction within the timeout period, the server will roll it back. You can configure the timeout in the server configuration, by setting dbms.transaction_timeout to the number of seconds before timeout. The default timeout is 60 seconds.
I've not found a way how to trigger timeout for a query which isn't orphaned with a native functionality.
#StefanArmbruster pointed a good direction. The timeout triggering functionality can be got with creating a wrapper function in Neo4j plugin like it is made in apoc.
We have a Grails project that runs behind a load balancer. There are three instances of the Grails application running on the server (using separate Tomcat instances). Each instance has its own searchable index. Because the indexes are separate, the automatic update is not enough keeping the index consistent between the application instances. Because of this we have disabled the searchable index mirroring and updates to the index are done manually in a scheduled quartz job. According to our understanding no other part of the application should modify the index.
The quartz job runs once a minute and it checks from the database which rows have been updated by the application, and re-indexes those objects. The job also checks if the same job is already running so it doesn’t do any concurrent indexing. The application runs fine for few hours after the startup and then suddenly when the job is starting, LockObtainFailedException is thrown:
22.10.2012 11:20:40 [xxxx.ReindexJob] ERROR Could not update searchable index, class org.compass.core.engine.SearchEngineException:
Failed to open writer for sub index [product]; nested exception is
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
out:
SimpleFSLock#/home/xxx/tomcat/searchable-index/index/product/lucene-a7bbc72a49512284f5ac54f5d7d32849-write.lock
According to the log the last time the job was executed, re-indexing was done without any errors and the job finished successfully. Still, this time the re-index operation throws the locking exception, as if the previous operation was unfinished and the lock had not been released. The lock will not be released until the application is restarted.
We tried to solve the problem by manually opening the locked index, which causes the following error to be printed to the log:
22.10.2012 11:21:30 [manager.IndexWritersManager ] ERROR Illegal state, marking an index writer as open, while another is marked as
open for sub index [product]
After this the job seems to be working correctly and doesn’t become stuck in a locked state again. However this causes the application to constantly use 100 % of the CPU resource. Below is a shortened version of the quartz job code.
Any help would be appreciated to solve the problem, thanks in advance.
class ReindexJob {
def compass
...
static Calendar lastIndexed
static triggers = {
// Every day every minute (at xx:xx:30), start delay 2 min
// cronExpression: "s m h D M W [Y]"
cron name: "ReindexTrigger", cronExpression: "30 * * * * ?", startDelay: 120000
}
def execute() {
if (ConcurrencyHelper.isLocked(ConcurrencyHelper.Locks.LUCENE_INDEX)) {
log.error("Search index has been locked, not doing anything.")
return
}
try {
boolean acquiredLock = ConcurrencyHelper.lock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
if (!acquiredLock) {
log.warn("Could not lock search index, not doing anything.")
return
}
Calendar reindexDate = lastIndexed
Calendar newReindexDate = Calendar.instance
if (!reindexDate) {
reindexDate = Calendar.instance
reindexDate.add(Calendar.MINUTE, -3)
lastIndexed = reindexDate
}
log.debug("+++ Starting ReindexJob, last indexed ${TextHelper.formatDate("yyyy-MM-dd HH:mm:ss", reindexDate.time)} +++")
Long start = System.currentTimeMillis()
String reindexMessage = ""
// Retrieve the ids of products that have been modified since the job last ran
String productQuery = "select p.id from Product ..."
List<Long> productIds = Product.executeQuery(productQuery, ["lastIndexedDate": reindexDate.time, "lastIndexedCalendar": reindexDate])
if (productIds) {
reindexMessage += "Found ${productIds.size()} product(s) to reindex. "
final int BATCH_SIZE = 10
Long time = TimeHelper.timer {
for (int inserted = 0; inserted < productIds.size(); inserted += BATCH_SIZE) {
log.debug("Indexing from ${inserted + 1} to ${Math.min(inserted + BATCH_SIZE, productIds.size())}: ${productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size()))}")
Product.reindex(productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size())))
Thread.sleep(250)
}
}
reindexMessage += " (${time / 1000} s). "
} else {
reindexMessage += "No products to reindex. "
}
log.debug(reindexMessage)
// Re-index brands
Brand.reindex()
lastIndexed = newReindexDate
log.debug("+++ Finished ReindexJob (${(System.currentTimeMillis() - start) / 1000} s) +++")
} catch (Exception e) {
log.error("Could not update searchable index, ${e.class}: ${e.message}")
if (e instanceof org.apache.lucene.store.LockObtainFailedException || e instanceof org.compass.core.engine.SearchEngineException) {
log.info("This is a Lucene index locking exception.")
for (String subIndex in compass.searchEngineIndexManager.getSubIndexes()) {
if (compass.searchEngineIndexManager.isLocked(subIndex)) {
log.info("Releasing Lucene index lock for sub index ${subIndex}")
compass.searchEngineIndexManager.releaseLock(subIndex)
}
}
}
} finally {
ConcurrencyHelper.unlock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
}
}
}
Based on JMX CPU samples, it seems that Compass is doing some scheduling behind the scenes. From 1 minute CPU samples it seems like there are few things different when normal and 100% CPU instances are compared:
org.apache.lucene.index.IndexWriter.doWait() is using most of the CPU time.
Compass Scheduled Executor Thread is shown in the thread list, this was not seen in a normal situation.
One Compass Executor Thread is doing commitMerge, in a normal situation none of these threads was doing commitMerge.
You can try increasing the 'compass.transaction.lockTimeout' setting. The default is 10 (seconds).
Another option is to disable concurrency in Compass and make it synchronous. This is controlled with the 'compass.transaction.processor.read_committed.concurrentOperations': 'false' setting. You might also have to set 'compass.transaction.processor' to 'read_committed'
These are the compass settings we are currently using:
compassSettings = [
'compass.engine.optimizer.schedule.period': '300',
'compass.engine.mergeFactor':'1000',
'compass.engine.maxBufferedDocs':'1000',
'compass.engine.ramBufferSize': '128',
'compass.engine.useCompoundFile': 'false',
'compass.transaction.processor': 'read_committed',
'compass.transaction.processor.read_committed.concurrentOperations': 'false',
'compass.transaction.lockTimeout': '30',
'compass.transaction.lockPollInterval': '500',
'compass.transaction.readCommitted.translog.connection': 'ram://'
]
This has concurrency switched off. You can make it asynchronous by changing the 'compass.transaction.processor.read_committed.concurrentOperations' setting to 'true'. (or removing the entry).
Compass configuration reference:
http://static.compassframework.org/docs/latest/core-configuration.html
Documentation for the concurrency of read_committed processor:
http://www.compass-project.org/docs/latest/reference/html/core-searchengine.html#core-searchengine-transaction-read_committed
If you want to keep async operations, you can also control the number of threads it uses. Using compass.transaction.processor.read_committed.concurrencyLevel=1 setting would allow asynchronous operations but just use one thread (the default is 5 threads). There are also the compass.transaction.processor.read_committed.backlog and compass.transaction.processor.read_committed.addTimeout settings.
I hope this helps.