Cassandra rpc timeout occurring during heavy writes and deletes - timeout

I'm using cassandra 2.0, and I have created a column family that looks like this:
CREATE TABLE user_id_timestamp_index (
user_id int,
timestamp text,
PRIMARY KEY (user_id, timestamp)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
I've written over 2 million rows to this table without any issues, and I perform numerous deletes as well.
The problem arises after about 10k deletes or so in rapid succession, and I start encountering numerous rpc_timeouts. A simple "delete from user_id_timestamp_index where user_id = 5 AND timestamp = '12345'" via cqlsh fails during this period.
Things I've noticed and tried:
During rpc timeouts, Load Average on 2 nodes (out of 5) shoot up to about 50.
Compaction is done almost every 5 minutes during these load-intensive writes and deletes.
During rpc_timeouts, tpstats shows pending mutation stages:
MutationStage 64 (active) 395 (pending) 48182373 (completed) 0 0
Timeouts tend to occur when memtable data size exceeds 3 mb for this CF.
After I perform a nodetool flush, pending mutations goes to zero and rpc time disappears, until memtable size creeps up again to past 3 mb.
My question is, there a configuration I can tune? For example, is the solution to simply force a memtable flush on this column family every 5 minutes? Lessen the write load on this table? A way to have faster writes and reduce pending stages? Or is there a better solution?

If you're experiencing GC pressure (which you can tell from the GCInspector lines in the log), you can reduce the amount of memory used by memtables by adjusting memtable_total_space_in_mb in cassandra.yaml. You may also need to reduce the key or row cache settings instead or as well.

Related

Over 8 minutes to read a 1 GB side input view (List)

We have a 1 GB List that was created using View.asList() method on beam sdk 2.0. We are trying to iterate through every member of the list and do, for now, nothing significant with it (we just sum up a value). Just reading this 1 GB list is taking about 8 minutes to do so (and that was after we set the workerCacheMb=8000, which we think means that the worker cache is 8 GB). (If we don't set the workerCacheMB to 8000, it takes over 50 minutes before we just kill the job.). We're using a n1-standard-32 instance, which should have more than enough RAM. There is ONLY a single thread reading this 8GB list. We know this because we create a dummy PCollection of one integer and we use it to then read this 8GB ViewList side-input.
It should not take 6 minutes to read in a 1 GB list, especially if there's enough RAM. EVEN if the list were materialized to disk (which it shouldn't be), a normal single NON-ssd disk can read data at 100 MB/s, so it should take ~10 seconds to read in this absolute worst case scenario....
What are we doing wrong? Did we discover a dataflow bug? Or maybe the workerCachMB is really in KB instead of MB? We're tearing our hair out here....
Try to use setWorkervacheMb(1000). 1000 MB = Around 1GB. It will pick the side input from cache of each worker node and that will be fast.
DataflowWorkerHarnessOptions options = PipelineOptionsFactory.create().cloneAs(DataflowWorkerHarnessOptions.class);
options.setWorkerCacheMb(1000);
Is it really required to iterate 1 GB of side input data every time or you are need some specific data to get during iteration?
In case you need specific data then you should get it by passing specific index in the list. Getting data specific to index is much faster operation then iterating whole 1GB data.
After checking with the Dataflow team, the rate of 1GB in 8 minutes sounds about right.
Side inputs in Dataflow are always serialized. This is because to have a side input, a view of a PCollection must be generated. Dataflow does this by serializing it to a special indexed file.
If you give more information about your use case, we can help you think of ways of doing it in a faster manner.

Spark JobServer, memory settings for release

I've set up a spark-jobserver to enable complex queries on a reduced dataset.
The jobserver executes two operations:
Sync with the main remote database, it makes a dump of some of the server's tables, reduce and aggregates the data, save the result as a parquet file and cache it as a sql table in memory. This operation will be done every day;
Queries, when the sync operation is finished, users can perform SQL complex queries on the aggregated dataset, (eventually) exporting the result as csv file. Every user can do only one query at time, and wait for its completion.
The biggest table (before and after the reduction, which include also some joins) has almost 30M of rows, with at least 30 fields.
Actually I'm working on a dev machine with 32GB of ram dedicated to the job server, and everything runs smoothly. Problem is that in the production one we have the same amount of ram shared with a PredictionIO server.
I'm asking how determine the memory configuration to avoid memory leaks or crashes for spark.
I'm new to this, so every reference or suggestion is accepted.
Thank you
Take an example,
if you have a server with 32g ram.
set the following parameters :
spark.executor.memory = 32g
Take a note:
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17 --executor-cores 5
--executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors. --executor-memory was
derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47
~ 19.
This is explained here if you want to know more :
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

neo4j not creating index on large dataset

I am creating an index on a very large (8.2M node, 63M property) neo4j db instance.
CREATE INDEX ON :Article(lowerTitle)
It takes a negligible amount of time to issue the command, and the index (presumably) begins to process.
I have a max java heap of 100GB, and 40 cores (it's a large server). It is, however stupidly, a HDD.
Right after issuing the index command, my core usage spikes up to very efficient usage. After about 20 seconds, it drops to using almost no processor power, but about 90% of MEM.
I have left it running for 3 hours, and the index is still not created (or at least, there are no improvements for simple MATCH queries on single parameters, which average out at about 16 seconds).
MATCH (arti {lowerTitle: "quantum mechanics"}) RETURN arti
Is this reasonable? What is taking so long? Am I doing something wrong?
NOTE: I have also noticed that my total database size (38.02GB) has not increased over the 3 hours
For verifying that your index is online, issue the :schema command in the browser.
You should see your index status.
ONLINE means OK
POPULATING means it is still populating the indices
FAILED means, well, failed
Your query will never run fast, because you are not using a label, so no indices will be used, change it to :
MATCH (arti:Article {lowerTitle: "quantum mechanics"}) RETURN arti

NEsper memory usage of "output" keyword

I have many EPL statements that output a period of time (1~24 hours), and following is my statement
"SELECT MessageID, VName, count(VName) as count FROM DDIEvent(MajorType=4).std:groupwin(VName).win:time(3 hour).win:length(10) group by VName having count(VName) >= 10 output last every 3 hour"
If there is no limit of the length window, my case will retain around 700K records in 3 hours.
And in above, my test case only have 100 different VName. For my understanding, there will have maximum 1000 records keep in memory at the same time, (100 * 10[length]) am i right?
But the memory usage of my application will keep growing until output to listener.
The memory usage almost the same as the sample without length window.
And after output to listener the memory significantly fall down.
Then, another cycle begin, memory grow slowly until 3 hour later.
I already check the document, do not find the memory related topic of the "output" keyword.
Does anyone knows what is the really root cause? And how to avoid? Or just my EPL's problem?
Thank you~
If your query removes the "MessageId" from the select clause, it becomes a regular aggregation query. You could do a "last(MessageId)" instead. Because "MessageId" is in there the rows that the engine delivers are a row for each event rather then a row for each aggregation group.

Efficient creation of Neo4j relationship indexes

Can you please explain the best way to add relationship indexes to a Neo4j database created using the BatchInserter?
Our database contains about 30 million nodes and about 300 million relationships. If we build this without any indexes then it takes about 10 hours (just calls to BatchInserter.createNode and BatchInserter.createRelationship).
However if we also try to create relationship indexes using LuceneBatchInserterIndexProvider with repeated calls to index.add then the process takes 12 hours to add everything but then gets stuck on indexProvider.shutdown and doesn't complete. The longest I have left it is 3 days. Can you please explain what it is doing at this point? I expected the work to be done during the calls to index.add. What is going on during shutdown that is taking so long?
Our PC has 64GB RAM and we have allocated 40GB to the JVM. During this shutdown step, Windows reports that 99% of the memory is in use (far more than allocated to the JVM) and the computer becomes almost unusable.
The configuration settings I am using are:
neostore.nodestore.db.mapped_memory = 1G
neostore.propertystore.db.mapped_memory = 1G
neostore.propertystore.db.index.mapped_memory = 1M
neostore.propertystore.db.index.keys.mapped_memory = 1M
neostore.propertystore.db.strings.mapped_memory = 1G
neostore.propertystore.db.arrays.mapped_memory = 1M
neostore.relationshipstore.db.mapped_memory = 10G
We've tried changing some of these but it didn't appear to make any difference.
We have also tried adding the relationship indexes as a separate step after first building the database without any indexes. In this case we used GraphDatabaseFactory.newEmbeddedDatabaseBuilder and GraphDatabaseService.index().forRelationships. Doing it this way seems to work although it was estimated that it would take around 6 days to complete. We have tried invoking commit at various different intervals which makes some difference but not significant. Most of the time seems to be spent just iterating over the relationships.
The only thing I can think of that may be abnormal about our data is that the relationships have about 20 properties on them. But even creating an index on just 1 of these properties doesn't work.
The file sizes without any indexes are:
neostore.nodestore.db 400MB
neostore.propertystore.db 100GB
neostore.propertystore.db.strings 2GB
neostore.relationshipstore.db 10GB
Can you please give us some advice on how to get this working either during the BatchInserter process or as a separate step?
We are using version 2.0.1 of the Neo4j jars.
Thanks, Damon

Resources