DSE 4.8 Solr deep paging performace issue - datastax-enterprise

I've just have a question about Solr deep-paging performance. I have installed DSE 4.8 cluster (Cassandra + Solr) and the things are going well until we hit some issues with depp-paging. Solr performs well searching and indexing data but we hit a serious lack of performace doing deep-paging.
The database has 110GB of data per each node (4 in total; 2 x C* + 2 x Solr), Solr resolves the searches very quick and the paging through the result set (20 rows per page), however the problem comes when I try to get an entire result set of 3,000 or 5,000 rows from the database. For this last scenario I am using the deep page technique and unfortunately i can't fetch more than 1000 rows every ~10 secs. Is there any chance to improve deep page performance on DSE Solr?.
Cheers

Related

Django or Ruby-On-Rails max users on one server deployment and implcations of GIL

I'm aware of the hugely trafficked sites built in Django or Ruby On Rails. I'm considering one of these frameworks for an application that will be deployed on ONE box and used internally by a company. I'm a noob and I'm wondering how may concurrent users I can support with a response time of under 2 seconds.
Example box spec: Core i5, 8Gb Ram 2.3Ghz. Apache webserver. Postgres DB.
App overview: Simple CRUD operations. Small models of 10-20 fields probably <1K data per record. Relational database schema, around 20 tables.
Example usage scenario: 100 users making a CRUD request every 5 seconds (=20 requests per second). At the same time 2 users uploading a video and one background search process running to identify potentially related data entries.
1) Would a video upload process run outside the GIL once an initial request set it off uploading?
2) For the above system built in Django with the full stack deployed on the box described above, with the usage figures above, should I expect response times <2s? If so what would you estimate my maximum hits per second could be for response time <2s?
3) For the the same scenario with Ruby On Rails, could I expect response times of <2s?
4) Specifically with regards to response times at the above usage levels, would it be significantly better built in Java (play framework) due to JVM support for concurrent processing.
Many thanks
Duncan

bottleneck node taking long time to return

we're on neo4j 2.1.4 soon to upgrade to 2.2.1.
We've been experiencing some slow downs with certain cypher queries and I think they are mostly centered around two to three nodes out of millions in the graph. These nodes were created with the intent on having some monitoring put in place to check the availability of the graph. I've since found out that a few apps that have been built are actually exercising these queries before actually performing their write operations on the graph. Then I found out that our load balancer was setup to actually do some tests through multiple apps that end up querying the same nodes. So we have a large mix of applications that are all either pulling or updating these same nodes. This has resulted in those two nodes taking anywhere from 8 to 40 seconds to be returned.
Is there any way to determine how many updates and how many queries are being issued against one node?
Since Neo4j 2.2 there's a config option to log queries taking longer than a given threshold, see the dbms.querylog.XXXX settings in http://neo4j.com/docs/stable/configuration-settings.html.
To get an update count for a given node you could setup a custom TransactionEventHandler that tracks write accesses to your given nodes.

Direction of DSE Solr cluster capacity planning

Getting started with latest DSE, trying to setup an initial DSE solr cluster and wanting to make sure basic capacity needs are met. In following docs I have done some initial capacity testing following directions here:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/srch/srchCapazty.html
My test single node setup is on AWS, m3.xl, 80GB raid0 for the two 40GB ssd's, latest DSE installed
I have inserted a total of 6MM example records and run some solr searches which would be similar to that which production would be running.
Have the following numbers for my 6MM records:
6MM Records
7.6GB disk (Cassandra + solr)
2.56GB solr index size
96.2MB solr field cache(totalReadableMemSize)
25.57MB solr Heap
I am trying to plan out an initial starter cluster, would like to plan for around 250MM records stored and indexed to start. Read load will be pretty minimal in the early days, so not too worried about read throughput to start.
Following the capacity planning doc page and some numbers for 250MM from 6MM looks like base requirements for dataset would be:
250MM Records
106GB solr index size
317GB disk (Cassandra + solr)
4GB solr field cache(totalReadableMemSize)
1.1GB solr Heap
So some questions looking for some guidance on and if I am understanding docs correctly:
Should I be targeting ~360GB+ storage to be safe and not exceed 80% disk capacity on average as data set grows?
Should I use nodes that can allocate 6GB for solr + XGB for Cassandra? (ie: if entire solr index for 250MM is around 6GB for heap and field cache, and I partition across 3 nodes with replication)
With ~6GB for solr, how much should I try to dedicate to Cassandra proper?
Anything else to consider with planning (will be running on AWS)?
UPDATED (11/6) - Notes/suggestions from phact
With Cass+Solr running together, will target prescribed 14GB for each node for base operation, moving to targeted 30GB memory nodes on AWS, leaving 16GB for OS, solr index, solr field cache
I added the solr index size to numbers above, if suggested target to keep most/all index in memory seems I might need to target AT LEAST 8 nodes to start, with 30GB memory per node.
Seems like a good amount of extra overhead for solr nodes for targeting index in memory, might have to re-consider approach
DSE heap on a Solr node
The recommended heap size for a DSE node running solr is 14gb. This is because Solr and Cassandra actually run in the same JVM. You don't have to allocate memory for Solr separately.
AWS M3.xl
m3.xl's with 15gb ram will be a bit tight with a 14gb heap. However, if your workload is relatively light, you can probably get away with a 12gb heap on your solr nodes.
OS page cache
You do want to make sure that you are able to at least fit your Solr indexes in OS page cache (memory left over after subtracting out your heap -- assuming this is a dedicated box). Ideally, you will also have room for Cassandra to store some of your frequently read rows in page cache.
A quick and dirty way of figuring out how big your indexes are is checking the size of your index directory on your file system. Make sure to forecast out/extrapolate if you're expecting your data to grow. You can also check the index size for each of your cores as follows:
http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
Note - each node should hold it's index in memory, not the entire cluster's index.
Storage
Yes you do want to ensure your disks are not over utilized or you may face issues during compaction. In theory--worse case scenario--tiered compaction could require up to 50% of your disk to be free. This is not common though, see more details here.

Slow Query Performance

I am running some very large databases (500 MB and 300 MB) in my application on several different machines.
From a hardware perspective, the machines have been identically configured.
I am using SQL Server CE 4.0 as my DBMS.
The performance critical query has been indexed to improve its performance.
The problem is that on [only] one of the machines, I am observing egregiously slow query performance. This usually happens after a long period of time of inactivity (from a query perspective). After I do several (about 7-8) queries, the slow performance disappears.
The weird thing is that this initial slow query performance does not happen on the other machine.
The only difference between the two machines is the data contained inside the databases.
I suspect that the distribution of data on the slow machine is somehow reducing the effectiveness of the indexing and that SQL Server CE has to rebalance the indexing in a much more significant way than on the other faster machine.
One thing I notice is that when the query is very slow, the disk activity increases significantly and the process corresponding to reading the database shows a spike in the read bytes.
This does not happen on the other machine.
Does anyone know how I might go about root causing this issue?
My code is written in C++ and uses the ATL/OLEDB API to manipulate the database.
UPDATE: My performance profiling activities indicate that it's not the query itself that is slow - it is the processing of the returned rowset that takes a while. For each row returned, I query another database for related data. I understand that this is not the right way to do it but the performance problem only happens on one machine. One thing I noticed is that when I have other unrelated queries happening on the same database in other threads, the unrelated queries will stall the query that is exhibiting the performance problem.

sql azure batch size

Have some employee segmentation tasks that result in a large number of records (about 2000) that needs to be inserted into SQL Azure. The records themselves are very small about 4 integers. A Azure worker role performs the segmentation task and inserts the resultant rows to a SQL Azure table. There might be multiple such tasks (each with about 1000 - 2000 rows) in queue and hence each of these inserts will need to be performed pretty fast.
Timing tests using a local machine to SQL Azure took significant time (approximately 2 minutes for the 1000 inserts). This might be caused due to the network latency. I am assuming the inserts from the worker role should be much faster.
However, since entity framework does not to batch inserts well we were thinking about using SQLBulkCopy. Would using SQLBulkcopy result in the queries being throttled if the batch size is say a 1000? Is there any recommended approach?
The Bulk Copy API should serve your purposes perfectly and result in very dramatic performance improvements.
I have tested inserting 10 million records with a batch size of 2000 into an Azure database and no throttling has occured with performance of ~10 seconds per batch when running from my local machine.

Resources