Splitting the workload between DSE Search and DSE Analytics Spark - datastax-enterprise

I have 2 types of use cases - search and analytics. I also have 2 distinct ways to categorize my primary key candidate fields.
Partition keys by high-cardinality fields, where number of distinct values is between 100,000 and 10,000,000 for example:
Customer_id
Employee_id
IP_address
MAC_address
The query by a row key here typically returns a handful of results. Secondary indexes and faucets are practical, because they are on low-cardinality fields - see the #2 below.
Partition keys by low-cardinality fields, where number of unique values is less than a 100, for example:
event_type - like "purchase" or "authenticated_OK"
platform - like 5 types of OS or 50 types of Aplications
metric_type - like CPU_utilization
protocol - like http or ftp
SNMP MIB name
country code, like us, ca, uk
state, like de, ny
A typical query by a row key returns millions of results, maybe for further analytics.
Secondary indexes are less practical here, because they are often on high-cardinality fields of the kind #1 above.
My question::
is modeling the data like in #1 above more suitable to DSE Search; and
data modeling like #2 above more suitable for DSE Analytics?
Thanks

The First Use Case, if properly data modeled and on an appropriately sized cluster will be fine querying cassandra without any additional indexing (no secondary indexes or need for solr aka DSE Search).
The Second Use Case, is quite hard to know with the information provided; however, it does sound like it could be a case where a proper data model and an appropriately sized cluster for cassandra plus secondary indexes on the low cardinality fields, may be a good fit. However, its unclear exactly what your access patterns are with the information provided.
I suggest you read this which provides some great info on seconday indexes and solr with cassandra: When to use Cassandra vs. Solr in DSE?

Related

Multi Tenant dynamic key value store

I have to implement a system where a tenant can store multiple key-value stores. one key-value store can have a million records, and there will be multiple columns in one store
[Edited] I have to store tabular data (list with multiple columns) like Excel where column headers will be unique and have no defined schema.
This will be a kind of static data (eventually updated).
We will provide a UI to handle those updates.
Every tenant would like to store multiple table structured data which they have to refer it in different applications and the contract will be JSON only.
For Example, an Organization/Tenant wants to store their Employees List/ Country-State List, and there are some custom lists that are customized for the product and this data is in millions.
A simple solution is to use SQL but here schema is not defined, this is a user-defined schema, and though I have handled this in SQL, there are some performance issues, so I want to choose a NoSQL DB that suits better for this requirement.
Design Constraints:
Get API latency should be minimum.
We can simply assume the Pareto rule, 80:20 80% read calls and 20% write so it is a read-heavy application
Users can update one of the records/one columns
Users can do queries based on some column value, we need to implement indexes on multiple columns.
It's schema-less so we can simply assume it is NoSql, SQL also supports JSON but it is very hard to update a single row, and we can not define indexes on dynamic columns.
I want to segregate key-values stores per tenant, no list will be shared between tenants.
One Key Value Store :
Another key value store example: https://datahub.io/core/country-list
I am thinking of Cassandra or any wide-column database, we can also think of a document database (Mongo DB), every collection can be a key-value store or Amazon Dynamo database
Cassandra: allows you to partition data by partition key and in my use case I may want to get data by different columns in Cassandra we have to query all partitions which will be expensive.
Your example data shows duplicate items, which is not something NoSQL datbases can store.
DynamoDB can handle this scenario quite efficiently, its well suited for high read activity and delivers consistent single digit ms low latency at any scale. One caveat of DynamoDB compared to the others you mention is the 400KB item size limit.
In order to get top performance from DynamoDB, you have to utilize the Partition key as much as possible, because it provides you with hash-based access (super fast).
Its obvious that unique identifier for the user should be present (username?) in the PK, but if there is another field that you always have during request time, like the country for example, you should include it in the PK.
Like so
PK SK
Username#S2#Country#US#State#Georgia Address#A1
It might be worth storing a mapping for the countries alone so you can retrieve them before executing the heavy query. Global Indexes can't be more than 20, keep that in mind and reuse/overload indexes and keys as much as possible.
Stick to single table design to utilize this better.
As mentioned by Lee Hannigan, duplicated elements are not supported, all keys (including those of the indexes) must be unique pairs

Faceting in Solr when index contains millions of documents

I'm working on a project that uses a solr index with a few million documents and we've recently hit a memory problem. Faceting has become unusable on a couple of our fields - solr runs out of heap memroy - because of the number of documents containing those fields.
What options do we have besides increasing the memory? We see memory increases as a temporary solution because the number of documents goes up by a few 100k documents per day.
I'm looking at the minute into solrcloud but I'm not sure this is the right solution.
Any suggestions?
Thanks!
FacetFields: Allow for facet counts based on distinct values in a field. There are two methods for FacetFields, one that performs well with few distinct values in a field, and the other for when a field contains many distinct values (generally, thousands and up – you should test what works best for you).
The first method, facet.method=enum, works by issuing a FacetQuery for every unique value in the field. As mentioned, this is an excellent method when the number of distinct values in a field is small. It requires excessive memory though, and breaks down when the number of distinct values gets large. When using this method, be careful to ensure that your FilterCache is large enough to contain at least one filter for every distinct value you plan on faceting on.
The second method uses the Lucene FieldCache (future version of Solr will actually use a different non-inverted structure – the UnInvertedField). This method is actually slower and more memory intensive for fields with a low number of unique values, but if you have a lot of uniques, this is the way to go. This method uses the FieldCache to look up the values for the given field for each document, and every time a document with a given value is found, the value has its count incremented.
Please check the allotted memory for each cache and if you can tweak FieldCache to handle the situation. (As you have mentioned, type3 and type4 have large number of documents.
Source for the above information is Scaling Lucene and Solr. I found one more article which talks about solr faceting You are faceting it wrong.
Before solrcould you can think of solr multiple core.
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
With SolrCloud, a single index can span multiple Solr instances.
This means that a single index can be made up of multiple SolrCore's on different machines.
These SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy.
If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
SolrCloud adds the distributed capabilities in Solr.
With this enable you can have highly available, fault tolerant cluster of Solr servers.
Use SolrCloud when you want high scale, fault tolerant, distributed indexing and search capabilities.
You can get more info about SolrCloud here
https://cwiki.apache.org/confluence/display/solr/SolrCloud

Mnesia: time and space efficiency of read, match_object, select, and qlc queries

Mnesia has four methods of reading from database: read, match_object, select, qlc. Besides their dirty counterparts of course. Each of them is more expressive than previous ones.
Which of them use indices?
Given the query in one of this methods will the same queries in more expressive methods be less efficient by time/memory usage? How much?
UPD.
As I GIVE CRAP ANSWERS mentioned, read is just a key-value lookup, but after a while of exploration I found also functions index_read and index_write, which work in the same manner but use indices instead of primary key.
One at a time, though from memory:
read always uses a Key-lookup on the keypos. It is basically the key-value lookup.
match_object and select will optimize the query if it can on the keypos key. That is, it only uses that key for optimization. It never utilizes further index types.
qlc has a query-compiler and will attempt to use additional indexes if possible, but it all depends on the query planner and if it triggers. erl -man qlc has the details and you can ask it to output its plan.
Mnesia tables are basically key-value maps from terms to terms. Usually, this means that if the key part is something the query can latch onto and use, then it is used. Otherwise, you will be looking at a full-table scan. This may be expensive, but do note that the scan is in-memory and thus usually fairly fast.
Also, take note of the table type: set is a hash-table and can't utilize a partial key match. ordered_set is a tree and can do a partial match:
Example - if we have a key {Id, Timestamp}, querying on {Id, '_'} as the key is reasonably fast on an ordered_set because the lexicographic ordering means we can utilize the tree for a fast walk. This is equivalent of specifying a composite INDEX/PRIMARY KEY in a traditional RDBMS.
If you can arrange data such that you can do simple queries without additional indexes, then that representation is preferred. Also note that additional indexes are implemented as bags, so if you have many matches for an index, then it is very inefficient. In other words, you should probably not index on a position in the tuples where there are few distinct values. It is better to index on things with many different (mostly) distinct values, like an e-mail address for a user-column for instance.

Riak MapReduce: Group items by field + sum another field

Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
http://docs.basho.com/riak/latest/tutorials/Secondary-Indexes---Examples/
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.

Solr Join - getting data from different index

I'm working on a project where we have 2 million products and have 50 clients with different pricing scheme. Indexing 2M X 50 records is not an option at the moment. I have looked at solr join and cannot get it to work the way i want it too. I know it's like a self join so I'm kinda skeptical it would work but here it is anyway.
here is the sample schema
core0 - product
core1 - client
So given a client id, i wanted to display all bags manufactured by Samsonite sorted by lowest price.
If there's a better way of approaching this, I'm open to redesigning exciting schema.
Thank you in advance.
Solr is not a relational database. You should give a look at the sharding feature and split your indexes. Also, you could write your custom plugins to elaborate the price data based on client's id/name/whatever at index time (BAD you'll still get a product replicated for each client).
How we do (so you can get an example):
clients are handled by sqlite
products are stored in solr with their "base" price
each client has a "pricing rule" applied via custom query handler when they interrogate the db (it's just a value modifier)

Resources