The answer to my question should be relatively simple - but I can't seem to find a simple answer. I'm trying to find a way to regularly connect to Volusion API to retrieve orders in a manner that assures no duplicates and no missing orders. However, because of the way that queries must be written I'm finding it difficult: 1) You can't use comparisons in queries (greater than, less than, etc.) and 2) The date fields require a full date/time with HHmmss. 3) There is a limit on the number of orders retrieved with each query. If you query against a date/time, for example, you usually can only get one order, because I can't use a comparative function. I saw one post here that suggested iterating through order ID's until no data is received. Has anyone found an easier way to accomplish order retrieval with Volusion API?
To perform geoqueries in Firebase or Firestore, there are libraries like GeoFire and GeoFirestore. But to sort the results of that geoquery by distance, the entire dataset must be read, correct? If a geoquery produces a large number of results, there is no way to paginate those results (on the backend, not to the user) when sorting by distance, is there?
Yes, in order to sort by distance you must read all results that fall into the Geoquery range.
The reason for this is how such queries work: they return a set of documents that are within a range of geohash values, which is not necessarily the same order as by their distance to the center of the query.
This also means that there is no way to do meaningful pagination in a list of documents that are ordered by their distance, since you need to read all results anyway. The best I can think of is implementing the Geoquery in Cloud Functions, so that you can do the sort/filter there, and only return the page-full of results to the client. While this doesn't save on your cost (as you're still reading all documents in the range), it will save bandwidth in sending documents to the user.
To learn more about how such geoqueries work, which explains why they can't be optimized the way you're looking to do, have a look at the video of my talk here or this article+shorter video on Jeff Delaney's site.
There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there:
I am examining AWS CloudSearch for system's new searching engine.
Assume that there are articles and some comments written on each articles. The search API should return articles that are matching or having any matching comments. So is there any possible way to retrieve DISTINCT values(in this case, unique ID of the article) from CloudSearch with single query execution? If not, what would be the nice solution to resolve this feature's requirement with CloudSearch?
I know there's text-array type for document field in CloudSearch but it seems expensive to update documents since N of comments on single article can be more than thousands.
I faced similar problem, putting comments is not an option in your case as array elements cannot be more than 1000 in cloudsearch. I will make two search domains, articles and comments. I will issue search query to both of them in parallel (async or multithreaded depending upon the language), articles will always generate non duplicate ids but on the results of comments query you have to apply the logic to an article id only once and always pick the top one, as results are sorted by matching score.
Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
FROM sales
GROUP BY month, product_key
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.
I'm trying to build a (simple) twitter-clone which uses CouchDB as Database-Backend.
Because of its reduced feature set, I'm almost finished with coding, but there's one thing left I can't solve with CouchDB - the per user timeline.
As with twitter, the per user timeline should show the tweets of all people I'm following, in a chronological order. With SQL it's a quite simple Select-Statement, but I don't know how to reproduce this with CouchDBs Map/Reduce.
Here's the SQL-Statement I would use with an RDBMS:
SELECT * FROM tweets WHERE user_id IN [1,5,20,33,...] ORDER BY created_at DESC;
CouchDB schema details
"text":"Sample Text",
"created_at":"2011-10-17 10:21:36 +000"
With view collations it's quite simple to query CouchDB for a list of "all tweets with user_id = 1 ordered chronologically".
But how do I retrieve a list of "all tweets which belongs to the users with the ID 1,2,3,... ordered chronologically"? Do I need another schema for my application?
The best way of doing this would be to save the created_at as a timestamp and then create a view, and map all tweets to the user_id:
if(doc.type == 'tweet'){
emit(doc.user_id, doc);
Then query the view with the user id's as keys, and in your application sort them however you want(most have a sort method for arrays).
Edited one last time - Was trying to make it all in couchDB... see revisions :)
Is that a CouchDB-only app? Or do you use something in between for additional buisness logic. In the latter case, you could achieve this by running multiple queries.
This might include merging different views. Another approach would be to add a list of "private readers" for each tweet. It allows user-specific (partial) views, but also introduces the complexity of adding the list of readers for each new tweet, or even updating the list in case of new followers or unfollow operations.
It's important to think of possible operations and their frequencies. So when you're mostly generating lists of tweets, it's better to shift the complexity into the way how to integrate the reader information into your documents (i.e. integrating the readers into your tweet doc) and then easily build efficient view indices.
If you have many changes to your data, it's better to design your database not to update too many existing documents at the same time. Instead, try to add data by adding new documents and aggregate via complex views.
But you have shown an edge case where the simple (1-dimensional) list-based index is not enough. You'd actually need secondary indices to filter by time and user-ids (given that fact that you also need partial ranges for both). But this not possible in CouchDB, so you need to work around by shifting "query" data into your docs and use them when building the view.