I've done some research and found Ice Cube, but it stores the schedule as YAML in a TEXT SQL type column.
The goal is to find something that is standard as far as storing schedules and allows, for example, a query interface for "all records where the schedule includes TODAY()".
It seems this is a hard problem, and the leading solution (that I could find) is to create an occurrences table. If there's no end time to the schedule, then you have to make a decision as to how many occurrences to store. Are there any best practices here?
Related
I have a fact table that includes "wait times in hours" for certain services. I have a lot of dimensions that could describe the wait-times based on different slices; however, I am also interested in knowing how many people (counts) came for services through the filters of the same dimensions.
Given the dimensions for both the wait-times in hours and the number of people who got services are exactly the same, I think it's best practice to keep it in the same fact table. My question is:
Should there be a different fact table for the count measure mentioned?
How would I include this measure? Do I just put 1 in every single row? Because regardless of the wait-time, they've gotten the service only once (you cannot go above/below 1 in my scenario).
1) Think about the grain of your existing fact table. It sounds like it's probably "an occasion on which a person received a service." If that's the same thing you're trying to count, then yes - the waiting time and the count are the same grain.
However, while they may well be the same grain, there might be no need to add anything to the table. Read point 2 for an explanation.
2) You could put a 1 in a column on every row, but I'm not sure what you'd gain from it. You've not said what tools will be consuming this data, but you should be able to do a count/distinct count of some kind.
Working on the basis that you've tagged SSIS so are likely using Microsoft's BI stack:
TSQL has count(), and you can do count(distinct [column]).
SSAS has both counts and distinct counts as aggregation types.
MDX offers several different types of count.
SSRS has Count, CountDistinct, and CountRows.
Whether you do a normal count or a distinct count will depend on whether you're trying to ask "How many people used this service?" or "How many different people used this service?"
There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html
Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
http://docs.basho.com/riak/latest/tutorials/Secondary-Indexes---Examples/
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.
I have a table in my database that stores event totals, something like:
event1_count
event2_count
event3_count
I would like to transition from these simple aggregates to more of a time series, so the user can see on which days these events actually happened (like how Stack Overflow shows daily reputation gains).
Elsewhere in my system I already did this by creating a separate table with one record for each daily value - then, in order to collect a time series you end up with a huge database table and the need to query 10s or 100s of records. It works but I'm not convinced that it's the best way.
What is the best way of storing these individual events along with their dates so I can do a daily plot for any of my users?
When building tables like this, the real key is having effective indexes. Test your queries with the EXAMINE statement or the equivalent in your database of choice.
If you want to build summary tables you can query, build a view that represents the query, or roll the daily results into a new table on a regular schedule. Often summary tables are the best way to go as they are quick to query.
The best way to implement this is to use Redis. If you haven't worked before with Redis I suggest you to start. You will be surprised how fast this can get :). The way I would do such a thing is to use the Hash data structure Redis provides. Just assign every user to his Hash (making a unique key for every user like "user:23:counters"). Inside this Hash you can store a daily timestamp as "05/06/2011" as the field and increment its counter every time an event happens or whatever you want to do with that!
A good start would be this thread. It has a simple, beginner level solution. Time Series Starter. If you are ok with rails models: This is a way it could work. For a sol called "irregular" time series. So this is a event here and there, but not in a regular interval. Like a sensor that sends data when your door is opened.
The other thing, and that is what I was looking for in this thread is regular time series db: Values come at a interval. Say 60/minute aka 1 per second for example a temperature sensor. This all boils down to datasets with "buckets" as you are suspecting right: A time series table gets long, indexes suck at a point etc. Here is one "bucket" approach using postgres arrays that would a be feasible idea.
Its not done as "plug and play" as far as I researched the web.
I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.