I have ~10 excel files which are produced by a third party and updated each night and are available as a download. They contain ~ 10 fields (all short text / dates) and between ~10,000 and ~1m rows in each.
I'm planning to create a simple web application to enable people to search the data. I'll host it on AWS or similar. Search load will be light maybe ~1000 searches / day.
I have to assume that all the records are unique each night and need completely replace the online dataset.
It's relatively simple for me to convert the data from the excel files into a database such as Postgres and create a simple search on top of it.
My question is how do I deal with the time it takes to do the database update each night? Should I create two databases and have my application alternate between them every other night?
What is a typical strategy for dealing with a situation like this?
My current skill set is Ruby/Rails/Postgres building and simple(ish) web apps. I've been intentionally vague about technology because I'm open minded about what to use. And I'm quite happy to learn something new to solve the problem.
If you do all updates in one stransaction, you don't need too dbs - all the time you updtae tables people see "old" version, just a moment after COMMIT they will see all "new"...
Related
I want to create an autocomplete to Rails 5 app with PostgreSQL and I'd have a database with something around 50,000 records available to the user find out (neighborhoods from one entire country). I did some research on web and there are many tutorials outdated and some of them was using redis as the best option for this case. So, are there something new that I should follow in these days? Thank you.
To summarize the comments under the question (writing it down, because users looking for answers usually gloss over comments... I know I do)
50k records data set is not that big
nature of the data set is that it will be rarely updated
therefore there will be a lot of reads, and almost no writes
So PostgreSQL database should be more than enough, and Martin suggested great read about trigram indexes perfect for the task.
If the day comes that there are a ton of users trying to use this autocomplete at once - you should be fine with simple replication.
Is there an easy way to set up a database that's accessible to several people that can do all the things a single user would do?
I'm studying Database 101 and I am currently doing a project with four other people and we're having trouble meeting up and doing it so it would be great if we could do it from wherever.
When I say "easy way" I mean without having the super-ultra-deluxe-enterprise-edition of software.
Can it be done with a "local" Dropbox folder?
What you can do at zero cost is to have one project master.
Distribute a copy to each member. Each will have to do completely separate tasks, like one for designing a form, one for adjusting a report, one for some code module, one for another code module.
When done, in the evening or what you agree upon, you collect the different versions with a list of what objects has been changed or added. Import these in your master, and then distribute this to the members as the current revised working copy.
It takes some discipline, but that's all. And all masters you save as a zip given a filename including the date and time. This way, nothing can get lost.
I'm building an app that has a complex search (including geographic data), the users can save the searches so they can see it later, but also they can be notified if a new item is placed on the list.
When I think about this problem I see thousands of searches that I need to process in a background job every week (or day) and as each search is mostly unique, I can not process them in batches, meaning I need to make a full search on the database with all the items the user choose, I'm worried about the performance issues this can have in the application.
I was wondering if there is a pattern or a tool that can help me with this problem, or how is the best way I can take to solve it.
I'm using ruby on rails and PostgreSQL with Postgis on the stack.
We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.
I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here