Data storage solution for a Build Server - storage

I'm drafting a proof-of-concept build server and what got me thinking is how to store all the data it produces. For example, I'd like to store
Unit test results: which tests were run, how much time each test took, results, stacktraces, number of assertions
Code coverage information, with line-level granularity
Various LoC metrics - per file, per file type
Code duplicates information
Additionally, these are the kinds of queries I'd like to run:
How has tests' execution time changed over time?
How has overall code coverage percentage changed over time? And what about this particular method? How has uncovered line count changed over time?
What was the dynamic of LoC for *.cs files? How total LoC count was changing?
Stuffing all this into a RDBMS doesn't sound like a particularly good idea. What storage technology fits my bill best here?

If you don't want to use an RDBMS you could definitely go with MongoDB for your requirements.
It allows you to group similar documents in a collection and each document in a collection does not have to have the same schema. One document can have 5 fields, another can have 10.
It can be fairly easily scaled to provide redundancy.
MongoDB also provides what they call the "aggregation framework" that allows you to generate stats/aggregations over your data. It's faster than their map/reduce solution - which can be a little slow of course.
Of all the document databases out there right now, I would say it is clearly the most mature and definitely has the richest query language.

Related

Which approach promises better performance - a megaquery or several targeted queries?

I am creating an SSRS report that returns data for several "Units", which are all to be displayed on a row, with Unit 1 first, to its right Unit 2 data, etc.
I can either get all this data using a Stored Proc that queries the database using an "IN" clause, or with multiple targeted ("Unit = Bla") queries.
So I'm thinking I can either filter each "Unit" segment with something like "=UNIT:[Unit1]" OR I can assign a different Dataset to each segment (witht the targeted data).
Which way would be more "performant" - getting a big chunk of data, and then filtering the same thing in various locations, or getting several instances/datasets of targeted data?
My guess is the latter, but I don't know if maybe SSRS is smart enough to make the former approach work just as well or better by doing some optimizing "behind the scenes"
I think it really depends on how big the big chunk of data is. My experience has been that SSRS can process quite a large amount of data after it comes back from the database, and it does it quickly. If the report is going to aggregate the data in the end, I try to do as much of that as I can on the database end. The reason, usually the database server has more resources to do all that work. But, if the detail is needed, and you can aggregate on the report server end easily enough, pull 10K records and do it to it.
I lean toward hitting the database as few times as possible, but sometimes it just makes sense to get the data I need with individual queries. I have built reports with over 20 datasets, each for very specific measures that just didn’t union up really well. Breaking it up like this took the report run time from 3 minutes, to 20 seconds.
Not a great answer if you were looking for which exact solution to go with. It depends on the situation. Often, trial and error gets you to the answer for the report in question.
SSRS is not going to do any "optimizing" and the rendering requirements sound trivial, so you should probably consider this as SQL query issue, not really SSRS.
I would expect the single SELECT with an IN clause will be faster, as it will require fewer I/Os on the database files. An SP is not required, you can just write a SELECT statement.
A further benefit is that you will be left with N-times less code to maintain (where N = the number of Units), and can guarantee the consistency of the code/logic across Units.

Input and test data for a SpecFlow scenario

I have started recently using SpecFlow and I have 2 basic questions I need to clarify, also to confirm I am on the right way:
As I understand, it is a must that all the input data (test parameters for the scenarios) to be provided by the tester, the same about the test data (input data for the tables involved in the test scenarios)
Are there any existing tools for a quick way of generating test data (inserting it into the DB) ? I am using Entity Framework as part of the Data access layer. I was wondering about some tool that would read the data from a file or probably some Desktop application to provide values for the table's fields (which could also then generate a file from which some other tool could read all the data and generate all the required objects etc).
I also had a look at Preparing data for a SpecFlow scenario - I was thinking if there is already a framework which would achieve insert\delete of test data to use alongside with SpecFlow.
I don't think you are on the right track. SpecFlow is a BDD tool, but in some ways it only covers part of the process. Have a read of http://lizkeogh.com/2013/07/01/behavior-driven-development-shallow-and-deep/ and see if any if the scenarios sound familiar?
To move forwards I would recommend you start with http://dannorth.net/introducing-bdd/ to get a good idea of how it all began. Now lets consider your points;
The tester provides all the test data. Well yes and no. The idea is that between yourself and the feature expert, you are able to have a conversation that provides all the examples that you need to develop your feature. If you don't involve yourself in that conversation, then yes all the data will come from the other side, but the chances are it won't be such high quality as if you are able to ask the right questions and guide the conversation so the data follows a structure that you can code tests too.
As an example here, when I first started with BDD I thought I could get the business experts to write the plain text scenario files with less input from the development, but in practice the documents tended to be less useful than when we were involved. Not because they couldn't write decent specifications, but actually because they couldn't refactor them to reuse bindings etc. We were still needed to add our skills into the process.
Why does data go into a database? A good test is isolated to the scope that it is testing. For a UI layer test this means that we don't have a database. For a business tier test we shouldn't be reliant on the database to get data either.
In practice a database is one of the most difficult things to include in your testing because once any part of the data changes you cause cascading test failures.
Instead I would recommend making your features smaller and provide the data for your test in the scenario or binding. This also makes having your conversation easier, because the fiftieth row of test pack is not something either party is going to remember. ;-) I recommend instead trying to give you data identities, so "bob" might be individual in a test you can discuss, and both sides understand what makes him an interesting example.
good luck :-)
Update: With regard to using a database during testing, my experience is that there are a lot of complexities that make it a difficult choice to work with. Consider these points,
How will you reset the state of your data between tests?
How will you reset the state if one / some tests fail?
If you are using branches or even just if two developers are making changes at the same time, how will you support multiple test datasets?
How will you handle two instances of the tests running at the same time (don't forget the build server)?
Have a look at this question SpecFlow Integration Testing with Database Patterns which includes some patterns that you can use.

How to do some reporting with Rails (with a dedicated DB)

In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails

Suitability of Amazon SimpleDB for large temporal data sets eminating from thousands of separate devices

I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here

Free data warehousing systems--specifically, for data storage

I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.
http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps
I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.
Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.
In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.
It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.
I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.
A lot of people just use Mysql or Postgres :)

Resources