Asp.net mvc small amount data storage - asp.net-mvc

I am writing some learning tests (i.e. what's the answer for...; choose correct options...). Now my question is, how should I store them. SQL db seems quite an overkill, but I really don't know what would be the best choice if I wanted to select random subset of questions etc. Perhaps some simple xml files?
Thanks for advice.

A RDBMS could be a good option for you since it sounds like you're wanting to collect some sort of result set based on the questions your asking. This way you'll be able to tie the questions, answers and users together is some logical way.
You could easily store the questions in an XML file and that would work, however it makes it a little more tricky to tie your overall data together.
One thing you could do is draw or write out a real plan of how you'd like your data to interact with itself and other data. This will probably give you a better idea of what needs to be done and how to go about doing it.

The easiest way is to go for a relation database solution. If you don't need support by the heavier db's out there like ms sql server i should have a look at ms sql express or sql lite. Thees database don't need any database server running to work, they are just file based databased and are easily moved around....

Related

Avoid writing SQL queries altogether in SSIS

Working on a Data Warehouse project, the guy that gave us the tutorial advised that we stick to using SQL queries over defining a lot of data flow transformations, citing points like it'll consume a lot of memory on the ETL box so we'd rather leave the processing to the DB box. Is this really advisable? Where's the balance between relying on GUI tools over executing a bunch of SQL scripts on your Integration package?
And honestly, I'd like to avoid writing SQL queries as much as I can. (but that's beside the point. I'd really like to look at this objectively.)
The answer is: it depends, but you want to pick one or the other for any given job and avoid mixing the two where possible.
Generally, it's best to either do everything possible within the tool or do everything possible within stored procedure code. When you have significant amounts of logic split between layers the system becomes harder to trace and debug.
Where the tool can do the transformations without the data flows becoming awkward and convoluted you could use the tool and try to have little or no logic in queries. This means that one single layer has the business logic and it should be fairly obvious where to find it. However, ETL tools tend to handle highly complex transformations relatively poorly. The sweet spot for this type of approach is on systems where you have a large number of data sources but relatively simple transformations.
If you have relatively complex transformations you may be better off putting all the business logic and transformation into a layer of stored procedures. SQL code is better at implementing complex transformations in a maintainable way - I have it on fairly good authority that around half of all data warehouse projects in the banking and insurance sectors use this type of architecture for precisely that reason. In this case the ETL tool can be used to implement relatively dumb data copies. Source data can be copied into staging areas essentially verbatim and then picked up by a body of stored procedure code that does the ETL. The ETL tool can be used for data copies, bulk load operations, logging, scheduling and other framework tasks.
In either case you're best off picking one approach. Otherwise, you can end up with business logic spread across extraction layers, database views, data flows, and stored procedure code. Logic spread across multiple layers is much harder to test.
When all of the logic is (for example) contained within stored procedures or focussed ETL transformation jobs you can unit test a given transformation in isolation. The clarity in design also helps with maintenance and auditing.
I find that using SQl code is not only faster to run, but it is faster to develop and much much easier to maintain.
Generally when you want to process each row individually, use a data flow, otherwise it may be better to use a Sql Command.
Personally I'd go with writing the SQL where I can. It's easier to optimise later and (usually) faster as well. Google will give much more detailed answers.
Another factor to think about is the provider you use for your connections.
You need to make the decision based on your needs. We use postgres DB, so we have to create a load of staging tables for some processes, which speeds the whole thing up.
You should also take into consideration the box it is running on, if you have an all powerful DB box, and a little ETL box, there'd be no point in running anything.
If you do all your processing on the ETL box you'll be dragging a lot of data across the network as well.
Check out these links to get you started:
ssistalk.com/category/ssis/ssis-advanced-techniques/
msdn.microsoft.com/en-us/library/ms141031.aspx
weblogs.sqlteam.com/jamesn/Default.aspx
I think this is a difficult question; and an interesting one as well.
One reason to use SSIS is to improve maintainability, IMHO. If you pack all the logic in SQL statements (and you sure can!) you tend to spoil this reason of using SSIS in the first place. You cannot really "see the data flow" anymore.
On the other hand I feel there are times when a well placed SQL statement has its value. For example when you read data from a table and for whatever reason already know you will only ever need the rows satisfying condition X I do not see the reason for reading the whole table and in the next step "conditional-splitting most of it away".
What I do not know is what this means in terms of performance, by the way. Is SSIS smart enough to see what is happening and change the "read-whole-table-and-conditional-split-it" into a "select Y from where X" on the fly (or when building/deploying)?
The big question is where to draw the line. And this depends to a certain extent on the people working on your ETL process. If everyone ever supporting the process knows SQL since its beginning you can better support a higher amount of SQL in your ETL than if you have co-workers (or customers, or successors you care about) that hardly understand what is happening in all your SQL, let alone change/improve/add to it.
So I think the bottom line is that neither not using nor doing everything in SQL is better. Try to make up some simple rules that fit your requirements and that everyone can live with, then follow them. This buys you the most value from using SSIS.
SQL Server does some things well and other things not so well. I use SSIS to import to or export data from SQL Server. During the course of the move I use SSIS where it makes sense. I can easily do work on a per row basis, which is not very efficient in SQL Server (cursors). To say that you shouldn't use transformations and data flows on an ETL box, because it is too expensive on the ETL box is like say 'don't drive your car too fast, because it causes the engine to work'. The purpose of an ETL and SSIS is to take some of the processing that SQL Sever does not do well and move it to an engine that does.
Got to use the right tool for the job. Generally, you do most things in SSIS, with certain things done in "pure" SQL.
For instance, in cases where you do a lot of UPDATE (table difference on dimension table in a dimensional model, say), you really don't want to execute an UPDATE for each row. In this scenario, you do a regular insert into a temporary table and then do the UPDATE in SQL, joining on appropriate keys.

Opinion Mining - What Database Type?

I am entering a project to make a Opinion Mining (Data Mining -> Web Mining -> Opinion Mining) to get semantic orientation of the words contained. We will use a crawler to get the pages opinion. Now the question is, what type of DataBase should I use (OO, Relational, hierachycal, etc), is best to use in this type of project.
I know this is a specific question, Im not expecting everybodies response but at least someone that already did it, that would help.
Regards!
If you need something large scale and responsive, you would probably need to go for Google's BigTable or something of that nature. At the prototype level, I am sure you can use traditional relational databases, but at certain point you'd hit the performance wall. See Brewer's CAP Theorem.
From my experience in such kind of scenarios a relational database can serve your purpose pretty well. You need to be extra careful when storing the web content part of it - whether you want to at all use a database to store it or will storing on as simple as a file system can do. BLOBs specially require extra care and they increase your maintenance work.
Also based on the nature of the project, you would certainly be using a lot of already built in components etc. many of which would already support/easy to extend to use a relational DB as a data store.

Is Using Db4o For Web Sites a judicious choice?

Is using Db4o as a backend datastore for a Web site (ASP.NET MVC) a judicious choice as an alternative to MS SQL Server ?
The main issue with DB4o is: Can you cut your object net in some useful manner? If not, then you'll keep too many objects in RAM for too long and your performance will suffer.
For example, in SQL, you can create a cursor and then easily traverse a huge set of results. You can also query for a small set of columns while DB4o always loads the whole objects (and its references and the references of the references). With DB4o, you must make sure that DB4o doesn't try to pull in all objects from the DB at once.
You'll also need to get used to querying things your "DB" by filling out example objects which feels weird in the beginning.
That depends, what kind of site your creating, the traffic your expecting etc...Are you going to handle a million requests a second, or 100 a minute...Does your domain justify using a Object Database? Do you really need it?
In general, most sites are not heavy hitters so they might not require all the scale out functionality (I believe and this is only a belief that traditional RDBMS have been tested and designed to handle extreme loads where as Object DB's might not have been given the same attention).
So then the question is does your domain justify this? Your going to base a core piece of your site on a technology that you will not find a lot of experts in. So how do you handle turn over rate? Are you willing to take the cost associated with training all current and future employees on this?

Free data warehousing systems--specifically, for data storage

I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.
http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps
I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.
Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.
In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.
It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.
I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.
A lot of people just use Mysql or Postgres :)

If using LINQ to SQL is there any good reason to learn SQL queries/syntax anymore?

I do understand SQL querying and syntax because of previous work using ASP.NET web forms and stored procedures, but I would not call myself an "expert" in it.
Since I have been using ASP.NET MVC and LinqToSql it seems that so much of the heavy lifting is done for me and encapsulated away at the SQL end that I'm questioning whether there is any benefit in continuing to top-up my knowledge of SQL queries or whether I'm better off focusing my "learning time" on other things.
Your thoughts?
You should absolutely know SQL and keep your knowledge up-to-date. ORM is designed to ease the pain of doing something tedious that you know how to do, much like a graphing calculator is designed to do something that you can do by hand (and should know how).
The minute you start letting your ORM do things in the database that you don't fully understand is the minute you've lost control over your model.
In my opinion, knowing SQL is more valuable than any vendor specific technology. There will always be cases when those nice prepackaged frameworks will not be able to solve a particular situation and knowledge of advanced SQL will be required.
It is still important to learn SQL queries/syntax. The reason is you need to at least understand how Linq to SQL translate to the database behind the scenes.
This will help you when you find problems, for example something not updating correctly. Or a query performance needs to increase.
It is the same that you need to understand what assembly language is and how it eventually becomes machine language. However in all you don't have to be an expert, but at least be able to write in it and understand it.
It is still important to know SQL and the paradigm (set-based) behind it to be able to create efficient SQL statements, even if your using LinqToSql or any other OR/M.
There will always be situations where you will want to write the query in native SQL because it is not possible to write it in LinqToSql / HQL / whatever, or LinqToSql is just not able to generate a performant query for it.
There will always be situations where you will want to execute an ad-hoc query on a database using native sql, etc...
I think LinqToSQL (or other Linq to SQL providers) should not prevent you of knowing SQL.
When your query is not returning what you expect, or when it takes 30 minutes to run on the production database, you'd better be able to understand what LTS has generated, and why it is failing.
I know, it's a rehashed topic, and it might not be applicable to what you do ("small" database that will never hit that kind of problem etc), but it pays not to get too oblivious of abstraction layers sometimes.
The other reason is, Linq does not the whole range of what you can do in SQL, so you might have to resort to writing "raw" SQL, even if the result is materialised as objects.
It depends what you're working on, and from what you said it might make more sense to focus on other areas.
Having said that I find knowing SQL allows the following:
The ability to write queries to extract data from systems easily.
For adhoc queries, or for checking things.
The ability to write complex stored procedures, which allows me to group complex data processing in one place, where it should be, in the database.
The ability to fine tune LinqToSql by adding indexes, and understanding the SQL/query plan's it procedures.
Most of these are more of a help on more complex systems, so if you're not working on those it might not be as much of a help.
It may help in your situation to list the technologies which might be of use, and then prioritise them.
In order words make a development plan for yourself, which may encompass more then just learning technical knowledge but allow a more broad focus like design patterns, communication skills and other areas.
SQL is a tool. Linq to SQL is also a tool. Having more tools in your belt is a good thing. It'll give you more perspectives when attacking a problem.
Consider a scenario where you may want to do multiple queries or multiple updates to the db in one operation. If you can write TSQL you can potentially save yourself a lot of roundtrips to the database.
I would say you definately need to know your SQL in depth, because you need to know what code your Linq-expression generates and what effects the code will have if you want high performing queries. Sure you might get the job done in most cases, but sometimes there is a huge difference in performance in very subtle difference in Linq-syntax.
I ran into this this morning actually, where I had done .Any(d => d.Id == (...).First().Id) instead of doing where (...).Any(i => i.Id == d.Id). This resulted in the query executing five times slower.
Sometimes you need to analyze the actual Sql-query to realise the mistakes you make.
Its always a good think to learn the underlying language for stuff like Linq To SQL. SQL is pretty much standardized and it will help you understand a new paradigm in programming.
You may not always be working in .NET.
Doesn't hurt to know the underlying concepts.
LINQ to SQL is not being maintained anymore in favor of the Entity Framework
Sooner or later you will run into problems that need at leat a working knowledge of SQL to solve. And sooner or later you will run into requirements that are best realised in the DB (whether in SP-s or in triggers or views or whaterver).
LINQ To SQL will only work with .NET. IF you happen get another job where you are not working with .NET, then you will have to go back to writing Stored Procs.
Knowing SQL will also give you a better understanding of how the server operates as well as possibly making you a better database designer.

Resources