How to store "large" amounts of data in a desktop app?

How to store "large" amounts of data in a desktop app? - storage

Update
On a completely unrelated search, I have found this: Lightweight SQL database which doesn't require installation which lists a few possible options with about the same goals.
Original post
We have a desktop .Net/WPF app, with large (for a desktop app) amounts of data stored: it's has layouts, templates, products list and technical specs, and many more stuff.
Today it's stored in an Access DB, but we have hit the Access limitations pretty hard: it's not very fast, the DB weights 44Mb (which results in a large setup package), and more importantly, it's a pain to use with version control, because we can't merge the data from a branch to another. For instance, we might create a branch to add a few products, but then we have to add them manually in the trunk when we merge. We could use SQL scripts, but writing advanced SQL scripts for Access is a pain.
Basically, I want to replace the MS Access DB with another storage format, because Access is not well adapted.
I had tought of using JSON files that would be unzipped during or after install, but I'm a bit afraid of performance problems.
I'm also thinking of splitting the data into multiple files with multiple formats, depending on its usage, but using different formats might get complicated or annoying to develop.
Performance
Some parts of the DB are accessed pretty often and should be performance-optimized, whereas others are accessed maybe 1 or 2 times per work session, and using a poor-performance but high-compression format could be OK.
Size
We want the installer to be the smallest possible, so the library should be small, and the format should use small files. Using a library that adds 5 Mb to the installer file is out of the question.
Compatibility
The software must be able to run on .Net 4 (not 4.5), and it would be great if it ran on Windows XP (even though we're thinking more and more of just abandoning it going forward, it's still more than 7% of our market share).
Moreover, it should not need to install a server (like MS Access or SQLite) because it will be installed on end-user's computers, and we don't want to bloat them.
Versionning
It should be easy to version the data and the DB structure. The file should either be a text file (like JSON), or scripts should be easy to run in the continuous integration platform (like SQL server).
So, which technology would you use that answers all these contraints ?
Thanks !

As for your version control pains, it sounds from your description that if I were you, I'd keep the raw data in text files that are version-controlled, and only have the build process produce the database from them. This way, you should be able to use SQLite.

I would go for SQLite in your case, since the files are self-contained and easy to locate (hence easy to save on a version control system), installer is small, and performance is good.
http://www.sqlite.org/

Related

How can I store large amounts of unorganized, related data and organize it as I receive it?

I'm writing a program for a client. The data they send us is essentially information from a relational database that got flattened, resulting in utterly gigantic comma-delimited text files that consist of extremely redundant information with only a few fields changing per line.
I am reading this into a typed dataset and essentially organizing the data I'm getting into the third normal form, which drastically cuts down on the sheer amount of redundancy. From there, I convert the data in the dataset to XML and send it off to another program to create forms and statements.
However, I'm wondering if there's a better way to go about this. It might not be as bad as I think it is, but I can't shake the feeling that there's a better, faster way to do this. The important thing is that the data is organized and easily understood, and that it is constraint-checked and validated before I convert it to XML.
Since none of the data needs to persist (in fact, it shouldn't), an actual RMDB didn't seem worth it if I was just going to end up clearing it after every use.
The program also needs to run in a myriad of environments; my workstation is Windows 7 64-bit, the testing server is Windows XP 32-bit, and the production server is Windows 7 64-bit or 32-bit depending on which server it's going on.

IMHO then I would start off with SQL Express - it's designed to work its way through those kinds of data volumes, and will adapt itself to the different platforms you're running; it's scalable to the bigger versions if necessary; and in SSMS you have a tool for easily examining intermediate results etc., and importing .csv is straightforward. And it's free.
For all the above reasons, I would give SQL Express a try and evaluate its real-world performance.
Going back to your original question, my opinion fwiw is that this is a reasonable approach; I don't think you're missing anything.

Flat file in delphi

In my application I want to use files for storing data. I don't want to use database or clear text file, the goal is to save double and integer values along with string just to identify the name of the record ; I simple need to save data on disk for generating reports. File can grow even to gigabyte. What format you suggest to use? Binary? If so what vcl component/library you know which is good to use? My goal is to create an application which creates and updates the files while another tool will "eat" those file
producing nice pdf reports for user on demand. What do you think? Any idea or suggestion?
Thanks in advance.

If you don't want to reinvent the wheel, you may find all needed Open Source tools for your task from our side:
Synopse Big Table to store huge amount of data - see in particular the TSynBigTableRecord class to store an unlimited number of records with fields, including indexes if needed - it will definitively be faster and use less disk size than any other regular SQL DB
Synopse SQLite3 Framework if you would rather use a standard SQLite engine for the storage - it comes with a full Client/Server ORM
Reporting from code, including pdf file generation
With full Source code, working from Delphi 6 up to XE.
I've just updated the documentation of the framework. More than 600 pages, with details of every class method, and new enhanced general introduction. See the SAD document.
Update: If you plan to use SQLite, you should first guess how the data will be stored, which indexes are to be created, and how a SQL query may speed up your requests. It's a bad idea to read all file content for every request: you should better structure your data so that a single SQL query would be able to return the expended results. Sometimes, using additional values (like temporary sums or means) to the data is a good idea. Also consider using the RTree virtual table of SQLite3, which is dedicated to speed up access to double min/max multi-dimensional data: it may speed up a lot your requests.

You don't want to use a full SQL database, and you think that a plain text file is too simple.
Points in between those include:
Something that isn't a full SQL database, but more of a key-value store, would technically not be a flat file, but it does provide a single "key+value" list, that is quickly searchable on a single primary key. Such as BSDDB. It has the letter D and B in the name. Does that make it a database, in your view? Because it's not a relational database, and doesn't do SQL. It's just a binary key-value (hashtable) blob storage mechanism, using a well-understood binary file format. Personally, I wouldn't start a new project and use anything in this category.
Recommended: Something that uses SQL but isn't as large as standalone SQL database servers. For example, you could use SQLite and a delphi wrapper. It is well tested, and used in lots of C/C++ and Delphi applications, and can be trusted more than anything you could roll yourself. It is a very light embedded database, and is trusted by many.
Roll your own ISAM, or VLIR, which will eventually morph over time into your own in-house DBMS. There are multiple files involved, and there are indexes, so you can look up data fast without loading everything into memory. Not recommended.
The most flat of flat binary fixed-record-length files. You mentioned originally in your question, power basic which has something called Random Access files, and then you deleted that from your question. Probably what you are looking for, especially for append-only write as the primary operation. Roll your own TurboPascal era "file of record". If you use the "FILE OF RECORD" type, you hit the 2gb limit, and there are problems with Unicode. So use TStream instead, like this. Binary file formats have a lot of strikes against them, especially since it is difficult to grow and expand your binary file format over time, without breaking your ability to read old files. This is a key reason why I would recommend you start out with what might at first seem like overkill (SQLite) instead of rolling your own binary solution.
(Update 2: After updating the question to mention PDFs and what sounds like a reporting-system requirement, I think you really should be using a real database but perhaps a small and easy to use one, like firebird, or interbase.)

I would suggest using TClientDataSet, and use it's SaveToFile() / SaveToStream() methods by the generating program, and LoadFromFile() / LoadFromStream() methods for the program that will "consume" the data. That way, you can still make indexed records without connecting to any external database, all while keeping the interchange data in a single file.

Define API to work with your flat file, so that the API can be implemented by a separate data layer in many ways.
Implement the API using standard embedded SQL database (ex SQLite or Firebird).
Only if there is something wrong with the standard solution think of your own.

I use KBMMemtable - see http://www.components4developers.com/ - fast, reliable, been around a long time - supports binary and CSV streaming in and out of files, as well indexing, filters, and lots of other goodies - TClientDataSet will not do well with large datasets.

Managing Team Development with SSAS, TFS, & BIDS

I am currently a single BI developer over a corporate datawarehouse and cube. I use SQL Server 2008, SSAS, and SSIS as my basic toolkit. I use Visual Studio +BIDS and TFS for my IDE and source control. I am about to take on multiple projects with an offshore vendor and I am worried about managing change. My major concern is manging merges and changes between me and the offshore team. Merging and managing changes to SQL & XML for just one person is bad enough but with multiple developers it seems like a nightmare. Any thoughts on how best to structure development knowing that sometimes there is no way to avoid multiple individuals making changes to the same file?

SSIS, SSAS and SSRS files are not merge-friendly. They are stored in an xml file that is changed drastically - even with minor changes (such as changing a property) - so it becomes really impossible to merge.
So stop thinking about parallel development on one file. You need to think how you can achieve that people are not need to do parallel development on one file. So start with disabling the multiple checkout of a file. You might even want to consider to enable the option to get the latest version on a checkout.
Then start thinking how you can achieve that people can work independent. This is more in the way you structure the work and files:
Give people their own area they can work on. One SSIS package is only developed by person X at any given moment in time.
Make smaller files so the change that two people need to work in the same file is small.
I have given feedback to the product team of the imcompatability of BIDS to merge. It is a known issue, but will be hard to tackle. They don't know when it will be possible to really do parallel development on these files. Until then keep away from parallel development.

As Ewald Hofman mentioned, SSAS and SSIS is not merge-friendly.
In one environment I worked solved the problem as follows:
do only use SSIS when you have to (fuzz algorithm or something similar). Replace SSIS packages as often as you can with SQL code (see Linked Server for datasync. and MEARGE Command for dimension/fact-table-creating for instance).
build your data warehouse structure as follows:
build 2 databases, one for the "raw source data" from the source systems and one (the "stage" database) for the dimension and fact views and tables
use procedures that can deploy the whole "stage" database
put the structure for the "stage" database into your Repository
build a C# application that build your dimensions and cubes via the AMO API (I know, that's a tough job at the beginning but it is it worth - think on what you gain - Look at the Pros below )
add the stage database and the C# application to your Repository (TFS/Git etc.)
Pros of that structure:
you have a merge-able structure you can put in your Repository
you are using the AMO API witch has
you can automate the generation of new partitions
you can use procedures to automate and clone measure groups to different cubes (what I think is sometimes a big benefit!)
you could outsource your translation and import it easily (the cube designer is probably not the best translator)
Cons:
the vendor would probably not adapt that structure
you have to pay more (because of either higher skill requirements or for teaching him your individual structure)
you probably need knowledge over a new language C# - if you don't already have
Conclusion:
there are possibilities to get a merge-friendly environment
you will get lost of nice click-and-run tools f.e. BIDS - but will get into process of high automation functionality
outsourcing will be maybe unprofitable because of high individualization

http://code.google.com/p/support/wiki/DVCSAnalysis
maybe a better tag is DVCS?
https://stackoverflow.com/questions/tagged/dvcs

As long as both teams are using bids and TFS this should not be a problem.
assuming that your tsql code is checked in to source control in a single file per object, merging TSQL code is straight forward since it is text based. I have found the VSTS Database projects help with this.
Merging the XML based source files of SSIS and the MSAS can be cumbersome as you indicate below. to alleviate some of the pain, I find that keeping each package limited to a single dataflow or logical unit of work helps reduce developer contention on packages. I then call these packages from one or more master packages. I also try to externalize all of my tsql source queries using sprocs, view or udfs so that the need to edit the package is further reduced. using configuration files and variables also helps to a smaller extent.
MSSAS cubes are a little bit tougher. My best suggestion is to look into a 3rd party xml differencing tool. I have been able to successfully merge small changes use the standard text based tools but it can be a daunting task.

What design considerations should one take to receive text and multiple attachments via web?

I am developing a web application to accept a bunch of text and attachments (1 or more) via email, web and other methods.
I am planning to build a single interface, mostly a web service to accept this content.
What design considerations should I make?
I am building the app using ASP.NET MVC 2.
Should the attachments be saved to disk or in the database?
Should the unified single interface be a web service?
Pros and cons to using web services to upload files

as with any acceptance of files i'd be checking them for viruses or the like. i'm very nervous about files transmitted from the internet.
i always like putting my files in a database because it's neater i find. i hate having files over the network with folders needing rights etc. i know there are people that prefer it the other way so i guess my answer is also depends on personal preference.
i like the db approach because i can more easily tie files to records and do searches. if you have a file system then you still need to store info about the file plus the extra work of storing it.
then if you need to move files around you also need to possibly modify references in the database.
then again, you need to allocate enough space to grow the database and then cater for multiple databases perhaps as storage runs out.
so i guess if you're downloading large files then yeah maybe i can see the point of a file system as it's easier to grow it. if you have small text files then maybe a database will work.

Rails: Storing binary files in database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Using Rails, is there a reason why I should store attachments (could be a file of any time), in the filesystem instead of in the database? The database seems simpler to me, no need to worry about filesystem paths, structure, etc., you just look in your blob field. But most people seem to use the filesystem that it leaves me guessing that there must be some benefits to doing so that I'm not getting, or some disadvantages to using the database for such storage. (In this case, I'm using postgres).

This is a pretty standard design question, and there isn't really a "one true answer".
The rule of thumb I typically follow is "data goes in databases, files go in files".
Some of the considerations to keep in mind:
If a file is stored in the database, how are you going to serve it out via http? Remember, you need to set the content type, filename, etc. If it's a file on the filesystem, the web server takes care of all that stuff for you. Very quickly and efficiently (perhaps even in kernel space), no interpreted code needed.
Files are typically big. Big databases are certainly viable, but they are slow and inconvenient to back up etc. Why make your database huge when you don't have to?
Much like 2., it's really easy to copy files to multiple machines. Say you're running a cluster, you can just periodically rsync the filesystem from your master machine to your slaves and use standard static http serving. Obviously databases can be clustered as well, it's just not necessarily as intuitive.
On the flip side of 3, if you're already clustering your database, then having to deal with clustered files in addition is administrative complexity. This would be a reason to consider storing files in the DB, I'd say.
Blob data in databases is typically opaque. You can't filter it, sort by it, or group by it. That lessens the value of storing it in the database.
On the flip side, databases understand concurrency. You can use your standard model of transaction isolation to ensure that two clients don't try to edit the same file at the same time. This might be nice. Not to say you couldn't use lockfiles, but now you've got two things to understand instead of one.
Accessibility. Files in a filesystem can be opened with regular tools. Vi, Photoshop, Word, whatever you need. This can be convenient. How are you gonna open that word document out of a blob field?
Permissions. Filesystems have permissions, and they can be a pain in the rear. Conversely, they might be useful to your application. Permissions will really bite you if you're taking advantage of 7, because it's almost guaranteed that your web server runs with different permissions than your applications.
Cacheing (from sarah mei below). This plays into the http question above on the client side (are you going to remember to set lifetimes correctly?). On the server side files on a filesystem are a very well-understood and optimized access pattern. Large blob fields may or may not be optimized well by your database, and you're almost guaranteed to have an additional network trip from the database to the web server as well.
In short, people tend to use filesystems for files because they support file-like idioms the best. There's no reason you have to do it though, and filesystems are becoming more and more like databases so it wouldn't surprise me at all to see a complete convergence eventually.

There's some good advice about using the filesystem for files, but here's something else to think about. If you are storing sensitive or secure files/attachments, using the DB really is the only way to go. I have built apps where the data can't be put out on a file. It has to be put into the DB for security reasons. You can't leave it in a file system for a user on the server/machine to look at or take with them without proper securty. Using a high-class DB like Oracle, you can lock that data down very tightly and ensure that only appropriate users have access to that data.
But the other points made are very valid. If you're simply doing things like avatar images or non-sensitive info, the filesystem is generally faster and more convenient for most plugin systems.
The DB is pretty easy to setup for sending files back; it's a little bit more work, but just a few minutes if you know what you're doing. So yes, the filesystem is the better way to go overall, IMO, but the DB is the only viable choice when security or sensitive data is a major concern.

I don't see what the problem with blobstores is. You can always reconstruct a file system store from it, e.g. by caching the stuff to the local web server while the system is being used.
But the authoritative store should always be the database. Which means you can deploy your application by tossing in the database and exporting the code from source control. Done.
And adding a web server is no issue at all.

Erik's answer is great. I will also add that if you want to do any caching, it's much easier and more straightforward to cache static files than to cache database contents.

If you use a plugin such as Paperclip, you don't have to worry about anything either. There's this thing called the filesystem, which is where files should go. Just because it is a bit harder doesn't mean you should put your files in the wrong place. And with paperclip (or other similar plugins) it isn't hard. So, gogo filesystem!

Unable to find an up-to-date answer to this question I have implemented an
database service for Active Storage available since Rails 5.2 that works just like any other Active Storage service, but stores file content in a special database column instead of a cloud service.
The implementation is based on a standard Rails Active Storage service, adding a migration with a new model: an extra table that stores blob contents in a binary field. The service creates and destroys records in this table as requested by Active Storage.
Therefore, this service, once installed, can be consumed via a standard Rails Active Storage API.
https://github.com/TitovDigital/activestorage-database-service
Please be aware of all pros and cons of using a database for storing files.
With the right database it will provide full ACID support and can wrap file storage and deletion into transactions. It is also much easier in DevOps as there is one less service to configure.
Large files or large traffic are the risky cases. Either will put an unnecessary strain on the app and database servers.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart