One influxdb, two databases on separate directories - influxdb

I have two databases, lets call them "one" and "two". I would like "one" to use a ssd disk since this will be my main database. "Two" will contain data older than, let's say, 30 days stored on a normal spinning disk, but still accessible for queries. So "one" data from now to now+29d on an SSD and "two" data from now+30d to something older on a spinning disk.
Is it possible to have to databases in one Influxdb pointing to different directories? Or is there a better way of doing this except having two Influx servers running?
Creating a symlink in the influx data directory could be possible for database "two" linking to a different directory, but it feels kind of hackish.

Unfortunately, it looks that InfluxDB didn't support multi-volume storage (or storage tier) on the configuration level yet, see here.

Related

Queries performances on ADLS gen 2

I'm trying to migrate our "old school" database (mostly time series) to an Azure Data Lake.
So I took a random table (10 years of data, 200m records, 20Gb), copied the data in a single csv file AND also to the same data and created 4000 daily files (in monthly folders).
On top of those 2 sets of files, I created 2 external tables.... and i'm getting pretty much the same performance for both of them. (?!?)
No matter what I'm querying, whether I'm looking for data on a single day (thus in a single small file) or making summation of the whole dataset... it basically takes 3 minutes, no matter if I'm looking at a single file or the daily files (4000). It's as if the whole dataset had to be loaded into memory before doing anything ?!?
So is there a setting somewhere that I could change so avoid having load all the data when it's not required?? It could literally make my queries 1000x faster.
As far as I understand, indexes are not possible on External tables. Creating a materialized view will defeat the purpose of using a Lake. t
Full disclosure; I'm new to Azure Data Storage, I'm trying to see if it's the correct technology to address our issue.
Best practice is to use Parquet format, not CSV. It is a columnar format optimized for OLAP-like queries.
With Synapse Preview, you can then use SQL on-demand engine (serverless technology) when you do not need to provision DW cluster and you will be charged per Tb of scanned data.
Or you can spin up Synapse cluster and ingest your data using COPY command into DW (it is in preview as well).

rsync for sharing files across nodes in application cluster?

I've a Rails application that runs on multiple load-balanced nodes. One of its functions is allowing users to upload content. That content needs to be visible fairly quickly, if not immediately, to all nodes.
Currently each node mounts a directory from an NFS server and reads/writes uploaded content to that shared location.
If possible, I'd like to move away from this solution and instead store content locally (on each node) and periodically sync with an rsync server in order to keep all nodes in sync.
Is this reasonable? How would rsync behave if a certain file were modified on multiple nodes at approximately same time? Would the changes be serialized on the "server" with no potential for corruption (i.e. each change only partially applied resulting in an corrupted file)?
I considered using some other shared resource (database, redis, etc.) but given how this content is used it's highly desirable for it to exist in "raw" form on the filesystem.

Where to store created PDF files on ASP .NET MVC site [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
storing uploaded photos and documents - filesystem vs database blob
I am starting to develop a web app, the primary purpose of which is to display photos. The users will be able to upload photos as well.
The first question that came up was where to store the photos: on the file system or the database.
I will be using a Windows box to host the site. The database is MySQL and the backend code is in C# utilizing ASP.NET MVC.
Filesystem, of course, unless you're aiming for a story on thedailywtf. The easiest way is to have the photos organized by a property you can derive from the file itself, such as its SHA-1 hash. Then just store the hash in the database, attached to the photo's primary key and other attributes (who uploaded it, upload date, etc).
It's also a good idea to divvy up the photos on the filesystem, so you don't end up with millions of files in a single directory. So you'll have something like this:
storage/00/e4/f56c0de1c61fdb926e79e8a0a65bd12930c9.jpg
storage/25/9a/ec1c55bfb660548a6770238668c4b117d92f.jpg
storage/5d/d5/4b01d98f17a9ad9dd1526b49ba39b5aa37a1.jpg
storage/63/49/6f740b6c284ce6685dc17d473a7360ace249.jpg
storage/b1/75/066d178188dde110149a8422ab651b0ee615.jpg
storage/b1/20/a2b7d02b7b0c43530677ab06235382a37e20.jpg
storage/da/39/a3ee5e6b4b0d3255bfef95601890afd80709.jpg
This is also easy to port if you ever move to sharded storage.
If you are using SQL Server 2008 there's a Filestream datatype that handles most of the problems mentioned about the DB getting larger. It handles all the annoying details of synchronizing between the filesystem and the table.
Look here for a blog post about the topic: Store any data in SQL Server 2008 (Katmai)
If you're building a website around photos then forget the database. If it will become popular your database is going to be hit hard and the majority of its time will be spent delivering photos. Also databases don't scale very well. There are so much more advantages in keeping them on the file system. And you can scale very well, having static content servers, using services for content delivery.
Also, Amazon S3 or other cloud providers do have their advantages. For instance S3 + Amazon CloudFront will provide good performance. CloudFront caches your files on servers around the world so they'll be very easily/fast accessible from anywhere. BUT if we're talking pictures and the site becomes popular your bills might be quite high.
For S3 Amazon charges per storage and per transfer in/out of the cloud.
For CloudFront per transfer.
Generally, people store binary data such as images on the filesystem, not the database. They reference the filesystem path from the database. Retrieving BLOBs (binary large objects) from the database is slower than allowing the web server to serve static files from the file system.
I would use something like Amazon S3.
But, if the choice is between filesystem and database I would choose filesystem because it's faster to server images from a filesystem than a database.
The only reason I would put photos as BLOBs in a database would be if I had a cluster of servers, and I was using database replication to automatically copy the photos to every machine in the cluster.
Life is much simpler if you just store the photos as files, and store the filenames of the photos in the database. If you need to create unique filenames for the photos, you can use a primary key integer from the database as part of the filename. But you could also just use a hash of the photo itself, as suggested by John Milliken. That's simple, and simple is better.
Some people point out that it's easier to manage if everything's in the database: including making backups, and preserving referential integrity.
If you store it in db, the db will grow quickly and will be much, much larger. It is just a touch more complicated to get an image out of db for display then to it is to get it from a file system. On the other hand, you better make sure that the file names and paths do not get out of sync with what is stored in db. In the past i have chosen to store on disk instead of db. It made it easier for me do move the database to different boxes. Worked out well.
We had a similar decision to make for a project I am on. The compelling thing about jamming stuff (images and other BLOBy things) into the DB is that it is is less likely that someone might delete/alter something (either intentionally or unintentionally). But, that isn't the choice we made. Instead we have the path info stored in the DB and use that to reference the data via UNC path. Data paths are stored in two parts - a part that references the location of the data relative to which machine it resides on and a part that points to which machine that group of data is on. When we need to move data around we can update the appropriate path info.
It is certainly quick to get the data without pulling out of the DB. Ultimately that was a major deciding factor.
It makes life so easy when you have a blob database. You should forget about the nightmare that is file system management.
EDIT
ID
VARBINARY
From experience this is an efficient way to manage binary files. You have one database that has only binary files. How can this be any harder to backup?

Docker images across multiple disks

I'm getting going with Docker, and I've found that I can put the main image repository on a different disk (symlink /var/lib/docker to some other location).
However, now I'd like to see if there is a way to split that across multiple disks.
Specifically, I have an old SSD that is blazingly fast to read from, but doesn't have too many writes left until it kicks the can. It would be awesome if I could store the immutable images on here, then have my writeable images on some other location that can handle the writes.
Is this something that is possible? How do you split up the repository?
Maybe you could do this using the AUFS driver and some trickery such as moving layers to the SSD after initially creating them and pointing symlinks at them - I'm not sure, I never had a proper look at how that storage driver worked.
With devicemapper thinp, btrfs and OverlayFS this isnt possible AFAICT:
The Docker dm-thinp and btrfs drivers both build layers one on top of the other using block device snapshot mechanisms. Your best bet here would be to include the SSD in the storage pool and rely on some ability to migrate the r/o snapshots to a specific block device that is part of the pool. Doubt this exists though.
The OverlayFS driver stacks layers by hard-linking files in independent directory structures. Hard-links only work within a filesystem.

What is Mnesia replication strategy?

What strategy does Mnesia use to define which nodes will store replicas of particular table?
Can I force Mnesia to use specific number of replicas for each table? Can this number be changed dynamically?
Are there any sources (besides the source code) with detailed (not just overview) description of Mnesia internal algorithms?
Manual. You're responsible for specifying what is replicated where.
Yes, as above, manually. This can be changed dynamically.
I'm afraid (though may be wrong) that none besides the source code.
In terms of documenation the whole Erlang distribution is hardly the leader
in the software world.
Mnesia does not automatically manage the number of replicas of a given table.
You are responsible for specifying each node that will store a table replica (hence their number). A replica may be then:
stored in memory,
stored on disk,
stored both in memory and on disk,
not stored on that node - in this case the table will be accessible but data will be fetched on demand from some other node(s).
It's possible to reconfigure the replication strategy when the system is running, though to do it dynamically (based on a node-down event for example) you would have to come up with the solution yourself.
The Mnesia system events could be used to discover a situation when a node goes down; given you know what tables were stored on that node you could check the number of their online replicas based on the nodes which were still online and then perform a replication if needed.
I'm not aware of any application/library which already manages this kind of stuff and it seems like a quite an advanced (from my point of view, at least) endeavor to make one.
However, Riak is a database which manages data distribution among it's nodes transparently from the user and is configurable with respect to the options you mentioned. That may be the way to go for you.

Resources