Merging giant tables with small RAM - join

I want to join two tables. One of them is giant. The giant one is nearly 16 Gigabytes, and the small one is almost 1 Megabyte. The RAM of my machine is 16 GB. I tried SQLite, but my RAM was insufficient to complete the join operation. I think I must split the giant table into smaller chunks and then join each chunk. But I thought there might be optimized tools for this task. Please let me know if you know a tool for this task.

Related

What's the best way to handle large timeseries in dask / xarray?

I've got 17,000 CSV files, each ordered by timestamp (some with missing data). The total CSV files are around 85GB, which is much larger than my 32GB RAM.
I'm trying to figure out the best way to get these into a time-aligned, out-of-memory data structure, such that I can compute things like PCA.
What's the right approach?
(I've tried to set up an xarray.DataSet, with dim=(filename, time), and then I'm trying to xr.merge() on each CSV file into the DataSet, but it gets slower with every insert, and I expect it will crash when RAM runs out.)
Have you tried dd.read_csv(...).
Dask reads CSVs in a lazily and can perform certain operations in a streaming manner, so you can run an analysis on a larger than memory dataset.
Make sure that Dask is able to properly set divisions when you read in your data. Once the data is read, check dd.divisions and make sure they're values.
You can also use a Dask cluster to access more memory of course.
Those files are really small and Dask typically works best with partitions that are around 100MB. You might want to compact your data a bit.

Is one large sorted set or many small sorted sets more memory performant in Redis

I'm trying to design a data abstraction for Redis using sorted sets. My scenario is that I would either have ~60 million keys in one large sorted set or ~2 million small sorted sets with maybe 10 keys each. In either scenario the functions I would be using are O(log(N)+M), so time complexity isn't a concern. What I am wondering is what are the trade offs in memory impact. Having many sorted sets would allow for more flexibility, but I'm unsure if the cost of memory would become a problem. I know Redis says it now optimizes memory usage for smaller sorted sets, but it's unclear to me by how much and at what size is too big.
Having many small sorted sets would help spreading load over different redis instances, in case the data set grows beyond single host memory limit.

Neo4j inserting large files - huge difference in time between

I am inserting a set of files (pdfs, of each 2 MB) in my database.
Inserting 100 files at once takes +- 15 seconds, while inserting 250 files at once takes 80 seconds.
I am not quite sure why this big difference is happening, but I assume it is because the amount of free memory is full between this amount. Could this be the problem?
If there is any more detail I can provide, please let me know.
Not exactly sure of what is happening on your side but it really looks like what is described here in the neo4j performance guide.
It could be:
Memory issues
If you are experiencing poor write performance after writing some data
(initially fast, then massive slowdown) it may be the operating system
that is writing out dirty pages from the memory mapped regions of the
store files. These regions do not need to be written out to maintain
consistency so to achieve highest possible write speed that type of
behavior should be avoided.
Transaction size
Are you using multiple transactions to upload your files ?
Many small transactions result in a lot of I/O writes to disc and
should be avoided. Too big transactions can result in OutOfMemory
errors, since the uncommitted transaction data is held on the Java
Heap in memory.
If you are on linux, they also suggest some tuning to improve performance. See here.
You can look up the details on the page.
Also, if you are on linux, you can check memory usage by yourself during import by using this command:
$ free -m
I hope this helps!

sybase stored procedure slow after deleting rows

We deleted table rows in order to improve performance since we had a very large database. The database size reduced to 50% but the stored procedure became even more slower after the delete. It used to run within 3 minutes and now it is taking 3 hours. No changes made to procedure.
We ran the same procedure again in old database(before delete) and it worked fine. All other procedures run faster after the database size reduction. What could be the problem?
Deleting rows in the database doesn't truly free up space on it's own.
Space usually isn't really freed up until you run a command that can reorganize the data stored in the table. In SAP ASE the command reorg can be run with options such as reclaim space, rebuild and forwarded rows on the database. Logically, it's a lot like defragmenting a hard drive, the data is reorganized to use less physical space.
In SQL Anywhere the command is REORGANIZE TABLE, or can be found on the Fragmentation tab in Sybase Central. This will also help with index fragmentation.
The other thing that frequently needs to be done after large changes to the database is to update the table or index statistics. The query optimizer builds the query plans based of the table statistics stored in system tables. When large transactions, or a large number of small transactions happen, the statistics can lead the optimizer to make less optimal choices.
In SQL Anywhere this can be done using Sybase Central.
You may also want to check out the Monitoring and improving database performance section of the SQL Anywhere documentation. It covers these procedures, and much more.

2 Files, Half the Content, vs. 1 File, Twice the Content, Which is Greater?

If I have 2 files each with this:
"Hello World" (x 1000)
Does that take up more space than 1 file with this:
"Hello World" (x 2000)
What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?
Update:
I'm using a Macbook Pro, 10.5. But I'd also like to know for Ubuntu Linux.
Marcelos gives the general performance case. I'd argue worrying about this is premature optimization. you should split things into different files where it is logical to split them.
also if you really care about file size of such repetitive files then you can compress them.
your example even hints at this, a simple run length encoding of
"Hello World"x1000
is much more space efficient than actually having "hello world" written out 1000 times.
Files take up space in the form of clusters on the disk. A cluster is a number of sectors, and the size depends on how the disk was formatted.
A typical size for clusters is 8 kilobytes. That would mean that the two smaller files would use two clusters (16 kilobytes) each and the larger file would use three clusters (24 kilobytes).
A file will by average use half a cluster more than it's size. So with a cluster size of 8 kilobytes each file will by average have an overhead of 4 kilobytes.
Most filesystems use a fixed-size cluster (4 kB is typical but not universal) for storing files. Files below this cluster size will all take up the same minimum amount.
Even above this size, the proportional wastage tends to be high when you have lots of small files. Ignoring skewness of size distribution (which makes things worse), the overall wastage is about half the cluster size times the number of files, so the fewer files you have for a given amount of data, the more efficiently you will store things.
Another consideration is that metadata operations, especially file deletion, can be very expensive, so again smaller files aren't your friends. Some interesting work was done in ReiserFS on this front until the author was jailed for murdering his wife (I don't know the current state of that project).
If you have the option, you can also tune the file sizes to always fill up a whole number of clusters, and then small files won't be a problem. This is usually too finicky to be worth it though, and there are other costs. For high-volume throughput, the optimal file size these days is between 64 MB and 256 MB (I think).
Practical advice: Stick your stuff in a database unless there are good reasons not to. SQLite substantially reduces the number of reasons.
I think the usage of file(s) is to take into consideration, according to the API and the language used to read/write them (and hence eventually API restrictions).
Fragmentation of the disk, that will tend to decrease with only big files, will penalize data access if you're reading one big file in one shot, whereas several access spaced out time to small files will not be penalized by fragmentation.
Most filesystems allocate space in units larger than a byte (typically 4KB nowadays). Effective file sizes get "rounded up" to the next multiple of that "cluster size". Therefore, dividing up a file will almost always consume more total space. And of course there's one extra entry in the directory, which may cause it to consume more space, and many file systems have an extra intermediate layer of inodes where each file consumes one entry.
What are the drawbacks of dividing
content into multiple smaller files
(assuming there's reason to divide
them into more files, not like this
example)?
More wasted space
The possibility of running out of inodes (in extreme cases)
On some filesystems: very bad performance when directories contain many files (because they're effectively unordered lists)
Content in a single file can usually be read sequentially (i.e. without having to move the read/write head) from the HD, which is the most efficient way. When it spans multiple files, this ideal case becomes much less likely.

Resources