Bundle download limit? - open-policy-agent

encountering a bit of an issue that I would need to find a work around for.
OPA Engine version is 0.23.2 and I am trying to download a bundle, but I am getting the following error message "Bundle download failed: bundle exceeded max size (1073741824 bytes)"

The 1GB limit is hard coded (as of now) https://github.com/open-policy-agent/opa/blob/63560e0d1e767a8c973bfa217d3c734adea6d5f7/bundle/bundle.go#L42
Keep in mind that 1GB of gzipped json data is likely to use a significant amount of memory when uncompressed and loaded into memory. To give some context on why that limit is there... Rule of thumb is that its something like 20x from raw JSON -> golang/opa data structures in memory, and the gzip compression varies more but I think a 5:1 ratio is pretty reasonable. Napkin math on that works out to like 100GB of memory for OPA to load a 1GB bundle (1520). It is unlikely that most OPA use-cases would be able/want to have that happen, hence the error.
EDIT: That 1GB limit turns out to not be for the compressed bundle size but instead for the uncompressed JSON files. So the memory expansion isn't as big of a ceiling (however the example above still hold true for a 1GB compressed sized bundle)
I would recommend filing an issue, it seems like we could probably expose a config option to increase the limit for use-cases where the memory usage is acceptable (lower ratio for gzip, less overhead on the structures, etc.. depends on the JSON data)

Related

What are the scenario that makes us compress data before we transfer it?

I am wondering the reason why we need to apply file compression before we upload files to server under some scenarios. For my understanding, as soon as the server received the compressed files, the compressed file need to be extracted to allow the server read the file content. It certainly consumes the computation power of the server if multiple Http POSTs are sent from many client side platforms.
Therefore, as far as I can think of the scenario of sending the compressed file is uploading the backup files, setting files, files that only servers as back up for the client side platforms. Please give me more scenarios for uploading compressed data.
I think the following article gives an perfect explanation to the question:http://www.dataexpedition.com/support/notes/tn0014.html
Here's the content:
Compression Pros & Cons
Simply put, compression is a process which trades CPU cycles for bytes. But the trade isn't always a good one. Sometimes you can spend a lot of valuable CPU cycles for little or no gain.
In the context of network data transport, "Should I compress?" is a common question. But the answer can get complicated, depending on several factors. The most important thing to remember is that compression can actually make your data move much slower, so it should not be used without some consideration.
When Compression Is Good
Compression algorithms try to identify large repeating patterns in a data set and replace them with smaller patterns. Ideally, this shrinks the size of the data set. For the purposes of network transport, having less data to move means it should take less time to move it.
Documents and files which consist mostly of plain text or machine executable code tend to compress well. Examples include word processing documents, HTML files, some .exe files, and some database files.
Combining many small files into a single archive prior to network transfer can often result in faster speeds than transferring each file individually. This may be true even if the individual files themselves are not compressible. Many archiving utilities have options to pack files into an archive without compression, such as the "-0" option for "zip". ExpeDat will combine the contents of a folder into a single data stream when you enable Streaming Folders.
When Compression Is Bad
Many data types are not compressible, because the repeating patterns have already been removed. This includes most images, videos, songs, any data that is already compressed, or any data that has been encrypted.
Trying to compress data that is not compressible wastes CPU time. When you are trying to move data at high speeds, that CPU time may be critical to feeding the network. So by taking away processing time with worthless compression, you can actually end up moving your data much more slowly than if you had compression turned off.
If you are using a compression utility only for the purposes of combining many small files, check for options that disable compression. For example, the "zip" command has a "-0" option which packages files into an archive without spending time trying to compress them.
Inline versus Offline
Many transport mechanisms allow you to apply compression algorithms to data as its being transferred. This is convenient because the compression and decompression occur seamlessly without the user having to perform extra steps. But it is also risky because any CPU time spent on compression is time NOT being spent on feeding data through the network. If the network is very fast, the CPU is very slow, or the compression algorithm is unable to scale, having inline compression turned on may cause your data to move more slowly than if you turn compression off. Inline compression can be slower than no compression even when the data is compressible!
If you are going to be transferring the same data set multiple times, it pays to compress it first using Zip or Tar-Gzip. Then you can transfer the compressed archive without taking CPU cycles away from the network processing. If you are planning to encrypt your data, make sure you compress it first, then encrypt second.
Hidden Compression
Devices in your network may be applying compression without you realizing it. This becomes evident if the "speed" of the network seems to change for different data types. If the network seems slow when you are transferring data that is already compressed, but fast when you are transferring uncompressed text files, then you can be pretty sure that something out there is making compression decisions for you.
Network compression devices can be helpful in that they take the compression burden away from the end-point CPUs. But they can also create very inconsistent results since they will not work for all destinations and data types. Network level compression can also run into the same CPU trade-offs discussed above, resulting in some files moving more slowly than they would if there was no compression.
If you are testing the speed of your network, try using data that is already compressed or encrypted to ensure consistent results.
Should I Turn On Inline Compression?
For compressed data, images, audio, video, or encrypted files: No.
For other types of data, test it both ways to see which is faster.
If the network is very fast (hundreds of megabits per second or faster), consider turning off inline compression and instead compress the data before you move it.

Neo4j inserting large files - huge difference in time between

I am inserting a set of files (pdfs, of each 2 MB) in my database.
Inserting 100 files at once takes +- 15 seconds, while inserting 250 files at once takes 80 seconds.
I am not quite sure why this big difference is happening, but I assume it is because the amount of free memory is full between this amount. Could this be the problem?
If there is any more detail I can provide, please let me know.
Not exactly sure of what is happening on your side but it really looks like what is described here in the neo4j performance guide.
It could be:
Memory issues
If you are experiencing poor write performance after writing some data
(initially fast, then massive slowdown) it may be the operating system
that is writing out dirty pages from the memory mapped regions of the
store files. These regions do not need to be written out to maintain
consistency so to achieve highest possible write speed that type of
behavior should be avoided.
Transaction size
Are you using multiple transactions to upload your files ?
Many small transactions result in a lot of I/O writes to disc and
should be avoided. Too big transactions can result in OutOfMemory
errors, since the uncommitted transaction data is held on the Java
Heap in memory.
If you are on linux, they also suggest some tuning to improve performance. See here.
You can look up the details on the page.
Also, if you are on linux, you can check memory usage by yourself during import by using this command:
$ free -m
I hope this helps!

Limiting ImageMagick Memory Use

Basically I'm trying to keep the memory use on my Nginx server under a certain amount, both because I'm insane (according to my friends) & I want to save money. However I'm worried ImageMagick may push it over the edge.
I'm using -limit area 20MiB and I've also tried -limit memory 15MiB -limit map 15MiB but when checking the process (as it runs) through top -c (with Shift-M) and ps aux it shows it using, sometimes, considerably more memory than I've set in the limits. To give numbers it may be using 35MB or 40MB, instead of the 20MB/30MB I would expect. I wouldn't be bothered for 2MB or 3MB but that's quite a large offset.
I've been told the extra memory may be the ImageMagick's overhead as it loads the interpreter etc, but I'm not super familiar with Unix programs so haven't a clue in that department.
If anyone can explain why this is happening, that would be great. If it's a normal thing, great. I'll just adjust things to take into account the fact that it may use my limit plus a certain amount, but if it isn't and the -limit parameter doesn't limit memory to a certain amount, what exactly is the point in having that parameter in ImageMagick?
Again thanks for your help in advance, it's much appreciated, as always.
According the documentation ImageMagick is moving all memory operations to mmaped files, so it will start to swap if you have enough disk space, see the manual:
SNIP from manual -limit:
The value for File is in number of files. The Disk limit is in
Gigabutes and the values for the other resources are in Megabytes. By
default the limits are 768 files, 1024MB memory, 4096MB map, and
unlimited disk, but these are adjusted at startup time on platforms
that can provide information about available resources. When the limit
is reached, ImageMagick will fail in some fashion, or take
compensating actions if possible. For example, -limit memory 32 -limit
map 64 limits memory When the pixel cache reaches the memory limit it
uses memory mapping. When that limit is reached it goes to disk. If
disk has a hard limit, the program will fail.
The Limits only affect ImageMagick's pixel cache. The program code and anything the libraries / delegates may do to load or process the images are not influenced by these settings at all.
You don't specify what you're looking at in top, the proper column would obviously be RES or RSIZE. With such small limits as 20MiB, the program and library code will represent a significant fraction of resident set size.
To verify that you're using the right units for your environment variables, use identify -list resource . If the size of the memory pixel cache (MAGICK_MEMORY_LIMIT) is insufficient for an image, an mmap-ed file will be used (MAGICK_MAP_LIMIT) and if that limit is too low, a conventional disk file (MAGICK_DISK_LIMIT) is used instead. If all the limits are too low, ImageMagick will fail immediately with an error such as cache resources exhausted, Memory allocation failed or corrupt image.

What is the fastest way for reading huge files in Delphi?

My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.
The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:
FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);
This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.
The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.
I am using Delphi 2007 on a Windows XP box.
What can I do to speed up this file access?
edit:
The file access is slow for large files, regardless of which part of the file is being read.
The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.
I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.
I am sorry for being sloppy in identifying the bottleneck.
If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.
Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).
Maybe you can take this approach:
Sort the entries on max fileposition and then to the following:
Take the entries that only need the first X MB of the file (till a certain fileposition)
Read X MB from the file into a buffer (TMemorystream
Now read the entries from the buffer (maybe multithreaded)
Repeat this for all the entries.
In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.
Maybe you can gain speed if you just take your original approach, but sort the entries on position.
The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.
The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.
Good luck.

2 Files, Half the Content, vs. 1 File, Twice the Content, Which is Greater?

If I have 2 files each with this:
"Hello World" (x 1000)
Does that take up more space than 1 file with this:
"Hello World" (x 2000)
What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?
Update:
I'm using a Macbook Pro, 10.5. But I'd also like to know for Ubuntu Linux.
Marcelos gives the general performance case. I'd argue worrying about this is premature optimization. you should split things into different files where it is logical to split them.
also if you really care about file size of such repetitive files then you can compress them.
your example even hints at this, a simple run length encoding of
"Hello World"x1000
is much more space efficient than actually having "hello world" written out 1000 times.
Files take up space in the form of clusters on the disk. A cluster is a number of sectors, and the size depends on how the disk was formatted.
A typical size for clusters is 8 kilobytes. That would mean that the two smaller files would use two clusters (16 kilobytes) each and the larger file would use three clusters (24 kilobytes).
A file will by average use half a cluster more than it's size. So with a cluster size of 8 kilobytes each file will by average have an overhead of 4 kilobytes.
Most filesystems use a fixed-size cluster (4 kB is typical but not universal) for storing files. Files below this cluster size will all take up the same minimum amount.
Even above this size, the proportional wastage tends to be high when you have lots of small files. Ignoring skewness of size distribution (which makes things worse), the overall wastage is about half the cluster size times the number of files, so the fewer files you have for a given amount of data, the more efficiently you will store things.
Another consideration is that metadata operations, especially file deletion, can be very expensive, so again smaller files aren't your friends. Some interesting work was done in ReiserFS on this front until the author was jailed for murdering his wife (I don't know the current state of that project).
If you have the option, you can also tune the file sizes to always fill up a whole number of clusters, and then small files won't be a problem. This is usually too finicky to be worth it though, and there are other costs. For high-volume throughput, the optimal file size these days is between 64 MB and 256 MB (I think).
Practical advice: Stick your stuff in a database unless there are good reasons not to. SQLite substantially reduces the number of reasons.
I think the usage of file(s) is to take into consideration, according to the API and the language used to read/write them (and hence eventually API restrictions).
Fragmentation of the disk, that will tend to decrease with only big files, will penalize data access if you're reading one big file in one shot, whereas several access spaced out time to small files will not be penalized by fragmentation.
Most filesystems allocate space in units larger than a byte (typically 4KB nowadays). Effective file sizes get "rounded up" to the next multiple of that "cluster size". Therefore, dividing up a file will almost always consume more total space. And of course there's one extra entry in the directory, which may cause it to consume more space, and many file systems have an extra intermediate layer of inodes where each file consumes one entry.
What are the drawbacks of dividing
content into multiple smaller files
(assuming there's reason to divide
them into more files, not like this
example)?
More wasted space
The possibility of running out of inodes (in extreme cases)
On some filesystems: very bad performance when directories contain many files (because they're effectively unordered lists)
Content in a single file can usually be read sequentially (i.e. without having to move the read/write head) from the HD, which is the most efficient way. When it spans multiple files, this ideal case becomes much less likely.

Resources