Lucene: Indexing a stream (not available in a buffer)

Lucene: Indexing a stream (not available in a buffer) - stream

I'd like to use Lucene to index a stream as it is being read. Due to the size of the data and limited RAM, I can't put the whole thing into a buffer; instead, I want Lucene to consume from the stream, index, wait for more data to be available, consume more, until EOF.
Lucene should only buffer what it needs to: that is, partial tokens until enough chars have been acquired to end the token.
Can I do that with Lucene? How?

You should be able to pass a Reader into your field constructor, rather than a String. I believe this would not need to read the entire field into memory (I haven't really done a great of testing on this). You would not be able to make the field stored, but if you can't load it into memory anyway, why would you want to?
However, I don't believe there is any way to read only partial tokens at a time. I believe you would need to be able to load each token into memory at an absolute minimum. If your tokens are so large as to overflow available memory, you probably need to rethink your analysis scheme.

Related

How do I increase maximum download size in an MVC application?

I am frankly stumped. This is beyond my experience.
I have a C# MVC program that generates a zip file in a MemoryStream for downloading. The action method is called by a button click to JavaScript.
The only problem is that in some cases the potential file size can easily exceed one Gig and from my reading, that is a common problem. I've tried upping the Maximum Allowed Content Length to 3000000000 in Request Filtering on IIS (IIS8). I've tried adding requestLimits maxAllowedContentLength to my web.config. I've even tried breaking up the zip through multiple calls to the action method (without success), although I have yet to get any confirmation/denial that this is even possible.
Is there any setting within IIS or my web.config that I could be overlooking? Could this be a company network issue, not solvable on an app developer's level?

Okay, so it's kind of hard to explain big concepts in 400 characters or less, so I think I'm just causing more confusion sticking in the comments section. Besides, I think we're close enough here to an "answer" as you're likely to get.
The default constructor of MemoryStream essentially sets the initial size to 0. In reality, the initial size is set to somewhere around 256, but since the initial size is mostly a guide, and it doesn't actually claim that space until its needed, it starts at 0.
Each time you write to the stream, it checks how much is being written versus the remaining size of the buffer array. If it can't fit the write, it creates a new, larger buffer array and copies the old buffer array into that. In this way, setting an initial size can help somewhat, in that you start off with a larger initial buffer array and you may not need to grow that buffer. You might have a better chance of getting a contiguous block of memory, which I'll explain the importance of in a bit, but that actually kind of works against you, as well. If you only need 1MB for the file, but you're initializing with 100MB and there's not 100MB of contiguous memory, you'll get an OutOfMemoryException, even though there might be 1MB of contiguous memory available.
Regardless of whether you initialize or not, there remains certain immutable facts. First, MemoryStream requires contiguous memory. Even if you technically have memory available on the system, it's possible you might not have large blocks of available memory. In other words if you have 4GB available, but it's all fragmented, even trying to create a 1GB stream in memory could fail, simply because it can't reserve 1GB of contiguous memory. Obviously, the larger the file you're tying to create in memory, the greater the chances that you're going to run into this issue. For this reason alone, I would say you're out of luck without raising the amount of system RAM. With 8GB and probably only 4-6GB actually available to IIS and then split up between worker processes and threads, the odds that you're going to be able to claim 25% or so of the available RAM as contiguous space, is highly unlikely.
The next immutable fact may or may not be relevant, but since you haven't specified, I'll mention it. If your web app is deployed as 32-bit, you'll have a hard limit of 2GB for any object, meaning a MemoryStream could never house more than 2GB (actually around 1.3-1.6GB as .NET code consumes some of that address space), and any attempt to make it do so will result in an OutOfMemoryException, even if you had some ridiculous amount of RAM on the system like 1TB+. If your app is 64-bit, this is less likely an issue as you can address a ton more memory, assuming it's compiled properly. You'd have to pretty much try to screw that up, though, so you should be fine.
Finally, multiple writes can cause an issue as well. As I said previously, the buffer array resizes (if necessary) in response to writes. Each time it resizes, the new buffer array must also be able to fit in contiguous address space. As a result, multiple resizes can cause you to bump into an OutOfMemoryException you wouldn't have hit if you had written all the data from the start. This is where initializing the MemoryStream can be helpful, but as I said before, it's also a double-edged sword, as your initial buffer size might be too great to begin with and you end up with an exception where you may have not had one letting it grow organically. Long and short, try to write everything to the stream in one go rather than piecemeal.

How do allocations work and how do you prevent them?

The go test tool has a profiler which can tell you the amount of allocations you did inside the code.
However, seeing libraries such as this one:
https://github.com/valyala/fasthttp
stating "Zero memory allocations in hot paths"... what does that mean? and how do you achieve this in Go?

I personally don't like their use of language as it sounds like something a marketer would say... All they mean to say is that no allocations will occur in that code because a buffer has been allocated in advance for use there.
So to be clear, they mean 'in this limited scope no allocations will occur'. How do you achieve this? By allocating a sufficiently large buffer in advance of that and then leveraging it in the scope.
The intent of packages author(s) is to speed up request handling by allocating up front at the cost of using more memory (or having a more constant hold on memory at least, in theory the buffer could be the same size as what would need to be allocated).
If you're curious about the implementation details take a look in files like byte_buffer.go and args.go and you'll find that there is a pool of buffer objects allocated in advance so that your handler code doesn't have to do an allocation for the response body ect. Instead you obtain a buffer from the pool (already allocated) and write the response data to it and then when you're done it's released back into the pool for reuse. In the standard scenario you'd instead allocate space for the response body and after the response is returned that object would leave scope and the memory would be freed. As I mentioned in the paragraph above, moving all of this upfront means when your service starts up it will obtain and hold a larger amount of memory than a similar service which used net/http since it will instead obtain and release memory on an as needed basis.

iOS SQLite or 1000 loose files?

Suppose I have 1000 records of variable size, ranging from around 256 bytes to a few K. I wonder is there any advantage of putting them into a sqlite database versus just reading/writing 1000 loose files on iOS? I don't need to do any operations other than access by a single key, which I can use as the filename. Seems like the file system would be the winner unless the number of records grows very large.

If your system were read-only, I would say that the file system is the clear winner: a simple binary file and perhaps a small index to know where each record starts would be all that you need. You could read the entire index into memory, and then grab your records from the file system as needed, for a performance that would be extremely tough to match for any RDBMS.
However, since you are planning on writing data back, I would suggest going with SQLite because of potential data integrity issues.
Performance concerns should not be underestimated, too: since your records are of variable size, writing the data back may prove to be difficult in cases when records need to expand. Moreover, since you are on a mobile platform, you would need to build something in to avoid data corruption when the program is killed unexpectedly in the middle of a write. SQLite takes care of this; your code would have to build something comparable to it, or risk data corruption problems.

Constant memory dumps

Is it possible to constantly dump the memory of a process to record every change that is happening? For example if I have a program that modifies the contents of an array I'd like to know the contents of that array before some modification. I imagine a program could save the initial memory and then all changes in a file and I'd just search the file by the modified contents of the array which I know. Then I'd look for changes in that specific memory location before that moment and find the initial contents.
Does a program like that exist? If so, what program would you recommend?
EDIT: I wrote a program in C++ that captures packets of another process using pcap and I would like to know how these packets are constructed inside that program. I'm using Windows.

Notice that memory content is (or may be) changing a lot faster than what a disk is capable of writing.
Also, your question is OS specific. I guess that you are using Linux.
In all cases, design your application very early with your goals.
Perhaps you are looking for application checkpointing. If on Linux, consider BLCR.
Perhaps you are looking for some persistence mechanism. A possible way might be to explicitly persist the state of your application at some points in your program, which are executed frequently. Persistence of the call stack or of continuations is a difficult issue
You may want to use textual formats (like JSON) for serialization. You could be interested in database technology, either relational-SQL (e.g. Sqlite or PostGreSQL) or noSQL mongodb
Persistence and checkpointing may be related to garbage collection algorithms (notably copying GC).
Some language implementations are able to persist their entire heap. For example, in Common Lisp, the SBCL implementation offers save-lisp-and-die
For debugging, you might want watchpoints, or the gcore(1) command.
Notice that if you fork(2) your process and sleep or idle immediately the child process you are keeping in that child process a snapshot of your address space.
Read also about transactional memory & ACID properties

What is the fastest way for reading huge files in Delphi?

My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.
The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:
FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);
This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.
The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.
I am using Delphi 2007 on a Windows XP box.
What can I do to speed up this file access?
edit:
The file access is slow for large files, regardless of which part of the file is being read.
The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.
I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.
I am sorry for being sloppy in identifying the bottleneck.

If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.
Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).

Maybe you can take this approach:
Sort the entries on max fileposition and then to the following:
Take the entries that only need the first X MB of the file (till a certain fileposition)
Read X MB from the file into a buffer (TMemorystream
Now read the entries from the buffer (maybe multithreaded)
Repeat this for all the entries.
In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.
Maybe you can gain speed if you just take your original approach, but sort the entries on position.

The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.
The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.
Good luck.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart