Best way to represent formatted text in memory? C++

Best way to represent formatted text in memory? C++ - memory

I'm writing a basic text editor, well it's really an edit control box where I want to write code, numerical values and expressions for my main program.
The way I'm currently doing it is that I feed character strings into the edit control. In the edit control I have a class that breaks the string up into “glyphs” like words, numbers, line breaks, tabs, format tokens, etc. The word glyphs for example contain a string representing a literal word and a short integer that represents the number of trailing white spaces. The glyphs also contain info needed when drawing the text and calculating line wrapping.
For example the text line “My name is Karl” would equal a linked list of glyphs like this:
NewLineGlyph → WordGlyph (“My”, 1 whitespace) → WordGlyph (“name”, 1 whitespace) → WordGlyph(“is”, 1 whitespace ) → WordGlyph (“Karl”, 0 whitespace) → NULL.
So instead of storing the string in memory as a continuous block of chars (or WCHARs), it is stored in small chunks with potentially lots of small allocations and deallocations.
My question is; should I be concerned with heap fragmentation when doing it this way? Do you have any tips on makinging this more efficient? Or a completely different way of doing it? :)
PS. I'm working in C++ on Win7.

Should you be concerned about fragmentation? The answer likely depends on how large your documents are (e.g., number of words), and how much editing will occur and the nature of those edits. The approach you have outlined might be reasonable for a static (read-only) document where you can "parse" the document once, but I imagine there will be a fair amount of work that needs to happen behind the scenes to keep your data structures in the correct state as a user is making arbitrary edits. Also, you'll have to decide on what a "word" is, which isn't necessarily obvious/consistent in every case. For example, is "hard-working" one word or two? If it's one, does that mean you will never word wrap at the hyphen? Or, consider the case where a "word" will not fit on a single line. In that case, will you simply truncate, or would you want to force break the word across lines?
My recommendation would be store the text as a block, and store the line breaks separately (as offsets into the text block), then recalculate line breaks as needed each time there is a change. If you're concerned about fragmentation and minimizing the number of allocations/deallocations, you could allocate fixed-size blocks and then manage memory inside of those blocks yourself. Here's what I've done in the past:
Text is stored as a block of characters, but rather than having a single contiguous block for the entire document, I maintain a linked list of blocks that are always allocated 4KB (i.e., either 4K single-byte chars, or 2K WCHARs). In other words, the text is stored as a linked list of arrays, where each array is allocated to a constant size.
Each block keeps track of how much space (i.e., characters) are used/free within that block.
When inserting one or more characters, if there is space in the current block, I can simply shift memory within that block (no allocation/deallocation required). If no space is available in the current block, but space is available in the adjacent block, then again I can just shift memory between existing blocks (no allocation/deallocation required). If both blocks are full, only then do I allocate a new 4KB block and add at the appropriate position in the linked list.
When deleting one or more characters, I simply need to shift memory (at most 4KB) rather than entire document text. I also may have to deallocate and remove any block(s) that become completely empty.
I also do some "garbage collection" to coalesce free space at appropriate times. This is fairly straightforward and involves moving characters from one block to another so that some blocks become empty and can be removed.
From the OS and/or runtime library's perspective, all of the allocations/dellocations are the same size (4KB), so there is no fragmentation. And since I manage the contents of that memory, I can avoid fragmentation within my allocated space by shifting memory contents to eliminate wasted space. The other advantage is that it minimizes the number of alloc/dealloc calls, which can be a performance concern depending on what allocator you are using. So, it's an optimization for both speed and size -- how often does that happen? :-)

I wouldn't worry about heap fragmentation; modern heap manager is pretty good at dealing with that.
I might worry about poor data locality, though. With each glyph as a separate allocation in a linked list (especially a non-invasive list like std::list), any sort of pass through the document is going to jump all over memory in a potentially non-cache-friendly way.
Text editors are harder than they seem at first blush. There are a lot of specialized data structures out there for representing blocks of text and structured documents. They each optimize for different types of operations. I recommend searching for explanations of them and then considering the types of operations you'll have to do most.
This paper is old, but it has a lot of good information: http://www.cs.unm.edu/~crowley/papers/sds.pdf

Related

How is data written to memory

When we store data in memory.
How does it get stored, so it can recognize what type of data it is when loaded.
What I want to ask is how the data types like Natural numbers, integers, characters, etc are stored in memory. So they can be recognized easily later when extracted from memory.
When we see at memory, what we see are hex numbers.
How can we relate these hex numbers for ASCII value or Integer Value or any other etc.

Since all of your data is written in binary, there isn't much difference between how the char a is written and how the int 97 is written, since they represent the same binary string (at least the last 8 bits of those strings). That being said, when you read from memory, you read a data type, by that type, you know how you should interpret the data

Memory does not operate in terms of "character" or "integer", these are high-level concepts that assume an abstract machine.
Typically, but not necessarily, a character is just an integer with a smaller size, often 8 bits (but a character could as well be 32 bits!) which represents one symbol or letter, rather than a discrete number. In some cases, a character may even be encoded using a variable length.
Memory operates in terms of bits that are organized in bytes (smallest directly addressable unit) or words. These are -- unbeknownst to you -- organized in banks. The hardware typically allows access in units called "cache lines", but this is something that happens secretly behind your back.
In assembler language, you can typically access bytes and power-of-two multiples of these, sometimes with special alignment requirements (there's usually also bit operations, but while they only change one bit, they still work on whole bytes/words).
All of that is, however, not very interesting, and also widely irrelevant for you. It is first and foremost the compiler's (or interpreter's) job to make sure that when you speak of an integer or a character, that whatever you want comes out at the other end. It is also the tool's responsibility to convert one into another if possible, and produce an error if not possible.
You do not even know for certain whether the value of an integer or a character has a memory location at all (it may very well be stored in a register) unless you explicitly enforce that.
You cannot distinguish a byte at some memory location that came from a "character" from a byte that belongs to an "integer". They look just the same.
And while it is possible to read the raw bytes of one type as another type in most languages, this is not something you normally need to do (or should do).

How much space does a tab take?

I want to quantify the saving of space I can get by changing the format of a file.
I have a sparse matrix stocked in a text file (30% sparsity). Columns are separated by tabs.
Following an idea in an SO answer, I will change the format to row_id, col_id for the non zero terms only. I know how much space a float takes, but my question is: how much space does a tab take?

CouchDeveloper in his comment is correct. It's impossible to tell from the data you provide.
In a single byte character set encoding you'd save 1 byte per separator from the current ", ".
In a multibyte encoding it'd depend on the way each of those characters is encoded, you could theoretically even lose space. Say a tab is encoded as 4 bytes, a comma and space as 1 each, you'd end up taking 2 more bytes per separator.
Unless you have many separators and relatively very little data, I'd not worry one way or another, it'd be micro optimisation.
If you do, a binary encoding scheme might be more relevant.

1 byte, but significantly less if you're using compression (based on how common they will be, less than a bit on average). Use compression.

String indexing vs. dynamic array indexing in Delphi

In Delphi, why are AnsiStrings indexed from one and dynamic arrays indexed from zero? Is this a historical accident, to make AnsiStrings work more like ShortStrings, or is there some deeper logic at work?

One of the contributing factors that led to "Pascal" strings being 1 indexed instead of 0 indexed was that the length of the string was stored in the zeroth byte. Yes, that could have been hidden from the programmer's view by having the compiler internally add a constant offset to the string index expression (as was done in Delphi's long strings later) but in the beginning things were much simpler. Allocate a block of memory, store the length in byte zero, index the char data from byte 1. End of story.
As I recall UCSD Pascal was using this length-in-zero-byte convention long before Turbo Pascal came along.
As for why dynamic arrays are zero based, I don't recall any specific reason but I would guess it reflects the dynamic array's kinship to dynamically allocating a buffer and indexing off the buffer pointer. The array types that you would use to create array pointer types were zero based arrays. The first byte is found at buffer pointer + 0 offset. This is the C rationalization for zero based everything. There was no compelling reason to carry string's 1 based indexing pattern over to compiler managed arrays when string's 1 based indexing was already (and had always been) the exception rather than the norm.
It may well be that because the string type was the first array-like data type that everyone first encountered and possibly the most used data type across the board, there may be a perception of a bias towards 1 based indexing in the language. However, if you look closely I think you'll find arrays in Pascal (distinct from string) have never been inherently 1 based, especially when dynamically allocated.

The reason for the Delphi string tradition of 1-based strings is quite simple. The tradition comes from the implementation of old style Turbo Pascal strings. That data type stored the length of the string in the first byte of the variable, index 0. The string data began in the next byte, index 1.
You can still use that data type today. It's now called ShortString. As is immediately obvious from it's implementation, there is a 255 character limit. This limit led to the introduction of huge strings, if I recall correctly, in Delphi 2. When huge strings were introduced the language designers chose to retain 1-based indexing to make it easier for developers to switch from short strings to huge strings.
I guess Turbo Pascal didn't invent the idea of using element 0 for length. It's just that I'm too young to remember what came before then!
Dynamic arrays weren't bound by the past in the same way and had a free choice. I don't know why zero based was chosen. Perhaps because it fits more easily with the prevailing fashion on platform on which Delphi existed at that time, namely Windows. That's just a guess though. Danny Thorpe worked on the Delphi compiler at that time, and even he can't remember the rationale!
The Delphi language designers are currently moving towards zero based string indexing for huge strings. The initial steps in this direction can be seen in XE3 in the TStringHelper class which uses 0-based indexing. And also in the ZEROBASEDSTRINGS conditional which allows you to opt in to 0-based indexing. Expect the next generation Delphi compiler to use 0-based indexing only. The times they are changin'.

Historical accident.
Pascal strings and arrays traditionally start at 1.
C - and perhaps consequently AnsiStrings - start at 0.
I don't know the rationale for "breaking with Pascal tradition" for Dynamic arrays, which also start at zero. But it makes sense, and I agree with it ...
IMHO...

Delphi TStringList wrapper to implement on-the-fly compression

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA
Ross

I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).

You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.

I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and
keep separated growning dictionary
to reference the words and save list of indexes internally
You implement something related to
zlib stream available in Delphi, but operating by the block that
for example can contains 10-100
strings. In this case you still have
to recompress/compress the complete
block, but the "price" you pay is lower.

I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile

Does how you name a variable have any impact on the memory usage of an application?

Declaring a variable name, how much (if at all) impact does the length of its name have to the total memory of the application? Is there a maximum length anyway? Or are we free to elaborate on our variables (and instances) as much as we want?

It depends on the language, actually.
If you're using C++ or C, it has no impact.
If you're using an interpreted language, you're passing the source code around, so it can have a dramatic impact.
If you're using a compiled language that compiles to an intermediate language, such as Java or any of the .NET languages, then typically the variable names, class names, method names, etc are all part of the IL. Having longer method names will have an impact. However, if you later run through an obfuscator, this goes away, since the obfuscator will rename everything to (typically) very short names. This is why obfuscation often gives performance impacts.
However, I will strongly suggest using long, descriptive variable/method/class names. This makes your code understandable, maintainable, and readable - in the long run, that far outweighs any slight perf. benefit.

It has no impact in a compiled language.

In compiled languages, almost certainly not; everything becomes a symbol in a symbol table. In interpreted languages, the answer is also no, with a few extremely rare exceptions (in certain older versions of Python there would be a difference, for example).

MSVC++ truncates variable names to 255 characters. Variable name length has no impact on compiled code size.

As stated by others, variable names disappear in compiled languages. I believe that local variable names in .Net may be discarded. But generally speaking, even in an interpreted language, the memory consumption of variable names is negligible, especially in light of the advantages of good variable names.

Actually in ASP.NET long variable names for controls and master pages do add to the size of the generated HTML. This will add some insignificant extra memory to buffer the output stream, but the effect will be most noticed in the extra few hundred bytes your sending over the network.

In Python, the names appear to be collected into a number of simple tables; each name appears exactly once in each code object.The names have no impact on performance.
For statistical purposes, I looked at a 20 line function that was a solution to Project Euler problem 15. This function created a 292-byte code object. It used 7 distinct names in the name table. You'd have to use 41-character variable names to double the size of the byte-code file.
That would be the only impact -- insanely large names might slow down load time for your module.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart