Is it practical to include an adaptive or optimizing memory strategy into a library? - buffer

I have a library that does I/O. There are a couple of external knobs for tuning the sizes of the memory buffers used internally. When I ran some tests I found that the sizes of the buffers can affect performance significantly.
But the optimum size seems to depend on a bunch of things - the available memory on the PC, the the size of the files being processed (varies from very small to huge), the number of files, the speed of the output stream relative to the input stream, and I'm not sure what else.
Does it make sense to build an adaptive memory strategy into the library? or is it better to just punt on that, and let the users of the library figure out what to use?
Has anyone done something like this - and how hard is it? Did it work?
Given different buffer sizes, I suppose the library could track the time it takes for various operations, and then it could make some decisions about which size was optimal. I could imagine having the library rotate through various buffer sizes in the initial I/O rounds... and then it eventually would do the calculations and adjust the buffer size in future rounds depending on the outcomes. But then, how often to re-check? How often to adjust?

The adaptive approach is sometimes referred to as "autonomic", using the analogy of a Human's autonomic nervous system: you don't conciously control your heart rate and respiration, your autonomic nervous system does that.
You can read about some of this here, and here (apologies for the plugs, but I wanted to show that the concept is being taken seriously, and is manifesting in real products.)
My experience of using products that try to do this is that they do acually work, but can make me unhappy: that's because there is a tendency for them to take a "Father knows best" approach. You make some (you believe) small change to your app, or the environment and something unexecpected happens. You don't know why, and you don't know if it's good. So my rule for autonomy is:
Tell me what you are doing and why
Now sometimes the underlying math is quite complex - consider that some autonomic systems are trending and hence making predictive changes (number of requests of this type growing, let's provision more of resource X) so the mathematical models are non-trivial. Hence simple explanations are not always available. However some level of feedback to the watching humans can be reassuring.

Related

Is there any way to calculate DRAM access latency (cycles) from data size?

I need to calculate DRAM access latency using given data size to be transfered between DRAM-SRAM
The data is seperated to "load size" and "store size" and "number of iteration of load and store" is given.
I think the features I need to consider are many like first DRAM access latency, transfer one word latency, address load latency etc..
Is there some popular equation to get this by given information?
Thank you in advance.
Your question has many parts, I think I can help better if I knew the ultimate goal? If it's simply to measure access latency:
If you are using an x86 processor maybe the Intel Memory Latency Checker will help
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.
If not x86, I think the Gem5 Simulator has what you are looking for, here is the main page but more specifically, for your needs, I think this config for Gem5 will be the most helpful.
Now regarding a popular equation, the best I could find is this Carnegie Melon paper that goes over my head: https://users.ece.cmu.edu/~omutlu/pub/chargecache_low-latency-dram_hpca16.pdf However, it looks like your main "features" as you put it revolve around cores and memory channels. The equation from the paper:
Storagebits = C ∗MC ∗Entries∗(EntrySizebits +LRUbits)
Is used to create a cache that will ultimately (the goal of ChargeCache) reduce access latency in DRAM. I'm sure this isn't the equation you are looking for but just a piece of the puzzle. The LRUbits relate to the cache this mechanism (in the memory controller, no DRAM modification necessary) creates.
EntrySizebits is determined by this equation EntrySizebits = log2(R)+log2(B)+log2(Ro)+1 and
R, B, and Ro are the number of ranks, banks, and rows in DRAM, respectively
I was surprised to learn highly charged rows (recently accessed) will have a significantly lower access latency.
If this goes over your head as well, maybe this 2007 paper by Ulrich Drepper titled What Every Programmer Should Know About Memory will help you find the elements you need for your equation. I'm still working through this paper myself, and there is some dated references but those depend on what cpu you're working with. Hope this helps, I look forward to being corrected on any of this, as I'm new to the topic.

What is "expensive" vs "inexpensive" memory?

I've been reading that accessing data from lower components in the memory hierarchy is slower but less expensive. For example, fetching data from registers is fast but expensive. Could someone explain what 'expensive' means here? Is it literally the dollar cost of components? If so, I don't understand why faster components would be more expensive. I read this answer (Memory Hierarchy - Why are registers expensive?) and it talks about how accessing data in registers requires additional data paths that aren't required by lower memory components, but I didn't understand from any of the examples why those data paths would be required when fetching from registers, but not from something like main memory.
So to summarize, my two questions are:
1) What does 'expensive' mean in this context?
2) Why are faster areas of memory like registers more expensive?
Thanks!
1) What does 'expensive' mean in this context?
Expensive has the usual meaning ($). For integrated circuit, price depends on circuit size and is directly related to the number and size of transistors. It happens that and "expensive" memory requires more area on an integrated circuit.
Actually, there are several technologies used to implement a memorizing device.
Registers that are used in processor. They are realized with some kind of logic devices called latches, and their main quality is to be fast, in order to allow two reads/one write per cycle. To that purpose, transistors are dimensioned to improve driving. It depends on the actual design, but typically a bit of memory requires ~10 transistors in a register.
Static memory (SRAM) is designed as a matrix of simplified flip-flops, with 2 inverters per cell and only requires 6 transistors per memorized bit. More, static memory is a memory and to improve the number of bits per unit area, transistors are designed to be smaller than for registers. SRAM are used in cache memory.
Dynamic memory (DRAM) uses only a unique transistor as a capacitance for memorization. The transistor is either charged or discharged to represent a 1 or 0. Though extremely economic, this technique cannot be very fast, especially when a large number of cells is concerned as in present DRAM chips. To improve capacity (number of bits on a given area), transistors are rendered as small as possible, and a complex analog circuitry is used to detect small voltage variations to speed up cell content reading. More reads destroys the cell content and requires a write. Last, there are leaks in the capacitance and data must be periodically rewritten to insure data integrity. Put altogether, it makes a DRAM a slow device, with an access time of 100-200 processor cycles, but they can provide extremely cheap physical memory.
2) Why are faster areas of memory like registers more expensive?
Processor rely on a memory hierarchy and different level of the hierarchy have specific constraints. To make a cheap memory, you need small transistors to reduce the size required to memorize a bit. But for electrical reasons, a small transistor is a poor generator that cannot provide enough current to drive rapidly its output. So, while the underlying technology is similar, design choices are different in different part of the memory hierarchy.
What does 'expensive' mean in this context?
More and larger transistors (more silicon per bit), more power to runs those transistors
Why are faster areas of memory like registers more expensive?
All memory is made as cheap as it can be for the needed speed -- there's no reason to make it more expensive if it doesn't need to be, or slower if it doesn't need to be. So it becomes a tradeoff and finding "sweet spots" in the design space -- make a particular type of circuit as fast and cheap as possible, while a different circuit is also tuned to be as fast and cheap as possible. If one design is both slower and more expensive, then there's no reason to ever use it. Its only when one design is faster while the other is cheaper, does it make sense to use both in different parts of the system.

Methods to Find 'Best' Cut-Off Point for a Continuous Target Variable

I am working on a machine learning scenario where the target variable is Duration of power outages.
The distribution of the target variable is severely skewed right (You can imagine most power outages occur and are over with fairly quick, but then there are many, many outliers that can last much longer) A lot of these power outages become less and less 'explainable' by data as the durations get longer and longer. They become more or less, 'unique outages', where events are occurring on site that are not necessarily 'typical' of other outages nor is data recorded on the specifics of those events outside of what's already available for all other 'typical' outages.
This causes a problem when creating models. This unexplainable data mingles in with the explainable parts and skews the models ability to predict as well.
I analyzed some percentiles to decide on a point that I considered to encompass as many outages as possible while I still believed that the duration was going to be mostly explainable. This was somewhere around the 320 minute mark and contained about 90% of the outages.
This was completely subjective to my opinion though and I know there has to be some kind of procedure in order to determine a 'best' cut-off point for this target variable. Ideally, I would like this procedure to be robust enough to consider the trade-off of encompassing as much data as possible and not telling me to make my cut-off 2 hours and thus cutting out a significant amount of customers as the purpose of this is to provide an accurate Estimated Restoration Time to as many customers as possible.
FYI: The methods of modeling I am using that appear to be working the best right now are random forests and conditional random forests. Methods I have used in this scenario include multiple linear regression, decision trees, random forests, and conditional random forests. MLR was by far the least effective. :(
I have exactly the same problem! I hope someone more informed brings his knowledge. I wander to what point is a long duration something that we want to discard or that we want to predict!
Also, I tried treating my data by log transforming it, and the density plot shows a funny artifact on the left side of the distribution ( because I only have durations of integer numbers, not floats). I think this helps, you also should log transform the features that have similar distributions.
I finally thought that the solution should be stratified sampling or giving weights to features, but I don't know exactly how to implement that. My tries didn't produce any good results. Perhaps my data is too stochastic!

Is CGPathContainsPoint() hardware accelerated?

I'm doing an iOS game and would like to use this method for collision detection.
As there are plenty (50+) of points to check every frame, I wondered if this method runs on the iDevice's graphics hardware.
Following up on #DavidRönnqvist point: it doesn't matter if it's "hardware accelerated" or not. What matters is whether it is fast enough for your purpose, and then you can use Instruments to check where it is eating time and try to improve things.
Moving code to the GPU doesn't automatically make it faster; it can in fact make it much slower since you have to haul all the data over to GPU memory, which is expensive. Ideally to run on the GPU, you want to move all the data once, then do lots of expensive vector operations, and then move the data back (or just put it on the screen). If you can't make the problem look like that, then the GPU isn't the right tool.
It is possible that it is NEON accelerated, but again that's kind of irrelevant; the compiler NEON-accelerates lots of things (and running on the NEON doesn't always mean it runs faster, either). That said, I'd bet this kind of problem would run best on the NEON if you can test lots of points (hundreds or thousands) against the same curves.
You should assume that CGPathContainsPoint() is written to be pretty fast for the general case of "I have one random curve and one random point." If your problem looks like that, it seems unlikely that you will beat the Apple engineers on their own hardware (and 50 points isn't much more than 1). I'd assume, for instance, that they're already checking the bounding box for you and that your re-check is wasting time (but I'd profile it to be sure).
But if you can change the problem to something else, like "I have a known curve and tens of thousands of points," then you can probably hand-code a better solution and should look at Accelerate or even hand-written NEON to attack it.
Profile first, then optimize. Don't assume that "vector processor" is exactly equivalent to "fast" even when your problem is "mathy." The graphics processor even more-so.

fastest image processing library?

I'm working on robot vision system and its main purpose is to detect objects, i want to choose one of these libraries (CImg , OpenCV) and I have knowledge about both of them.
The robot I'm using has Linux , 1GHz CPU and 1G ram and I'm using C++ the size of image is 320p.
I want to have a real-time image processing near 20 out of 25 frames per seconds.
In your opinion which library is more powerful l although I have tested both and they have the same process time, open cv is slightly better and I think that's because I use pointers with open cv codes.
Please share your idea and your reason.
thanks.
I think you can possibly get best performance when you integrated - OpenCV with IPP.
See this reference, http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-open-source-computer-vision-library-opencv-faq/
Here is another reference http://experienceopencv.blogspot.com/2011/07/speed-up-with-intel-integrated.html
Further, if you freeze the algorithm that works perfectly, usually you can isolate your algorithm and work your way towards doing serious optimization (such as memory optimization, porting to assembly etc.) which might not be ready to use.
It really depends on what you want to do (what kind of objects you want to detect, accuracy, what algorithm you are using etc..) and how much time you have got. If it is for generic computer vision/image processing, I would stick with OpenCV. As Dipan said, do consider further optimization. In my experience with optimization for Computer Vision, the bottleneck usually is in memory interconnect bandwidth (or memory itself) and so you might have to trade in cycles (computation) to save on communication. Do understand the algorithm really well to further optimize the algorithm (which at times can give huge improvements as compared to compilers).

Resources