Is it technically possible to fine-tune stable diffusion (any checkpoint) on a GPU with 8GB VRAM - machine-learning

i am currently playing around with some generative models, such as stable-diffusion and i was wondering if it is technically possible and actually sensible to fine-tune the model on a Geforce RTX3070 with 8GB VRAM. Its just to play around a bit so small dataset and i dont expect good results out of it, but to my understanding if i turn down the batch size far enough and use lower resolution images it should be technically possible. Or am i missing something because on their repository they say that you need a GPU with at least 24GB.
I did not get to coding yet because i wanted to first check if its even possible before i end up setting everything up and then find out it does not work.

I've read about a person that was able to train using 12GB of ram, using the instructions in this video:
https://www.youtube.com/watch?v=7bVZDeGPv6I&ab_channel=NerdyRodent
It sounds a bit painful though. You would definitely want to try using the
--xformers
and
--lowvram
command line arguments when you startup SD. I would love to hear how this turns out and if you get it working.

Related

Emgu CV Surf picture detection against known database?

I'm trying to compare an image against a known set of images and find the closest match(es) using Emgu CV and Surf. I've found a lot of people trying to do the same thing but not a complete solution that uses the GPU for speed.
The closest I've gotten is the tutorial here:
http://romovs.github.io/blog/2013/07/05/matching-image-to-a-set-of-images-with-emgu-cv/
However that doesn't take advantage of the GPU and it's really slow for my application. I need something fast like the SurfFeature sample.
So I tried to refactor that tutorial code to match the SurfFeature logic that uses the GPU. Everything was going well with GpuMat's replacing Matrix here and there. But I ran into a major problem when I got to the core of the tutorial above, that is to say, the logic that concatenates all of the descriptors into one large matrix. I couldn't find a way to append GpuMat's to each other - even if I could do that, there's no guarantee that the FlannIndex search routine would even work with the Gpu-based code.
So now I'm stuck on something I thought would be relatively straight-forward. There are certainly a number of people trying to do this over the years so I'm really surprised that there isn't a published solution.
If you could help me, I'd be most appreciative. To summarize, I need to do the following:
Build a large in-memory (on the GPU) list of descriptors and keypoints for a known set of images using Surf (as per the SurfFeature sample). Given an unknown image, search against the in-memory stuff to find the closest match (if any).
Thanks in advance if you can help!

Is CGPathContainsPoint() hardware accelerated?

I'm doing an iOS game and would like to use this method for collision detection.
As there are plenty (50+) of points to check every frame, I wondered if this method runs on the iDevice's graphics hardware.
Following up on #DavidRönnqvist point: it doesn't matter if it's "hardware accelerated" or not. What matters is whether it is fast enough for your purpose, and then you can use Instruments to check where it is eating time and try to improve things.
Moving code to the GPU doesn't automatically make it faster; it can in fact make it much slower since you have to haul all the data over to GPU memory, which is expensive. Ideally to run on the GPU, you want to move all the data once, then do lots of expensive vector operations, and then move the data back (or just put it on the screen). If you can't make the problem look like that, then the GPU isn't the right tool.
It is possible that it is NEON accelerated, but again that's kind of irrelevant; the compiler NEON-accelerates lots of things (and running on the NEON doesn't always mean it runs faster, either). That said, I'd bet this kind of problem would run best on the NEON if you can test lots of points (hundreds or thousands) against the same curves.
You should assume that CGPathContainsPoint() is written to be pretty fast for the general case of "I have one random curve and one random point." If your problem looks like that, it seems unlikely that you will beat the Apple engineers on their own hardware (and 50 points isn't much more than 1). I'd assume, for instance, that they're already checking the bounding box for you and that your re-check is wasting time (but I'd profile it to be sure).
But if you can change the problem to something else, like "I have a known curve and tens of thousands of points," then you can probably hand-code a better solution and should look at Accelerate or even hand-written NEON to attack it.
Profile first, then optimize. Don't assume that "vector processor" is exactly equivalent to "fast" even when your problem is "mathy." The graphics processor even more-so.

fastest image processing library?

I'm working on robot vision system and its main purpose is to detect objects, i want to choose one of these libraries (CImg , OpenCV) and I have knowledge about both of them.
The robot I'm using has Linux , 1GHz CPU and 1G ram and I'm using C++ the size of image is 320p.
I want to have a real-time image processing near 20 out of 25 frames per seconds.
In your opinion which library is more powerful l although I have tested both and they have the same process time, open cv is slightly better and I think that's because I use pointers with open cv codes.
Please share your idea and your reason.
thanks.
I think you can possibly get best performance when you integrated - OpenCV with IPP.
See this reference, http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-open-source-computer-vision-library-opencv-faq/
Here is another reference http://experienceopencv.blogspot.com/2011/07/speed-up-with-intel-integrated.html
Further, if you freeze the algorithm that works perfectly, usually you can isolate your algorithm and work your way towards doing serious optimization (such as memory optimization, porting to assembly etc.) which might not be ready to use.
It really depends on what you want to do (what kind of objects you want to detect, accuracy, what algorithm you are using etc..) and how much time you have got. If it is for generic computer vision/image processing, I would stick with OpenCV. As Dipan said, do consider further optimization. In my experience with optimization for Computer Vision, the bottleneck usually is in memory interconnect bandwidth (or memory itself) and so you might have to trade in cycles (computation) to save on communication. Do understand the algorithm really well to further optimize the algorithm (which at times can give huge improvements as compared to compilers).

GUI version of OpenCV for feature-detection (SIFT etc.) prototyping before actual project development?

I had an idea for which I need to be able to recognize certain objects or models from a rendered three dimensional digital movie.
After limited research, I know now that what I need is called feature detection in the field of Computer Vision.
So, what I want to do is:
create a few screenshots of a certain character in the movie (eg. front/back/leftSide/rightSide)
play the movie
while playing the movie, continuously create new screenshots of the movie
for each screenshot, perform feature detection (SIFT?, with openCV?) to see if any of our character appearances are there (they must still be recognized if the character is further away and thus appears smaller, or if the character is eg. lying down).
give a notice whenever the character is found
This would be possible with OpenCV, right?
The "issue" is that I would have to learn c++ or python to develop this application. This is not a problem if my movie and screenshots are applicable for what I want to do.
So, I would like to first test my screenshots of the movie. Is there a GUI version of OpenCV that I can input my test data and then execute it's feature detection algorithms manually as a means of prototyping?
Any feedback is appreciated. Thanks.
There is no GUI of OpenCV able to do what you want. You will be able to use OpenCV for some aspects of your problem, but there is no ready-made solution waiting there for you.
While it's definitely possible to solve your problem, the learning curve for this problem is quite long. If you're a professional, then an alternative to learning about it yourself would be to hire an expert to do it for you. It would cost money, but save you time.
EDIT
As far as template matching goes, you wouldn't normally use it to solve such a problem because the thing you're looking for is changing appearance and shape. There aren't really any "dynamic parameters to set". The closest thing you could try is have a massive template collection that would try to cover the expected forms that your target may take. But it would hardly be an elegant solution. Plus it wouldn't scale.
Next, to your point about face recognition. This is kind of related, but most facial recognition applications deal with a controlled environment: lighting, distance, pose, angle, etc. Outside of that controlled environment face detection effectiveness drops significantly. If you're detecting objects in a movie, then your environment isn't really controlled.
You may want to first try a simpler problem of accurately detecting where the characters are, without determining who they are (video surveillance, essentially). While it may sound simple, you'll find that it's actually non-trivial for arbitrary scenes. The result of solving that problem may be useful in identifying the characters.
There is Find-Object by Mathieu Labbé. It was very helpful for me to start getting an understanding of the descriptors since you can change them while your video is running to see what happens.
This is probably too late, but might help someone else looking for a solution.
Well, using OpenCV you would of taking a frame of a video file and do any computations on it.
You can do several different methods of detecting a character on that image, but it's not so easy to have it as flexible so you can even get that person if it's lying on the floor for example, if you only entered reference images of that character standing.
Basically you could try extracting all important features from your set of reference pictures and have a (in your case supervised) learning algorithm that gets a good feature-vector of that character for classification.
You then need to write your code that plays the video and which takes a video frame let's say each 500ms (or other as you desire), gets a segmentation of the object you thing would be that character and compare it with the reference values you get from your learning algorithm. If there's a match, your code can yell "Yehaaawww!" or do other things...
But all this depends on how flexible you want this to be. You could also try a template match or cross-correlation which basically shifts the reference image(s) over the frame and checks how equal both parts are. But this unfortunately is very sensitive for rotation, deformations or other noise... so you wouldn't get that person if its i.e. laying down. And I doubt you can get all those calculations done in realtime...
Basically: Yes OpenCV is good to use for your image processing/computer vision tasks. But it offers a lot of methods and ways and you'd need to find a way that works for your images... it's not a trivial task though...
Hope that helps...
Have you tried looking at some of the work of the Oxford visual geometry group?
Their Video Google system describes to a large extent what you want, instance detection.
Their work into Naming People in TV shows is also pretty relevant. A face detection and facial feature pipeline is included that can be run from Matlab. Are you familiar with Matlab?
Have you tried computer vision frameworks like Cassandra? There you can exactly do that just by some mouse clicks.

Is it practical to include an adaptive or optimizing memory strategy into a library?

I have a library that does I/O. There are a couple of external knobs for tuning the sizes of the memory buffers used internally. When I ran some tests I found that the sizes of the buffers can affect performance significantly.
But the optimum size seems to depend on a bunch of things - the available memory on the PC, the the size of the files being processed (varies from very small to huge), the number of files, the speed of the output stream relative to the input stream, and I'm not sure what else.
Does it make sense to build an adaptive memory strategy into the library? or is it better to just punt on that, and let the users of the library figure out what to use?
Has anyone done something like this - and how hard is it? Did it work?
Given different buffer sizes, I suppose the library could track the time it takes for various operations, and then it could make some decisions about which size was optimal. I could imagine having the library rotate through various buffer sizes in the initial I/O rounds... and then it eventually would do the calculations and adjust the buffer size in future rounds depending on the outcomes. But then, how often to re-check? How often to adjust?
The adaptive approach is sometimes referred to as "autonomic", using the analogy of a Human's autonomic nervous system: you don't conciously control your heart rate and respiration, your autonomic nervous system does that.
You can read about some of this here, and here (apologies for the plugs, but I wanted to show that the concept is being taken seriously, and is manifesting in real products.)
My experience of using products that try to do this is that they do acually work, but can make me unhappy: that's because there is a tendency for them to take a "Father knows best" approach. You make some (you believe) small change to your app, or the environment and something unexecpected happens. You don't know why, and you don't know if it's good. So my rule for autonomy is:
Tell me what you are doing and why
Now sometimes the underlying math is quite complex - consider that some autonomic systems are trending and hence making predictive changes (number of requests of this type growing, let's provision more of resource X) so the mathematical models are non-trivial. Hence simple explanations are not always available. However some level of feedback to the watching humans can be reassuring.

Resources