My linux computer will become very slow or even hang up when GPU memory and DRAM are used up after using pytorch to run cudamallocmanaged

Question

1.00/5 (2 votes)

See more:

, +

In PyTorch, I use `torch.cuda.memory.CUDAPluggableAllocator` and `cudaMallocManaged` methods to allocate (GPU memory + DRAM + swap memory).

When I do this, my computer becomes very slow with extremely high `iowait` values and will quickly hang up after using up all the GPU memory and DRAM, but the `CPU usage` values appear to be normal.

I am using *Ubuntu Server 22.04* *Anaconda3* and *Docker*. Since Linux automatically adjusts and uses swap memory once all the DRAM is used up. I want to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).

##################################################

Reference:
- Using custom memory allocators for CUDA `[https://pytorch.org/docs/stable/notes/cuda.html](https://stackoverflow.com)`
- Introduction swap memory `[https://blogs.oracle.com/linux/post/understanding-linux-kernel-memory-statistics](https://stackoverflow.com)`

What I have tried:

What did you try and what were you expecting?
Describe what you tried, what you expected to happen, and what actually resulted. Minimum 20 characters.

My goal is to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).

In order to achieve my goal, I have tried the following three methods or a combination of them:
- Force a program to use swap memory directly before DRAM runs out.
- Use PyTorch's built-in functions to accomplish my objectives.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

I have tried `cgroup v2`, Docker (including Nvidia Docker runtime), Linux preloading `vm.swappiness` functions, PyTorch `fbgemm` UVM tensor, and `torch.cuda.memory.CUDAPluggableAllocator` but could not achieve my goal.

---
The following command line is expected to implement the following 2 methods:
- Force a program to use swap memory directly before DRAM runs out.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

`cgroup v2` is used to limit DRAM use.
The command line trying to achieve my goal is:
```
echo 42949672960 > /path/to/the/location/memory.high
```
---
The following segment of a command line is expected to implement the following 2 methods:
- Force a program to use swap memory directly before DRAM runs out.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

Docker is used to limit DRAM or Swap Devices use too.
The segment of this command line is:
```
docker run ... \
--memory=10g \
--memory-swap=3789g \
...
```
and [EDITTED ON 2024-05-25]
```
docker run ... \
--device-write-bps=/path/to/device:1500mb \
--device-read-iops=/path/to/device:1500gb \
...
```
---
The following source code is expected to implement the following 2 methods:
- Use PyTorch's built-in functions to accomplish my objectives.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

`torch.cuda.memory.CUDAPluggableAllocator` method is called from alloc.so which is compiled from the following alloc.cc source code:
```
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda-11.8/include -shared -fPIC
#include <sys types.h="">
#include <cuda_runtime_api.h>
#include <iostream>
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream)
{
void *ptr;
cudaMallocManaged(&ptr, size);
return ptr;
}
void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream)
{
cudaFree(ptr);
}
}
```
and
```
#include <sys types.h="">
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda-11.8/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
void *ptr;
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&ptr,size,cudaHostAllocMapped);
return ptr;
}
void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
cudaFreeHost(ptr);
}
}
```
---
I also tried `cpulimit`, `prlimit` and `nice` but it still doesn't work. [EDITTED ON 2024-05-26]
`cpulimit` command line
```
cpulimit --pid $(processs_pid) --limit=15 --lazy --background
```
It is limited to <100, but the process is lagging and auto-killed.
`nice` command line
```
nice -n 19 python /path/to/file.py
```
This command does not solve the lagging problem.
And `prlimit`
The command line
```
prlimit -m=42949672960 python3 /path/to/file.py
```
Process limit status
```
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set 42949672960 42949672960 bytes
Max processes 256496 256496 processes
Max open files 1024 1048576 files
Max locked memory 8419708928 8419708928 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 256496 256496 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
```
This command does not solve the lagging problem too.
---
I have tried to use PyTorch `fbgemm` library, but the result is similar to employing `torch.cuda.memory.CUDAPluggableAllocator`.

Would there be any other possible methods to achieve my goal?

##################################################

Complement 1:
I have already used M.2. SSD nvme PCIe 3.0 x 2 to swap memory.
As I know, A PCIe 3.0 has a maximum bandwidth of around 3.x Gbps, I have 2 M.2. SSD.

##################################################

Reference:
- swappiness `[https://phoenixnap.com/kb/swappiness](https://stackoverflow.com)`
- Use RAM after GPU memory is not enough [`https://stackoverflow.com/questions/27035851/use-ram-after-gpu-memory-is-not-enough`](https://stackoverflow.com)
- What is the maximum read and write speed for pcie 3.0 x4 m.2 slots? [`https://pcpartpicker.com/forums/topic/391989-what-is-the-maximum-read-and-write-speed-for-pcie-30-x4-m2-slots?__cf_chl_tk=ytRz0fL2zwxqVfoQBY0G6gwWnFHiwRBSpcV9dbbeSEU-1716647279-0.0.1.1-1791`](https://stackoverflow.com)

Posted 24-May-24 4:58am

Justin202

Updated 26-May-24 4:03am

v3

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Dave Kreskowiak · Answer 1 · 2024-05-24T12:00:00

Solution 1

You answered your own question:

Quote:
My goal is to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).

How on earth are you going to execute code on a machine that does not meet the requirements of the code? If your code is demanding 64GB of RAM to execute with any speed and your machine only has 8GB, you're going to run into severe performance problems. You simple have no way around that, besides adding more RAM to the machine!

Your machine slows down because it's swapping memory to page file, a very slow process compared to the speed of memory. You cannot possibly expect a millions of page swap operations to not slow the machine down.

Posted 24-May-24 12:00pm

Dave Kreskowiak

Comments

Justin202 25-May-24 11:02am

Sorry about that I didn't say it clearly.
I have a further question to ask, did I have another way to limit the memory IO speed?
Also, would 3.x Gbps speed make the computer lag in this situation?
Is there any solution to make the program run peacefully without lagging, but not reducing the parameters, and batch size, and also use GPU to process?
Thank you very much for the answer from Dave Kreskowiak.

Dave Kreskowiak 25-May-24 15:24pm

Generally this isn't a good idea, but you can, limit CPU usage using either "nice" or "cpulimit". Limiting memory usage can be done with "prlimit".

Keep in mind, doing this opens up a pandora box of issues that may cause your code problems and make it more difficult to diagnose them.

Justin202 26-May-24 10:05am

Many thanks to Dave Kreskowiak again.
Are there other ways to fix it?

Dave Kreskowiak 26-May-24 11:16am

You have control over CPU usage, that's about it. You cannot control memory I/O speed, nor can you control GPU usage.

Justin202 27-May-24 10:02am

Is it possible to achieve this goal by starting with `torch.cuda.memory.CUDAPluggableAllocator`, `torch.cuda.change_current_allocator`, and `alloc.so` file?
Or in a CUDA setting?

Dave Kreskowiak 27-May-24 10:48am

I have no idea. You're going to have to try it to find out.

Justin202 29-May-24 9:25am

Thank you so much.
Now I have more ideas.
Hopefully, I can solve this problem eventually.
Does anyone have any advice for me?
Is there any other Python deep-learning library that uses UVM without making the computer lagging?

Dave Kreskowiak 29-May-24 10:32am

It's not the library. Training and running an AI model is extremely resource intensive and you're expecting a lot from a machine with limited resources. If you can use a cloud-based solution, that would free up your machine to do other things.