Click here to Skip to main content
16,004,587 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
In PyTorch, I use `torch.cuda.memory.CUDAPluggableAllocator` and `cudaMallocManaged` methods to allocate (GPU memory + DRAM + swap memory).

When I do this, my computer becomes very slow with extremely high `iowait` values and will quickly hang up after using up all the GPU memory and DRAM, but the `CPU usage` values appear to be normal.

I am using *Ubuntu Server 22.04* *Anaconda3* and *Docker*. Since Linux automatically adjusts and uses swap memory once all the DRAM is used up. I want to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).

##################################################

Reference:
- Using custom memory allocators for CUDA `[https://pytorch.org/docs/stable/notes/cuda.html](https://stackoverflow.com)`
- Introduction swap memory `[https://blogs.oracle.com/linux/post/understanding-linux-kernel-memory-statistics](https://stackoverflow.com)`

What I have tried:

What did you try and what were you expecting?
Describe what you tried, what you expected to happen, and what actually resulted. Minimum 20 characters.

My goal is to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).

In order to achieve my goal, I have tried the following three methods or a combination of them:
- Force a program to use swap memory directly before DRAM runs out.
- Use PyTorch's built-in functions to accomplish my objectives.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

I have tried `cgroup v2`, Docker (including Nvidia Docker runtime), Linux preloading `vm.swappiness` functions, PyTorch `fbgemm` UVM tensor, and `torch.cuda.memory.CUDAPluggableAllocator` but could not achieve my goal.

---
The following command line is expected to implement the following 2 methods:
- Force a program to use swap memory directly before DRAM runs out.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

`cgroup v2` is used to limit DRAM use.
The command line trying to achieve my goal is:
```
echo 42949672960 > /path/to/the/location/memory.high
```
---
The following segment of a command line is expected to implement the following 2 methods:
- Force a program to use swap memory directly before DRAM runs out.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

Docker is used to limit DRAM or Swap Devices use too.
The segment of this command line is:
```
docker run ... \
--memory=10g \
--memory-swap=3789g \
...
```
and [EDITTED ON 2024-05-25]
```
docker run ... \
--device-write-bps=/path/to/device:1500mb \
--device-read-iops=/path/to/device:1500gb \
...
```
---
The following source code is expected to implement the following 2 methods:
- Use PyTorch's built-in functions to accomplish my objectives.
- Employ a software control program to prevent the computer from lagging and continue the training of the AI model, predict with it, and store it.

`torch.cuda.memory.CUDAPluggableAllocator` method is called from alloc.so which is compiled from the following alloc.cc source code:
```
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda-11.8/include -shared -fPIC
#include <sys types.h="">
#include <cuda_runtime_api.h>
#include <iostream>
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream)
{
void *ptr;
cudaMallocManaged(&ptr, size);
return ptr;
}
void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream)
{
cudaFree(ptr);
}
}
```
and
```
#include <sys types.h="">
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda-11.8/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
void *ptr;
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&ptr,size,cudaHostAllocMapped);
return ptr;
}
void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
cudaFreeHost(ptr);
}
}
```
---
I also tried `cpulimit`, `prlimit` and `nice` but it still doesn't work. [EDITTED ON 2024-05-26]
`cpulimit` command line
```
cpulimit --pid $(processs_pid) --limit=15 --lazy --background
```
It is limited to <100, but the process is lagging and auto-killed.
`nice` command line
```
nice -n 19 python /path/to/file.py
```
This command does not solve the lagging problem.
And `prlimit`
The command line
```
prlimit -m=42949672960 python3 /path/to/file.py
```
Process limit status
```
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set 42949672960 42949672960 bytes
Max processes 256496 256496 processes
Max open files 1024 1048576 files
Max locked memory 8419708928 8419708928 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 256496 256496 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
```
This command does not solve the lagging problem too.
---
I have tried to use PyTorch `fbgemm` library, but the result is similar to employing `torch.cuda.memory.CUDAPluggableAllocator`.

Would there be any other possible methods to achieve my goal?

##################################################

Complement 1:
I have already used M.2. SSD nvme PCIe 3.0 x 2 to swap memory.
As I know, A PCIe 3.0 has a maximum bandwidth of around 3.x Gbps, I have 2 M.2. SSD.

##################################################

Reference:
- swappiness `[https://phoenixnap.com/kb/swappiness](https://stackoverflow.com)`
- Use RAM after GPU memory is not enough [`https://stackoverflow.com/questions/27035851/use-ram-after-gpu-memory-is-not-enough`](https://stackoverflow.com)
- What is the maximum read and write speed for pcie 3.0 x4 m.2 slots? [`https://pcpartpicker.com/forums/topic/391989-what-is-the-maximum-read-and-write-speed-for-pcie-30-x4-m2-slots?__cf_chl_tk=ytRz0fL2zwxqVfoQBY0G6gwWnFHiwRBSpcV9dbbeSEU-1716647279-0.0.1.1-1791`](https://stackoverflow.com)
Posted
Updated 26-May-24 4:03am
v3

1 solution

You answered your own question:
Quote:
My goal is to train an AI model, predict with it, and store it without any computer lagging while running the computer with the required storage space exceeding the (GPU memory + DRAM + swap memory).

How on earth are you going to execute code on a machine that does not meet the requirements of the code? If your code is demanding 64GB of RAM to execute with any speed and your machine only has 8GB, you're going to run into severe performance problems. You simple have no way around that, besides adding more RAM to the machine!

Your machine slows down because it's swapping memory to page file, a very slow process compared to the speed of memory. You cannot possibly expect a millions of page swap operations to not slow the machine down.
 
Share this answer
 
Comments
Justin202 25-May-24 11:02am    
Sorry about that I didn't say it clearly.
I have a further question to ask, did I have another way to limit the memory IO speed?
Also, would 3.x Gbps speed make the computer lag in this situation?
Is there any solution to make the program run peacefully without lagging, but not reducing the parameters, and batch size, and also use GPU to process?
Thank you very much for the answer from Dave Kreskowiak.
Dave Kreskowiak 25-May-24 15:24pm    
Generally this isn't a good idea, but you can, limit CPU usage using either "nice" or "cpulimit". Limiting memory usage can be done with "prlimit".

Keep in mind, doing this opens up a pandora box of issues that may cause your code problems and make it more difficult to diagnose them.
Justin202 26-May-24 10:05am    
Many thanks to Dave Kreskowiak again.
Are there other ways to fix it?
Dave Kreskowiak 26-May-24 11:16am    
You have control over CPU usage, that's about it. You cannot control memory I/O speed, nor can you control GPU usage.
Justin202 27-May-24 10:02am    
Is it possible to achieve this goal by starting with `torch.cuda.memory.CUDAPluggableAllocator`, `torch.cuda.change_current_allocator`, and `alloc.so` file?
Or in a CUDA setting?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900