GPUs and scaling

Request GPUs by canonical name; set worker bounds and multi-GPU configs.

GPU types

Specify by canonical name. All types are available in US; availability varies by region. Use A100_80 for the 80 GB variant.

python

1# Consumer / inference2@app.task(gpu=gw.Gpu("T4"))       # 16 GB3@app.task(gpu=gw.Gpu("L4"))       # 24 GB4@app.task(gpu=gw.Gpu("A10G"))     # 24 GB5 6# Training / large models7@app.task(gpu=gw.Gpu("A100"))     # 40 GB8@app.task(gpu=gw.Gpu("A100_80"))  # 80 GB9@app.task(gpu=gw.Gpu("H100"))     # 80 GB10@app.task(gpu=gw.Gpu("H200"))     # 141 GB

Multi-GPU

Request multiple GPUs on a single worker with the count argument. All GPUs share the same node — useful for tensor-parallel inference or DDP training.

python

1@app.task(gpu=gw.Gpu("A100", count=4), memory=65536, timeout=3600)2async def train_ddp():3    import torch4    print(torch.cuda.device_count())  # 4

Worker bounds

min_workers keeps containers warm (eliminates cold starts). max_workers caps fan-out. Set min_workers=1 for latency-sensitive inference.

python

1@app.task(2    gpu=gw.Gpu("A100"),3    min_workers=1,   # always 1 warm worker4    max_workers=10,  # scale up to 10 under load5)6async def infer(prompt: str) -> str: ...

Region pinning

Pin to a datacenter with the region argument. ANY (default) lets gworker pick the cheapest available. Match your volume's region to avoid cross-region egress.

python

1@app.task(gpu=gw.Gpu("H100"), region=gw.Region.US)2async def run(): ...

GPUs and scaling

GPU types#

Multi-GPU#

Worker bounds#

Region pinning#

GPU types

Multi-GPU

Worker bounds

Region pinning