• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    edit-2
    2 days ago

    Not at all. Not even close.

    Image generation is usually batched and takes seconds, so 700W (a single H100 SXM) for a few seconds for a batch of a few images to multiple users. Maybe more for the absolute biggest (but SFW, no porn) models.

    LLM generation takes more VRAM, but is MUCH more compute-light. Typically one has banks of 8 GPUs in multiple servers serving many, many users at once. Even my lowly RTX 3090 can serve 8+ users in parallel with TabbyAPI (and modestly sized model) before becoming more compute bound.

    So in a nutshell, imagegen (on an 80GB H100) is probably more like 1/4-1/8 of a video game at once (not 8 at once), and only for a few seconds.

    Text generation is similarly efficient, if not more. Responses take longer (many seconds, except on special hardware like Cerebras CS-2s), but it parallelized over dozens of users per GPU.


    This is excluding more specialized hardware like Google’s TPUs, Huawei NPUs, Cerebras CS-2s and so on. These are clocked far more efficiently than Nvidia/AMD GPUs.


    …The worst are probably video generation models. These are extremely compute intense and take a long time (at the moment), so you are burning like a few minutes of gaming time per output.

    ollama/sd-web-ui are terrible analogs for all this because they are single user, and relatively unoptimized.