• notfromhere@lemmy.oneOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    I hope llama.cpp supports SuperHOT at some point. I never use GPTQ but may need to make an exception to try out the larger context sized. Are you using exllama? Curious why you’re getting garbage output

    • simple@lemmy.mywire.xyz
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Yeah llama.cpp with SuperHOT support would be great, and yeah I’m using exllama with oobabooga UI. I found out why I’m getting garbage output with 2k. It seems like SuperHOT 8K models, when run with 2k context, have a massive increase in perplexity.

      (Higher perplexity, the worse the output quality).

      So I’ll need to figure out if I can get at least 4K running without running out of VRAM.

      Also, there is a new PR for exllama which uses a different method of getting higher context (not SuperHOT) and also has less perplexity loss. So that might be a better alternative potentially.

      • notfromhere@lemmy.oneOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        I read the guy’s blog post on SuperHOT and it sounded like it didn’t increase perplexity and kept perplexity super low with large contexts. I could have read it wrong but I thought it wasn’t supposed to increase perplexity.

        • simple@lemmy.mywire.xyz
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 year ago

          The increase in perplexity is very small, but there is still some with 8K content. But it seems like with 2K its much larger. I could be misunderstanding something myself. But my little test with 2K context does suggest there’s something going on with 2K contexts on SuperHOT models