Hey, I’m working on some local LLM applications and my goal is to run the smallest model possible without crippling performance. I’m already using 4 bit GPTQ but I want something smaller. These models have been trained on such a massive amount of data but my specific use case only touches a very very small fraction of that, so I would imagine it’s possible to cut away large chunks of the model that I don’t care about. I’m wondering if there has been any work on runtime pruning of LLMs (not just static pruning based on model weights) based on “real world” data. Something like: you run the model a bunch of times with your actual data and monitor the neuron activations to inform some kind of pruning process. Does anyone here know about something like that?

  • minipasila@lemmy.fmhy.ml
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I don’t know about that, but you could try GGML (llama.cpp). It has quantization up to 2-bits so that might be small enough.