• 0 Posts
  • 15 Comments
Joined 1 year ago
cake
Cake day: June 14th, 2023

help-circle





  • What use case would that be?

    I can get like 8 tokens/s running 13b models in q_3_k_L quantization on my laptop, about 2.2 for 33b, and 1.5 for 65b (I bought 64gb of RAM to be able to run larger models lol). 7B was STUPID fast because the entire model fits inside my (8gb) GPU, but 7b models mostly suck (wizard-vicuna-uncensored is decent, every other one I’ve tried was Not).


  • Adding to this: text-generation-webui (https://github.com/oobabooga/text-generation-webui) works with the latest bleeding edge llama.cpp via llama-cpp-python, and it has a nice graphical front-end. You do have a manually tell pip to install llama.cpp-python with the right compiler flags to get GPU acceleration working but the llama-cpp-python github and ooba github explain how to do this.

    You can even set up GPU acceleration through metal on m1 Macs I’ve seen some fucking INSANE performance numbers online for the higher RAM MacBook pros (20+ tokens/sec, I think with a 33b model, but it might have been 13b, either way, impressive.)


  • Llama.cpp recently added CUDA acceleration for generation (previously only ingesting the prompt was GPU accelerated), and in my experience it’s faster than GPTQ unless you can fit absolutely 100% of the model in VRAM. If literally a single layer is CPU offloaded, the performance in GPTQ immediately becomes like 30-40% worse than an equivalent CPU offload with llama.cpp








  • It may or may not take off to the point of replacing reddit, but I think the exodus if people now and especially after the end of the month will lead to it having at least the same amount of users as Mastodon. Maybe more, since the average reddit user is probably more tech-savvy and more willing to migrate to a different platform than the average Twitter user (since they follow subreddits rather than individual users). And a roughly Mastodon sized lemmy is more than usable to replace reddit imo