2 articles tagged #optimization.
Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.
Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.