2 articles tagged #inference.
Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.
Proposed standard for publishing a Markdown file at a website's root that enables language models to efficiently understand and use the site's content at inference time.