Expand description
Quantized Llama2 model implementation.
This provides an 8-bit quantized implementation of Meta’s LLaMA2 language model for reduced memory usage and faster inference.
Key characteristics:
- Decoder-only transformer architecture
- RoPE position embeddings
- Grouped Query Attention
- 8-bit quantization of weights
References:
Re-exports§
pub use crate::quantized_var_builder::VarBuilder;