Module moondream

Module moondream 

Source
Expand description

MoonDream Model vision-to-text

Moondream is a computer-vision model that can answer real-world questions about images. It’s lightweight with only 1.6B parameters, enabling it to run on mobile phones and edge devices. MoonDream Original Implementation

The model consists of:

  • Vision encoder using a ViT-style architecture
  • Text decoder based on Microsoft’s Phi model
  • Vision projection module to align vision and text embeddings

§Examples

# download an example image
wget https://raw.githubusercontent.com/vikhyat/moondream/main/assets/demo-1.jpg

# Now you can run Moondream from the `candle-examples` crate:
cargo run --example moondream \
  --release -- \
  --prompt "What is the girl eating?"
  --image "./demo-1.jpg"

> avavx: false, neon: true, simd128: false, f16c: false
> temp: 0.00 repeat-penalty: 1.00 repeat-last-n: 64
> retrieved the files in 3.395583ms
> Running on CPU, to run on GPU(metal), build this example with `--features metal`
> loaded the model in 5.485493792s
> loaded and encoded the image Tensor[dims 3, 378, 378; f32] in 4.801396417s
> starting the inference loop
> The girl is eating a hamburger.<
> 9 tokens generated (0.68 token/s)

Structs§

Config
Encoder
Model
VisionConfig
VisionEncoder