Vision-Language Models | Nay Linn Aung

Posted on Dec 15, 2024 | in Artificial Intelligence | by Nay Linn Aung

Traditional Machine Vision relies on rigid rule-based algorithms or specific supervised models trained on thousands of labeled defect images. If the lighting changes or a new defect type appears, the system fails.

Enter MLLMs

Multimodal Large Language Models (MLLMs) like GPT-4V and Gemini Pro Vision are changing the game. These models can "see" and "reason" about images without specific training.

Application in Manufacturing

Imagine a camera system where you can simply type a prompt: "Is there any debris on the conveyor belt?" or "Does this weld look consistent with the previous one?". Detailed prompt engineering allows for zero-shot defect detection, drastically reducing deployment time.

Challenges

The main bottleneck is latency. Running a massive transformer model for every frame at 60 FPS is not yet improved. However, distillation techniques are rapidly making smaller, faster vision-language models viable for edge deployment.

About the Author

Nay Linn Aung is a Senior Technical Product Owner specializing in the convergence of OT and IT.

The Era of Vision-Language Models in Quality Control

Enter MLLMs

Application in Manufacturing

Challenges

About the Author