Traditional Machine Vision relies on rigid rule-based algorithms or specific supervised models trained on thousands of labeled defect images. If the lighting changes or a new defect type appears, the system fails.
Enter MLLMs
Multimodal Large Language Models (MLLMs) like GPT-4V and Gemini Pro Vision are changing the game. These models can "see" and "reason" about images without specific training.
Application in Manufacturing
Imagine a camera system where you can simply type a prompt: "Is there any debris on the conveyor belt?" or "Does this weld look consistent with the previous one?". Detailed prompt engineering allows for zero-shot defect detection, drastically reducing deployment time.
Challenges
The main bottleneck is latency. Running a massive transformer model for every frame at 60 FPS is not yet improved. However, distillation techniques are rapidly making smaller, faster vision-language models viable for edge deployment.
About the Author
Nay Linn Aung is a Senior Technical Product Owner specializing in the convergence of OT and IT.