Jockey: Leveraging Twelve Labs APIs and LangGraph for Advanced Video Processing

Jockey, an open-source conversational video agent, has been significantly enhanced through the integration of Twelve Labs APIs and LangGraph. This combination aims to provide more intelligent and efficient video processing capabilities, according to a recent LangChain Blog post.

Overview of Twelve Labs APIs

Twelve Labs offers state-of-the-art video understanding APIs that extract rich insights and information directly from video content. These advanced video foundation models (VFMs) work natively with video, bypassing intermediary representations like pre-generated captions. This allows for a more accurate and contextual understanding of video content, including visuals, audio, on-screen text, and temporal relationships.

The APIs support various functionalities, such as video search, classification, summarization, and question answering. They can be integrated into applications for content discovery, video editing automation, interactive video FAQs, and AI-generated highlight reels. With enterprise-grade security and scalability, Twelve Labs APIs open up new possibilities for video-powered applications.

LangGraph v0.1 and LangGraph Cloud Launch

LangChain has introduced LangGraph v0.1, a framework designed for building agentic and multi-agent applications with enhanced control and precision. Unlike its predecessor, LangChain AgentExecutor, LangGraph provides a flexible API for custom cognitive architectures, allowing developers to control code flow, prompts, and LLM calls. It also supports human-agent collaboration through a built-in persistence layer, enabling human approval before task execution and ‘time travel’ for editing and resuming agent actions.

To complement this framework, LangChain has also launched LangGraph Cloud, currently in closed beta. This service provides scalable infrastructure for deploying LangGraph agents, managing horizontally-scaling servers and task queues to handle numerous concurrent users and store large states. LangGraph Cloud integrates with LangGraph Studio for visualizing and debugging agent trajectories, facilitating rapid iteration and feedback for developers.

How Jockey Leverages LangGraph and Twelve Labs APIs

Jockey, in its latest v1.1 release, now utilizes LangGraph for enhanced scalability and functionality. Originally built on LangChain, Jockey’s new architecture offers more efficient and precise control over complex video workflows. This transition marks a significant advancement, enabling better management of video processing tasks.

Jockey combines Large Language Models (LLMs) with Twelve Labs’ specialized video APIs through LangGraph’s flexible framework. The intricate network of nodes within LangGraph UI illustrates Jockey’s decision-making process, including components like the supervisor, planner, video-editing, video-search, and video-text-generation nodes. This granular control optimizes token usage and guides node responses, resulting in more efficient video processing.

The data-flow diagram of Jockey shows how information moves through the system, from initial query input to complex video processing steps. This involves retrieving videos from Twelve Labs APIs, segmenting content as needed, and presenting final results to the user.

Jockey Architecture Overview

Jockey’s architecture is designed to handle complex video-related tasks through a multi-agent system comprising the Supervisor, Planner, and Workers. The Supervisor acts as the central coordinator, routing tasks between nodes and managing the workflow. The Planner creates detailed plans for complex requests, while the Workers execute tasks using specialized tools like video search, text generation, and editing.

This architecture allows Jockey to adapt dynamically to different queries, from simple text responses to complex video manipulation tasks. LangGraph’s framework helps manage the state between nodes, optimize token usage, and provide granular control over each step in the video processing workflow.

Customizing Jockey

Jockey’s modular design facilitates customization and extension. Developers can modify prompts, extend the state for more complex scenarios, or add new workers to address specific use cases. This flexibility makes Jockey a versatile foundation for building advanced video AI applications.

For example, developers can create prompts that instruct Jockey to identify specific scenes from videos without changing the core system. More substantial customizations can involve modifying prompts, extending state management, or adding new specialized workers for tasks like advanced video effects or video generation.

Conclusion

Jockey represents a powerful fusion of LangGraph’s agent framework and Twelve Labs’ video understanding APIs, opening new possibilities for intelligent video processing and interaction. Developers can explore Jockey’s capabilities by visiting the Jockey GitHub repository or accessing the LangGraph documentation for more details.

Image source: Shutterstock