Python Engineer

Python and Machine Learning Tutorials

Create Conversational AI Applications With NVIDIA Jarvis

06 Mar 2021

NVIDIA Jarvis is an end-to-end application framework for multimodal conversational AI services that delivers real-time performance on GPUs.

In this Tutorial I show you an overview of this framework and how to get started with it. We're also having a look at how to use the Python API to connect to different services.

What does the framework include?

Jarvis is a fully accelerated application framework for building multimodal conversational AI services that use an end-to-end deep learning pipeline. It is optimized for inference to offer end-to-end real-time services that run in less than 300 milliseconds (ms) and delivers 7x higher throughput on GPUs compared with CPUs.

Additionally, it includes pre-trained conversational AI models and tools to easily finetune it to achieve a deeper understanding of a specific context

Different services

Jarvis offers multiple services that can be combined to build various types of applications, such as:

With those services we can fuse speech and vision to offer accurate and natural interactions in virtual assistants, chatbots, and other conversational AI applications. To take full advantage of the computational power of the GPUs, Jarvis is based on Triton to serve neural networks and ensemble pipelines that are running efficiently with TensorRT.

The services that Jarvis provides are exposed through API operations accessible using gRPC endpoints that also hide all the complexity to application developers. The API server can be run in a Docker container and accessed from the client with simple gRPC calls.

E.g., the following code shows a simple Python script that connects to the server and uses the TTS service with a simple request-response mechanism:

import numpy as np import grpc import src.jarvis_proto.jarvis_tts_pb2 as jtts import src.jarvis_proto.jarvis_tts_pb2_grpc as jtts_srv import src.jarvis_proto.audio_pb2 as ja # Create a gRPC channel to the Jarvis endpoint: channel = grpc.insecure_channel('localhost:50051') jarvis_tts = jtts_srv.JarvisTTSStub(channel) # Create a TTS request: req = jtts.SynthesizeSpeechRequest() req.text = "We know what we are, but not what we may be?" req.language_code = "en-US" req.encoding = ja.AudioEncoding.LINEAR_PCM req.sample_rate_hz = 22050 req.voice_name = "ljspeech" # Send request to the service and get the response: resp = jarvis_tts.Synthesize(req) audio_samples = np.frombuffer(, dtype=np.float32)

Create State-of-the-Art Deep Learning Models

The framework offers state-of-the-art pre-trained models that have been built with more than 100,000 hours on NVIDIA DGX™ systems for speech, language understanding, and vision tasks. Pre-trained models and scripts used in Jarvis are freely available in NGC™.

Models can then be finetuned either with the Transfer Learning Toolkit (TLT), a zero coding approach, or with NeMo, an open-source toolkit on top of PyTorch.

Easy Deployment

Jarvis offers an end-to-end pipeline that includes an easy deployment in the cloud or at the edge. Only one command is needed to deploy the entire Jarvis application or individual services through Helm charts on Kubernetes clusters.

Getting Started

To get started, I recommend to follow the official introductory resources and the quick start guide. Make sure to install all prerequisites first, and check the Support Matrix for the list of supported hardware and software requirements.