🪄Orchestration

With the rise of foundational models, we’ve seen the emergence of new tooling called Foundational Model Orchestration (FOMO) that coordinate tasks within a foundational model driven workflow.

One reason FOMO solutions have emerged is because the foundational model API exists within the context of a pipeline the includes data computation and knowledge systems. The majority of the time finding value from the model isn’t just querying the API to receive a result. Instead, it’s a multi-step process. Currently, hosted APIs don’t offer pre- and post-processing as part of the platform experience. A catalyst for pre-processing is that foundational models have a ~4K token limit so users need to perform string splitting. A second reason that FOMO solutions have appeared is that foundational models currently do not allow teams to integrate directly with external resources like databases and SaaS products. This means users can’t add their own data to the model directly to enhance performance for certain tasks or domains.

Let’s say you want to summarize transcripts using a foundational model like GPT-3. If you have a transcript of an hour-long meeting, it can be ~7.5K words. 1,500 words equate to about 2048 tokens so the total transcript is about 10K tokens. That’s beyond the 4K token limit and too much for one prompt. In turn you need to start with string splitting to break down the text into three groups. Then it is a map/reduce effort where you use GPT-3 to summarize chunks of the document (with some degree of overlap between chunks). Last, you query the model to summarize the summaries for a final output.

Another example would be creating a Q&A service for a textbook. This is a large amount of information so you start by string splitting the materials into chunks. You can create embeddings of chunks and adds these together in a vector store like Pinecone or Weaviate. When a user asks a question, you embed the query using the same embeddings model, and do a cosine similarity search using your vector database. This enables you to build a prompt that can be used against a foundational model API to get a response. This demonstrates the multi-step process of pre-processing to embedding to finding “k nearest neighbors” search to querying the LLM model with a prompt to get a return answer.

FOMO solutions are valuable for tying LLMs to internal data systems; prompt engineering like A/B testing prompts; chaining models together; switching foundational models; and A/B testing foundational models. We expect foundational models will add the ability to tie into third party services for retrieval augmentation so the value of this functionality goes down overtime. While ML practitioners will still need to manage the individual prompts that ping the API, we believe the value of prompt engineering will decrease over time with zero-shot learning and increased token limits. We believe in a world where there are multiple foundational models that are tuned for a particular task that must be chained together. In this world FOMO’s value goes up because it facilitates A/B testing foundational models, chaining services, and switching out models easily. When speaking with users, we heard the decision criteria for which model vendor to pick is a mix of performance, cost, result filters, and existing relationships. We consistently heard that pinging model APIs can be expensive so companies may use different models based on customer tier. FOMO solutions can help enable cost savings and vendor flexibility by being a unified multi-provider interface.

There are now a handful of offerings in the foundational model orchestration space including LangChain, Dust, GPT Index, Fixie.ai, and Cognosis. We heard from users the offerings accelerate foundational model application development and are being used in production only a few months after their release.

PreviousEvaluation NextAgents

Last updated 11 months ago