Building/augmenting loop
Thisstage is part of the second loop. After the team identifies the desired models, in this stage, the goal is to tailor the models based on business requirements through prompt engineering and grounding the data:
- Metaprompting and grounding: As outlined in Chapter 5, prompt engineering and metaprompts can enhance retrieval accuracy. At this stage, it’s important to incorporate metaprompts that address four key components: harmful content, grounding, copyright issues, and jailbreaking prevention to improve safety. We have already explored these metaprompt components with examples in Chapter 5, so we will not delve into details here. However, this area is continuously evolving, and you can expect to see more templates emerge over time. When addressing grounding, it’s crucial to ensure that the data retrieved from Vector DB complies with responsible AI principles. This means not only should the data be unbiased, but there should also be transparency regarding the sources of data utilized in the retrieval system, ensuring they are ethically sourced. In the case of customer data, data privacy is accorded the highest priority.
- Evaluation: It is important to evaluate LLM models before deploying into production. Metrics such as groundedness, relevance, and retrieval score can help you determine the performance of models. Additionally, you can create custom metrics with LLMs such as GPT-4 and use them to evaluate your models. Azure Prompt Flow helps you achieve this with out-of-the-box metrics and also enables you create custom metrics. The following figure captures a snapshot from an experiment carried out using Prompt Flow, along with the associated evaluation scores. Figure 9.6 offers a visualization of the test conducted on an evaluation dataset. The LLM responses were assessed against the actual answers, and an average rating of 4 or higher for groundedness, the retrieval score, and relevance suggests that the application is performing effectively:
Figure 9.6 – Azure Prompt Flow evaluation metrics (visualization)
Operationalizing/deployment loop
This stagemarks the final loop, transitioning from development into production, and includes designing monitoring processes that continuously evaluate metrics. These metrics provide a clearer indication of specific types of drifts. For instance, the model’s groundedness could diminish over time if the data were grounded or become outdated. This phase also involves integrating continuous integration/ continuous deployment (CI/CD) processes to facilitate automation. Additionally, collaboration with the user experience (UX) team is crucial to ensure the creation of a safe user experience:
- User experience: In this layer, incorporating a human feedback loop to assess the responses of LLM models is crucial. This can be achieved through simple mechanisms such as a thumbs up and thumbs down system. Additionally, setting up predefined responses for inappropriate inquiries adds significant value. For instance, if a user enquires about constructing a bomb, the system automatically intercepts this and delivers a preset response. Furthermore, offering a prompt guide that integrates RAI principles and includes citations with responses is an effective strategy to guarantee the reliability of the responses.
- Monitoring: Continuous model monitoring is a crucial component of LLMOps, guaranteeing that AI systems stay pertinent in the face of changing societal norms and data trends over time. Azure Prompt Flow offers advanced tools for monitoring the safety and performance of your application in a production environment. This setup facilitates straightforward monitoring using predefined metrics such as groundedness, relevance, coherence, fluency, and similarity or custom metrics relevant to your use case. We have already conducted a lab in Chapter 4, focusing on evaluating RAG workflows where we discussed these metrics.
Throughout all these stages, it’s important to engage with stakeholders, including diverse user groups, to understand the impact of the LLM and to ensure that it’s being used responsibly. Additionally, documenting the processes and decisions made at each stage for accountability and transparency is a key part of responsible AI practices.