Head of Engineering at YLD Sérgio Ramos discusses the relationship between semantics and syntax in leveraging LLMs, whether through RAG, caching, evals, or guard-railing. Learn how these elements enhance LLM applications, combining old and new techniques for optimal performance.

This blog was originally published on YLD’s website

Syntactic and semantic elements are essential for maximising the potential of LLMs with practices like RAG or Guardrailing playing key roles. Whether utilising RAG, caching, evals, or guardrailing, leveraging these components can elevate various LLM applications. This article aims to simplify various concepts and enable you to build more complex systems with confidence. 


The vast majority of GenAI use cases heavily rely on RAG (Retrieval-Augmented Generation) to a certain degree, directing a significant portion of the investment towards it. RAG facilitates data utilisation, empowering businesses to develop internal and external knowledge-based tools. 

We’ve been assisting organisations in building their RAG systems for numerous use-cases and seeing how they can drive value for businesses. Additionally, we have embraced these systems by building an open source integrated writing environment tailored for technical writing.

RefStudio — An Open Source Integrated Writing Environment for Technical Writing

There’s a saying in the Dev community that “Confluence is where documentation goes to die,” whereas RAG is where documentation comes to life. RAG isn’t limited to documentation, as it also applies to code and other data types. RAG can enhance operational efficiency, streamline onboarding, and facilitate knowledge sharing.

However, it’s important to note that you need to surface your data to your prompts. Unless you invest in fine-tuning models specifically for your needs, they won’t be trained on your data and will require access to it in some way. This means you’ll encounter a search problem: sifting through a large amount of information to find what’s relevant to your query — a challenge engineers have been tackling for over 20 years, leveraging search indexes.

YLD’s co-founder wrote about inverted indexes years ago, and that content is still relevant and interesting today: 

“An inverted index answers questions like ‘find me the document that contains the word blue but not the word black’. They are kind of like the index in the back of a book.

An inverted index looks a lot like a hash table. You hash the word and place it in a hash table. Then, like in the range index, you keep an array of the documents that match that term.” – Nuno Job, Database Indexes for The Inquisitive Mind.

blueA, B
runningB, C
Terms as the search key, Documents as the value

This allows you to know which documents contain which terms. That means that if your prompt is about running, you’ll likely need to include documents B and C in the context of that prompt. 

Inverted indexes employ many techniques for effectiveness, including TF-IDF, StopWord removal, Stemming, and Diactrics replacement, among others. For further details, please refer to YLD’s blog.

In contrast, inverted indexes suffer from a lack of understanding of meaning and semantics. Conversely, embedding models excel in capturing semantic relationships and meaning.


Embeddings are a highly versatile and fascinating machine learning technique. From a piece of content, it generates a fixed-size array of floating numbers and places these numbers in a multidimensional space.

The crucial aspect is that once all of those arrays are placed in that space, other points nearby have a similar semantic meaning. 

Interactive map of embeddings 

In the image above, you can see an interactive map of embeddings for the ‘Our Time’ podcast, built by Matt Webb. If you navigate through that interactive map, you’ll see that the episodes close to each other have related topics. 

If you can do that with podcast episodes, you can do that with everything. The Word2Vec playground allows you to play around with this concept. Give it a word, and it will give you other words in the same space in similar positions:


While in the inverted index, you would get words that are syntactically similar to your search, here you get semantically similar words. It is a critical and fundamental difference. Embeddings have versatile applications, including recommendation systems, multimodal search, and data preprocessing.

There are many embedding models out there. You should choose the ones that fit your use cases the best:

  • text-embedding-ada-002 is the most common and popular model;
  • fastText is very fast and lightweight
  • e5-large-v2 is strong with QA style content
  • You can go through the HuggingFace leaderboard to learn about all of them

In the context of GenAI, we can use embeddings to index all our data and search through it to find the most relevant content for our prompt. That means we need to have performant ways to navigate through vector indexes. And that’s what vector databases do: they let you carry out an Approximate Nearest Neighbour Search.

Furthermore, the market of Vector Databases is exploding, with so many and varied options. However, you don’t have to use one; you can also use sqlite-vss and pgvector for SQLite and Postgres, respectively. Alternatively, you can do that locally with ANN libraries like FAISS. And you can go to the edge with Athena.

Content indexing can be sliced per document, paragraph, phrase, Q&A, or other methods. Combining inverted indexes with embeddings addresses the limitations of each search type.

Beyond RAG, embeddings support prompt caching, enabling the discovery of semantically similar prompts. However, results may vary based on use cases and usage patterns.


It’s important to validate the output of LLMs to guarantee that it is either syntactically or semantically correct. Semantically, it connotes that the text output is free from harmful content, correct, and factual. Alternatively, Syntactically means that we can restrict the output to a certain machine-readable schema, whether it’s JSON, XML, Typescript, etc.


Syntactic guardrailing basics involve instructing the model to respond in a specific format.

  • You are a service that translates user requests into JSON objects of type “SentimentResponse” according to the following TypeScript definitions:
  • “`
  • export interface SentimentResponse {
  •   sentiment: “negative” | “neutral” | “positive”; // The sentiment of the text
  • }“`
  • The following is a user request:
  • “””
  • hello, world
  • “””
  • The following is the user request translated into a JSON object with 2 spaces of indentation and no properties with the value undefined:
  • “`json
  • {
  •   “sentiment”: “neutral”
  • }
  • “`

In the example above, generated by TypeChat, we’re constraining the output to a schema defined by the Typescript interface “SentimentResponse”. This way you can parse the output (JSON in this case), and perform all kinds of validations (like URLs, e-mail, etc.) and transformations. You can use validation libraries, like Zod. If it fails, you can even return to the LLM with the error and ask it to fix the output.

Many open-source libraries support syntactic guard-railing:

Additionally, llama.cpp recently added grammar-based sampling support. With it, you can author GBNF files, which are a type of Backus-Naur notation, for defining a context-free language. For that, a library like gbnfgen can help you generate grammars directly from TypeScript interfaces.


On the other hand, you might also need to semantically guardrail your outputs. Make sure that it:

  • has no harmful or inappropriate content
  • is factual and relevant to the input

Nvidia’s NeMo-Guardrails is one guardrail option to: 

  • prevent the model from engaging in discussions on unwanted topics
  • steer the model to follow pre-defined conversational paths and enforce standard operating procedures (e.g., authentication, support)

You can also go back to old-school methods, like:

Or, chain the output to another LLM and ask it to classify it. Ideally, a better model.

Last, but not least, it’s also worth paying attention to your system prompt. It can have a strong impact on the alignment of the answers the model gives. For instance, read into the breakdown of the Claude-3 system prompt. You can even give examples of behaviour in the system prompt.

Final thoughts

We’re not just skimming the surface when it comes to embeddings and guard-railing, as many more key components optimise the use of GenAI.

Embeddings are a big deal in GenAI, but their impact stretches further than its basic functionalities. Mastering guard railing on the other hand, whether it’s about syntax or meaning, is crucial for top-notch GenAI systems.

Ultimately, our goal is to dig into these concepts so we can create applications that truly benefit our customers— a more diverse range of applications, not just chatbots.

For more details on the topic, check out YLD’s full blog


We’re proud to announce the launch of CTO Craft Con: London 2024 at the prestigious QEII venue. Grab your tickets now.

Join now to become a member of the free CTO Craft Community, where you’ll get exclusive access to Slack channels, conference insights and other valuable content. Subscribe to Tech Manager Weekly for a free weekly dose of tech culture, hiring, development, process and more.