Tutorial — Building LLM applications with Meltano in 30 minutes or less

Sven Balnojan
5 min readSep 25, 2023
The streamlit app in action, based on my blog content. Image by the author.

Chatting with a PDF file? No problem, build it on top of a LLM. Chatting with your database without SQL? Same story. Building a chatbot into your Slack channel?

These types of applications utilize LLMs like ChatGPT to build their backbone. Yet precisely, this backbone work is heavy data lifting and quite ugly. It means scraping hundreds of web pages, loading documents, chunking them into little parts, feeding them into an embedding API, and finally uploading that embedding into your favorite vector database.

I’ll show you how you can do this in 30 minutes or less on a production-ready scale using the template provided by Meltano.

The 30 sec Pitch For Meltano as a Backbone

The core challenges of building LLM applications are moving data around and transforming it.

Many tools in the data space are already great at these tasks, while the LLM community around LlamaIndex and LangChain still needs production-ready capabilities.

Meltano is perfect for developers who want their applications reproducible; everything is versioned and testable. In addition, it integrates with the most popular vector database Pinecone, LangChain, and LlamaIndex frameworks.

All code for this tutorial is contained in this GitHub repository: LLM Data Backend Demo.

What We’re Going to Build

We’re going to build two things:

  1. A scraper for my personal blog (datacisions.com) to create a simple Q&A interface on that knowledge base
  2. A loader for one particular article (as CSV) to chat just with that one article.

Both applications will use Meltano, the OpenAI embeddings API, and Pinecone.

Image by the author.

In the picture above, you can see both applications. We’ll need to adapt each step, but it will be a quick process because the config is either in YAML or plain Python code.

Prerequisites

You don’t need to know anything about Meltano or these tools. Just make sure you have the following handy:

  • A fork of either this or that repository
  • An OpenAI API key. Paid, even if it is just $5. The rate limit for the embeddings API will otherwise cripple your application.
  • A Pinecone account. They have a free tier, which is more than enough for this tutorial.

Copy the file .env.template to .env and fill in your details. You can run the original template, scraping the https://sdk.meltano.com/en/latest/ website by executing the three commands inside the README.

Scraping a new Website

Paste the following configuration into the plugins: extractors: tap-beautifulsoup section:

plugins:
extractors:
- name: tap-beautifulsoup
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-beautifulsoup.git@v0.1.0
config:
source_name: sdk-docs
site_url: https://datacisions.com #OLD VALUE: https://sdk.meltano.com/en/latest/
output_folder: output
parser: html.parser
download_recursively: true
find_all_kwargs:
attrs:
role: main

Let’s go through this line by line.

- name: tap-beautifulsoup
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-beautifulsoup.git@v0.1.0

We’re letting Meltano install the plugin located at this pip URL, with the name “tap-beautifulsoup”. Nothing changes here to the template.

source_name: svens-blog
site_url: https://datacisions.com #OLD VALUE: https://sdk.meltano.com/en/latest/
output_folder: output

We’re giving the whole website a new name, “svens-blog,” and pointing the scraper to a new URL. We keep our output_folder, which is our temporary store for the web content.

parser: html.parser
download_recursively: true
find_all_kwargs:
attrs:
role: mainya

We’ll use the standard html parser, but you could also install and use an XML parser. The “download_rec.” flag tells Meltano to do the download and the parsing. If you set it to false, you must download the website into the specified output directory.

Finally, we’re instructing the beautiful soup “find_all()” function to check on the datacisions website for all classes with a role=main (which is where my content is located).

To test that everything works, you can run meltano invoke tap-beautifulsoup, and check your output folder.

Image by the author.

Then, use the command meltano run reload-pinecone to run the entire sequence of steps.

Open up the Q&A interface with meltano invoke streamlit_app:demo_ui.

The streamlit app in action, based on my blog content. Image by the author.

If you don’t know, these are responses straight from a couple of my articles, summarized well.

That’s pretty neat.

Next, we will make even more modifications to build our second mini-app.

Adding a new data source

I added another file with one of my articles as text inside a CSV. It’s inside data/mental_models.csv. Feel free to add more articles to this CSV; they will all be processed.

We will add a new connector to include this CSV in our pipeline. Edit the meltano.yml file by inserting this part:

- name: tap-csv
pip_url: git+https://github.com/sbalnojan/tap-csv.git
config:
files:
- entity: doc
path: ./data/mental_models.csv
keys: [id]
delimiter: ;
add_metadata_dict: True

As in the example above, we specify a pip URL and configure out “tap-csv.” We’re only passing one “file” here. The “entity: doc” block specifies which file we want to load.

- entity: doc
path: ./data/mental_models.csv
keys: [id]
delimiter: ;
add_metadata_dict: True

We call our document doc, located at the indicated path. I included a primary key id in case we want to add more articles to this one CSV. The CSV delimiter is the usual semicolon, and we instruct the tap to add metadata to our stream.

The metadata will be the extraction time and the file’s name, so we can use it downstream inside Pinecone target in case we want to load multiple files at different times.

While we’re at it, we can quickly add another loader to test out of CSV extraction:

loaders:
[…]
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl

This will add a JSONL loader dumping out CSV again to a local file just as our Meltano pipeline would further process it.

You can test this out by installing these plugins with “meltano install” and then “meltano run tap-csv target-jsonl”.

Adapting the data cleaning step

I would like to change up a few things inside the data cleaning because my article contains weird Unicode characters and a few too many mentions of my newsletter.

We can do this inside the file mappers/clean_text.py:

page_content = message_dict["record"]["page_content"]
page_content = page_content.replace("Three Data Point Thursday", "").replace("Finish Slime","") # remove specific phrases
no_unicode = page_content.encode("ascii", "ignore").decode() # remove all unicode chars
text_nl = " ".join(no_unicode.split("\n"))
text_spaces = " ".join(text_nl.split())
message_dict["record"]["page_content"] = text_spaces
return message_dict

And that’s it. You could also use LlamaIndex or LangChain here to do advanced data manipulation.

We can test that everything works fine, including the embeddings generation, up to this point by running “meltano run tap-csv clean-text add-embeddings target-jsonl”.

Finish Up — Loading Pinecone

To finally reload our embeddings inside Pinecone and thus our application from this one document, run

“meltano run tap-csv clean-text add-embeddings target-pinecone

Again, run

meltano invoke streamlit_app:demo_ui

to open up the chat dialog, and enjoy!

--

--

Sven Balnojan
Sven Balnojan

Written by Sven Balnojan

Head of Marketing @ Arch | Data PM | “Data Mesh in Action” | Join my free data newsletters at http://thdpth.com/ and https://svenbalnojan.gumroad.com/l/oivjd

No responses yet