3 Ways to Run LLama 3.2 Locally

Sacha Schwab
2 min readOct 1, 2024

--

Large Language Models (LLMs) have revolutionized the AI landscape, and small models are on the rise. Thus, the potential to run advanced LLMs even on older PCs and smartphones is there. To give a starting point, we’ll explore three different methods to interact with LLama 3.2 locally.

Prerequisites

Before we dive in, make sure you have:

  • Ollama installed and running
  • The LLama 3.2 model pulled (use ollama pull llama3.2 in your terminal)

Now, let’s explore our three methods!

The Ollama Python package provides a straightforward way to interact with LLama 3.2 in your Python scripts or Jupyter notebooks.

import ollama

response = ollama.chat(
model="llama3.2",
messages=[
{
"role": "user",
"content": "Tell me an interesting fact about elephants",
},
],
)
print(response["message"]["content"])

This method is great for simple, synchronous interactions. But what if you want to stream the response? Ollama’s got you covered with its AsyncClient:

import asyncio
from ollama import AsyncClient

async def chat():
message = {
"role": "user",
"content": "Tell me an interesting fact about elephants"
}
async for part in await AsyncClient().chat(
model="llama3.2", messages=[message], stream=True
):
print(part["message"]["content"], end="", flush=True)

# Run the async function
asyncio.run(chat())

Method 2: Using the Ollama API

For those who prefer working with APIs directly or want to integrate LLama 3.2 into non-Python applications, Ollama provides a simple HTTP API.

curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "What are God Particles?"
}
],
"stream": false
}'

This method gives you the flexibility to interact with LLama 3.2 from any language or tool that can make HTTP requests.

Method 3: Using Langchain for Advanced Applications

For more complex applications, especially those involving document analysis and retrieval, Langchain integrates seamlessly with Ollama and LLama 3.2.

Here’s a snippet that demonstrates loading documents, creating embeddings, and performing a similarity search:

from langchain_community.document_loaders import DirectoryLoader, UnstructuredWordDocumentLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_community.vectorstores import Chroma

# Load documents
loader = DirectoryLoader('/path/to/documents', glob="**/*.docx", loader_cls=UnstructuredWordDocumentLoader)
documents = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(documents)

# Create embeddings and vector store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

# Initialize LLama 3.2
llm = Ollama(model="llama3.2", base_url="http://localhost:11434")

# Perform a similarity search and generate a response
query = "What was the main accomplishment of Thomas Jefferson?"
similar_docs = vectorstore.similarity_search(query)
context = "\n".join([doc.page_content for doc in similar_docs])
response = llm(f"Context: {context}\nQuestion: {query}\nAnswer:")
print(response)

This method allows you to build applications that can understand and reason about large amounts of text data using LLama 3.2’s powerful language understanding capabilities.

Conclusion

Running LLama 3.2 locally opens up a world of possibilities for AI-powered applications. Whether you’re looking for simple chat interactions, API-based integrations, or complex document analysis systems, these three methods provide the flexibility to suit a wide range of use cases.

Remember to use these powerful tools responsibly and ethically. Happy coding!

--

--

No responses yet