3 Ways to Run DeepSeek and Llama Locally
Large Language Models (LLMs) have revolutionized the AI landscape, and small models are on the rise. Thus, the potential to run advanced LLMs even on older PCs and smartphones is there. To give a starting point, we’ll explore three different methods to interact with LLama 3.3 or Deepseek locally.
TL:DR
- Python Package: Use Ollama for simple or streamed chats with models like
deepseek-r1:1.5b
. - API: Interact via Ollama’s local HTTP API for flexible integration in any language.
- LangChain: Build advanced applications like document analysis and retrieval with embeddings and vector stores.
Prerequisites
Before we dive in, make sure you have:
- Ollama installed and running
Decisions, decisions…
Which model version is small enough to run on your computer locally? My approach: Just try them out! Start with the smallest version, then work yourself up, until you know how many parameters your computer can handle.
To investigate which versions are available on Ollama, visit https://ollama.com/search.
So, e.g. for Deepseek-r1 we find that the smallest available model is 1.5b parameters large. Hence, the pull request is in terminal
ollama pull deepseek-r1:1.5b
We get:

Ok, so the pulled model is somewhat over 1.1 GB small. Not bad.
To quickly try out what we just got,
ollama run deepseek-r1:1.5b
We get:
>>> Send a message (/? for help)
Let’s send a message:

The result is quite impressive.
Three Ways to Get the Job Done
Now, let’s explore our three methods!
The Ollama Python package provides a straightforward way to interact with Deepseek 1.5b in your Python scripts or Jupyter notebooks.
import ollama
response = ollama.chat(
model="deepseek-r1:1.5b",
messages=[
{
"role": "user",
"content": "Tell me an interesting fact about elephants",
},
],
)
print(response["message"]["content"])
So, we get an answer that’s much larger than 50 words… (btw I tried the 32b model with the same question, with not a much better answer).
But let’s continue with the programming fun. The model performance is not part of this article.
The method above is great for simple, synchronous interactions. But what if you want to stream the response? Ollama’s got you covered with its AsyncClient:
import asyncio
from ollama import AsyncClient
async def chat():
message = {
"role": "user",
"content": "Tell me an interesting fact about elephants"
}
async for part in await AsyncClient().chat(
model="deepseek-r1:1.5b", messages=[message], stream=True
):
print(part["message"]["content"], end="", flush=True)
# Run the async function
async def main():
await chat()
# Check if there's a running event loop and use it, otherwise create a new one
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = None
if loop and loop.is_running():
asyncio.ensure_future(main())
else:
asyncio.run(main())
Method 2: Using the Ollama API
For those who prefer working with APIs directly or want to integrate LLama or Deepseek into non-Python applications, Ollama provides a simple HTTP API. Let’s try it out in bash:
curl http://localhost:11434/api/chat -d '{
> "model": "deepseek-r1:1.5b",
> "messages": [
> {
> "role": "user",
> "content": "What are God Particles?"
> }
> ],
> "stream": false
> }'
This method gives you the flexibility to interact with the model from any language or tool that can make HTTP requests.
Method 3: Using Langchain for Advanced Applications
For more complex applications, especially those involving document analysis and retrieval, Langchain integrates seamlessly with Ollama and LLama 3.2.
Here’s a snippet that demonstrates loading documents, creating embeddings, and performing a similarity search with a Wikipedia article on Thomas Jefferson that I saved as Word document.
For this, make sure you have python-docx installed (e.g. pip install).
from langchain_community.document_loaders import DirectoryLoader, UnstructuredWordDocumentLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_community.vectorstores import Chroma
# Load documents
loader = DirectoryLoader('/path/to/documents', glob="**/*.docx", loader_cls=UnstructuredWordDocumentLoader)
documents = loader.load()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(documents)
# Create embeddings and vector store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
# Initialize LLama 3.2 1B
llm = Ollama(model="llama3.2:1b", base_url="http://localhost:11434")
# Perform a similarity search and generate a response
query = "What was the main accomplishment of Thomas Jefferson?"
similar_docs = vectorstore.similarity_search(query)
context = "\n".join([doc.page_content for doc in similar_docs])
response = llm(f"Context: {context}\nQuestion: {query}\nAnswer:")
print(response)
The model’s reponse is:

Not the perfect answer, but you see that the method above allows you to build applications that can understand and reason about large amounts of text data using LLama’s, Deepseek’s and other models’ powerful language understanding capabilities.
Conclusion
Running LLama 3.2, Deepseek or other models available with Ollama locally opens up a world of possibilities for AI-powered applications.
Whether you’re looking for simple chat interactions, API-based integrations, or complex document analysis systems, these three methods provide the flexibility to suit a wide range of use cases.
Remember to use these powerful tools responsibly and ethically. Happy coding!