DeepSeek-OCR + LLama4 + RAG Just Revolutionized Agent OCR Forever


During the weekend, I scrolled through Twitter to see what was happening in the AI community. Once again, DeepSeek has drawn worldwide attention.

This isn’t just any text recognition tool — it’s a brand-new contextual optical compression technology that uses visual methods to solve the challenge of processing long texts, offering a completely new approach to handling massive amounts of document information.

Anyone who has used a large language model (LLM) has encountered a common pain point:

When you ask the model to summarise tens of thousands of words from conference notes or academic papers, it starts to lose its memory.

This is because the quadratic complexity of sequence length inherently limits GPT, Gemini, and Claude — the longer the input, the more computational power it requires.

But humans aren’t like that.
We can glance at a note or a diagram and instantly recall an entire passage.

Traditionally, for AI to understand long documents, the entire document must be converted into digital text. This process consumes a large number of tokens (which can be understood as the units used by AI to process information), resulting in low computational efficiency.

DeepSeek-OCR takes a different approach: it first converts text into images and then uses visual tokens to compress and represent this information. Imagine you have a 10,000-word article — instead of having AI read it word by word, it can simply “glance” at an image to understand and reconstruct the original text.

The core breakthrough lies in its ability to represent rich information in a single image containing document text using far fewer tokens than the equivalent text. This means that optical compression with visual tokens can achieve higher compression ratios, allowing us to do more with fewer resources.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check a video

I will ask the chatbot a question: “What are the main findings?” If you take a look at how the chatbot generates the output, you’ll see that the agent extract text from each page, but if a page contains less than 50 characters or lacks embedded text, it converts that page into a high-resolution image and sends it to DeepSeek-OCR on Replicate, which uses an innovative “Contextual Optical Compression” approach where it converts the document into visual tokens and compresses the information — essentially allowing the AI to “glance” at an image representation rather than reading word-by-word, which can turn a 10,000-word article into a much more efficient compressed format.

Once all text is extracted, the system breaks it into 500-character chunks with 50-character overlap to maintain context, converts each chunk into mathematical vectors using OpenAI embeddings, and stores them in a Chroma vector database that persists on disk for future use.

When you ask a question, the agent searches through these vectors to find the 5 most semantically similar document chunks, assembles them into a context prompt along with your question and instructions to cite page numbers, then sends everything to the Llama 3.1 405B model running on Replicate’s streaming API, which processes the prompt and generates an intelligent answer chunk-by-chunk in real-time.

Then generate the answer and the source document citations, showing which pages the information came from, creating a complete RAG agent that can understand any PDF

DeepSeek-OCR is an end-to-end OCR and document parsing model designed to achieve optical context compression.

This model consists of two major components: a DeepEncoder that compresses high-resolution image input into a small number of visual tokens, and a DeepSeek-3B-MoE decoder (a Mixture-of-Experts language model) that restores the original text from the visual token sequence.

DeepEncoder (approximately 380 million parameters) incorporates a SAM-based window attention mechanism for local image feature extraction, and by inserting a two-layer CNN with 16x compression in between, it significantly compresses a 1024×1024 pixel image from 4096 patches to around 256 tokens.

The decoder side, which receives these visual tokens, has a total of 3 billion parameters (approximately 570 million are effective during inference) and features a MoE structure that dynamically utilises 6 experts per step from a pool of 64 experts, allowing for lightweight yet efficient text reconstruction.

With this architecture, DeepSeek-OCR takes an unconventional approach by converting the contents of a text document into an “image” and then reading it.

Check the video PaddleOCR-VL: Video

When I tested both OCR models, I found something interesting — PaddleOCR-VL, which has fewer parameters (0.9B), was beating much larger 3B models in real-world tests.

I gave it tough jobs: reading vertical text in the right direction, understanding complex math formulas, and handling documents with multiple columns — and PaddleOCR-VL nailed them all, while DeepSeek-OCR made mistakes with reading order and formulas, even though it has cool compression features.

Then I discovered something fun in DeepSeek-OCR’s research paper — they actually thanked PaddleOCR and admitted they used it to label their training data, which made me realize why companies like Baidu, DeepSeek, and Shanghai AI Lab are all releasing OCR models: they’re not making OCR tools as their main job, they’re building them to clean up huge amounts of data for training their AI models, and we’re getting these powerful OCR tools as free bonuses.

After testing everything, I figured out that if you’re building something for real work and need to read printed text, forms, tables, or documents in different languages, PaddleOCR-VL is the way to go, while DeepSeek-OCR is better if you’re a researcher trying to compress data to save money on AI costs.

In traditional LLMs, text is broken down into discrete text tokens (typically words or subwords). Each token is assigned a fixed ID in the vocabulary and mapped into a vector via a large “lookup table” (embedded layer). While this process is efficient, its expressive power is limited by the limited vocabulary.

Visual Tokens are completely different. Instead of coming from a fixed lookup table, they are continuous vectors generated directly from image pixels by a neural network (visual encoder). This means:

Higher information density: Visual tokens exist in a continuous vector space and can encode richer and more nuanced information than discrete text tokens. A visual token can represent the color, shape, texture, and spatial relationships within an area, rather than just a word or subword.

Global pattern perception: The visual encoder can capture global information, such as the overall layout, typesetting, and font style of the text, which is lost in the plain text token sequence.

Larger expression space: In theory, the “vocabulary” of visual tokens is infinite because they are continuous vectors generated directly from pixels rather than selected from a fixed dictionary.

Let’s start coding :

Before we dive into our application, we will create an ideal environment for the code to work. For this, we need to install the necessary Python libraries.


pip install requirements
Enter fullscreen mode

Exit fullscreen mode

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.

import os
import replicate
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_core.language_models.llms import LLM
from typing import List, Optional, Any
import fitz
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()
Enter fullscreen mode

Exit fullscreen mode

I developed this custom Llama class by inheriting from LangChain’s base LLM class and configuring it with the Llama 3.1 405B model identifier, token limits, and temperature settings.

I implemented the required _llm_type property to return an identifier, then I built the core _call method, which takes a prompt, packages it with the configuration into a dictionary, sends it to Replicate’s streaming API, and loops through the response chunks to concatenate them into a complete answer.

class Llama(LLM):
    model: str = "meta/meta-llama-3.1-405b-instruct"
    max_tokens: int = 1024
    temperature: float = 0.7

    @property
    def _llm_type(self) -> str:
        return "replicate_llama"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        input_data = 
            "prompt": prompt,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature
        

        output = ""
        for event in replicate.stream(self.model, input=input_data):
            output += str(event)

        return output
Enter fullscreen mode

Exit fullscreen mode

I built this OCRPDFLoader class to extract text from PDFs by first trying text extraction and falling back to OCR when needed. I initialised it with a file path, an optional OCR flag, and a text threshold (default 50 characters) to detect if a page has enough text.

In the load method, I opened the PDF with PyMuPDF, looped through each page to extract text, then checked if OCR was forced or if the extracted text was below the threshold – if so,

I called my _ocr_page method, which I built to convert the page into a high-resolution PNG image, send it to Replicate’s DeepSeek-OCR API, get the OCR text back, clean up the temporary image, and return the extracted text.

Finally, I packaged each page’s text into LangChain Document objects with metadata (source file, page number, filename) and returned them as a list, giving me a smart loader that automatically handles both digital and scanned PDFs.

class OCRPDFLoader:
    def __init__(self, file_path: str, use_ocr: bool = False, text_threshold: int = 50):
        self.file_path = file_path
        self.use_ocr = use_ocr
        self.text_threshold = text_threshold

    def load(self) -> List[Document]:
        doc = fitz.open(self.file_path)
        documents = []

        for page_num in range(len(doc)):
            page = doc[page_num]
            text = page.get_text()

            if self.use_ocr or len(text.strip()) < self.text_threshold:
                print(f"OCR: page page_num + 1")
                text = self._ocr_page(page, page_num)

            if text.strip():
                documents.append(Document(
                    page_content=text.strip(),
                    metadata=
                        'source': self.file_path,
                        'page': page_num + 1,
                        'filename': Path(self.file_path).name
                    
                ))

        doc.close()
        return documents

    def _ocr_page(self, page, page_num, temp_dir="./temp_ocr"):
        os.makedirs(temp_dir, exist_ok=True)

        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
        img_path = f"temp_dir/page_page_num.png"
        pix.save(img_path)

        with open(img_path, "rb") as image_file:
            input_data = 
                "image": image_file,
                "task_type": "Free OCR"
            

            output = replicate.run(
                "lucataco/deepseek-ocr:cb3b474fbfc56b1664c8c7841550bccecbe7b74c30e45ce938ffca1180b4dff5",
                input=input_data
            )

        os.remove(img_path)
        return output
Enter fullscreen mode

Exit fullscreen mode

Next, I built this LangChainPDFRAG. The lass is the main orchestrator that ties everything together into a complete RAG system. I initialised it by setting up my custom Llama model for generating answers, OpenAI embeddings for converting text into vectors, a text splitter that breaks documents into 500-character chunks with 50-character overlap to maintain context between chunks, and a Chroma vector database that I configured to persist on disk so it could reload existing data between sessions.

I created the add_pdf method, which uses my OCR loader to extract text from PDFs, splits that text into manageable chunks, then either creates a new vector store or adds to an existing one by converting each chunk into embeddings and storing them for semantic search.

Finally, I implemented the query method where I set up a retriever to find the 5 most relevant document chunks, built a LangChain chain that takes a user’s question, retrieves relevant context, formats it into a prompt template asking the LLM to cite page numbers, passes everything to my Llama model for generation, and returns both the generated answer and the source documents with their page numbers – essentially creating a complete question-answering system that can intelligently search through PDFs and provide accurate, cited responses.

class LangChainPDFRAG:
    def __init__(self, 
                 llm_model="meta/meta-llama-3.1-405b-instruct",
                 embedding_model="text-embedding-3-small",
                 persist_directory='./chroma_db'):

        self.llm = Llama(model=llm_model)
        self.embeddings = OpenAIEmbeddings(model=embedding_model)
        self.persist_directory = persist_directory
        self.vectorstore = None

        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", ". ", " ", ""]
        )

        if os.path.exists(persist_directory):
            self.vectorstore = Chroma(
                persist_directory=persist_directory,
                embedding_function=self.embeddings
            )

    def add_pdf(self, pdf_path: str, use_ocr: bool = False):
        loader = OCRPDFLoader(pdf_path, use_ocr=use_ocr)
        documents = loader.load()
        splits = self.text_splitter.split_documents(documents)

        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(
                documents=splits,
                embedding=self.embeddings,
                persist_directory=self.persist_directory
            )
        else:
            self.vectorstore.add_documents(splits)

        print(f"Added len(splits) chunks from Path(pdf_path).name")
        return len(splits)

    def query(self, question: str):
        if self.vectorstore is None:
            raise ValueError("No documents.")

        retriever = self.vectorstore.as_retriever(search_kwargs="k": 5)

        def format_docs(docs):
            return "\n\n".join([doc.page_content for doc in docs])

        prompt = ChatPromptTemplate.from_template(
            "You are a helpful assistant. Answer based on the context provided. Cite page numbers when relevant.\n\n"
            "Context:\ncontext\n\n"
            "Question: question\n\n"
            "Answer:"
        )

        chain = (
             format_docs, "question": RunnablePassthrough()
            | prompt
            | self.llm
            | StrOutputParser()
        )

        docs = retriever.invoke(question)
        answer = chain.invoke(question)

        return 
            'answer': answer,
            'sources': [
                
                    'filename': doc.metadata.get('filename'),
                    'page': doc.metadata.get('page'),
                    'content': doc.page_content[:200]
                
                for doc in docs
            ]
        
Enter fullscreen mode

Exit fullscreen mode

I instantiated the RAG system with Llama 3.1 405B, loaded a PDF into the vector database, and queried it with a question. The Agent retrieved relevant document chunks, generated an answer, and returned both the answer and source citations

if __name__ == "__main__":
    # Using Llama 3.1 405B from Replicate
    rag = LangChainPDFRAG(llm_model="meta/meta-llama-3.1-405b-instruct")

    rag.add_pdf('TSLA-Q2-2025-Update.pdf', use_ocr=False)

    result = rag.query('What are the main findings?')

    print("=== Answer ===")
    print(result['answer'])

    print("\n=== Sources ===")
    for source in result['sources']:
        print(f"- source['filename'], Page source['page']")
Enter fullscreen mode

Exit fullscreen mode

Conclusion :

DeepSeek-OCR is not just a more powerful OCR tool, but a research paper that opens a new chapter. The concept of visual-text compression that it proposes offers an imaginative path to solving one of the biggest challenges facing current large-scale models: the bottleneck of long context processing efficiency.

By “rendering” textual information as two-dimensional images and compressing it into information-dense visual tokens using an efficient visual encoder, DeepSeek-OCR demonstrates that AI can “see images” like humans can, allowing it to understand and remember large amounts of information more efficiently.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai

Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d

Subscribe to the Newsletter for free: https://substack.com/@gaodalie



Source link