1. Home
  2. Coding
  3. How to merge PDF documents using Python

How to merge PDF documents using Python

Share

It’s easy to merge PDF documents using Python code, if you use the right tools!

Nowadays PDF format is the most popular file format to share documents on the internet. Many software is available to manage PDF documents, but when you need to perform some easy task on them, it is not so easy to find the best one to use. One example is merging multiple PDF files into one file.

In this guide, we will show how you can merge many PDF documents into one file using a simple Python script. In particular, to merge two PDF documents in Python we need to:

Let’s now see in the following sections how to do it in more details.

Prerequisites

First, we need to make sure that our development environment meets all the minimum requirements we need. In particular, we should already have installed:

  • Python >= 3.x
  • Pip (Python’s package installer)

After we met all the minimum requirements, we can move on to the next steps.

Install PyPDF2 using Pip

In Python, there are different libraries that allow us to manipulate PDF documents, and some of them are pretty complex to use. In this guide, we use PyPDF2, which is a simple Python library that we can use also to merge multiple PDF documents.

We can use Pip, the Python’s package installer, to install PyPDF2. To do so, we simply need to run the following command:

python3 -m pip install PyPdf2==1.26.0

Merging Pdf Documents inside a Subfolder

When you have to merge lots of PDF documents, it is really annoying to pass all the absolute paths of the files. So, to save time in coding the paths we can just put all the PDF files in a parent directory, and then Python will handle automatically all the paths of the PDF files for us.

To do so, make sure your project structure looks like this:

\main.py
\parent_folder
    \folder_1
        file_1.pdf
        file_2.pdf
    \folder_2
        file_3.pdf
    \folder_3
        \folder_4
            file_4.pdf
        file_5.pdf
    file_6.pdf

Note that all the PDF files should be inside the \parent_folder.

Create the Python script to merge PDF documents

After setting up the environment, it’s finally time to code our script.

Python’s code to extract all the files under the Parent_folder

To get all the paths of the PDF files under the parent folder we can just use the following code:

import os

# pass the path of the parent_folder
def fetch_all_files(parent_folder: str):
    target_files = []
    for path, subdirs, files in os.walk(parent_folder):
        for name in files:
            target_files.append(os.path.join(path, name))
    return target_files 

# get a list of all the paths of the pdf
extracted_files = fetch_all_files('./parent_folder')

Basically, we use the fetch_all_files function to recursively find all PDF files located into the parent_folder and it’s subfolders.

Python’s code to Merge the PDF documents

Now that we have the list of all the PDF files paths, we can use the following code to merge them into a unique PDF file:

from PyPDF2 import PdfFileMerger

# pass the path of the output final file.pdf and the list of paths
def merge_pdf(out_path: str, extracted_files: list [str]):
    merger   = PdfFileMerger()
    
    for pdf in extracted_files:
        merger.append(pdf)

    merger.write(out_path)
    merger.close()

merge_pdf('./final.pdf', extracted_files)

Complete code to merge PDFs in Python

Simple as that! We can now put all the final code in a ./main.py file:

#main.py

import os
from PyPDF2 import PdfFileMerger

# pass the path of the parent_folder
def fetch_all_files(parent_folder: str):
    target_files = []
    for path, subdirs, files in os.walk(parent_folder):
        for name in files:
            target_files.append(os.path.join(path, name))
    return target_files 

# pass the path of the output final file.pdf and the list of paths
def merge_pdf(out_path: str, extracted_files: list [str]):
    merger   = PdfFileMerger()
    
    for pdf in extracted_files:
        merger.append(pdf)

    merger.write(out_path)
    merger.close()

# get a list of all the paths of the pdf
parent_folder_path = './parent_folder'
outup_pdf_path     = './final.pdf'

extracted_files = fetch_all_files(parent_folder_path)
merge_pdf(outup_pdf_path, extracted_files)

To run the script, you just need to type python3 main.py in your terminal!

Conclusions

In this article, we have learned to merge different PDF files into a unique PDF using Python code. We used the simple Python library PyPDF2 to manipulate the PDFs and we wrote some code to collect multiple PDFs paths under different subfolders.

For more details about what you can do with PyPDF2, you can check its official documentation:

If you like our post, please share it: