Passport Loader

Splitting a multipage PDF and extracting specific pages

Introduction

In this tutorial, you will learn how to split a multipage PDF file into single pages documents. You will be doing this using PassportPDF and Python.

You should have your machine already set up using instructions from the getting started guide.

Splitting PDF files with PassportPDF

Now, we’ll see how to use PassportPDF to split a multipage PDF document into multiple single-page documents. Then we’ll show how to download these documents and save them on a local disk.

We will use the following endpoints from PassportPDF:

  • DocumentLoadFromURI to load a document from a URI.
  • ExtractPage to extract single pages from the PDF document.
  • SaveDocument to save every single page as a separate document.
  • DocumentClose to close the PDF document.

These endpoints are demonstrated in the following example:

"""
PassportPDF tutorial : Splitting a multi-page PDF document into single-page documents.
"""
 
import requests
import base64
from urllib.parse import urlparse
from pathlib import Path
 
 
if __name__=="__main__":
 
    endpoint = "https://passportpdfapi.com/api/document/DocumentLoadFromURI"
 
    headers = {
        "X-PassportPDF-API-Key" : "YOUR-PASSPORT-CODE",
    }
 
    URI = "https://passportpdfapi.com/test/multiple_pages.pdf"
 
    data = {
        "URI" : URI
    }
 
    document_name = Path(urlparse(URI).path).stem
    response = requests.post(endpoint, json=data, headers=headers)
 
    if(response.status_code == 200):
 
        json_response = response.json()
        file_id = json_response["FileId"]
 
        # Extract pages from document
        extract_page_endpoint = "https://passportpdfapi.com/api/pdf/ExtractPage"
 
        extraction_data = {
            "FileId" : file_id,
            "PageRange" : "*",
            "ExtractAsSeparate" : True
        }
 
        extract_page_response = requests.post(extract_page_endpoint, json=extraction_data, headers=headers)
        json_response = extract_page_response.json()
 
        pages_ids = json_response["FileIds"]
 
        if(extract_page_response.status_code == 200):
 
            print("Extracting pages..")
 
            # Download pages
            save_document_endpoint = "https://passportpdfapi.com/api/pdf/SaveDocument"
 
            for i, page_id in enumerate(pages_ids):
 
                save_document_response = requests.post(save_document_endpoint, json={"FileId" : page_id}, headers=headers)
                page_nbr = i+1
 
                if(save_document_response.status_code == 200):
 
                    json_response = save_document_response.json()
 
                    with open(f"./data/output/split_pdf/{document_name}_{page_nbr}.pdf", "wb") as file:
                        decoded_data = base64.b64decode(json_response["Data"].encode())
                        file.write(decoded_data)
 
                else:
                    print("Could not download page number {page_nbr}!")
 
            print("Done extracting pages.")
 
        else:
            print("Could not extract pages from document!")
 
 
        # Close document
        close_document_endpoint = "https://passportpdfapi.com/api/document/DocumentClose"
        close_response = requests.post(close_document_endpoint, json={"FileId" : file_id}, headers=headers)
 
        if(close_response.status_code == 200):
            print("Document closed successfully.")
        else:
            print("Could not close document!")
 
    else:
        print("Something went wrong when trying to load the document!")

Some important points to notice in the code:

  • When making a request to the ExtractPage endpoint, you need to set the parameter ExtractAsSeparate to True. This will create different “file IDs” for each page, allowing you to save each page separately.
  • To grab the different pages IDs, in the JSON response dictionary, you need to extract the value corresponding to the key “FileIds.”

After running the sample code above, you should have 4 PDF files saved on your local machine. Each file contains one page from the original PDF document.

Extracting specific pages from the document

What if you would like to extract some specific pages from the document and not all pages?
To achieve this, you just need to change the PageRange parameter to include the numbers of the specific pages that you would like to extract.

For example, if you want to extract only page 1 and page 3 from the previous document, then in the previous code, you need to change this part:

extraction_data = {
    "FileId" : file_id,
    "PageRange" : "*",
    "ExtractAsSeparate" : True
}

To this:

extraction_data = {
    "FileId" : file_id,
    "PageRange" : "1,3",
    "ExtractAsSeparate" : True
}

This will create 2 IDs, one for the first page and one for the third page of the PDF document. Then these pages will be downloaded to your local disk as separate documents.

Final remarks

If you don’t set the parameter ExtractAsSeparate to true, then no IDs will be generated for each page, and you will get only one ID representing the whole document. This means that you will be downloading the same multipage document that you started with.

For more information about the endpoints used in this tutorial, please visit the PassportPDF API reference.