Passport Loader

Compressing a PDF document with Python

Introduction

In this tutorial, you will learn how to reduce the PDF document size using PassportPDF API and Python.

You should have your machine already set up using instructions from the getting started guide.

Everybody needs to compress documents

Businesses and organizations process large volumes of documents daily.
Compressing them brings many benefits, such as:

  • saving money by reducing storage costs,
  • boosting productivity by easing sharing and reducing transfer time,
  • reducing the carbon footprint of an organization.

PassportPDF PDF compression

PassportPDF API allows you to compress PDF documents very easily. It has a variety of options that you can choose from to compress a document further.
For instance:

  • EnableCharRepair specifies whether character repairing will be performed during bitonal conversion.
  • EnableMRC specifies whether MRC (Mixed Raster Content) will be used for compressing the PDF contents.
  • RemoveAnnotations for removing annotations from the PDF document.

MRC Compression

One of the above mentioned options is enabling Mixed Raster Content compression. It is a powerful paradigm when it comes to compressing images.

This method is especially powerful with images containing binary-compressible text and continuous-tone components. It uses image segmentation to improve the level of compression and the quality of the rendered image.

The MRC engine separates regions of an image into three image called layers (binary, background, and foreground layers). It then applies the most efficient and accurate compression algorithm for each layer before reconstructing a PDF document using specific rendering instructions.

You can find more information about how MRC compression works in this GdPicture.NET blog article about compression methods (PassportPDF uses the GdPicture.NET hyper-compression engine).

How to compress a PDF file using PassportPDF API

To compress a PDF file using PassportPDF API and Python, we will use the following endpoints:

  • DocumentLoadFromURI to load a document from a URI.
  • Reduce to compress the file.
  • SaveDocument to save the compressed file to your local disk.

The code below shows a minimal example of using these endpoints to compress a PDF file.

import requests
import base64


if __name__=="__main__":

    endpoint = "https://passportpdfapi.com/api/document/DocumentLoadFromURI"

    headers = {
        "X-PassportPDF-API-Key" : "YOUR-PASSPORT-CODE",
    }

    data = {
        "URI" : "https://passportpdfapi.com/test/invoice_with_barcode.pdf"
    }

    response = requests.post(endpoint, json=data, headers=headers)
    
    if(response.status_code == 200):

        json_response = response.json()
        file_id = json_response["FileId"]
        
        data = {
            "FileId" : file_id,
        }

        # Reduce document
        reduce_endpoint = "https://passportpdfapi.com/api/pdf/Reduce"
        reduce_response = requests.post(reduce_endpoint, json=data, headers=headers)

        if(reduce_response.status_code == 200):

            json_response = reduce_response.json()
            content_removed = json_response["ContentRemoved"]
            new_file_size = json_response["NewFileSize"]
            print("content_removed : ", content_removed)
            print("new file size : ", new_file_size)

            # Download reduced document
            save_document_endpoint = "https://passportpdfapi.com/api/pdf/SaveDocument"
            save_document_response = requests.post(save_document_endpoint, json=data, headers=headers)
            json_response = save_document_response.json()

            with open("./data/output/reduced.pdf", "wb") as file:
                decoded_data = base64.b64decode(json_response["Data"].encode())
                file.write(decoded_data)

        else:
            print("Something went wrong when trying to reduce the document.")

        # Close document
        close_document_endpoint = "https://passportpdfapi.com/api/document/DocumentClose"
        close_response = requests.post(close_document_endpoint, json={"FileId" : file_id}, headers=headers)

        if(close_response.status_code == 200):
            print("Document closed successfully.")
        else:
            print("Could not close document!")

    else:
        print("Something went wrong when trying to load the document!")

The file that we used as input was 200 Kb. After the compression process, we obtained a 50 Kb file, four times less than the original size!
You should also notice that we only used the default options for compression.

You can compress the file further (up to 90% in some cases) by enabling other options.
For example, EnableMRC reduces it further to 44 Kb. To enable this option, I added it to my post request JSON data as shown below:

...
data = {
    "FileId" : file_id,
    "EnableMRC" : True
}

# Reduce document
reduce_endpoint = "https://passportpdfapi.com/api/pdf/Reduce"
reduce_response = requests.post(reduce_endpoint, json=data, headers=headers)
...

For the full list of options available, check the Reduce endpoint in the API reference.

Final remarks

With PassportPDF API, you can compress documents and images easily, thanks to many options. In addition, the API allows you to adjust the tradeoff between file size and image quality to customize the compression engine as you wish.

Keep in mind that with some compression techniques, your document may lose some important information that could prevent the use of the document in specific contexts (such as archiving legal files or viewing medical images, for instance).