Parsing PDF documents using .NET and C#

Introduction

In this tutorial, you will learn how to parse PDF documents and extract text from them using the .NET framework and PassportPDF API. Please check our getting started guide for setting up your machine.

There is a business need for parsing PDF documents

Since PDF is one of the most widely used document formats in the world, many businesses use it to store information. Most of these businesses process large amounts of PDF documents daily. So to consume this information, businesses need to parse these documents and extract text from them.

Parsing PDF documents can help in automating and streamlining processes because once a document is parsed, the text contained in that document can be analyzed and decisions can be made based on the content of that text.

How to use PassportPDF API and .NET to parse a PDF file

In this tutorial we will be using PassportPDF API to OCR a document and then extract text from it.

It’s important to know that there are 2 different types of PDF documents:

Text-based PDF documents.
Image-based PDF documents.

For text-based PDF documents, there is already an OCR layer. This layer makes the process of extracting text from the document straightforward.
On the contrary, image-based PDF documents such as scanned PDFs do not have an OCR layer. This means that to parse such a document and extract text from it, you first need to generate such a layer.

In the example below, we will look at image-based PDF documents. This will show you how easy and powerful the PassportPDF API is.

We will use these endpoints:

DocumentLoadFromURI to load a document from a URI.
OCR to generate an OCR layer on the PDF document.
ExtractText to extract the text from the PDF document.
DocumentClose to close the document.

The code below illustrates how to use these endpoints:

using PassportPDF.Api;
using PassportPDF.Client;
using PassportPDF.Model;

namespace TextExtraction
{

    public class TextExtractor
    {
        static async Task Main(string[] args)
        {
            GlobalConfiguration.ApiKey = "YOUR-PASSPORT-CODE";

            PassportManagerApi apiManager = new();
            PassportPDFPassport passportData = await apiManager.PassportManagerGetPassportInfoAsync(GlobalConfiguration.ApiKey);

            if (passportData == null)
            {
                throw new ApiException("The Passport number given is invalid, please set a valid passport number and try again.");
            }
            else if (passportData.IsActive is false)
            {
                throw new ApiException("The Passport number given not active, please go to your PassportPDF dashboard and active your plan.");
            }

            string uri = "https://passportpdfapi.com/test/invoice_with_barcode.pdf";
            
            DocumentApi api = new();

            Console.WriteLine("Loading document into PassportPDF...");
            DocumentLoadResponse document = await api.DocumentLoadFromURIAsync(new LoadDocumentFromURIParameters(uri));
            Console.WriteLine("Document loaded.");

            PDFApi pdfApi = new();

            Console.WriteLine("Launching text recognition process...");
            PdfOCRResponse ocrPdfResponse = await pdfApi.OCRAsync(new PdfOCRParameters(document.FileId, "*")
            {
                Language = "eng",
                SkipPageWithText = false
            });

            if (ocrPdfResponse.Error is not null)
            {
                throw new ApiException(ocrPdfResponse.Error.ExtResultMessage);
            }
            else
            {
                Console.WriteLine("Text recognition process ended.");
            }

            Console.WriteLine("Start text extraction process...");
            PdfExtractTextResponse extractTextResponse = await pdfApi.ExtractTextAsync(new PdfExtractTextParameters(document.FileId, "*")
            {
                TextExtractionMode = PdfExtractTextMode.WholePagePreserveLayout
            });

            Console.WriteLine("Text extracted :");
            foreach (PageText page in extractTextResponse.ExtractedText)
            {
                Console.WriteLine($"======== Page {page.PageNumber} ========");
                Console.WriteLine(page.ExtractedText);
                Console.WriteLine("========================");
            }
        }
    }
}

For the full .NET project, please visit the GitHub repository.

The PDF document we used in this tutorial is shown below:

The ExtractedText entry contains a list of elements.
Each element represents a page in the PDF document. For each page, there are 2 entries: ExtractedText and PageNumber. Below you can see the full response:

imagingORPALIStechnologies
 
 
    SAS ORPALIS IMAGING                                                                                                                         CUSTOMER
    52 rue de Marclan                                                                                                                           Sam Sung Macro Software
    Batiment Le Verdi                                                                                                                           42 Apple Road
    31600 MURET                                                                                                                                 12345 Amazonia
    FRANCE                                                                                                                                      California
    Tel: (+33)6 59 49 60 76                                                                                                                     USA
 
    Mail: esales@orpalis.com
    Site : https://orpalis.com
 
 
 
 
                                                                                                                                    REFERENCE       DATE            CUSTOMER VAT
 
INVOICE                                                                                                                             CIN22060001     02 June 2022
 
 
                                                                                                                                                                                                            Total Price
                                Description                                                                                                                         Qty             Unit Price
                                                                                                                                                                                                            (Excluding taxes)
 
    GdPicture.NETSDKUltimate - Annual Subscription - Worldwide license                                                                                              1               150.96                           150.96
 
 
 
 
 
 
 
 
 
 
                                                                                                                                                                    SUBTOTAL (Excl. taxes)          150.96
                                                                                                                                                                    VAT
                                                                                                                                                                    TOTAL PRICE                     150.96
                                                                                                                                                                    (incl. taxes)
VAT:                                                                                                                                                                Currency                        USD
Country outside the European Union: Tx exempt in accordance with the Art 262! of the code général des impéts.
 
Payment methods:
Payment methods: Wire transfer/Check in USD only
IBAN: FR76 1027 8022 4900 0203 3340 470
BIC: CMCIFR2A
Bank Name: CCM COLOMIERS PERGET
Bank Address: 1 B R A LAURENT DE LAVOISIER, 31770 COLOMIERS
 
 
Payment terms:
 
Early payment discount: none.
Any late payment as of the Due Date will automatically suspend the License key(s) and/or Maintenance contract(s}, as applicable.
in addition: late payment penalties will automatically apply to any portion of the Price that remains unpaid as of the due date.
The rate of late payment penalties is equal to 12% per year. A Jump sum recovery payment of 40 euros will apply for each late payment to recover.
 
                                                                                                Thank you for your Business
 
                                                                                    SAS au capital social de 201.732 €, RCS 891 177 560 Toulouse
                                                                                                        SIREN 891177560
                                                                                        N° TVA intracommunautaire FR 68891177560

Final remarks

If you don’t run the OCR endpoint first before you try to extract the text, you will be getting an empty list for the extracted text. This is because, as mentioned above, image-based PDF documents need an OCR layer before you can extract any text.

For more information about the endpoints used in this tutorial, please visit the PassportPDF API reference.

Parsing PDF documents using .NET and C#

Introduction

There is a business need for parsing PDF documents

How to use PassportPDF API and .NET to parse a PDF file

Final remarks

Products

Developers

Company

Social