5

How to Detect and Extract Table Data as JSON from PDF Documents in C#

 9 months ago
source link: https://www.textcontrol.com/blog/2023/08/15/how-to-detect-and-extract-table-data-as-json-from-pdf-documents-in-csharp/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to Detect and Extract Table Data as JSON from PDF Documents in C#

Extracting tabular data is a typical use case for importing "digitally born" PDF documents such as invoices. Upon reading an invoice, the recognized data will be matched with the corresponding purchase order data in an ERP system.

Import PDF Documents

Like any other supported file type, TX Text Control can import digitally born Adobe PDF documents. Typically, you can use this approach to search PDF documents for strings in document pages and extract the contents of form fields.

Learn More

This article shows different ways to extract data from existing PDF documents.

Extract Text and Data from PDF Documents in C#

Unlike in the above article, the document is actually loaded into the TX Text Control to loop through recognized tables. Consider the following invoice PDF document.

Extracting Tables from PDFs

In this case, the TX Text Control will recognize the table that is highlighted in red.

Extracting Tables from PDFs

Looping through Tables

We use the PDFImportSettings to specify that the TX Text Control should import the content and format it in such a way that tables are recognized as well. The following code loads the PDF document and loops through all of the tables it finds.

Generate JSON

The Table2Data method loops through all rows and creates a Dictionary for each row. If the containsTableHeader parameter is set to true, the column name is used from the first row. When finished, the list of dictionaries is returned as a JSON string.

For the example document above, the following JSON is returned.

If the containsTableHeader parameter is set to false, the following JSON will be returned.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK