Extracting Form Fields from a Multi-Page PDF AWS Textract and .NET

2 years ago

source link: https://nodogmablog.bryanhogan.net/2023/02/extracting-form-fields-from-a-multi-page-pdf-aws-textract-and-net/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Extracting Form Fields from a Multi-Page PDF AWS Textract and .NET

Want to talk with other .NET on AWS developers, ask questions, and share what you know? Join us on Slack!

Download full source code.

In my previous post, I showed how to extract key-value pairs from an image. The Textract client sent a request to the service, and the service returned the results promptly.

However, with a PDF, the file must be uploaded to S3. The request to the Textract service returns a job id. The application must then poll for the results.

Keep in mind this is a demo, and there are some limitations in the code - duplicated keys are removed, as are keys without values.

The pdf to process

If you want to know more about the process for extracting key-value pairs from a form, see the Textract documentation.

Like the previous post, I use Blazor to load the form and display the extracted key-value pairs. My Blazor skills are limited, so don’t copy/paste the code.

The attached zip has the full source code, so I won’t go through it all here, instead, I’ll show only a few snippets.

Using Textract

Because the file is a PDF, I need to upload it to S3 before it can be processed. Pass the Textract client to the Razor page using dependency injection.

    
@inject IAmazonTextract TextractClient
@inject IAmazonS3 S3Client

Uploading the file to S3 requires you to have an S3 bucket in place already. See this blog post for more information on creating an S3 bucket.

PutObjectRequest putRequest = new PutObjectRequest
{
    BucketName = "textract-blog-posts", // you won't be able to use this bucket name
    Key = sourcePdf.Name,
    InputStream = sourcePdf.OpenReadStream(1024000),
    ContentType = sourcePdf.ContentType
};

PutObjectResponse response = await S3Client.PutObjectAsync(putRequest);

Then I send a request to Textract to process the file. The response contains a job id, which I use to poll for results. In this example, I am using a simple while loop, but if you are building a production application, you should use a more robust and scalable approach.

var startDocumentAnalysisRequest = new StartDocumentAnalysisRequest
{
    DocumentLocation = new DocumentLocation
    {
        S3Object = new Amazon.Textract.Model.S3Object
        {
            Bucket = "textract-blog-posts",
            Name = sourcePdf.Name
        }
    },
    FeatureTypes = new List<string> { "FORMS" }
};

var startDocumentAnalysisResponse = await textractClient.StartDocumentAnalysisAsync(startDocumentAnalysisRequest);

GetDocumentAnalysisResponse getDocumentAnalysisResponse;
while (true)
{
    getDocumentAnalysisResponse = await textractClient.GetDocumentAnalysisAsync(new GetDocumentAnalysisRequest
    {
        JobId = startDocumentAnalysisResponse.JobId
    });
    if(getDocumentAnalysisResponse.JobStatus != JobStatus.IN_PROGRESS)
    {
        break;
    }
    await Task.Delay(5000);
}

pagesKeysValues = getDocumentAnalysisResponse.GetKeyValuePairs();

The last line in the above code calls an extension method to extract the key-value pairs from the response, see the attached zip for the source code. The extension method is for demo purposes only. For a production application, you should read the Textract documentation and implement your own logic.

Output of Textract

Download full source code.

Recommend

www.tuicool.com 6 years ago
Cache

Amazon Textract – Extract text and data from virtually any document

Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and informa...

fuzzyblog.io 4 years ago
Cache

PDF Text Extraction Is Hard Even for AWS Textract

PDF Text Extraction Is Hard Even for AWS Textract Mar 5, 2020 I have always found that serendipity plays a large rol...

pspdfkit.com 4 years ago
Cache

Using JavaScript in PDF Form Fields

www.codesd.com 4 years ago
Cache

PHP: Do not Email Form Fields Left in White

PHP: Do not Email Form Fields Left in White advertisements Currently I am making an online enquiry form with a set of fields that are non-mand...

blog.prototypr.io 4 years ago
Cache

Best practice for date-of-birth form fields

Best practice for date-of-birth form fieldsWhat the evidence says vs. OS pattern libraries

markshust.com 3 years ago
Cache

Creating a multi-page form using MobX with Meteor & React

Creating a multi-page form using MobX with Meteor & ReactJune 02, 2016 · 3 min read ·

aws.amazon.com 2 years ago
Cache

Intelligently Extract Text & Data with OCR - Amazon Textract - Amazon Web Se...

Intelligently Extract Text & Data with OCR

www.honeybadger.io 2 years ago
Cache

Building an OCR service with Amazon Textract and AWS Lambda

Building an OCR service with Amazon Textract and AWS Lambda Are you looking for a good way to extract text from PDFs and images? What about extracting text from tables? If you have these questions in mind, you...

www.infoworld.com 2 years ago
Cache

AWS adds AI features to Textract, Transcribe and Kendra

AWS adds AI features to Textract, Transcribe and Kendra At re:Invent 2022, the cloud services provider also updated its HealthLake and CodeWhisperer serv...

nodogmablog.bryanhogan.net 2 years ago
Cache

Extracting Text from an Image with AWS Textract and .NET

Extracting Text from an Image with AWS Textract and .NETWant to talk with other .NET on AWS developers, ask questions, and share what you know?

Extracting Form Fields from a Multi-Page PDF AWS Textract and .NET

Extracting Form Fields from a Multi-Page PDF AWS Textract and .NET

Using Textract

Recommend

About Joyk