

Extracting Form Fields from a Multi-Page PDF AWS Textract and .NET
source link: https://nodogmablog.bryanhogan.net/2023/02/extracting-form-fields-from-a-multi-page-pdf-aws-textract-and-net/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Extracting Form Fields from a Multi-Page PDF AWS Textract and .NET
Want to talk with other .NET on AWS developers, ask questions, and share what you know? Join us on Slack!
Download full source code.
In my previous post, I showed how to extract key-value pairs from an image. The Textract client sent a request to the service, and the service returned the results promptly.
However, with a PDF, the file must be uploaded to S3. The request to the Textract service returns a job id. The application must then poll for the results.
Keep in mind this is a demo, and there are some limitations in the code - duplicated keys are removed, as are keys without values.
The pdf to process
If you want to know more about the process for extracting key-value pairs from a form, see the Textract documentation.
Like the previous post, I use Blazor to load the form and display the extracted key-value pairs. My Blazor skills are limited, so don’t copy/paste the code.
The attached zip has the full source code, so I won’t go through it all here, instead, I’ll show only a few snippets.
Using Textract
Because the file is a PDF, I need to upload it to S3 before it can be processed. Pass the Textract client to the Razor page using dependency injection.
@inject IAmazonTextract TextractClient
@inject IAmazonS3 S3Client
Uploading the file to S3 requires you to have an S3 bucket in place already. See this blog post for more information on creating an S3 bucket.
PutObjectRequest putRequest = new PutObjectRequest
{
BucketName = "textract-blog-posts", // you won't be able to use this bucket name
Key = sourcePdf.Name,
InputStream = sourcePdf.OpenReadStream(1024000),
ContentType = sourcePdf.ContentType
};
PutObjectResponse response = await S3Client.PutObjectAsync(putRequest);
Then I send a request to Textract to process the file. The response contains a job id, which I use to poll for results. In this example, I am using a simple while loop, but if you are building a production application, you should use a more robust and scalable approach.
var startDocumentAnalysisRequest = new StartDocumentAnalysisRequest
{
DocumentLocation = new DocumentLocation
{
S3Object = new Amazon.Textract.Model.S3Object
{
Bucket = "textract-blog-posts",
Name = sourcePdf.Name
}
},
FeatureTypes = new List<string> { "FORMS" }
};
var startDocumentAnalysisResponse = await textractClient.StartDocumentAnalysisAsync(startDocumentAnalysisRequest);
GetDocumentAnalysisResponse getDocumentAnalysisResponse;
while (true)
{
getDocumentAnalysisResponse = await textractClient.GetDocumentAnalysisAsync(new GetDocumentAnalysisRequest
{
JobId = startDocumentAnalysisResponse.JobId
});
if(getDocumentAnalysisResponse.JobStatus != JobStatus.IN_PROGRESS)
{
break;
}
await Task.Delay(5000);
}
pagesKeysValues = getDocumentAnalysisResponse.GetKeyValuePairs();
The last line in the above code calls an extension method to extract the key-value pairs from the response, see the attached zip for the source code. The extension method is for demo purposes only. For a production application, you should read the Textract documentation and implement your own logic.

Output of Textract
Download full source code.
Recommend
-
46
Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and informa...
-
6
PDF Text Extraction Is Hard Even for AWS Textract Mar 5, 2020 I have always found that serendipity plays a large rol...
-
19
Using JavaScript in PDF Form Fields
-
5
PHP: Do not Email Form Fields Left in White advertisements Currently I am making an online enquiry form with a set of fields that are non-mand...
-
5
Best practice for date-of-birth form fieldsWhat the evidence says vs. OS pattern libraries
-
2
Creating a multi-page form using MobX with Meteor & ReactJune 02, 2016 · 3 min read ·
-
7
Intelligently Extract Text & Data with OCR
-
7
Building an OCR service with Amazon Textract and AWS Lambda Are you looking for a good way to extract text from PDFs and images? What about extracting text from tables? If you have these questions in mind, you...
-
9
AWS adds AI features to Textract, Transcribe and Kendra At re:Invent 2022, the cloud services provider also updated its HealthLake and CodeWhisperer serv...
-
9
Extracting Text from an Image with AWS Textract and .NETWant to talk with other .NET on AWS developers, ask questions, and share what you know?
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK