Skip to main content
Version: 8.8 (unreleased)

IDP reference

Technical reference information for IDP, including technical architecture, supported documents, and known limitations.

Technical architecture

IDP offers a composable architecture that allows you to customize and extend IDP capabilities as needed. This flexibility enables you to adapt quickly to evolving business needs while maintaining a streamlined and manageable system.

IDP allows you to create, configure, and publish a document extraction template. This is a type of connector template.

Architecture diagram of IDP

The document extraction template integrates with Camunda document handling connectors and APIs such as Amazon S3, Amazon Textract, Amazon Comprehend, and Amazon Bedrock to retrieve, analyze, and process documents.

  1. Document upload: The template accepts uploaded documents as input. These documents can be uploaded to a local document store, and their references used in the extraction process. For example, the connector uploads the document to an Amazon S3 bucket for extraction.

  2. Amazon Textract: Uploaded documents are analyzed by Amazon Textract, which extracts text data and returns the results. The template configuration includes specifying the document, the S3 bucket name for temporary storage during Amazon Textract analysis, and other required parameters such as extraction fields and Amazon Bedrock Converse parameters.

  3. Amazon Bedrock: Your extraction field prompts are used by Amazon Bedrock to extract data from the document. The extracted content is mapped to process variables, and the results stored in a specified result variable.

note

Document storage

IDP stores documents as follows during the different extraction stages:

  • Web Modeler: Uploaded sample documents are stored within Web Modeler itself (SaaS) or the database (Self-Managed).
  • Cluster: During extraction testing (for example, when you click Extract document) the document is briefly stored in the cluster.
  • Extraction: Finally, when you extract content using a document extraction template, it is stored in an Amazon AWS S3 bucket, where it can be accessed by AWS Textract.

Document file formats

IDP currently only supports data extraction from the following uploaded document file formats.

File formatDescription

PDF

  • PDF documents must not be password protected.
  • Maximum document file size is 4MB for test extraction.

  • Both text and image content can be extracted from a PDF document. For example, data can be extracted from a scanned image that has been converted to PDF.

Document language support

IDP supports data extraction and processing of documents in multiple languages.

IDP integrates with Amazon Textract, which supports multilingual text extraction and is capable of detecting and extracting text in multiple languages. This ensures that the extracted text can be accurately mapped to process variables and used within your workflows, regardless of document language.

info

For example, as of February 2025, Amazon Textract can detect printed text and handwriting from the Standard English alphabet and ASCII symbols, and can extract printed text, forms and tables in English, German, French, Spanish, Italian and Portuguese. Refer to Amazon Textract FAQs for currently supported languages.

Extraction field data types

Specify the extraction field data type to indicate to the LLM what type of data it should be trying to extract. This helps the LLM more accurately analyze and extract the correct data.

For example, if you want to extract an expected numeric value (such as a monetary value), select the Number data type for the extraction field.

Supported data types

You can specify the following extraction field data types.

Data typeDescription
BooleanThe LLM should expect a true or false value, such as "yes" or "no".
NumberThe LLM should expect to extract a numeric value.
StringThe LLM should expect to extract a sequence of characters.

Extraction models

You can choose from the following supported LLM extraction models during data extraction.

Extraction modelModel providerDocumentation
Claude 3.5 SonnetAnthropicAnthropic's Claude in Amazon Bedrock
Claude 3 SonnetAnthropicAnthropic's Claude in Amazon Bedrock
Claude 3 HaikuAnthropicAnthropic's Claude in Amazon Bedrock
Llama 3 70B InstructMetaMeta's Llama in Amazon Bedrock
Llama 3 8B InstructMetaMeta's Llama in Amazon Bedrock
Titan Text PremierAmazon AWSAmazon Titan Text models

Validation status

During validation, a validation status is shown for extraction fields to indicate the accuracy of the extracted data.

IconStatusDescription
Pass iconPassThe document validation passed with accurate and expected results.
Caution iconCautionA test case is missing for comparison. Click Save test case to create a test case for this field.
Fail iconFailThe validation results do not match the expected output for the document. Click Review document to investigate and resolve.

Example

The following example shows the results of a partially successful extraction against three documents.

Example validation results table

The expanded contract_start_date field shows that each document returned different validation results.

  • The first document passed the validation, with the Extracted value matching the Expected test case output.
  • The second document could not be validated as a test case was not found for comparison. Click Save test case to create a test case for the document.
  • The third document failed validation as the Extracted value did not match the Expected test case output. Click Review document to open the document again and check the prompt for this field.