Introducing document-level sync stories: Enhanced information sync visibility in Amazon Kendra


Amazon Kendra is an clever search service powered by machine studying (ML). Amazon Kendra helps you combination content material from a wide range of content material repositories right into a centralized index that permits you to rapidly search all of your enterprise information and discover essentially the most correct reply.

Amazon Kendra securely connects to over 40 information sources. When utilizing your information supply, you may want higher visibility into the doc processing lifecycle throughout information supply sync jobs. They might embrace figuring out the standing of every doc you tried to crawl and index, in addition to having the ability to troubleshoot why sure paperwork weren’t returned with the anticipated solutions. Moreover, you may want entry to metadata, timestamps, and entry management lists (ACLs) for the listed paperwork.

We’re happy to announce a brand new function now out there in Amazon Kendra that considerably improves visibility into information supply sync operations. The most recent launch introduces a complete document-level report included into the sync historical past, offering directors with granular indexing standing, metadata, and ACL particulars for each doc processed throughout an information supply sync job. This enhancement to sync job observability allows directors to rapidly examine and resolve ingestion or entry points encountered whereas organising Amazon Kendra indexes. The detailed doc stories are endured within the new SYNC_RUN_HISTORY_REPORT log stream beneath the Amazon Kendra index log group, so essential sync job particulars can be found on-demand when troubleshooting.

On this publish, we talk about the advantages of this new function and the way it presents enhanced information sync visibility in Amazon Kendra.

Lifecycle of a doc in an information supply sync run job

On this part, we look at the lifecycle of a doc inside an information supply sync in Amazon Kendra. This offers beneficial perception into the sync course of. The information supply sync contains three key levels: crawling, syncing, and indexing. Crawling entails the connector connecting to the information supply and extracting paperwork assembly the outlined sync scope in line with the information supply configuration. These paperwork are then synced to the Amazon Kendra index in the course of the syncing section. Lastly, indexing makes the synced paperwork searchable inside the Amazon Kendra atmosphere.

The next diagram exhibits a flowchart of a sync run job.

Crawling stage

The primary stage is the crawling stage, the place the connector crawls all paperwork and their metadata from the information supply. Throughout this stage, the connector additionally compares the checksum of the doc towards the Amazon Kendra index to find out if a specific doc must be added, modified, or deleted from the index. This operation corresponds to the CrawlAction subject within the sync run historical past report.

If the doc is unmodified, it’s marked as UNMODIFIED and skipped in the remainder of the levels. If any doc fails within the crawling stage, for instance on account of throttling errors, damaged content material, or if the doc dimension is simply too huge, that doc is marked within the sync run historical past report with the CrawlStatus as FAILED. If the doc was skipped on account of any validation errors, its CrawlStatus is marked as SKIPPED. These paperwork will not be despatched to the following stage. All profitable paperwork are marked as SUCCESS and are despatched ahead.

We additionally seize the ACLs and metadata on every doc on this stage to have the ability to add it to the sync run historical past report.

Syncing stage

In the course of the syncing stage, the doc is distributed to Amazon Kendra ingestion service APIs like BatchPutDocument and BatchDeleteDocument. After a doc is submitted to those APIs, Amazon Kendra runs validation checks on the submitted paperwork. If any doc fails these checks, its SyncStatus is marked as FAILED. If there may be an irrecoverable error for a specific doc, it’s marked as SKIPPED and different paperwork are despatched ahead.

Indexing stage

On this step, Amazon Kendra parses the doc, processes it in line with its content material sort, and persists it within the index. If the doc fails to be endured, its IndexStatus is marked as FAILED; in any other case, it’s marked as SUCCESS.

After the statuses of all of the levels have been captured, we emit these statuses as an Amazon CloudWatch occasion to the shopper’s AWS account.

Key options and advantages of document-level stories

The next are the important thing options and advantages of the brand new document-level report in Amazon Kendra indexes:

  • Enhanced sync run historical past web page – A brand new Actions column has been added to the sync run historical past web page, offering entry to the document-level report for every sync run.

  • Devoted log stream – A brand new log stream named SYNC_RUN_HISTORY_REPORT has been created within the Amazon Kendra CloudWatch log group, containing the document-level report.

  • Complete doc info – The document-level report consists of the next info for every doc:
  • Doc ID – That is the doc ID that’s inherited instantly from the information supply or mapped by the shopper within the information supply subject mappings.
  • Doc title – The title of the doc is taken from the information supply or mapped by the shopper within the information supply subject mappings.
  • Consolidated doc standing (SUCCESS, FAILED, or SKIPPED) – That is the ultimate consolidated standing of the doc. It could actually have a worth of SUCCESS, FAILED, or SKIPPED. If the doc was efficiently processed in all levels, then the worth is SUCCESS. If the doc failed or was skipped in any of the levels, then the worth of this subject can be FAILED or SKIPPED, respectively.
  • Error message (if the doc failed) – This subject comprises the error message with which a doc failed. If a doc was skipped on account of throttling errors, or any inside errors, this can be proven within the error message subject.
  • Crawl standing – This subject denotes whether or not the doc was crawled efficiently from the information supply. This standing correlates to the syncing-crawling state within the information supply sync.
  • Sync standing – This subject denotes whether or not the doc was despatched for syncing efficiently. This correlates to the syncing-indexing state within the information supply sync.
  • Index standing – This subject denotes whether or not the doc was efficiently endured within the index.
  • ACLs – This subject comprises an inventory of document-level permissions that had been crawled from the information supply. The main points of every ingredient within the checklist are:
    • International title – That is the e-mail or consumer title of the consumer. This subject is mapped throughout a number of information sources. For instance, if a consumer has three datasources Confluence, SharePoint, and Gmail, with the native consumer ID as confluence_user, sharepoint_user and gmail_user respectively, and their e mail tackle consumer@e mail.com is the globalName within the ACL for all of them, then Amazon Kendra understands that every one of those native consumer IDs map to the identical world title.
    • Identify – That is the native distinctive ID of the consumer, which is assigned by the information supply.
    • Kind – This subject signifies the principal sort. This may be both USER or GROUP.
    • Is Federated – This can be a boolean flag that signifies whether or not the group is of INDEX degree (true) or DATASOURCE degree (false).
    • Entry – This subject signifies whether or not the consumer has entry allowed or denied explicitly. Values could be both ALLOWED or DENIED.
    • Knowledge supply ID – That is the information supply ID. For federated teams (INDEX degree), this subject can be null.
  • Metadata – This subject comprises the metadata fields (aside from ACL) that had been pulled from the information supply. This checklist additionally consists of the metadata fields mapped by the shopper within the information supply subject mappings in addition to additional metadata fields added by the connector.
  • Hashed doc ID (for troubleshooting help) – To safeguard your information privateness, we current a safe, one-way hash of the doc identifier. This encrypted worth allows the Amazon Kendra crew to effectively find and analyze the precise doc inside our logs, must you encounter any difficulty that requires additional investigation and determination.
  • Timestamp – The timestamp signifies when the doc standing was logged in CloudWatch.

Within the following sections, we discover totally different use instances for the logging function.

Decide the optimum boosting period for current paperwork in utilizing document-level reporting

On the subject of producing correct solutions, you might wish to fine-tune the way in which Amazon Kendra prioritizes its content material. For example, you might want to spice up current paperwork over older ones to ensure essentially the most up-to-date passages are used to generate a solution. To attain this, you need to use the relevance tuning function in Amazon Kendra to spice up paperwork primarily based on the final replace date attribute, with a specified boosting period. Nonetheless, figuring out the optimum boosting interval could be difficult when coping with a lot of ceaselessly altering paperwork.

Now you can use the per-document-level report back to get hold of the _last_updated_at metadata subject info on your paperwork, which may also help you identify the suitable boosting interval. For this, you utilize the next CloudWatch Logs Insights question to retrieve the _last_updated_at metadata attribute for machine studying paperwork from the SYNC_RUN_HISTORY_REPORT log stream.

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Metadata like 'Machine Studying'
| parse Metadata '{"key":"_last_updated_at","worth":{"dateValue":"*"}}' as @last_updated_at
| kind @last_updated_at desc, @timestamp desc
| dedup DocumentTitle

With the previous question, you possibly can acquire insights into the final up to date timestamps of your paperwork, enabling you to make knowledgeable choices in regards to the optimum boosting interval. This strategy makes positive your chat responses are generated utilizing the newest and related info, enhancing the general accuracy and effectiveness of your Amazon Kendra implementation.

The next screenshot exhibits an instance end result.

Frequent doc indexing observability and troubleshooting strategies

On this part, we discover some frequent admin duties for observing and troubleshooting doc indexing utilizing the brand new document-level reporting function.

Checklist all efficiently listed paperwork from an information supply

To retrieve an inventory of all paperwork which were efficiently listed from a selected information supply, you need to use the next CloudWatch Logs Insights question:

fields DocumentTitle, DocumentId, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/'
and ConnectorDocumentStatus.Standing = "SUCCESS"
| kind @timestamp desc | dedup DocumentTitle, DocumentId

The next screenshot exhibits an instance end result.

Checklist all efficiently listed paperwork from an information supply sync job

To retrieve an inventory of all paperwork which were efficiently listed throughout a selected sync job, you need to use the next CloudWatch Logs Insights question:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Standing AS IndexStatus, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Standing = "SUCCESS"
| kind DocumentTitle

The next screenshot exhibits an instance end result.

Checklist all failed listed paperwork from an information supply sync job

To retrieve an inventory of all paperwork that didn’t index throughout a selected sync job, together with the error messages, you need to use the next CloudWatch Logs Insights question:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Standing AS IndexStatus, ErrorMsg, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Standing = "FAILED"
| kind @timestamp desc

The next screenshot exhibits an instance end result.

Checklist all paperwork that include a consumer’s ACL permission from an Amazon Kendra index

To retrieve an inventory of paperwork which have a selected customers ACL permission, you need to use the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'
and Acl like 'aneesh@mydemoaws.onmicrosoft.com'
| show DocumentTitle, SourceUri

The next screenshot exhibits an instance end result.

Checklist the ACL of an listed doc from an information supply sync job

To retrieve the ACL info for a selected listed doc from a sync job, you need to use the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| show DocumentTitle, Acl

The next screenshot exhibits an instance end result.

Checklist metadata of an listed doc from an information supply sync job

To retrieve the metadata info for a selected listed doc from a sync job, you need to use the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'
and DocumentTitle = "your-document-title"
| show DocumentTitle, Metadata

The next screenshot exhibits an instance end result.

Conclusion

The newly launched document-level report in Amazon Kendra offers enhanced visibility and observability into the doc processing lifecycle throughout information supply sync jobs. This function addresses a essential want expressed by clients for higher troubleshooting capabilities and entry to detailed details about the indexing standing, metadata, and ACLs of particular person paperwork.

The document-level report is saved in a log stream named SYNC_RUN_HISTORY_REPORT inside the Amazon Kendra index CloudWatch log group. This report comprises complete info for every doc, together with the doc ID, title, total doc sync standing, error messages (if any), together with its ACLs and metadata info retrieved from the information sources. The information supply sync run historical past web page now consists of an Actions column, offering entry to the document-level report for every sync run. This function considerably improves the power to troubleshoot points associated to doc ingestion and entry management, and points associated to metadata relevance, and offers higher visibility in regards to the paperwork synced with an Amazon Kendra index.

To get began with Amazon Kendra, discover the Getting started information. To study extra about information supply connectors and finest practices, see Creating a data source connector.


In regards to the Authors

Aneesh Mohan is a Senior Options Architect at Amazon Internet Providers (AWS), with over 20 years of expertise in architecting and delivering high-impact options for mission-critical workloads. His experience spans throughout the monetary providers trade, AI/ML, safety, and information applied sciences. Pushed by a deep ardour for expertise, Aneesh is devoted to partnering with clients to design and implement well-architected, progressive options that tackle their distinctive enterprise wants.

Ashwin Shukla is a Software program Growth Engineer II on the Amazon Q for Enterprise and Amazon Kendra engineering crew, with 6 years of expertise in growing enterprise software program. On this position, he works on designing and growing foundational options for Amazon Q for Enterprise.

Leave a Reply

Your email address will not be published. Required fields are marked *