Introducing document-level sync studies: Enhanced information sync visibility in Amazon Q Enterprise


Amazon Q Business is a totally managed, generative synthetic intelligence (AI)-powered assistant that helps enterprises unlock the worth of their information and information. With Amazon Q, you possibly can rapidly discover solutions to questions, generate summaries and content material, and full duties through the use of the data and experience saved throughout your organization’s varied information sources and enterprise programs. On the core of this functionality are native information supply connectors that seamlessly combine and index content material from a number of repositories right into a unified index. This permits the Amazon Q massive language mannequin (LLM) to supply correct, well-written solutions by drawing from the consolidated information and knowledge. The info supply connectors act as a bridge, synchronizing content material from disparate programs like Salesforce, Jira, and SharePoint right into a centralized index that powers the pure language understanding and generative talents of Amazon Q.

Clients respect that Amazon Q Enterprise securely connects to over 40 information sources. Whereas utilizing their information supply, they need higher visibility into the doc processing lifecycle throughout information supply sync jobs. They need to know the standing of every doc they tried to crawl and index, in addition to the power to troubleshoot why sure paperwork weren’t returned with the anticipated solutions. Moreover, they need entry to metadata, timestamps, and entry management lists (ACLs) for the listed paperwork.

We’re happy to announce a brand new function now out there in Amazon Q Enterprise that considerably improves visibility into information supply sync operations. The newest launch introduces a complete document-level report integrated into the sync historical past, offering directors with granular indexing standing, metadata, and ACL particulars for each doc processed throughout a knowledge supply sync job. This enhancement to sync job observability permits directors to rapidly examine and resolve ingestion or entry points encountered whereas establishing an Amazon Q Enterprise utility. The detailed doc studies are persevered within the new SYNC_RUN_HISTORY_REPORT log stream underneath the Amazon Q Enterprise utility log group, so vital sync job particulars can be found on-demand when troubleshooting.

Lifecycle of a doc in a knowledge supply sync run job

On this part, we study the lifecycle of a doc inside a knowledge supply sync in Amazon Q Enterprise. This gives useful perception into the sync course of. The info supply sync includes three key levels: crawling, syncing, and indexing. Crawling includes the connector connecting to the info supply and extracting paperwork assembly the outlined sync scope in keeping with the info supply configuration. These paperwork are then synced to Amazon Q Enterprise in the course of the syncing part. Lastly, indexing makes the synced paperwork searchable inside the Amazon Q Enterprise surroundings.

The next diagram reveals a flowchart of a sync run job.

Crawling stage

The primary stage is the crawling stage, the place the connector crawls all paperwork and their metadata from the info supply. Throughout this stage, the connector additionally compares the checksum of the doc towards the Amazon Q index to determine if a specific doc must be added, modified, or deleted from the index. This operation corresponds to the CrawlAction discipline within the sync run historical past report.

If the doc is unmodified, it’s marked as UNMODIFIED and skipped in the remainder of the levels. If any doc fails within the crawling stage, for instance on account of throttling errors, damaged content material, or if the doc measurement is just too massive, that doc is marked as failed within the sync run historical past report with the CrawlStatus as FAILED. If the doc was skipped on account of any validation errors, its CrawlStatus is marked as SKIPPED. These paperwork should not despatched ahead to the following stage. All profitable paperwork are marked as SUCCESS and are despatched ahead.

We additionally seize the ACLs and metadata on every doc on this stage to have the ability to add it to the sync run historical past report.

Syncing stage

Through the syncing stage, the doc is shipped to Amazon Q Enterprise ingestion service APIs like BatchPutDocument and BatchDeleteDocument. After a doc is submitted to those APIs, Amazon Q Enterprise runs validation checks on the submitted paperwork. If any doc fails these checks, its SyncStatus is marked as FAILED. If there may be an irrecoverable error for a specific doc, it’s marked as SKIPPED and different paperwork are despatched ahead.

Indexing stage

On this step, Amazon Q Enterprise parses the doc, processes it in keeping with its content material kind, and persists it within the index. If the doc fails to be persevered, its IndexStatus is marked as FAILED; in any other case, it’s marked as SUCCESS.

After the statuses of all of the levels have been captured, we emit these statuses as an Amazon Cloudwatch occasion to the client’s AWS account.

Key options and advantages of document-level studies

The next are the important thing options and advantages of the brand new doc stage report in Amazon Q Enterprise functions:

  • Enhanced sync run historical past web page – A brand new Actions column has been added to the sync run historical past web page, offering entry to the document-level report for every sync run.
  • Devoted log stream – A brand new log stream named SYNC_RUN_HISTORY_REPORT has been created within the Amazon Q Enterprise CloudWatch log group, containing the document-level report.
  • Complete doc data – The document-level report consists of the next data for every doc.
  • Doc ID – That is the doc ID that’s inherited immediately from the info supply or mapped by the client within the information supply discipline mappings.
  • Doc title – The title of the doc is taken from the info supply or mapped by the client within the information supply discipline mappings.
  • Consolidated doc standing (SUCCESS, FAILED, or SKIPPED) – That is the ultimate consolidated standing of the doc. It may have a price of SUCCESS, FAILED, or SKIPPED. If the doc was efficiently processed in all levels, then the worth is SUCCESS. If the doc has failed or was skipped in any of the levels, then the worth of this discipline might be FAILED or SKIPPED.
  • Error message (if the doc failed) – This discipline accommodates the error message with which a doc failed. If a doc was skipped on account of throttling errors, or any inner errors, this might be proven within the error message discipline.
  • Crawl standing – This discipline denotes whether or not the doc was crawled efficiently from the info supply. This standing correlates to the syncing-crawling state within the information supply sync.
  • Sync standing – This discipline denotes whether or not the doc was despatched for syncing efficiently. This correlates to the syncing-indexing state within the information supply sync.
  • Index standing – This discipline denotes whether or not the doc was efficiently persevered within the index.
  • ACLs – This discipline accommodates a listing of document-level permissions that have been crawled from the info supply. The small print of every aspect within the listing are:
    • International identify: That is the e-mail/username of the consumer. This discipline is mapped throughout a number of information sources. For instance, if a consumer has 3 information sources – Confluence, Sharepoint and Gmail with the native consumer ID as confluence_user, sharepoint_user and gmail_user respectively, and their electronic mail handle consumer@electronic mail.com is the globalName within the ACL for all of them; then Amazon Q Enterprise understands that every one of those native consumer IDs map to the identical international identify.
    • Title: That is the native distinctive ID of the consumer which is assigned by the info supply.
    • Kind: This discipline signifies the principal kind. This may be both USER or GROUP.
    • Is Federated: It is a boolean flag which signifies whether or not the group is of INDEX stage (true) or DATASOURCE stage (false).
    • Entry: This discipline signifies whether or not the consumer has entry allowed or denied explicitly. Values could be both ALLOWED or DENIED.
    • Information supply ID: That is the info supply ID. For federated teams (INDEX stage), this discipline might be null.
  • Metadata – This discipline accommodates the metadata fields (aside from ACL) that have been pulled from the info supply. This listing additionally consists of the metadata fields mapped by the client within the information supply discipline mappings in addition to additional metadata fields added by the connector.
  • Hashed doc ID (for troubleshooting help) – To safeguard your information privateness, we current a safe, one-way hash of the doc identifier. This encrypted worth permits the Amazon Q Enterprise staff to effectively find and analyze the precise doc inside our logs, do you have to encounter any concern that requires additional investigation and determination.
  • Timestamp – The timestamp signifies when the doc standing was logged in CloudWatch.

Within the following sections, we discover completely different use instances for the logging function.

Troubleshoot “Sorry, I couldn’t discover related data” with the new logging feature

The brand new document-level logging function in Amazon Q Enterprise will help troubleshoot widespread points associated to the “Sorry, I couldn’t discover related data to finish your request” response.

Let’s discover an instance state of affairs. A mutual funds supervisor makes use of Amazon Q Enterprise chat for information retrieval and insights extraction throughout their enterprise information shops. When the fund supervisor asks, “What’s the CAGR of the multi-asset fund?” within the Amazon Q chat, they obtain the “Sorry, I couldn’t discover related data to finish your request” response.

Because the administrator managing their Amazon Q Enterprise utility, you possibly can troubleshoot the difficulty utilizing the next method with the brand new logging function. First, you need to decide whether or not the multi-asset fund doc was efficiently listed within the Amazon Q Enterprise utility. Subsequent, it is advisable confirm if the fund supervisor’s consumer account has the required permission to learn the data from the multi-asset fund doc. Amazon Q Enterprise enforces the doc permissions configured in its information supply, and you should utilize this new function to confirm that the doc ACL settings are synced within the Amazon Q Enterprise utility index.

You should utilize the next CloudWatch question string to test the doc ACL settings:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/' 
and DocumentTitle = "your-document-title"
| fields DocumentTitle, ConnectorDocumentStatus.Standing, Acl
| kind @timestamp desc
| restrict 1

This question filter makes use of the per-document-level logging stream SYNC_RUN_HISTORY_REPORT, and shows the doc title and its related ACL settings. By verifying the doc indexing and permissions, you possibly can establish and resolve potential points which may be inflicting the “Sorry, I couldn’t discover related data” response.

The next screenshot reveals an instance outcome.

Decide the optimum boosting period for current paperwork in utilizing document-level reporting

In the case of producing correct solutions, you might need to fine-tune the way in which Amazon Q prioritizes its content material. As an example, you might desire to spice up current paperwork over older ones to ensure essentially the most up-to-date passages are used to generate a solution. To realize this, you should utilize the enterprise’s relevance tuning function in Amazon Q Enterprise to spice up paperwork based mostly on the final replace date attribute, with a specified boosting period. Nevertheless, figuring out the optimum boosting interval could be difficult when coping with a lot of ceaselessly altering paperwork.

Now you can use the per-document-level report back to get hold of the _last_updated_at metadata discipline data to your paperwork, which will help you identify the suitable boosting interval. For this, you employ the next CloudWatch Logs Insights question to retrieve the _last_updated_at metadata attribute for machine studying paperwork from the SYNC_RUN_HISTORY_REPORT log stream:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/' 
and Metadata like 'Machine Studying'
| parse Metadata '{"key":"_last_updated_at","worth":{"dateValue":"*"}}' as @last_updated_at
| kind @last_updated_at desc, @timestamp desc
| dedup DocumentTitle

With the previous question, you possibly can achieve insights into the final up to date timestamps of your paperwork, enabling you to make knowledgeable selections in regards to the optimum boosting interval. This method makes certain your chat responses are generated utilizing the latest and related data, enhancing the general accuracy and effectiveness of your Amazon Q Enterprise implementation.

The next screenshot reveals an instance outcome.

Widespread doc indexing observability and troubleshooting strategies

On this part, we discover some widespread admin duties for observing and troubleshooting doc indexing utilizing the brand new document-level reporting function.

Listing all efficiently listed paperwork from a information supply

To retrieve a listing of all paperwork which were efficiently listed from a selected information supply, you should utilize the next CloudWatch question:

fields DocumentTitle, DocumentId, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/'
and ConnectorDocumentStatus.Standing = "SUCCESS"
| kind @timestamp desc | dedup DocumentTitle, DocumentId

The next screenshot reveals an instance outcome. 

Listing all efficiently listed paperwork from a information supply sync job

To retrieve a listing of all paperwork which were efficiently listed throughout a selected sync job, you should utilize the next CloudWatch question:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Standing AS IndexStatus, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Standing = "SUCCESS"
| kind DocumentTitle

The next screenshot reveals an instance outcome.

Listing all failed listed paperwork from a information supply sync job

To retrieve a listing of all paperwork that didn’t index throughout a selected sync job, together with the error messages, you should utilize the next CloudWatch question:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Standing AS IndexStatus, ErrorMsg, @timestamp
| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'
and ConnectorDocumentStatus.Standing = "FAILED"
| kind @timestamp desc

The next screenshot reveals an instance outcome.

Listing all paperwork that accommodates a specific consumer identify ACL permission from an Amazon Q Enterprise utility

To retrieve a listing of paperwork which have a selected consumer’s ACL permission, you should utilize the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/' 
and Acl like 'aneesh@mydemoaws.onmicrosoft.com'
| show DocumentTitle, SourceUri

The next screenshot reveals an instance outcome.

 Listing the ACL of an listed doc from a information supply sync job

To retrieve the ACL data for a selected listed doc from a sync job, you should utilize the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id' 
and DocumentTitle = "your-document-title"
| show DocumentTitle, Acl

The next screenshot reveals an instance outcome.

Listing metadata of an listed doc from a information supply sync job

To retrieve the metadata data for a selected listed doc from a sync job, you should utilize the next CloudWatch Logs Insights question:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id' 
and DocumentTitle = "your-document-title"
| show DocumentTitle, Metadata

The next screenshot reveals an instance outcome.

Conclusion

The newly launched document-level report in Amazon Q Enterprise gives enhanced visibility and observability into the doc processing lifecycle throughout information supply sync jobs. This function addresses a vital want expressed by clients for higher troubleshooting capabilities and entry to detailed details about the indexing standing, metadata, and ACLs of particular person paperwork.

The document-level report is saved in a devoted log stream named SYNC_RUN_HISTORY_REPORT inside the Amazon Q Enterprise utility CloudWatch log group. This report accommodates complete data for every doc, together with the doc ID, title, total doc sync standing, error messages (if any), together with its ACLs, and metadata data retrieved from the info sources. The info supply sync run historical past web page now consists of an Actions column, offering entry to the document-level report for every sync run. This function considerably improves the power to troubleshoot points associated to doc ingestion and entry management, and points associated to metadata relevance, and gives higher visibility in regards to the paperwork synced with an Amazon Q index.

To get began with Amazon Q Enterprise, discover the Getting started information. To be taught extra about information supply connectors and finest practices, see Configuring Amazon Q Business data source connectors.


Concerning the authors

Aneesh Mohan is a Senior Options Architect at Amazon Internet Companies (AWS), bringing twenty years of expertise in creating impactful options for business-critical workloads. He’s keen about expertise and loves working with clients to construct well-architected options, specializing in the monetary providers trade, AI/ML, safety, and information applied sciences.

Ashwin Shukla is a Software program Improvement Engineer II on the Amazon Q for Enterprise and Amazon Kendra engineering staff, with 6 years of expertise in growing enterprise software program. On this position, he works on designing and growing foundational options for Amazon Q for Enterprise.

Leave a Reply

Your email address will not be published. Required fields are marked *