Index web site contents utilizing the Amazon Q Net Crawler connector for Amazon Q Enterprise


Amazon Q Business is a completely managed service that allows you to construct interactive chat functions utilizing your enterprise information. These functions can generate solutions based mostly in your information or a big language mannequin (LLM) information. Your information just isn’t used for coaching functions, and the solutions supplied by Amazon Q Enterprise are based mostly solely on the info customers have entry to.

Enterprise information is commonly distributed throughout completely different sources, reminiscent of paperwork in Amazon Simple Storage Service (Amazon S3) buckets, database engines, web sites, and extra. On this publish, we exhibit how you can create an Amazon Q Enterprise utility and index web site contents utilizing the Amazon Q Web Crawler connector for Amazon Q Enterprise.

For this instance, we use two information sources (web sites). The primary information supply is an worker onboarding information from a fictitious firm, which requires primary authentication. We exhibit how you can arrange authentication for the Net Crawler. The second information supply is the official documentation for Amazon Q Enterprise. For this information supply, we exhibit how you can apply superior settings, reminiscent of common expressions, to instruct the Net Crawler to crawl solely pages and hyperlinks associated to Amazon Q Enterprise, ignoring pages associated to different AWS companies.

Overview of the Amazon Q Net Crawler connector

The Amazon Q Net Crawler connector makes it attainable to crawl web sites that use HTTPS and index their contents so you’ll be able to construct a generative synthetic intelligence (AI) expertise to your customers based mostly on the listed information. This connector depends on the Selenium Net Crawler Bundle and a Chromium driver. The connector is absolutely managed and updates to those elements are utilized mechanically with out your intervention.

This connector crawls and indexes the contents of webpages and attachments. Amazon Q Enterprise helps a number of connectors, and every connector has its personal properties and entities that it considers paperwork. Within the context of the Net Crawler connector, a doc refers to a single web page or attachment contents. Individually, an index is usually known as a corpus of paperwork; consider it because the place the place you add and sync your paperwork for Amazon Q Enterprise to make use of for producing solutions to person requests.

Every doc has its personal attributes, often known as metadata. Metadata might be mapped to fields in your Amazon Q Enterprise index. By creating index fields, you’ll be able to enhance outcomes based mostly on doc attributes. For instance, there may be use circumstances the place you wish to give extra relevance to outcomes from a particular class, division, or creation date.

Amazon Q Enterprise information supply connectors are designed to crawl the default attributes in your information supply mechanically. It’s also possible to add customized doc attributes and map them to customized fields in your index. To study extra, see Mapping document attributes in Amazon Q Business.

For a greater understanding of what’s listed by the Net Crawler connector, we current a listing of metadata listed from webpages and attachments.

The next desk lists webpage metadata listed by the Amazon Q Net Crawler connector.

Area Knowledge Supply Area Amazon Q Enterprise Index Area (reserved) Area Sort
Class class _category String
URL sourceUrl _source_uri String
Title title _document_title String
Meta Tags metaTags wc_meta_tags String Checklist
File Measurement htmlSize wc_html_size Lengthy (numeric)

The next desk lists attachments metadata listed by the Amazon Q Net Crawler connector.

Area Knowledge Supply Area Amazon Q Enterprise Index Area (reserved) Area Sort
Class class _category String
URL sourceUrl _source_uri String
File Identify fileName wc_file_name String
File Sort fileType wc_file_type String
File Measurement fileSize wc_file_size Lengthy (numeric)

When configuring the info supply to your web site, you should utilize URLs or sitemaps, which might be outlined both manually or utilizing a textual content file saved in Amazon S3.

To implement safe entry to protected web sites, the Amazon Q Net Crawler helps the next authentication varieties and requirements:

  • Primary authentication
  • NTLM/Kerberos authentication
  • Kind-based authentication
  • SAML authentication

In contrast to different information supply connectors, the Amazon Q Net Crawler connector doesn’t assist access control list (ACL) crawling or identity crawling.

Lastly, you might have a spread of choices for configuring how and what information is synchronized. For instance, you’ll be able to select to synchronize web site domains solely, web site domains with subdomains solely, or web site domains with subdomains and the webpages included in hyperlinks. Moreover, you should utilize common expressions to filter which URLS to incorporate or exclude within the crawling course of.

Overview of answer

On a excessive stage, this answer consists of an Amazon Q Enterprise utility that makes use of two information sources: a web site internet hosting paperwork associated to an worker onboarding information, and the Amazon Q Enterprise official documentation web site. This answer demonstrates how you can configure each web sites as information sources for the Amazon Q Enterprise utility. The next steps might be carried out:

  1. Deploy an AWS CloudFormation template containing a static web site secured with primary authentication.
  2. Create an Amazon Q Enterprise utility.
  3. Create a Net Crawler information supply for the Amazon Q Enterprise documentation.
  4. Create a Net Crawler information supply for the worker onboarding information.
  5. Add teams and customers to the Amazon Q Enterprise utility.
  6. Run pattern queries to check the answer.

You possibly can observe alongside utilizing one or each information sources supplied on this publish or strive your personal URLs.

Conditions

To observe together with this demo, it’s best to have the next stipulations:

  • An AWS account with privileges to create Amazon Q Enterprise functions and AWS Identity and Access Management (IAM) roles and insurance policies
  • An IAM Identity Center instance with at the very least one user (and optionally, a number of groups)
  • When you determine to make use of a public web site, ensure you have permission to crawl the web site
  • Optionally, privileges to deploy CloudFormation templates

Deploy a CloudFormation template for the worker onboarding web site secured with primary authentication

Deploying this CloudFormation template is elective, however we advocate utilizing it so you’ll be able to study extra about how the Net Crawler connector works with web sites that require authentication.

We begin by deploying a CloudFormation template. This template will create a easy static web site secured with primary authentication.

  1. On the AWS CloudFormation console, select Create stack and select With new assets (normal).
  2. Choose Select an current template.
  3. For Specify template, choose Amazon S3 URL.
  4. For Amazon S3 URL enter the URL https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-16532/template-website.yml
  5. Select Subsequent.
  6. For Stack title, enter a reputation. For instance, onboarding-website-for-q-business-sample.
  7. Select Subsequent.
  8. Go away all choices in Configure stack choices as default and select Subsequent.
  9. On the Assessment and create web page, choose I acknowledge that AWS CloudFormation may create IAM assets, then select Submit.

The deployment course of will take a couple of minutes to finish. You possibly can transfer to the subsequent part of this publish whereas it’s in course of. Preserve this tab open—you’ll must consult with the Outputs tab later.

Create an Amazon Q Enterprise utility

Earlier than you begin creating Amazon Q Enterprise functions, you’re required to enable and configure an IAM Identification Heart occasion. This step is obligatory as a result of Amazon Q Enterprise integrates with IAM Identification Heart to handle person entry to your Amazon Q Enterprise functions. When you don’t have an IAM Identification Heart occasion arrange when making an attempt to create your first utility, you will notice the choice to create one, as proven within the following screenshot.

Create IAM Identity Center

If you have already got an IAM Identification Heart occasion arrange, you’re prepared to start out creating your first utility by following these steps:

  1. On a brand new tab in your browser, open the Amazon Q Enterprise console.
  2. Select Get began or Create utility (choices will differ based mostly on whether or not it’s your first time making an attempt the service).
  3. For Utility title¸ enter a reputation to your utility, for instance, my-q-business-app.
  4. For Service entry, choose Create and use a brand new service-linked function (SLR).
  5. Select Create.
  6. For Retrievers, choose Use native retriever.
  7. For Index provisioning, enter 1 for Variety of models. One unit can index 20,000 paperwork (a doc on this context is both a single web page of content material or a single attachment).
  8. Select Subsequent.

Create a Net Crawler information supply for the Amazon Q Enterprise documentation

After you full the steps within the earlier part, it’s best to see the Join information sources web page, as proven within the following screenshot.

Connect data sources

When you closed the tab by chance, you may get to this web page by navigating to the Amazon Q Enterprise console, selecting your utility title, after which selecting Add information supply.

Let’s create the info supply for the Amazon Q Enterprise documentation web site:

  1. On the Join information sources web page, select Net crawler.
  2. For Knowledge supply title, enter a reputation, for instance, q-business-documentation
  3. For Description, enter an outline.
  4. For Supply, you might have the choice to offer both URLs or sitemaps. For this instance, choose Supply URLs and enter the URL of the official documentation of Amazon Q: https://docs.aws.amazon.com/amazonq/

Place to begin URLs might be added straight on this UI (as much as 10), or you can use a file hosted in Amazon S3 to record as much as 100 place to begin URLs. Likewise, sitemap URLs might be added on this UI (as much as three), or you can add as much as three sitemap XML information hosted in Amazon S3.

We consult with supply URLs as place to begin URLs; later on this publish, you’ll have the chance to outline what will get crawled, for instance, domains and subdomains that the webpages may hyperlink to. It’s value mentioning that the Net Crawler connector can solely work with HTTPS.

  1. Choose No authentication within the Authentication part as a result of this can be a public web site.
  2. The Net proxy part is elective, so we go away it empty.
  3. For Configure VPC and safety group, choose No VPC.
  4. Within the IAM function part, select Create a brand new service function.
  5. Within the Sync scope part, for Sync area vary, choose Sync domains with subdomains solely.
  6. For Most file dimension, you’ll be able to hold the default worth of 50 MB.
  7. Underneath Extra configuration, increase Scope settings.
  8. Go away Crawl depth set to 2, Most hyperlinks per web page set to 999, and Most throttling set to 300.

When you open the Amazon Q official documentation, you’ll see that there are hyperlinks to Amazon Q Developer documentation and different AWS companies. As a result of we’re solely inquisitive about crawling Amazon Q Enterprise, we have to instruct the crawler to focus solely on related hyperlinks and pages associated to Amazon Q Enterprise. To attain this, we use common expressions to outline precisely what URLs the crawler ought to crawl.

  1. Underneath Crawl URL Patterns, enter the next expressions one after the other, and select Add:
    1. ^https://docs.aws.amazon.com/amazonq/$
    2. ^https://docs.aws.amazon.com/amazonq/newest/qbusiness-ug/.*.html$
    3. ^https://docs.aws.amazon.com/amazonq/newest/business-use-dg/.*.html$

List of URLs to crawl

  1. Within the Sync mode part, choose Full sync. This selection makes it attainable to sync all contents no matter their earlier standing.
  2. Within the Sync run schedule part, you outline how typically Amazon Q Enterprise ought to sync this information supply. For Frequency, choose Run on demand.

Selecting this feature means you have to manually run the sync operation; this feature is appropriate given the simplicity of this instance. For manufacturing workloads, you’ll wish to outline a schedule tailor-made to your wants, for instance, hourly, day by day, or weekly, or you can outline your personal schedule utilizing a cron expression.

  1. The Tags part is elective, so we go away it empty.

The default values within the Area mappings part can’t be modified at this level. This will solely be modified after the appliance and retriever have been created.

  1. Select Add information supply and wait a few seconds whereas modifications are utilized.

After the info supply is created, you’ll be proven the identical interface you noticed initially of this part, with the observe that one Net Crawler information supply has been added. Preserve this tab open, since you’ll create a second information supply for the worker onboarding information within the subsequent part.

Web crawler added

Create a Net Crawler information supply for the worker onboarding information

Full the next steps to create your second information supply:

  1. On the Join information sources web page, select Net crawler.
  2. Preserve this tab open and navigate again to the AWS CloudFormation console tab and confirm the stack’s standing is CREATE_COMPLETE.
  3. If the standing of the stack is CREATE_COMPLETE, select the Outputs tab of the stack you deployed.
  4. Be aware the URL, person title, and password (the next screenshot exhibits pattern values).

Website settings

  1. Select the hyperlink for WebsiteURL.

Though unlikely, if the URL isn’t working, it may be as a result of Amazon CloudFront hasn’t completed replicating the web site. In that case, it’s best to wait a few minutes and take a look at once more.

  1. Check in together with your person title and password.

Basic auth login form

It is best to now be capable to browse the worker onboarding information. Take a couple of minutes to get accustomed to the contents of the web site, since you’ll be asking your Amazon Q Enterprise utility questions on this content material in a later step.

  1. Return to the browser tab the place you’re creating the brand new information supply.
  2. For Knowledge supply title, enter a reputation, for instance, onboarding-guide.
  3. For Supply, choose Supply URLs and enter the web site URL you saved earlier.
  4. For Authentication, choose Primary authentication.
  5. Underneath Authentication credentials, for AWS Secrets and techniques Supervisor secret, select Create and add new secret.

Create and add secret

  1. For Secret title, enter a secret title of your choice.
  2. For Consumer title and Password, use the values you saved earlier and ensure there are not any additional whitespaces.
  3. Select Save.

These credentials might be saved as a secret in AWS Secrets Manager.

Relying on the kind of authentication you utilize, you’ll want sure fields current in your secret, as proven within the following desk.

Authentication Sort Fields current in secret
Kind based mostly username, password, userNameFieldXpath, passwordFieldXpath, passwordButtonXpath, loginPageUrl
NTLM username, password
Primary auth username, password
No Authentication NA
  1. Go away the Net proxy part empty.
  2. Choose No VPC within the Configure VPC and safety group
  3. For IAM function, select Create a brand new service function.
  4. Choose Sync domains with subdomains solely within the Sync scope
  5. Choose Full sync within the Sync mode
  6. For Sync run schedule, select Run on demand.
  7. Go away the sections Tags and Area mappings with their default values.
  8. Select Add information supply and wait a few seconds whereas modifications are utilized.

After modifications are utilized, the Join information sources web page exhibits two Net Crawler information sources have been added.

Two web crawlers have been added

  1. Scroll all the way down to the tip of the web page and select Subsequent.

We’ve got added our two information sources. Within the subsequent part, we add teams and customers to our Amazon Q Enterprise utility.

Add teams and customers to the Amazon Q Enterprise utility

Full the next steps so as to add teams and customers:

  1. On the Add teams and customers web page, select Add teams and customers.
  2. Choose Assign current customers and teams and select Subsequent.

When you’ve accomplished the prerequisite of establishing IAM Identification Heart, you’ve probably added at the very least one person. Though it’s not obligatory, we advocate creating a number of customers and teams. This may allow you to completely discover and perceive all of the options of Amazon Q Enterprise past what’s coated on this publish.

When you haven’t added any customers to your Identification Heart listing, you’ll be able to create them right here by selecting Add new customers. Nevertheless, you’ll want to finish extra steps, reminiscent of establishing their passwords on the IAM Identification Heart console. To completely profit from this tutorial, we advocate having energetic customers and teams by the point you attain this step.

  1. Within the search bar, enter both the show title or group title you wish to add to the appliance.

Start typing name

  1. Select the person (or group) and select Assign.

When you added a gaggle, you’ll see it on the Teams tab. When you added a person, you’ll see it on the Customers tab.

The following step is selecting a subscription to your teams or customers.

  1. Choose the person (or group) you simply added, and on the Present subscription dropdown menu, select your subscription tier. For this instance, we select Q Enterprise Professional.

Assign Q Business license

This can be a good time to get accustomed to the Amazon Q Enterprise subscription tiers and pricing. For this instance, we use Q Enterprise Professional, however you can additionally use a Q Enterprise Lite subscription.

  1. Within the Net expertise service entry part, choose Create and use a brand new service function.

A web experience is the chat interface that your customers will make the most of to ask questions and carry out duties.

  1. Select Create utility.

After the appliance is created efficiently, you’ll be redirected to the Amazon Q Enterprise console, the place you’ll be able to see your new utility. Your utility is prepared, however the information sources haven’t synced any information but. We’ll do this within the subsequent steps.

  1. Select the title of your new utility to open the Utility Particulars.

Q Business Application

  1. Within the Knowledge sources part, choose every information supply and select Sync now.

You will notice the Present sync state for each information sources as Syncing. This course of may take a number of minutes.

After the info sources are synced, you will notice their Final sync standing as Accomplished.

Sync completed

You’re now prepared to check your utility! Preserve this web page open since you’ll want it for subsequent steps.

Run pattern queries to check the answer

At this level, you might have created an Amazon Q Enterprise utility, added two information sources utilizing the Amazon Q Net Crawler connector, added customers to the appliance, and synchronized all information sources.

The following step goes by means of the complete person expertise of logging in to the appliance and working just a few take a look at queries to check our utility.

  1. On the Utility Particulars web page, navigate to the Net expertise settings
  2. Select the hyperlink underneath Deployed URL.

Web experience settings tab

You’ll be redirected to the AWS entry portal URL, which is about up by IAM Identification Heart.

  1. Enter the person title of a person beforehand added to your Amazon Q Enterprise utility and select Subsequent.

You’re now in your Amazon Q Enterprise app and able to begin asking questions!

  1. Enter your query (immediate) within the Enter a immediate textual content subject and press Enter.

For this instance, we begin by asking questions associated to the worker onboarding web site.

Amazon Q Business Conversation

Amazon Q Enterprise makes use of the onboarding information information supply you created earlier. When you select Sources, you’ll see a listing of in-text supply citations within the type of a numbered record.

Now we ask questions associated to the Amazon Q Enterprise documentation.

Amazon Q Business conversation

Attempt it out with your personal prompts!

Troubleshooting

On this part, we focus on a number of frequent points and how you can troubleshoot:

  • Amazon Q Enterprise isn’t answering your questions – If Amazon Q Enterprise isn’t answering your questions, it’s probably as a result of your information not being listed accurately. To verify your information has synced accurately, make certain your information sources have synced accurately.
  • The Net Crawler is unable to sync – When you used a place to begin URL completely different from this publish and the Net Crawler can’t sync, it may be as a result of permissions. If the web site requires authentication, consult with the part the place we create an information supply for extra data. One other frequent state of affairs is when settings on the internet server or firewalls stop the Net Crawler from accessing the info. Lastly, it’s beneficial to test if a txt file in your net server is explicitly denying entry to the Net Crawler. For extra particulars on how you can configure a robots.txt file, consult with Configuring a robots.txt file for Amazon Q Business Web Crawler.
  • Amazon Q Enterprise solutions questions utilizing outdated information – Whenever you create an information supply, you might have the choice to inform Amazon Q Enterprise how typically it ought to sync your information supply together with your index. Through the creation of our information sources, we selected to sync the info sources manually (Run on demand), which implies the sync course of will happen solely once we select Sync now on our information supply. For extra data, consult with Sync run schedule.
  • Amazon Q Enterprise supplies an inaccurate reply or no reply in any respect – In conditions the place Amazon Q Enterprise is offering an inaccurate reply, incomplete solutions, or no reply in any respect, we advocate trying on the format of the info. Is the info a part of a picture? Is the info in a tabular format? Amazon Q Enterprise works finest with unstructured, plain textual content information.

Doc enrichment

Though not coated on this publish, we advocate exploring document enrichment. This performance means that you can manipulate and enrich doc attributes previous to being added to an index. The next are a few concepts for superior functions of doc enrichment:

  • Run an AWS Lambda function that sends your doc to Amazon Textract. This service makes use of optical character recognition (OCR) to extract textual content from pictures containing handwriting, types, tables, and extra.
  • Use Amazon Transcribe to transform movies or audio information in your paperwork into textual content.
  • Use Amazon Comprehend to detect and redact private identifiable data (PII).

Clear up

After you end testing the answer and to keep away from incurring in additional prices, clear up the assets you created as a part of this answer.

Let’s begin by deleting the Amazon Q Enterprise utility.

  1. On the Amazon Q Enterprise console, choose your utility from the appliance record and on the Actions menu, select Delete.

Delete Q Business application

  1. Affirm its deletion by coming into Delete, then select Delete.

You may be requested to finish an elective survey in your causes for utility deletion. You might be can choose a number of causes (or none), then select Submit.

The following step is to delete the CloudFormation stack accountable for deploying the worker onboarding web site we used as an information supply.

  1. On the CloudFormation console, choose the stack you created initially of this walkthrough and select Delete.

Delete Cloudformation stack

  1. Select Delete to verify the stack deletion.

The stack deletion may take a couple of minutes. When the deletion is full, you’ll see the stack has been eliminated out of your record of stacks.

Optionally, in case you enabled IAM Identification Heart just for this tutorial and wish to delete your IAM Identification Heart occasion, observe these steps:

  1. On IAM Identification Heart console, select Settings within the navigation pane.

IAM identity center settings

  1. Select the Administration tab

IAM IDC management

  1. Select Delete.
  1. Choose the acknowledgement test bins, enter your occasion, and select Affirm.

Conclusion

The Amazon Q Enterprise Net Crawler means that you can join web sites to your Amazon Q Enterprise functions. This connector helps a number of types of authentication (if required by your web site) and may run sync jobs on an outlined schedule.

To study extra about Amazon Q Enterprise and its options, consult with the Amazon Q Business Developer Guide. For a complete record of what might be accomplished with this connector, consult with Connecting Web Crawler to Amazon Q Business.


Concerning the Writer

Guillermo MansillaGuillermo Mansilla is a Senior Options Architect based mostly in Orlando, Florida. He has had the chance to collaborate with startups and enterprise prospects within the USA and Canada, aiding them in constructing and architecting their functions on AWS. Guillermo has developed a eager curiosity in serverless architectures and generative AI functions. Previous to his present function, he gained over a decade of expertise working as a software program developer. Away from work, Guillermo enjoys collaborating in chess tournaments at his native chess membership, a pursuit that enables him to train his analytical expertise in a distinct context.

Leave a Reply

Your email address will not be published. Required fields are marked *