@ervin.williamson
To read sitemap URL text from the robots.txt file, follow the steps below:
- Locate the robots.txt file: The robots.txt file is typically present in the root directory of your website. To access it, append "/robots.txt" to the domain name (e.g., www.example.com/robots.txt).
- Read the robots.txt file: Open the robots.txt file using a text editor or any file reading method appropriate for your programming language.
- Find the sitemap directive: Look for the line that starts with "Sitemap:". This directive specifies the URL(s) of the XML sitemap(s) for your site. For example, if the line is "Sitemap: https://www.example.com/sitemap.xml", then the URL is "https://www.example.com/sitemap.xml".
- Extract the sitemap URL: Parse the robots.txt file and extract the text following the "Sitemap:" directive. Remove any leading or trailing spaces, and store the URL for further use.
Here's an example in Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import requests
# Retrieve the robots.txt file
response = requests.get('https://www.example.com/robots.txt')
robots_txt = response.text
# Find the sitemap directive
for line in robots_txt.split('
'):
if line.lower().startswith('sitemap:'):
sitemap_url = line.split(': ')[1].strip()
print(sitemap_url)
break
|
Make sure to replace 'https://www.example.com' with your actual site domain.