Scraping ICD codes from WHO dynamic website

Hi there,
It is first time for me here and I am completely new to python. I use ICD 10 codes from WHO website for analyzing national data that has a variable containing ICD 10 diagnosis code. One of the verification we initially do is to check for any wrong codes. To do so, we need the updated codes from the website (update occurs yearly). I tried and spent hours but gave up. I hope someone can help me with that. The website is as browser and it has dynamic nature which makes it difficult to scrap the codes from each chapter and sub chapter. This what I came up with but it suddenly stopped and looping again for the same section and codes and not moving to the next one.

`from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

Set up WebDriver

driver = webdriver.Chrome()
driver.implicitly_wait(10) # Adding an implicit wait
driver.get(“ICD-10 Version:2019”)

Wait for the main chapters to load

WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, “ygtvitem”))
)

Find all chapters under the given class and id

chapters = driver.find_elements(By.CSS_SELECTOR, “#ygtv1 .ygtvitem”)

for i in range(len(chapters)):
try:
# Re-find chapters
chapters = driver.find_elements(By.CSS_SELECTOR, “#ygtv1 .ygtvitem”)
chapter = chapters[i]

    # Click on each chapter to expand it
    chapter.click()
    time.sleep(5)  # Increasing the wait time to ensure the sub-items load
    
    print(f"Clicked on chapter: {chapter.text}")
    
    # Find all sub-items
    sub_items = chapter.find_elements(By.CSS_SELECTOR, ".ygtvitem a")
    
    for j in range(len(sub_items)):
        try:
            # Re-find sub-items
            sub_items = chapter.find_elements(By.CSS_SELECTOR, ".ygtvitem a")
            sub_item = sub_items[j]
            
            sub_item.click()
            time.sleep(5)  # Increasing the wait time to ensure the content loads
            
            print(f"Clicked on sub-item: {sub_item.text}")
            
            # Extract text under the class "code"
            codes = driver.find_elements(By.CLASS_NAME, "code")
            
            for code in codes:
                print(f"Code text: {code.text}")  # Print or save the extracted text
                
            # Go back to the main page
            driver.back()
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, "ygtvitem"))
            )
            time.sleep(2)  # Adding a short wait to ensure the main page loads
        except Exception as e:
            print(f"Error with sub-item: {e}")
except Exception as e:
    print(f"Error with chapter: {e}")

Close the driver

driver.quit()
`

Read this first and learn how to format code so we can help you. About the Python Help category

After looking at the website, I wonder if one giant PDF might be easier to extract the codes from. Do you have a link to such a PDF?

  1. Did you ask on this site for free sources of this data, a PDF, or even CSV file? https://opendata.stackexchange.com/

Hi,
I want python to visit this site ICD-10 Version:2019 , and click on (arrows to expand them) which are located on this //tr[@class=‘ygtvrow’]//td[starts-with(@id, ‘ygtvt’) and (contains(@class, ‘ygtvcell ygtvtm’) or contains(@class, ‘ygtvcell ygtvtp’) or contains(@class, ‘ygtvcell ygtvlm’))]. If you visit the website you will see each arrow is nested inside another. But when I inspect the html of it in google chrome and press ctrl+f and paste the above XPATH it was able to find them all and navigate through them. Also after python clicked and expanded all the arrows, I want it to select each element found here //tr[@class=‘ygtvrow’]//td[@class=‘ygtvcell ygtvhtml ygtvcontent’] to load its page content. It would be better if this happen sequentially.

I appreciate any help.