How to troubleshoot xpath not finding elements on a web page?

Question

1.00/5 (1 vote)

See more:

I am using a Bash script

Bash

#!/bin/bash
# Define variables for the URL and browser
sGDomain="idealista"
sGCitta="fucecchio-firenze"
sGTypo="vendita-case"
iGPagina=1

# Start of the loop
while :; do

    # Build the URL with the iGPagina variable
    url="https://www.$sGDomain.it/$sGTypo/$sGCitta/lista-$iGPagina.htm"
    #echo "$url"
    
    # Get the HTML content of the page
    html_content=$(curl -s -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" "$url")

    echo "$html_content" > htmlcompleto.txt
    
    # Check if the error string is not present in the HTML content
    if [[ ! $html_content =~ "Successiva" ]]; then
        break  # Exit the loop if the error string is not present
    fi
    
    # Use xidel to extract the ads
    xidel_output=$(xidel --silent --xpath '
        //div[contains(@class, "item-info-container")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)

    # Check if the temporary file exists and delete it if present
    if [ -f "temp.txt" ]; then
        rm temp.txt
    fi

    # Replace special characters from "desc=" to the end of each line in semi.txt
    echo "$xidel_output" | sed -e "s/desc=\(.*\)\(['\"]\)/desc=\1 /g" > semi.txt

    sed -i 's/\([0-9]\{1,\}\)\.\([0-9]\{1,\}\),[0-9]\{2\}/\1\2/g' semi.txt
    sed -i 's/m²//g' semi.txt

    # Concatenate semi.txt with debugtxt.txt for debugging purposes
    cat semi.txt >> debugtxt.txt

    # Connect to the SQLite database
    db_file="immo.db"
 
    # Loop through the lines and insert them into the SQLite database
    while IFS= read -r line; do
        # Extract price, size, link, and description values from the lines using awk
        prezzo=$(echo "$line" | awk -F 'price=' '{print $2}' | awk -F 'size=' '{print $1}')
        size=$(echo "$line" | awk -F 'size=' '{print $2}' | awk -F 'link=' '{print $1}')
        link=$(echo "$line" | awk -F 'link=' '{print $2}' | awk -F 'desc=' '{print $1}')
        descrizione=$(echo "$line" | awk -F 'desc=' '{print $2}')

        # Determine if the description contains "asta"
        if [[ $descrizione =~ "asta" ]]; then
            asta=1
        else
            asta=0
        fi

        # Insert the data into the SQLite database
        sqlite3 "$db_file" "INSERT INTO $sGDomain (prezzo, link, descrizione, metratura, asta) VALUES ('$prezzo', '$link', '$descrizione', '$size', $asta)"
    done < semi.txt

    # Increment the iGPagina variable for the next iteration**your text**
    ((iGPagina++))
done
to search a web page with a specific XPath expression. Although I believe the XPath is correct, the script fails to find anything on the page. Web page URL: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm. XPath expression used: xidel_output=$(xidel --silent --xpath ' //div[contains(@class, "item-info-container")] ! string-join( ( ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ), ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ), ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ), ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) ) ), codepoints-to-string(9) ) ' -) **

Expected result: I expect to extract the price, the listing link, the description, and the square meters from each listing on the web page. i tryed also this xpath expression

        //div[contains(@class, "items-container items-list")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)```
with items-container items-list, but nothing

to search a web page with a specific XPath expression. Although I believe the XPath is correct, the script fails to find anything on the page. Web page URL: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm. XPath expression used:

xidel_output=$(xidel --silent --xpath ' //div[contains(@class, "item-info-container")] ! string-join( ( ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ), ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ), ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ), ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) ) ), codepoints-to-string(9) ) ' -) **

What I have tried:

Expected result: I expect to extract the price, the listing link, the description, and the square meters from each listing on the web page. i tryed also this xpath expression

xidel_output=$(xidel  --xpath '
     //main//div[contains(@class, "item-info-container ")] ! string-join(
          (
              ( "price=" || normalize-space(.//span[contains(@class, "item-price h2-simulated")]/text()[1]) ),
              ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
              ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
              ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
          ),
          codepoints-to-string(9)
      )
  ' "$url")

but nothing

Posted 10-Jun-24 0:45am

GiulioRig

Add a Solution

Comments

Richard Deeming 10-Jun-24 7:15am

That URL just returns a 403 Fobidden error. We can't tell you why your XPath isn't working without seeing the relevant parts of the source you're trying to evaluate it against.

GiulioRig 10-Jun-24 9:31am

for me work https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm , but you can use also another page for testing https://www.idealista.it/vendita-case/fucecchio-firenze/lista-1.htm

Richard Deeming 10-Jun-24 9:37am

Exactly the same response - 403 Forbidden.

We can't help you scrape a site that we can't access!

GiulioRig 10-Jun-24 9:38am

is very strange also if you call a domain idealista.it ?

Richard Deeming 10-Jun-24 9:40am

Another 403 response, with a "Please enable JS and disable any ad blocker" message.

You cannot scrape a site that only works in a browser.

GiulioRig 10-Jun-24 10:21am

but i think is problem of your browser probalby you have some blocker ? for me work and navigate perfect

Richard Deeming 10-Jun-24 10:23am

No; I'm using Fiddler, which displays the HTML returned from the server, and does not execute JavaScript.

You would get the same result with Postman. Or with any other non-interactive client.

As I said, you CANNOT scrape a site that only works in an interactive browser!

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)