Click here to Skip to main content
16,022,362 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I am using a Bash script

Bash
#!/bin/bash
# Define variables for the URL and browser
sGDomain="idealista"
sGCitta="fucecchio-firenze"
sGTypo="vendita-case"
iGPagina=1

# Start of the loop
while :; do

    # Build the URL with the iGPagina variable
    url="https://www.$sGDomain.it/$sGTypo/$sGCitta/lista-$iGPagina.htm"
    #echo "$url"
    
    # Get the HTML content of the page
    html_content=$(curl -s -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" "$url")

    echo "$html_content" > htmlcompleto.txt
    
    # Check if the error string is not present in the HTML content
    if [[ ! $html_content =~ "Successiva" ]]; then
        break  # Exit the loop if the error string is not present
    fi
    
    # Use xidel to extract the ads
    xidel_output=$(xidel --silent --xpath '
        //div[contains(@class, "item-info-container")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)

    # Check if the temporary file exists and delete it if present
    if [ -f "temp.txt" ]; then
        rm temp.txt
    fi

    # Replace special characters from "desc=" to the end of each line in semi.txt
    echo "$xidel_output" | sed -e "s/desc=\(.*\)\(['\"]\)/desc=\1 /g" > semi.txt

    sed -i 's/\([0-9]\{1,\}\)\.\([0-9]\{1,\}\),[0-9]\{2\}/\1\2/g' semi.txt
    sed -i 's/m²//g' semi.txt

    # Concatenate semi.txt with debugtxt.txt for debugging purposes
    cat semi.txt >> debugtxt.txt

    # Connect to the SQLite database
    db_file="immo.db"
 
    # Loop through the lines and insert them into the SQLite database
    while IFS= read -r line; do
        # Extract price, size, link, and description values from the lines using awk
        prezzo=$(echo "$line" | awk -F 'price=' '{print $2}' | awk -F 'size=' '{print $1}')
        size=$(echo "$line" | awk -F 'size=' '{print $2}' | awk -F 'link=' '{print $1}')
        link=$(echo "$line" | awk -F 'link=' '{print $2}' | awk -F 'desc=' '{print $1}')
        descrizione=$(echo "$line" | awk -F 'desc=' '{print $2}')

        # Determine if the description contains "asta"
        if [[ $descrizione =~ "asta" ]]; then
            asta=1
        else
            asta=0
        fi

        # Insert the data into the SQLite database
        sqlite3 "$db_file" "INSERT INTO $sGDomain (prezzo, link, descrizione, metratura, asta) VALUES ('$prezzo', '$link', '$descrizione', '$size', $asta)"
    done < semi.txt

    # Increment the iGPagina variable for the next iteration**your text**
    ((iGPagina++))
done
to search a web page with a specific XPath expression. Although I believe the XPath is correct, the script fails to find anything on the page. Web page URL: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm. XPath expression used: xidel_output=$(xidel --silent --xpath ' //div[contains(@class, "item-info-container")] ! string-join( ( ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ), ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ), ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ), ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) ) ), codepoints-to-string(9) ) ' -) **

Expected result: I expect to extract the price, the listing link, the description, and the square meters from each listing on the web page. i tryed also this xpath expression

        //div[contains(@class, "items-container items-list")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)```
with items-container items-list, but nothing 

to search a web page with a specific XPath expression. Although I believe the XPath is correct, the script fails to find anything on the page. Web page URL: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm. XPath expression used:
xidel_output=$(xidel --silent --xpath ' //div[contains(@class, "item-info-container")] ! string-join( ( ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ), ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ), ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ), ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) ) ), codepoints-to-string(9) ) ' -) **


What I have tried:

Expected result: I expect to extract the price, the listing link, the description, and the square meters from each listing on the web page. i tryed also this xpath expression
xidel_output=$(xidel  --xpath '
     //main//div[contains(@class, "item-info-container ")] ! string-join(
          (
              ( "price=" || normalize-space(.//span[contains(@class, "item-price h2-simulated")]/text()[1]) ),
              ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
              ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
              ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
          ),
          codepoints-to-string(9)
      )
  ' "$url")
but nothing
Posted
Comments
Richard Deeming 10-Jun-24 7:15am    
That URL just returns a 403 Fobidden error. We can't tell you why your XPath isn't working without seeing the relevant parts of the source you're trying to evaluate it against.
GiulioRig 10-Jun-24 9:31am    
for me work https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm , but you can use also another page for testing https://www.idealista.it/vendita-case/fucecchio-firenze/lista-1.htm
Richard Deeming 10-Jun-24 9:37am    
Exactly the same response - 403 Forbidden.

We can't help you scrape a site that we can't access!
GiulioRig 10-Jun-24 9:38am    
is very strange also if you call a domain idealista.it ?
Richard Deeming 10-Jun-24 9:40am    
Another 403 response, with a "Please enable JS and disable any ad blocker" message.

You cannot scrape a site that only works in a browser.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900