Click here to Skip to main content
16,022,352 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
hi i have a script in bash for scrape a web page, it work but i have a regexp not work perfect this is my code
Bash
#!/bin/bash
# Define variables for URL and browser
sGCitta="fucecchio"
sGTypo="vendita-case"
sGDomain="immobiliare"
url="https://www.$sGDomain.it/$sGTypo/$sGCitta"
# Get HTML content of the page
html_content=$(curl -s -L "$url")
# Use xidel to extract the listings
xidel_output=$(xidel --xpath '
    //li//div[contains-token(@class, "in-listingCardPropertyContent")] ! string-join(
        (
            ( "price=" || tokenize(div[@class = "in-listingCardPrice"])[last()] ),
            ( "size="  || normalize-space(div[contains-token(@class,"in-listingCardFeatureList")]/div[contains(.,"m²")]) ),
            ( "link="  || a[@class = "in-listingCardTitle"]/@href ),
            ( "desc="  || a[@class = "in-listingCardTitle"]/@title )
        ),
        codepoints-to-string(9)
    )
' "$url")
# Check if the temporary file exists and delete it if present
if [ -f "temp.txt" ]; then
    rm temp.txt
fi
# Remove price=, the dots separating the thousands, and the decimal part if present
echo "$xidel_output" | awk -F 'price=' '{gsub(/\./,"",$2); gsub(/,[0-9]+/,"",$2); print $2}' | sed -e 's/size=/;/g' -e 's/link=/;/g' -e 's/desc=/;/g' -e 's/m²//g' > temp.txt
# Connection to SQLite database
db_file="immo.db"
# Loop through the listings and insert them into the SQLite database
while IFS= read -r row; do
    sqlite3 "$db_file" "INSERT INTO $sGDomain (prezzo, link, descrizione, metratura) VALUES ($row)"
done < temp.txt
and this is a example of that extract
price=29.920,00	size=80 m²	link=https://www.immobiliare.it/annunci/112315175/	desc=Appartamento all'asta via Saminiatese 82, Fucecchio
i want have somthing like this
;29920;80;https://www.immobiliare.it/annunci/112315175/ ;Appartamento all'asta via Saminiatese 82, Fucecchio
but in my case return this not have at astart ; and remove all dot in txt
29920	;80 	;https://wwwimmobiliareit/annunci/112315175/	;Appartamento all'asta via Saminiatese 82, Fucecchio
remove . also in link part is possible tell to awk remove only a dot in price to space ?

What I have tried:

i tryed also in this mode
#echo "$xidel_output" | sed -e 's/price=/;/g' -e 's/size=/;/g' -e 's/link=//g' -e 's/desc=/;/g' -e 's/m²/;/g' > temp.txt
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900