little trick for website scrapping with linux
#1
Hello to all ;-)

In the moment I do programm a epg-scrapper for linux ...
After many hours dealing with a mirrored website with tools
like sed / awk / grep I found a little trick that makes scrapping
with linux a easy job

- I use snarf to download a entire website to my local computer

snarf -m -q http://tv.search.ch searchlocal 2>/dev/null &

My first attempt was dealing direct with the html-code .... without great success ...
My first bash script looked like this .....

Code:
grep -A 3 Titel: tmp.html | tail -1  | sed 's/[ \t]*$//' | sed -e 's/^[ \t]*//' | sed "s/^ *//;s/ *$//;s/ \{1,\}/ /g" | sed "s/<b>\|<\/b>//g" | sed "s/<\/td>//g" | sed 's/- Laufzeit/\n/g' | head -1
  grep -A 10 Inhalt: tmp.html | grep -m 1 -A 15 " >" | head -10 | sed '1d' | sed -e 's/^[ \t]*//' | sed "s/<b>\|<\/b>\|<td>\|<tr>\|<br \/>\|<\/td>\|<\/tr>//g" | sed '/^</ d' | sed "s/^ *//;s/ *$//;s/ \{1,\}/ /g"
  grep -A 3 'Genre:' tmp.html | tail -1 | sed -e 's/^[ \t]*//' | sed "s/<b>\|<\/b>\|<td>\|<tr>\|<br \/>\|<\/td>\|<\/tr>//g"
  echo regie && grep 'Regie:' tmp.html | sed -e 's/^[ \t]*//' | sed 's/search="/\n/g' | tail -1 | sed 's/">/\n/g' | grep '^<' | tail -1

It was not easy to change the code .....

But I found a Solution that was right for my job.... lynx
It has a so called dump feature .... It displays the website inside a Terminal or stores the output to textfile ....

- A allready formated texfile can easy be searched with tools like grep
- I do not have to handle linebreaks or something other html related things

lynx -dump $line > tmp.html

All the text as it would be inside a gui browser is now inside the text-file
tmp.html

- With all formatings / linebreaks and everything ....



Regards form switzerland
Hans
Reply
#2
Thats so hardcore its kinda scary
Reply

Logout Mark Read Team Forum Stats Members Help
little trick for website scrapping with linux0