2012年8月1日 星期三

URL extraction using python urldigger.py

參考資料:http://code.google.com/p/urldigger/


EXAMPLES:
GET URLS FROM A GOOGLE SEARCH TERM
ecasbas@cipher:~/proyectos/urldigger$ python urldigger.py -g urldigger
http://urldigger.com/
http://code.google.com/p/urldigger/
http://code.google.com/p/urldigger/updates/list
http://sniptools.com/vault/urldigger
http://www.urldigger.com/articles/81/asshole-of-the-year-nominee-abu-abdullah.html
----OUTPUT CUT-----
GET URLS FROM TWITTER HOT WORDS
ecasbas@cipher:~/proyectos/urldigger$ python urldigger.py -W
http://itunes.apple.com/us/album/now-playing/id193558513
http://sourceforge.net/projects/nnplaying/
http://vivapinkfloyd.blogspot.com/2008/06/how-to-make-simple-amarok-now-playing.html
http://vivapinkfloyd.blogspot.com/2008/05/how-to-make-simple-amarok-now-playing.html
----OUTPUT CUT-----
GET URLS FROM CRAWLING YOUR SITE
ecasbas@cipher:~/proyectos/urldigger$ python urldigger.py -c http://www.nasa.gov
http://www.nasa.gov/about/career/index.html
http://www.nasa.gov/about/highlights/bolden_bio.html
http://www.nasa.gov/about/highlights/garver_bio.html
http://www.nasa.gov/about/highlights/leadership_gallery.html
http://www.nasa.gov/about/org_index.html
http://www.nasa.gov/about/sites/index.html
http://www.nasa.gov/astronauts
----OUTPUT CUT-----
SHOW HOT URLS FROM ALEXA
ecasbas@cipher:~/proyectos/urldigger$ python urldigger.py -H
http://realestate.yahoo.com/promo/most-expensive-us-small-town-sagaponack-ny.html
http://www.realsimple.com/home-organizing/new-uses-for-old-things/new-uses-penny-00000000027632/index.html?xid=yahoobuzz-rs-012210&xid=yahoo
http://movies.yahoo.com/news/usmovies.thehollywoodreporter.com/forbes-lists-biggest-flops-last-five-years
http://health.yahoo.com/experts/drmao/23125/soup-therapy-detoxify-lose-weight-and-boost-immunity/
http://answers.yahoo.com/question/index?qid=20100111162407AATTvcJ
----OUTPUT CUT-----
BRUTE FORCE MODE
ecasbas@cipher:~/proyectos/urldigger$ python urldigger.py -b > allurls.txt
Be careful, currently the output is about 18917 urls.
DETECT SPAM OR SPURIOUS CODE IN YOUR SITE
ecasbas@cipher:~/proyectos/urldigger$ python urldigger.py -g "site:uclm.es"
Looking for SPAM in........http://publicaciones.uclm.es/
*Suspicious SPAM!!!-----> http://publicaciones.uclm.es/* [(viagra)]
Looking for SPAM in........http://www.uclm.es/to/cdeporte/pdf/PublicacionesProfesorado.pdf
Looking for SPAM in........http://www.uclm.es/cr/caminos/publicaciones/publicaciones.html
Looking for SPAM in........http://www.uclm.es/profesorado/ricardo/Publicaciones.htm
Looking for SPAM in........http://publicaciones.uclm.es/index.php?action=module&path_module=modules_Product_index
*Suspicious SPAM!!!-----> http://publicaciones.uclm.es/index.php?action=module&path_module=modules_Product_index*
Looking for SPAM in........http://www.uclm.es/PROFESORADO/mydiaz/_private/PUBLICACIONES.pdf
NOTE: Functional code only available thorough the source in the repository.


另外撰寫過濾urldigger.py輸出行的程式碼,比如過濾出結尾為.jpg的URL
並將每行URL輸出,接著利用wget 下載所有過濾後的連結。

python urldigger.py -c http://exawarosu.net/archives/7356470.html|python isPicurl.py |while read line; do wget -P /home/stayhigh/mypics  $line; done

沒有留言:

張貼留言