Tuesday, November 18, 2008

Crawlers / web data mining / news bots

HOw to get bloomberg investment news articles in archive ( for future reference and to extract data info. later )
we may start with simple PHP script
- to get page shown below
http://www.bloomberg.com/news/moreinvest.html
- then get each head line ( with in the table , our investment articles)
- compare heading with what we already have, if same as old one do not fetch , else fetch web page
- once you get web page, search it for 'printer frienly verison' string and grab the url
- then the resulted URL is the real web page we need with no Ads etc..
- save all heading in one file and actual page in a different file like
20081101-1 "goldman says crude oil avg. price is $65 for the year"
the actual html content is in file names 20081101-1.htm

It looks the following two softwares have some good features, but we do not need their crawling ( as we need custom php to get boomberg news) , we need only their "view and category" featrues, mail and ask if they can do that that is can they read from our local disk stored files for categorization .
http://www.newzcrawler.com/
http://software.korzh.com/newspiper/


Here is a sample system
http://www.phptoys.com/e107_plugins/content/content.php?content.74.2

--------------
http://shuetech.com/minetheweb/requirements.php - runs on PHP
http://shuetech.com/minetheweb/demo/docs/examples.php -
http://shuetech.com/minetheweb/demo/docs/bettingodds.html -- this gives good code programmable example
http://shuetech.com/minetheweb/news.php#10 -- good hisotry improving for 5 years ..

---------
http://www.qualityunit.com/unitminer/buy-web-data-extraction-software - $140
http://www.qualityunit.com/unitminer/data-extraction-live-demo-bbc - it is PHP based scripts ...

---
http://www.newzcrawler.com/
- this seems good to crate "whole web page "( entire page with ads. and adjacent web page sections ) as shown in the screens shots

No comments: