There are many products available in market to collect information from sites so that the information can be of help to business. But these products are not absolutely free. Moreover there are no free web utilities or services available as a single request may take hours for very large sites.
How about creating your own crawler and collect emails, url's, images and other useful content. Lets create our own web crawler using Java / eclipse / crawler4j
Step 1 - Install eclipse from http://www.eclipse.org/downloads/
Step 2 - Download crawler4j jar from
http://code.google.com/p/crawler4j/downloads/list
Step 3 - Create a project and create following classes
Controller
MyCrawler
following steps as given at https://code.google.com/p/crawler4j/
Step 4 - Update the following within Controller class to crawl intended sites.
controller.addSeed("http://www.yourtargetforum.com");
controller.addSeed("http://www.yourtargetforum2.com");
controller.addSeed("http://www.yourtargetforum3.com");
Step 5 - Set the number of concurrent threads that should be initiated for crawling within controller class.
int numberOfCrawlers = 1;
( keep it 1 if you dont want concurrent thread )
Step 6 - Set the depth of crawling within Controller class
config.setMaxDepthOfCrawling(-1);
( use -1 for unlimited depth )
Step 7 - Set the maximum number pages to crawl
config.setMaxPagesToFetch(-1);
( use -1 for unlimited pages )
Now we are all set to crawl the intended web site. Perform a test by printing "page.getWebURL().getURL()" within visit function of BasicCrawler.
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
}
If its all fine till this, Go ahead with the following steps to add code for getting emails and images.
Step 8 - Lets add code to get all emails within visit function of BasicCrawler
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
List tr=htmlParseData.getOutgoingUrls();
Pattern regex = Pattern.compile("[@]");
String pageText = htmlParseData.getText();
Matcher regexMatcher = regex.matcher(pageText);
int i =0;
int width = 0;
while (regexMatcher.find()) {
if((regexMatcher.start()-25 > 0) && (regexMatcher.end()+25 < pageText.length())){
width=25;
String[] substr=pageText.substring(regexMatcher.start()-width,regexMatcher.end()+width).split(" ");
for(int j=0;j<substr.length;j++){
if(substr[j].contains("@") && (substr[j].contains(".com") || substr[j].contains(".net"))){
System.out.println(substr[j]);
subEmailCounter++;
}
}
} else {
width=0;
}
}
Step 9 - Lets add code to get all images
regex = Pattern.compile("[http]");
regexMatcher = regex.matcher(htmlParseData.getHtml());
imgUrlCounter=0;
for(i=0;i<tr.size();i++){
if(tr.get(i).toString().contains(".jpg") || tr.get(i).toString().contains(".jpeg") || tr.get(i).toString().contains(".gif") || tr.get(i).toString().contains(".bmp")){
url = new URL(tr.get(i).toString());
Image image = new ImageIcon(url).getImage();
// following code to get only large images
int imgWidth = image.getWidth(null);
int imgHeight = image.getHeight(null);
String[] t = url.toString().split("/");
if((imgWidth > 400) && (imgHeight > 400)){
System.out.println(tr.get(i).toString());
imgUrlCounter++;
}
}
}
Now you are all done. Just run your program to get all emails and images from intended sites.
Let me know if you face any problem with the crawler. Enjoy.