Create your own email, url, image and content collector - Crawl websites to get emails, images and other content using Java and crawler4j


There are many products available in market to collect information from sites so that the information can be of  help to business.  But these products are not absolutely free. Moreover there are no free web utilities or services available as a single request may take hours for very large sites.

How about creating your own crawler and collect emails, url's, images and other useful content. Lets create our own web crawler using Java / eclipse / crawler4j

Step 1 - Install eclipse from http://www.eclipse.org/downloads/

Step 2 - Download crawler4j jar from

http://code.google.com/p/crawler4j/downloads/list

Step 3 - Create a project and create following classes

Controller
MyCrawler

following steps as given at https://code.google.com/p/crawler4j/

Step 4 - Update the following within Controller class to crawl intended sites.

controller.addSeed("http://www.yourtargetforum.com");
controller.addSeed("http://www.yourtargetforum2.com");
controller.addSeed("http://www.yourtargetforum3.com");

Step 5 - Set the number of concurrent threads that should be initiated for crawling within controller class.

int numberOfCrawlers = 1;
( keep it 1 if you dont want concurrent thread )

Step 6 - Set the depth of crawling within Controller class

config.setMaxDepthOfCrawling(-1);
( use -1 for unlimited depth )

Step 7 - Set the maximum number pages to crawl

config.setMaxPagesToFetch(-1);
( use -1 for unlimited pages ) 

Now we are all set to crawl the intended web site. Perform a test by printing "page.getWebURL().getURL()" within visit function of BasicCrawler.


 @Override
        public void visit(Page page) {          
                String url = page.getWebURL().getURL();
                System.out.println("URL: " + url);

                if (page.getParseData() instanceof HtmlParseData) {
                        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                        String text = htmlParseData.getText();
                        String html = htmlParseData.getHtml();
                        List<WebURL> links = htmlParseData.getOutgoingUrls();

                        System.out.println("Text length: " + text.length());
                        System.out.println("Html length: " + html.length());
                        System.out.println("Number of outgoing links: " + links.size());
                }
        }

If its all fine till this, Go ahead with the following steps to add code for getting emails and images.

Step 8 - Lets add code to get all emails within visit function of BasicCrawler

HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
List tr=htmlParseData.getOutgoingUrls();
Pattern regex = Pattern.compile("[@]");
String pageText =  htmlParseData.getText();
Matcher regexMatcher = regex.matcher(pageText);
int i =0;
int width = 0;
while (regexMatcher.find()) {
                   if((regexMatcher.start()-25 > 0) && (regexMatcher.end()+25 < pageText.length())){
                             width=25;
                            String[] substr=pageText.substring(regexMatcher.start()-width,regexMatcher.end()+width).split(" ");
                              for(int j=0;j<substr.length;j++){
                                     if(substr[j].contains("@") && (substr[j].contains(".com") || substr[j].contains(".net"))){
                                         System.out.println(substr[j]);
                                         subEmailCounter++;
                                     }
                               }
                       
                     } else {
                                width=0;
                     }
 }


Step 9 - Lets add code to get all images 

regex = Pattern.compile("[http]");
regexMatcher = regex.matcher(htmlParseData.getHtml());
imgUrlCounter=0;
for(i=0;i<tr.size();i++){
                    if(tr.get(i).toString().contains(".jpg") || tr.get(i).toString().contains(".jpeg") || tr.get(i).toString().contains(".gif") || tr.get(i).toString().contains(".bmp")){
                                  url = new URL(tr.get(i).toString());
                                  Image image = new ImageIcon(url).getImage();
                                  // following code to get only large images

                                  int imgWidth = image.getWidth(null);
                                  int imgHeight = image.getHeight(null);
                                String[] t = url.toString().split("/");
                                if((imgWidth > 400) && (imgHeight > 400)){
                                          System.out.println(tr.get(i).toString());
                                          imgUrlCounter++;
                                }
                      }
  }   


Now you are all done. Just run your program to get all emails and images from intended sites.

Let me know if you face any problem with the crawler. Enjoy.
  


author

This post was written by:

Vivek Vermani
WCS Developer
Get Free Email Updates to your Inbox!