Create a WebCrawler with less than 100 Lines of Code

Hi,

while you find excellent open source Web crawlers (e.g. crawler4j), I wanted to write a crawler with very little coding.
Very often I use the famous Jsoup library. Jsoup has some nice features to find and extract data from an url:

// extract URL from HTML using Jsoup
def doc = Jsoup.connect("http://example.com").get() as Document
def links = doc.select("a[href]") as Elements

To crawl the web, you can use recursive programming and code a closure. This might be efficient and produces very compact code, but can end up in an out of memory scenario. If you prefer this way, remember groovy’s the @TailRecursive annotation. This annotation can help to avoid an out-of-memory.

Another solution is to work with queues, e.g. ArrayDeque. Array deques have no capacity restrictions so they grow as necessary and they are considerable faster than Stack or LinkedList. But remember an ArrayDeque is not thread safe. If you need thread safety, you have to provide your own synchronization code. ArrayDeque implements java.util.Deque, which defines a container that supports fast element adding and removal from the beginning and end of the container, which is used in the following code.

package crawler

import org.codehaus.groovy.grails.validation.routines.UrlValidator
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
import org.jsoup.Jsoup

import groovy.util.logging.*
import java.util.regex.Matcher
import java.util.regex.Pattern

@Log4j
class BasicWebCrawler {

   def private final boolean followExternalLinks = false
   def linksToCrawl = [] as ArrayDeque
   def visitedUrls = [] as HashSet
   def urlValidator = new UrlValidator()
   def final static Pattern IGNORE_SUFFIX_PATTERN = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"+ "|png|tiff?|mid|mp2|mp3|mp4" +
	"|wav|avi|mov|mpeg|ram|m4v|pdf" +"|rm|smil|wmv|swf|wma|zip|rar|gz))\$")

   def private final timeout = 3000
   def private final userAgent = "Mozilla"

   def collectUrls(List seedURLs) {
      seedURLs.each {url ->
         linksToCrawl.add(url);
      }
      try {
        while(!linksToCrawl.isEmpty()){
         def urlToCrawl = linksToCrawl.poll() as String // "poll" removes and returns the first url in the"queue"
         try {
            visitedUrls.add(urlToCrawl)
            // extract URL from HTML using Jsoup
            def doc = Jsoup.connect(urlToCrawl).userAgent(userAgent).timeout(timeout).get() as Document
            def links = doc.select("a[href]") as Elements
	        links.each {Element link ->
               // find absolute path
               def absHref = link.attr("abs:href") as String
               if (shouldVisit(absHref)) {
	             // If this set already contains the element, the call leaves the set unchanged and returns false.
		        if(visitedUrls.add(absHref)){
		          if (!linksToCrawl.contains(absHref)) {
		             linksToCrawl.add(absHref)
		             log.debug "new link ${absHref} added to queue"
		          }
		         } 
                }
            }
	    } catch (org.jsoup.HttpStatusException e) {
	      // ignore 404
	      // handle exception
	    } catch (java.net.SocketTimeoutException e) {
	      // handle exception
	    } catch (IOException e) {
	      // handle exception
	    }
     }
   } catch (Exception e){
      // handle exception
   }
 }

 def private boolean shouldVisit(String url) {
   // filter out invalid links
   def visitUrl = false
   try {
      def boolean followUrl = false
      def match = IGNORE_SUFFIX_PATTERN.matcher(url) as Matcher
      def isUrlValid = urlValidator.isValid(url)

     if (!followExternalLinks) {
       // follow only urls which starts with any of the seed urls
       followUrl = seedURLs.any { seedUrl ->
          if (url.startsWith(seedUrl)) {
	     return true // break
          }
       }
     } else {
	   // follow any url
	   followUrl = true
     }
     visitUrl = (!match.matches() && isUrlValid && followUrl)
   } catch (Exception e) {
     // handle exception
   }
   return visitUrl
 }
}

As shown, it is possible to code a web crawler with less than 100 lines of code.
Just provide a list of seed urls as argument of the collectUrls method:

def seedUrls = []
seedUrls.add("https://jolorenz.wordpress.com")
seedUrls.add("http://www.cnn.com")		
def crawler = new BasicWebCrawler()
crawler.collectUrls(seedUrls)

HTH Johannes

 

Update:
Developers of FAU picked and extended the crawler code. You find the crawler on Github and it is totally open source.

Advertisements
This entry was posted in Development, Groovy and tagged . Bookmark the permalink.

2 Responses to Create a WebCrawler with less than 100 Lines of Code

  1. A little bit revised code is available at https://github.com/RRZE-PP/crawler/
    Thank you again, for granting permission to re-use.

  2. Excellent post. I was checking continuously this weblog
    and I am impressed! Very useful info specifically the remaining
    part 🙂 I care for such info much. I used to be seeking this
    certain info for a very lengthy time. Thanks and best of
    luck.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s