Create a WebCrawler with less than 100 Lines of Code

Hi,

while you find excellent open source Web crawlers (e.g. crawler4j), I wanted to write a crawler with very little coding.
Very often I use the famous Jsoup library. Jsoup has some nice features to find and extract data from an url:

// extract URL from HTML using Jsoup
def doc = Jsoup.connect("http://example.com").get() as Document
def links = doc.select("a[href]") as Elements

To crawl the web, you can use recursive programming and code a closure. This might be efficient and produces very compact code, but can end up in an out of memory scenario. If you prefer this way, remember groovy’s the @TailRecursive annotation. This annotation can help to avoid an out-of-memory.

Another solution is to work with queues, e.g. ArrayDeque. Array deques have no capacity restrictions so they grow as necessary and they are considerable faster than Stack or LinkedList. But remember an ArrayDeque is not thread safe. If you need thread safety, you have to provide your own synchronization code. ArrayDeque implements java.util.Deque, which defines a container that supports fast element adding and removal from the beginning and end of the container, which is used in the following code.

package crawler

import org.codehaus.groovy.grails.validation.routines.UrlValidator
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
import org.jsoup.Jsoup

import groovy.util.logging.*
import java.util.regex.Matcher
import java.util.regex.Pattern

@Log4j
class BasicWebCrawler {

   def private final boolean followExternalLinks = false
   def linksToCrawl = [] as ArrayDeque
   def visitedUrls = [] as HashSet
   def urlValidator = new UrlValidator()
   def final static Pattern IGNORE_SUFFIX_PATTERN = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"+ "|png|tiff?|mid|mp2|mp3|mp4" +
	"|wav|avi|mov|mpeg|ram|m4v|pdf" +"|rm|smil|wmv|swf|wma|zip|rar|gz))\$")

   def private final timeout = 3000
   def private final userAgent = "Mozilla"

   def collectUrls(List seedURLs) {
      seedURLs.each {url ->
         linksToCrawl.add(url);
      }
      try {
        while(!linksToCrawl.isEmpty()){
         def urlToCrawl = linksToCrawl.poll() as String // "poll" removes and returns the first url in the"queue"
         try {
            visitedUrls.add(urlToCrawl)
            // extract URL from HTML using Jsoup
            def doc = Jsoup.connect(urlToCrawl).userAgent(userAgent).timeout(timeout).get() as Document
            def links = doc.select("a[href]") as Elements
	        links.each {Element link ->
               // find absolute path
               def absHref = link.attr("abs:href") as String
               if (shouldVisit(absHref)) {
	             // If this set already contains the element, the call leaves the set unchanged and returns false.
		        if(visitedUrls.add(absHref)){
		          if (!linksToCrawl.contains(absHref)) {
		             linksToCrawl.add(absHref)
		             log.debug "new link ${absHref} added to queue"
		          }
		         } 
                }
            }
	    } catch (org.jsoup.HttpStatusException e) {
	      // ignore 404
	      // handle exception
	    } catch (java.net.SocketTimeoutException e) {
	      // handle exception
	    } catch (IOException e) {
	      // handle exception
	    }
     }
   } catch (Exception e){
      // handle exception
   }
 }

 def private boolean shouldVisit(String url) {
   // filter out invalid links
   def visitUrl = false
   try {
      def boolean followUrl = false
      def match = IGNORE_SUFFIX_PATTERN.matcher(url) as Matcher
      def isUrlValid = urlValidator.isValid(url)

     if (!followExternalLinks) {
       // follow only urls which starts with any of the seed urls
       followUrl = seedURLs.any { seedUrl ->
          if (url.startsWith(seedUrl)) {
	     return true // break
          }
       }
     } else {
	   // follow any url
	   followUrl = true
     }
     visitUrl = (!match.matches() && isUrlValid && followUrl)
   } catch (Exception e) {
     // handle exception
   }
   return visitUrl
 }
}

As shown, it is possible to code a web crawler with less than 100 lines of code.
Just provide a list of seed urls as argument of the collectUrls method:

def seedUrls = []
seedUrls.add("http://jolorenz.wordpress.com")
seedUrls.add("http://www.cnn.com")		
def crawler = new BasicWebCrawler()
crawler.collectUrls(seedUrls)

HTH Johannes

Posted in Development, Groovy | Tagged | Leave a comment

How to solve: ** java.lang.instrument ASSERTION FAILED ***: “!errorOutstanding” with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line:

Hi,

as I tested some memory intensive, recursive functions with GGTS 3.6 and Grails 2.4, execution was interrupted with the following error message:

*** java.lang.instrument ASSERTION FAILED ***: 
"!errorOutstanding" with message transform method call failed at 
../../../src/share/instrument/JPLISAgent.c line: 844

While this error message is not really helpful to find the root cause, I guess it has to do with the forked execution of tests.

To make the error disappear you can either comment out the complete grails.project.fork section in BuildConfig.groovy, or use the following setting:

// BuildConfig.groovy
//

forkConfig = [maxMemory: 1024, minMemory: 64, debug: false, maxPerm: 256]
grails.project.fork = [
    ...
    // configure settings for the test-app JVM, uses the daemon by default
    test: false,
    //test: [maxMemory: 768, minMemory: 64, debug: false, maxPerm: 256, daemon:true],
    ...
]

After you have edited your BuildConfig.groovy, do also a grails clean and you will be back in business again.

HTH Johannes

Posted in Development, Grails | Tagged , , , | Leave a comment

Debugging in GGTS fails with: Error – your app path – does not appear to be part of a Grails application.

Hi,

recently I worked on a Grails 2.4.x project in GGTS 3.6. After some coding, refactoring and reconfiguration, debugging stopped working with the following error message:

Error | /Users/jolo/Documents/workspace-ggts-3.6.0.M1/FeedHarvesterGPars does not appear to be part of a Grails application.

The following commands are supported outside of a project:

add-proxy
clear-proxy
create-app
create-multi-project-build
create-plugin
help
install-app-templates
list-plugins
package-plugin
plugin-info
remove-proxy
set-proxy
|Run 'grails help' for a complete list of available scripts.

The weird thing was, that grails run-app still worked, while only debugging wasn’t possible anymore.

In a first attempt, I did the standard routines, deleted the local caches (e.g. rm -rf .grails) and run the typical grails commands:

grails clean
grails compile
grails refresh-dependencies

With the result, that

grails run-app 

was still successful, but

grails run-app -debug-fork

failed again, with the error message above.

With that kind of error message, it is almost impossible to track down the root cause. Especially when grails run-app works, and grails run-app -debug does not. These are the moments I dislike GGTS/Grails and all of its magic.

After I lost a reasonable amount of my precious time with guessing what the heck is going on, the problem was caused by a pull of the project from the local GGTS repository to a new local git repository!

To solve the issue, do in GGTS:

  1. choose Run -> Debug Configurations
  2. search the Grails section,
  3. mark the corresponding configuration and
  4. delete it with a right mouse click.

So be warned, if you move your project to a local git repository!

HTH
Johannes

Posted in Development, Grails | Tagged , | Leave a comment

Deploy a Grails App on AWS Beanstalk

Hi,

sometimes I use AWS Beanstalk to host my Grails apps.

It’s convenient, especially when it comes to managed databases, and it fulfills its purpose with a reasonable price tag.

Here, you will find a nice description about how to deploy a Grails App to AWS Beanstalk.

 

HTH Johannes

 

Posted in Development, Grails | Tagged , | Leave a comment

Get a URL Mappings Report in Grails 2.3

With Grails 2.3 you can get a report about your URL Mappings.

To get the report, just execute:

grails url-mappings-report

Then you will get an output similar to this:

|Loading Grails 2.3.7
|Configuring classpath
|Environment set to development
|URL Mappings Configured for Application
|---------------------------------------
Dynamic Mappings
| * | /${controller}/${action}?/${id}?(.${format)?    | Action: (default action)                                       
| * | /                                               | View:   /index                                                 
| * | ERROR: 500                                      | View:   /error                                                 
| * | /viewCities                                     | Action: (default action)                                       Controller: cities
| * | /api/cities                                     | Action: {GET=list, POST=save, PUT=unsupported,DELETE=unsupported}       Controller: city
| GET | /api/city/${id}/create                        | Action: create                                                           | POST| /api/city/${id}                               | Action: save                                                             | GET | /api/city/${id}                               | Action: show                                                             | GET | /api/city/${id}/edit                          | Action: edit                                                             Controller: dbdoc
| * | /dbdoc/${section}?/${filename}?/${table}?/${column}?      | Action: (default action)

HTH Johannes

Posted in Development, Grails | Tagged | Leave a comment

Install Plugins in Grails 2.3

With Grails 2.3 the known Plugin manager is not longer available.
To install a plugin, just follow these steps:

- add the plugin to: BuildConfig.groovy

- execute: grails run-app

If you end up with some errors, then

- execute: grails refresh-dependencies

HTH Johannes

Posted in Development, Grails | Tagged | Leave a comment