Java CSV-Parser Comparison on Github

There are quite a few Java CSV parsers available. If you are unsure which one to choose or you wonder how they perform, have a look at github.

There is a project to do performance testing of various CSV parsers.

If you don’t like to do the testing, there are also charts with the results depending on the JDK version.

Personally, I use the Jackson CSV library, which achieved quite good results in the  tests.

HTH Johannes

Posted in Development, Java | Tagged , | Leave a comment

Install the AVR Toolchain on OS X

Programming AVRs or Arduinos is easy and convenient with the Arduino IDE. On the one hand side, the high-level Arduino libraries hide a lot of complexity and you get your results probably faster, but on the other side the concept of high-level libraries results in bloated hex-files.

If you program small or cheap AVR MCUs, these hex files might not fit in the flash.  And in the end, the Arduino IDE is basically a text editor with some syntax coloring. Therefore I prefer to program against the avr-libc with the  support of a powerful IDE like Eclipse. Here are the steps to install this AVR toolchain on OS X:

1. Download the AVR toolkit from the Cross Pack website: www.obdev.at/products/crosspack/

2. Save the disk image

3. Run the disk image (dmg) and install the toolkit.

4. Open a terminal window and check the installation:

cd /usr/local/CrossPack-AVR-20131216/bin
avr-gcc --version

should respond to:

avr-gcc (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

While some people are completely satisfied to program the AVR firmware with vi, some C source files, avr-gcc and make, I prefer a fully fledged and convenient IDE.

6. Download and install the Eclipse C++ IDE

7. Install the AVR-Eclipse plugin by opening inside Eclipse the Help -> Install New Software menu. In the opened window, enter AVR Eclipse Plugin and http://avr-eclipse.sourceforge.net/updatesite/, press enter and select in the lower window the AVR Eclipse Plugin. Click Next, answer all questions with yes, wait until the installation is finished, ignore the warning about unsigned software and restart Eclipse.

8. Configure AVRDude: click Eclipse -> Preferences -> AVR menu. Check the Paths settings and if these are correct click AVRDude -> Add to configure avrdude and the appropriate programmer.

9. Now you are ready and prepared to program your first AVR project. Select New -> Project -> C-Project and the wizard will guide you and perform the initial setup of your project.

HTH Johannes

Posted in Embedded Development, IoT | Tagged , , , | Leave a comment

How to install Groovy on a Banana Pi or Raspberry Pi

The Raspberry or Banana Pi have already pre-installed Python.
If you like to stay with Groovy on these nice little devices, here are the instructions to install Groovy.
GVM makes the whole process very easy and convenient.

Open a terminal window and use these commands:

curl -s get.gvmtool.net > installGroovy.sh
chmod u+x installGroovy.sh
installGroovy.sh
source “$HOME/.gvm/bin/gvm-init.sh”
gvm install groovy

This will download and install the latest Groovy version.
Check the installed version:

groovy -version
Groovy Version: 2.4.2 JVM: 1.8.0 Vendor: Oracle Corporation OS: Linux

Happy Groovy coding on your Banana Pi,
Johannes

Posted in Development, Groovy | Tagged , , , , | Leave a comment

Adaptive machine learning in a streaming environment

Here is a very nice explanation about adaptive machine learning in a streaming environment.

Adaptive machine leaning

Posted in Big Data | Tagged , , | Leave a comment

How to Solve: “Too many connections”; nested exception is com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException:

Hi,

in one of my projects I used GPars and its withPool() method to execute some async tasks.
The prime purpose of a task is to fetch some data from external sources and save the resulting data to a database. Very simple.

While the task did not consume much memory and the performance was mostly determined by the response time of some external rest services, I used quite a high parallelism with a pool size of 500.

// GPars
withPool(500) {
   tasksToDo.eachParallel { task ->
      // .. do some long running tasks asynchronously
      // and save the result to a database 

   }
}

Everything worked fine in the local Grails test environment.
As production environment I used Amazon’s AWS beanstalk and a RDS mysql instance as database backend, both as t1.small instances.

But with the Amazon installation i got the following error message:

"Too many connections"; nested exception is com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException:

2014-08-20 09:19:54,373 [ForkJoinPool-1-worker-218] ERROR crawler.BasicRssGatherer  - org.springframework.dao.DataAccessResourceFailureException: Hibernate operation: could not prepare statement; .....
.....
Data source rejected establishment of connection,  message from server: "Too many connections"

It turned out, that Amazon RDS instances have a fixed limit of available connections.
You can check this out, by

  • opening the AWS console,
  • go to your RDS instance,
  • open “Parameter Groups” and
  • seek for “max_connections”.

In my case the formula found is: {DBInstanceClassMemory/12582880}, which calculates to:

	 * MODEL      max_connections innodb_buffer_pool_size
	 * ---------  --------------- -----------------------
	 * t1.micro   34                326107136 (  311M)
	 * m1-small   125              1179648000 ( 1125M,  1.097G)
	 * m1-large   623              5882511360 ( 5610M,  5.479G)
	 * m1-xlarge  1263            11922309120 (11370M, 11.103G)
	 * m2-xlarge  1441            13605273600 (12975M, 12.671G)
	 * m2-2xlarge 2900            27367833600 (26100M, 25.488G)
	 * m2-4xlarge 5816            54892953600 (52350M, 51.123G)

It is not possible to adjust these settings. So you can either switch to a bigger and more costly RDS instance, or reduce the pool size, which might reduce the overall performance of your app. This might be Amazon’s approach of monetarization. A third option is to roll out your own mysql installation, where you can set the connection limit by yourself, but you will loose all opportunities of a managed RDS database.

So, be aware that Amazon’s RDS mysql instances have a fixed connection limit, if you are working with high numbers of parallel database accesses. The limit can be calculated by the formula found in the max_connection section as DBInstanceClassMemory/12582880.

HTH
Johannes

Posted in Development | Tagged , , | 1 Comment

Create a WebCrawler with less than 100 Lines of Code

Hi,

while you find excellent open source Web crawlers (e.g. crawler4j), I wanted to write a crawler with very little coding.
Very often I use the famous Jsoup library. Jsoup has some nice features to find and extract data from an url:

// extract URL from HTML using Jsoup
def doc = Jsoup.connect("http://example.com").get() as Document
def links = doc.select("a[href]") as Elements

To crawl the web, you can use recursive programming and code a closure. This might be efficient and produces very compact code, but can end up in an out of memory scenario. If you prefer this way, remember groovy’s the @TailRecursive annotation. This annotation can help to avoid an out-of-memory.

Another solution is to work with queues, e.g. ArrayDeque. Array deques have no capacity restrictions so they grow as necessary and they are considerable faster than Stack or LinkedList. But remember an ArrayDeque is not thread safe. If you need thread safety, you have to provide your own synchronization code. ArrayDeque implements java.util.Deque, which defines a container that supports fast element adding and removal from the beginning and end of the container, which is used in the following code.

package crawler

import org.codehaus.groovy.grails.validation.routines.UrlValidator
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements
import org.jsoup.Jsoup

import groovy.util.logging.*
import java.util.regex.Matcher
import java.util.regex.Pattern

@Log4j
class BasicWebCrawler {

   def private final boolean followExternalLinks = false
   def linksToCrawl = [] as ArrayDeque
   def visitedUrls = [] as HashSet
   def urlValidator = new UrlValidator()
   def final static Pattern IGNORE_SUFFIX_PATTERN = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"+ "|png|tiff?|mid|mp2|mp3|mp4" +
	"|wav|avi|mov|mpeg|ram|m4v|pdf" +"|rm|smil|wmv|swf|wma|zip|rar|gz))\$")

   def private final timeout = 3000
   def private final userAgent = "Mozilla"

   def collectUrls(List seedURLs) {
      seedURLs.each {url ->
         linksToCrawl.add(url);
      }
      try {
        while(!linksToCrawl.isEmpty()){
         def urlToCrawl = linksToCrawl.poll() as String // "poll" removes and returns the first url in the"queue"
         try {
            visitedUrls.add(urlToCrawl)
            // extract URL from HTML using Jsoup
            def doc = Jsoup.connect(urlToCrawl).userAgent(userAgent).timeout(timeout).get() as Document
            def links = doc.select("a[href]") as Elements
	        links.each {Element link ->
               // find absolute path
               def absHref = link.attr("abs:href") as String
               if (shouldVisit(absHref)) {
	             // If this set already contains the element, the call leaves the set unchanged and returns false.
		        if(visitedUrls.add(absHref)){
		          if (!linksToCrawl.contains(absHref)) {
		             linksToCrawl.add(absHref)
		             log.debug "new link ${absHref} added to queue"
		          }
		         } 
                }
            }
	    } catch (org.jsoup.HttpStatusException e) {
	      // ignore 404
	      // handle exception
	    } catch (java.net.SocketTimeoutException e) {
	      // handle exception
	    } catch (IOException e) {
	      // handle exception
	    }
     }
   } catch (Exception e){
      // handle exception
   }
 }

 def private boolean shouldVisit(String url) {
   // filter out invalid links
   def visitUrl = false
   try {
      def boolean followUrl = false
      def match = IGNORE_SUFFIX_PATTERN.matcher(url) as Matcher
      def isUrlValid = urlValidator.isValid(url)

     if (!followExternalLinks) {
       // follow only urls which starts with any of the seed urls
       followUrl = seedURLs.any { seedUrl ->
          if (url.startsWith(seedUrl)) {
	     return true // break
          }
       }
     } else {
	   // follow any url
	   followUrl = true
     }
     visitUrl = (!match.matches() && isUrlValid && followUrl)
   } catch (Exception e) {
     // handle exception
   }
   return visitUrl
 }
}

As shown, it is possible to code a web crawler with less than 100 lines of code.
Just provide a list of seed urls as argument of the collectUrls method:

def seedUrls = []
seedUrls.add("https://jolorenz.wordpress.com")
seedUrls.add("http://www.cnn.com")		
def crawler = new BasicWebCrawler()
crawler.collectUrls(seedUrls)

HTH Johannes

Posted in Development, Groovy | Tagged | Leave a comment