Logo Computer scientist,
engineer, and educator
• Articles • Articles about computing • Articles about software development

Using Camel as a filesystem indexing tool

Following on from my article Creating an Apache Camel route in Java using Maven, step-by-step, this article presents a much more ambitious — and somewhat unusual — application of Camel: to index a filesystem into a relational database. The purpose of the application is to maintain an index that allows the user to search a whole filesystem for files by name, size, date, or MIME type. MIME type information is extracted from each file using an open-source library called SimpleMagic maintained by Gray Watson. The application could easily be extended to index more useful information, perhaps by parsing out keywords or tags from each file. Indexing a large filesystem takes a long time, but it only has to be done infrequently, and the index can be extended incrementally as new files are added.

I assume that the reader is broadly familar with the use of Maven, as this application simply has too many dependencies to comfortably manage using only command-line invocations. If you have no idea what Camel is, then I would recommend reading Creating an Apache Camel route in Java from the ground up first. This example uses JMS messaging; if you're completely unfamiliar with JMS then you might like to read JMS messaging from the ground up using Apache ActiveMQ as well but, in reality, most of the JMS details are hidden by Camel. It also uses JDBC but, again, Camel's use of JDBC is idiosyncratic, and you don't need to know much about JDBC in general to follow this example. Some basic understanding of SQL syntax is helpful. Of course, since this is a Java example, I assume some familiarity with the Java programming language.

This example demonstrates the following Camel features:

  • The use of the Apache Derby database with a Camel jdbc: endpoint
  • Database connection pooling using Apache DBCP
  • Using Camel SimpleRegistry with DBCP to link a jdbc: endpoint to a database
  • Use of the file: endpoint to scan directories recursively
  • The use of Java-based filters to select which files are included by the file: endpoint
  • Using null messages to control program lifecycle
  • Using an embedded ActiveMQ message broker to aggregate messages from different threads
Note:
This example uses specific features in Camel 2.12; it will not work with earlier versions
I can't present all the source code for the application in this article — only those bits that are specific to Camel. As always, however, the complete source code is available from the Download section at the end of this article.

I suppose that very few people will want to search for files using a SQL prompt; a companion application to the one described here will search the database according to criteria specified on the command line. However, that's not a Camel application, so I won't be describing it any further here.

About the application

This application scans one or more directories recursively, using the Camel file: endpoint. If multiple directories are specified, then they are scanned as separate routes and therefore on separate threads. My intention with this arrangement was to provide a way to scan multiple physical disks on separate threads, so that the indexing wasn't completely limited by disk seek times.

Each file scanning thread places an object of program-defined type FileInfo onto a JMS queue called fileinfo. This queue is consumed by another (single) route which uses a jdbc: endpoint to write to an Apache Derby database.

Because indexing can be a slow process on large filesystems, it seemed better to use a real database server, and not an embedded database, to collect the results. An embedded database is easier to set up and consumes fewer resources, but locking complications make it difficult for such a database to support multiple concurrent clients. We need a way to ensure that one process can be doing a search of the database, while another is updating it, without risking data corruption. Derby is an all-Java open-source database, that can run as a network server, and synchronize multiple clients.

The file scanning process not only records the easily-obtainable information about each file — name, size, modification times — but also reads the beginning of each file to attempt to determine its MIME type using the SimpleMagic library. This process is time-consuming where many files are involved, so we only want to do it for files that are newly added to the filesystem. Therefore, the file: endpoint is linked to a Java-based filter that looks up each file in the database and, if it already exists, the file is not processed any further. This operation is done in Java because I could not find any convenient way to do this check using only Camel features.

Directory scanner route

The directory scanning Camel route is defined as follows:
          // Read the directory recursively to...
          from ("file:" + directory +
            "?startingDirectoryMustExist=true&sendEmptyMessageWhenIdle=true" +
            "&recursive=true&filter=#filefilter&noop=true")
          // ... a logger, and then to ...
            .to("log:net.kevinboone.camelindex.DirectoryToJms?level=DEBUG")
          // ... a processor that replaces the file body with a FileInfo  ...
            .process (new Processor() {
              public void process(Exchange ex) throws Exception
                {
                org.apache.camel.Message in = ex.getIn();
                // Note the the body of the message will be null when
                //  the directory scanner has finished. When this happens
                //  we must simply write a new null body, without
                //  trying to get file information from this non-existent
                //  file
                if (in.getBody() == null)
                  {
                  in.setBody (null);
                  }
                else
                  {
                  GenericFile file = (GenericFile)in.getBody();
                  logger.info ("Indexing file " + file.getAbsoluteFilePath());
                  in.setBody
                    (new FileInfo (new File (file.getAbsoluteFilePath())));
                  }
                }
            })
          // ... and finally to a JMS queue called filelist
          .to("jms:filelist");
The file: endpoint is defined with recursive=true so that it descends the directory tree. The setting sendEmptyMessageWhenIdle=true tells the endpoint to send a null message as soon as directory scanning is complete. Since there can potentially be multiple file scanning routes in operation, and each one sends a null message when it is complete, the application must count the number of null messages to determine when the whole task is complete.

If camel consumes from a file: endpoint and writes to a jms: endpoint, then notionally only file information is written to the message broker, not the file contents. This is usually not what is required, but in this case it is: we care only about the first few hundred bytes of the file, and the rest is irrelevant. Unfortunately, it is all too easy to coerce Camel into expanding a GenericFile (information) message into the file contents, which is most emphatically not what we want. It seems that any kind of processing done on the GenericFile message will have this effect. Consequently, this application uses a .process step to transform the Camel-specific object into an application-specific FileInfo, which cannot accidentally be transformed to the file's contents. This is an irksome necessity, but we make it a virtue by using the FileInfo to carry the other information about the file that the application needs — notably the MIME type. The relevant code that does this transformation is

  GenericFile file = (GenericFile)in.getBody();
  in.setBody
    (new FileInfo (new File (file.getAbsoluteFilePath())));
The examination of the file's contents is done in the constructor of the FileInfo; I haven't included details in this article, because this operation is not connected with Camel. Of course, you can look at the source code to see exactly how it's done.

The setting filter=#filefilter instructs the file: endpoint to refer each filename to a Java-based filter class. This class, which is of type CamelIndexFileFiter here, must implement the GenericFileFilter interface. In outline the class looks like this:

public class CamelIndexFileFilter<T> implements GenericFileFilter<T>
  {
  public boolean accept(GenericFile<T> file)
    {
    String pathname = file.getAbsoluteFilePath();
    // Logic to determine if pathname is in the database
    return !fileIsInDatabase;
    }
  }
How does Camel know to associate the name #filefilter with an object of class CamelIndexFileFilter? Well, like most name mapping in Camel, we store a reference to the object in the Camel registry, like this:
    SimpleRegistry reg = new SimpleRegistry() ;
    CamelContext camelContext = new DefaultCamelContext(reg);
    reg.put("filefilter", new CamelIndexFileFilter(camelContext));
On the subject of the registry, we must also map the configuration for the jdbc: endpoint to its associated DataSource. The DataSource provides a getConnection() method that we (or Camel) can use to create new database connections. In this application, the DataSource implementation comes from an Apache DBCP connection pool, and is created in the code as follows:
    BasicDataSource ds = new BasicDataSource();
    ds.setDriverClassName ("org.apache.derby.jdbc.ClientDriver");
    ds.setUrl ("jdbc:derby://localhost:1527/camelindex;create=true");
    reg.put("datasource", ds);
Note that it is stored in the registry under the name datasource, which is the name used by the jdbc: endpoint:
          .to("jdbc:datasource...")
Slightly oddly, the jms: endpoint is configured in a completely different way. We use the endpoint like this:
          .to("jms:filelist...")
but filelist is not a name in the registry. Instead, it is the actual name of the JMS destination on which messages will be placed. This destination will be created if it does not exist. So how do we configure the connection to the message broker? This is done when registering the jms component itself:
    javax.jms.ConnectionFactory jmsConnectionFactory =
      new ActiveMQConnectionFactory ("vm://internal?broker.persistent=false");
    camelContext.addComponent ("jms",
      JmsComponent.jmsComponentAutoAcknowledge(jmsConnectionFactory));
The URI vm://internal indicates that the broker will be embedded in the application's JVM, not a separate process. The name "internal" is actually completely arbitrary, but I disapprove of the common use of "localhost" here — it isn't actually a hostname, and is misleading. persistent=false indicates that the messages will only be held in memory, and not written to disk. This is much faster and, in this application, quite safe, since if anything goes horribly wrong we can always recreate the data simply by running the program again.

Database writer route

The database writer route consumes FileInfo messages from the message broker, and writes them to the database.
        from ("jms:filelist")
          .to ("log:net.kevinboone.camelindex.JmsToIndex?level=DEBUG")
          .choice()
            .when(simple("${body} == null"))
              .process (new Processor() {
                public void process(Exchange ex) throws Exception
                  {
                  directoriesToScan--;
                  if (directoriesToScan <= 0)
                    MainApp.stop();
                  }
              })
              .endChoice()
            .otherwise()
              .setHeader ("pathname", simple("${body.pathname}"))
              .setHeader ("length", simple("${body.length}"))
              .setHeader ("modified", simple("${body.modified}"))
              .setHeader ("mimeType", simple("${body.mimeType}"))
              .setBody (constant
                 ("insert into files(pathname, length, modified, mimetype) " +
                   "values (:?pathname, :?length, :?modified, :?mimeType)"))
              .to("jdbc:datasource?useHeadersAsParameters=true");

This route uses a choice() step to provide different handling for FileInfo messages and null messages; remember that the file scanner route will write a null message when it has read all its files. In outline, the choice() structure looks like this:
          .choice()
            .when(simple("${body} == null"))
               // Do something
              .endChoice()
            .otherwise()
               // Do something
The endChoice() step is necessary because, despite appearances, the Java statements do not form a programming language construct. That is, the Java "fluid builder syntax" does not allow a true if...then... construct to be formed. The endChoice() tells Camel to reset its state to that at the start of the choice(), so that it knows how to interpret otherwise(). Despite this, my experience is that choice() constructs in Java are often quite fiddly to get right, and small changes to the ordering of the elements can have quite dramatic implications.

The choice in this case is based on the expression

            .when(simple("${body} == null"))
'Simple' is the, well, simple embedded scripting syntax used by Camel. Because we're using Java, the same thing could more familiarly be expressed as a Java expression. However, the importance of Simple is that exactly the same expression can be used when specifying routes in XML using Spring or Blueprint. In any event, I hope it is clear what this expression tests.

Assuming that the body is not zero, and is therefore a FileInfo, we must write it to the database. The way the jdbc: endpoint works is to take an SQL statement as the message body, subsitute certain parameters if necessary, and then run the statement against the database. So we must take the existing message body, extract the relevant information from it into message headers, and then substitute the headers into the SQL statement. Note that we're using Simple again here:

              .setHeader ("mimeType", simple("${body.mimeType}"))
Remember that body is an object of type FileInfo. The Simple expression here says to execute the method getMimeType() on that object, and store the result in a message header called mimeType. That header is then substituted for the token :?mimeType in the SQL statement. Although it isn't very visible here, Camel does some handy implicit type conversions here: the database table is actually defined as
 create table files
  (
  pathname varchar(1024) primary key,
  length bigint,
  modified timestamp,
  mimetype varchar(256)
  )
so we have varchar, timestamp, and bigint fields, all converted implictly to the right JDBC types.

Building and running the example

Running the application requires first that the Apache Derby database be installed and running.

Installing and running Apache Derby

Note:
At the time of writing the latest version of Derby is 10.10.2. However, that version seems to experience deadlocks when Camel writes large amounts of data to it in a short time. Version 10.8.3 worked fine.
Apache Derby is available from the Derby web site. Installation amounts to unpacking the archive into a convenient directory. To run Derby as a network server:
$ ./bin/startNetworkServer
Security manager installed using the Basic server security policy.
Apache Derby Network Server - 10.8.3.0 - (1582446) 
  started and ready to accept connections on port 1527
(There is a batch file for Windows users.)
Note:
By default, Derby writes it database in the current working directory. You don't need to cd to the installation directory to run it, but you do need to be careful about where the data will end up.

Derby ships with a tool called ij which provides an SQL prompt. This can be very useful for inspecting the database for debugging purposes. To use ij to examine (and optionally create) the database kbtest:

$ ./bin/ij
ij version 10.8
ij> connect 'jdbc:derby://localhost:1527/fileinfo;create=true';

Building the file indexer

Since the application is based on Maven, building it should be a matter of unpacking the source code bundle (available from the Downloads section), and running
$ mvn compile
The application has a huge number of dependencies, and the first run of Maven will download about 20Mb of libraries.

Running the indexer and building an index

Assuming it builds correctly, you can run the indexer, specifying a starting directory, using Maven like this:
$ mvn exec:java -Dexec.args=/home/kevin/docs
I would recommend testing with a copy of a directory structure, not on data you actually need to keep, just in case something goes horribly wrong. Because using Maven to run the application is a bit of a nuisance, you can build a stand-alone executable JAR file, that contains all the dependencies needed (and ends up about 12Mb in size, because Maven is not very selective about what it includes).
$ mvn compile assemble:single
To run the JAR file:
$ java -jar /path/to/jarfile.jar /some/directory
The use of assemble:single, and the necessary changes to pom.xml were covered in my article Creating an Apache Camel route in Java using Maven, step-by-step.

Testing the index

Assuming that the indexing operation completed successfully, we can query the database to find particular files. Since we don't have a specific client for this operation yet, we can in the meantime use the SQL prompt provided by Derby.

For example, to find all the MP3 files on my filesystem longer than 30Mb, I could do this:

ij> select * from files where length > 30000000 and mimetype='audio/mpeg';
I could use similar expressions to find, for example, all the .java files modified in the last week, etc.

Summary

This is a rather unusual application of Camel, but I hope it has illustrated some interesting features of the platform.

Downloads

Source code bundle

Copyright © 1994-2013 Kevin Boone. Updated May 09 2014