New feature: Allow different connection-id in onerror element

I’d like to announce a minor improvement which was requested by several people. It is now possible to use connection-id in <onerror> elements. Few examples:

<--
Example of how to use onerror to emulate CREATE TABLE ... IF NOT EXISTS:
Drops a table if it already exists and then call CREATE again (retry=true).
-->
<script connection-id="in">
    CREATE TABLE Table1;
    <onerror message=".*Table already exists*" retry="true">
        DROP TABLE Table1;
    </onerror>
</script>

<--
Example of how to log all errors in the script using a custom Java class
-->
<script connection-id="in">
    INSERT INTO TABLE TableName VALUES (1, 'Value1');
    <onerror connection-id="java">
        //"error" object represents error details (java.lang.Throwable)
        Throwable error = (Throwable)get("error");
        com.app.ApplicationLogger.logEtlError(error);
    </onerror>
</script>

New feature: Flexible formatting and parsing rules in CSV and text files

Why it is necessary?

Previously, fields in CSV files were parsed as text and necessary conversions had to be done manually before sending data to the output datasource. And vice versa, data has to be formatted before writing into a CSV/text file.

Now it’s possible to replace most of the boiler-plate code with a simple declarative approach. Basically formatting and parsing rules are defined as properties of a connection element.

Example. Import CSV data into a database

Let’s assume there is an input file with an exchange rate data in a CSV format:

CurrencyPair,RateDateTime,RateBid,RateAsk
EUR/USD,2012-02-05 17:00:34.427,1.311800,1.311970

and our goal is to import it data into the following table:
Rates Table

The following table describes parsing rules used when importing CSV data into a database:

CSV Column (Input) Database Column (Output) Parse type
pattern
Notes
CurrencyPair CurrencyPair N/A Text value, no conversion is necessary
RateDateTime Time timestamp
yyyy-mm-dd hh:mm:ss.fffffffff
Values are in JDBC Timestamp escape format
RateBid Bid number
#.#
Decimal number.
Most DB engines handle this conversion transparently.
In this example, it is used for demonstration purposes.
RateAsk Ask

These rules map into the following connection definition:

    <connection id="csv_in" driver="csv" url="eurusd_in.csv">
       #Define formatting of columns in the input CSV file
       format.CurrencyPair.trim=true
       format.RateBid.type=number
       format.RateBid.pattern=#.#
       format.RateAsk.type=number
       format.RateAsk.pattern=#.#
       format.RateDateTime.type=timestamp
    </connection>

Formats for decimal/date/time/timestamp types are supported out of the box. Detailed list is available on Text Driver JavaDoc page

The import logic is very simple – read CSV input and insert rows into a table (all conversion will be done automatically based on the parsing options above):

<query connection-id="csv_in">
    <script connection-id="db">
        INSERT INTO Rates(TIME, CurrencyPair, Bid, Ask) VALUES (?RateDateTime, ?CurrencyPair, ?RateBid, ?RateAsk);
    </script>
</query>

Example. Export data from a database table into a CSV file

Lets reuse data imported in the previous step. Now it will be exported into another CSV file. This time the format of the output file will be slightly changed – numbers will be rounded to 4 digits after decimal point and the time part will be formatted using a dd.MM.yyyy notation:

Database Column (Input) CSV Column (Output) Format type,
pattern
Notes
Time RateDateTime date
dd.MM.yyyy HH:mm:ss
Timestamps are formatted as date/time patterns with seconds precision
Bid RateBid number
##.0000
Decimal number.
Decimal numbers are outputted with 4 digits after decimal point(with rounding)
Ask RateAsk

These formatting rules can be expressed in the following connection definition:

    <connection id="csv_out" driver="csv" url="eurusd_out.csv">
       quote=
       #Define formatting of columns in the output CSV file
       format.Time.type=date
       format.Time.pattern=dd.MM.yyyy HH:mm:ss
       format.Bid.type=number
       format.Bid.pattern=##.0000
       format.Ask.type=number
       format.Ask.pattern=##.0000
    </connection>

The complete source for these examples is available on Scriptella currency example at GitHub
The latest 1.1-SNAPSHOT can be downloaded from JavaForge.

Note for Maven users

Since 1.1 hasn’t been released yet, a snapshots repository has to be added to settings.xml or directly to the pom file:

        <repository>
            <id>scriptella-snapshots</id>
            <name>Scriptella Central Development Repository</name>
            <url>http://oss.sonatype.org/content/repositories/snapshots</url>
        </repository>

Use this snippet to add a dependency on scriptella-core or scriptella-drivers:

        <dependency>
            <groupId>com.javaforge.scriptella</groupId>
            <artifactId>scriptella-drivers</artifactId>
            <version>1.1-SNAPSHOT</version>
        </dependency>

Running ETL files from Maven

If you are using Maven, and your project includes dependency to scriptella-tools:

      <dependency>
        <groupId>com.javaforge.scriptella</groupId>
        <artifactId>scriptella-tools</artifactId>
        <version>1.0</version>
      </dependency>

You can always use the following command to execute various ETL files:

mvn exec:java -Dexec.mainClass="scriptella.tools.launcher.EtlLauncher" -Dexec.args="/path/to/etl/file.xml"

New feature: global variables

Let’s start with a quick intro to Scriptella dataflow which is based on a concept of rows and columns(which can be treated as variables). When a query is executed it emits multiple rows making them available to nested elements. As a consequence a variable change is only visible to nested elements of the query. Here is an example to illustrate the above said:

<properties>
userCount=0  <!-- Setting an initial value for the variable -->
</properties>

<query connection-id="db">
    <!-- The query selects number of record in the Users table, the variable userName is set to the value of COUNT(*) column.
The change is available ONLY to nested elements -->
    SELECT COUNT(*) as userCount from Users
    <script connection-id="log">
        Overriden value of userCount: $userCount
    </script>
</query>
<script connection-id="log">
     Out of scope/unmodified value of userCount: $userCount
</script>

At times it is more convenient just to set a global variable, so that it’s value can be consumed in other places of ETL file. In Scriptella 1.0 this was possible only with a help of workarounds:

  • Use System.setProperty and System.getProperty to share a variable between scripts.
  • Another approach is similar to the technqiue utilized in anonymous inner classes to modify a single-element array declared as a final variable.The following example illustrates it:
    <!-- The query defines a scoped context by declaring 
           a globalVar array available to nested elements.
           Since the globalVar is an array, changes to its elements are immediately available to all callers -->
    <query connection-id="jexl">
        //Array with only one element modifiable by nested scripts
        globalVarArray = [0];
        query.next();
        <query connection-id="db">
            SELECT COUNT(*) as userCount from Users
            <script connection-id="jexl">
                <!-- Store userCount in a global array --> 
                globalVarArray[0] = userCount;
            </script>
            <!--And now print the value of global variable we've just set -->
            <script connection-id="log">
                Inner script: globalVar=${globalVarArray[0]}
            </script>
        </query>
        <script connection-id="log">
           Outer script: globalVar=${globalVarArray[0]}
        </script>
    </query>

    <script connection-id="log">
        Out of scope: globalVar=${globalVarArray[0]}
    </script>

If you run the script the following output is printed on the console:

Inner script: globalVar=USER_COUNT
Outer script: globalVar=USER_COUNT
Out of scope: globalVar=0

As you can see it is possible to declare global variables in Scriptella, but this requires an additional query element and use of arrays. Scriptella 1.1 introduces support for etl.globals – a global map for variables available to ETL elements. Based on that, the example above can be rewritten to leverage the new mechanism:

    <query connection-id="db">
        SELECT COUNT(*) as userCount from Users
        <script connection-id="jexl">
            etl.globals['globalVar'] = userCount;
        </script>  
        <script connection-id="log">
            Inner script: etl.globals.globalVar=${etl.globals['globalVar']}
        </script>
    </query>
    <script connection-id="log">
        Outer script: etl.globals.globalVar=${etl.globals['globalVar']}
        globalVar=$globalVar (normal variable globalVar is not defined)
    </script>

The code became less verbose. Additionally the “out of scope” script was removed since global variables have no scope. The line globalVar=$globalVar (normal variable globalVar is not defined) was added to demonstrate that global variables do not affect normal variables, however it is not recommended for a global variable to share a name with normal variable to avoid possible misunderstandings.
And this time the output would be:
Inner script: etl.globals.globalVar=USER_COUNT
Outer script: etl.globals.globalVar=USER_COUNT
globalVar=$globalVar (normal variable globalVar is not defined)

Implementation note: As of now the etl.globals map is not shared between scripts when called by “scriptella” driver. Bug-12790 was logged to track this issue and will be resolved prior to 1.1 release.
Update 1:
Thanks to Anji for pointing out that the example with an array initialization globalVarArray = [0] will not work in Scriptella 1.0 due to lack of array instantiation support in JEXL1.1. JavaScript can be used as an alternative as explained in the FAQ entry, or you can use Janino driver to achieve the same effect:

<etl>
	<connection id="janino" driver="janino"/>
	<connection id="log" driver="text"/>

	<!-- The query defines a scoped context by declaring
       a globalVar array available to nested elements.
       Since the globalVar is an array, changes to its elements are immediately available to all callers -->
	<query connection-id="janino">
		// Array with only one element modifiable by nested scripts
		set("globalVarArray", new int[1]);
		next();
		<script connection-id="janino">
			<!-- Store userCount in a global array -->
			((int[])get("globalVarArray"))[0] = 22;
		</script>
		<!--And now print the value of global variable we've just set -->
		<script connection-id="log">
				Inner script: globalVar=${globalVarArray[0]}
		</script>
	</query>
</etl>

Support for JDBC batching and query fetch has been added

JDBC batching is a very important feature which allows sending multiple commands to the database in one call. Scriptella batching is controlled by statement.batchSize parameter. The value of this parameter specifies number of statements to be combined in a batch before sending it to the database.

Please note that behavior for batching depends on the type of statements processed, as the result non-prepared SQL statements (statements without ? parameters) are processed in a single batch group different from parametrized prepared SQL statements. The grouping rules are the following:

  • Normal(non-prepared) statements are always grouped in a single batch per ETL script element.
  • Parameterized prepared statement use SQL text as a group key, i.e. different statements go into different groups what may sometimes introduce an undesired behavior.

As the result mixing parameterized (prepared statements) with normal statements in a single ETL element is NOT RECOMMENDED for batch-mode since statements are processed in different batch groups and results may be unexpected.

The following 2 examples show RECOMMENDED ways of batching.
Example 1. Extract->Load using PreparedStatement.setParameters/addBatch:

<connection id="out" ...>
   #enables batching and set batch size to 100
   statement.batchSize=100
</connection>
<query connection-id="in">
    SELECT * from Bug
    <script connection-id="out">
        INSERT INTO Bug VALUES (?ID, ?priority, ?summary, ?status);
    </script>
</query>
Example 2. Bulk load of non-prepared statements (without ?, but $ are allowed)
<script connection-id="out">
    INSERT INTO Bug VALUES (1, 'One');
    INSERT INTO Bug VALUES (2, 'Two');
    INSERT INTO Bug VALUES (3, '$Value');
    ....
</script>

Please pay attention to the following items when use batching:

  1. Queries are not supported in batch mode. Typically batches are intended for DB modifications, consider using a separate connection if you need querying.
  2. Batch buffers(pending SQL commands)  are flushed immediately before ETL script commits, rather than  after ETL element completes. You may face this problem  only while querying  tables being updated, which is not recommended (see item 1).
  3. Optimal size of batch varies between DBs and available JVM heap size, but in most cases it should not be less than 10.

Another important parameter statement.fetchSize is added to give the JDBC driver a hint about the number of rows that should be fetched from the database when more rows are needed for the result set.  Examples:

<connection url="jdbc:mysql://localhost:3306/db" ...>
   #MySQL-specific  hack
   #set fetchSize to Integer.MIN_VALUE to avoid loading all data into memory
   statement.fetchSize = -2147483648
</connection>
<connection url="jdbc:oracle:thin:@localhost:1521:orcl" ...>
   #For performance reasons increase fetchSize
   statement.fetchSize = 1000
</connection>

New feature: dynamic include elements

It seems obvious and natural, but it was challenging for me to add dynamic includes implementation requested by one of Scriptella users. Here is an example of dynamic includes.

<query connection-id="db">
    SELECT version FROM schema_info
    <script>
	<include href="${version}.sql"/>
     </script>
</query>

This feature is especially useful for building Rails-like migrations. Later me or Mike will post a short how-to on a simplified migrations in Scriptella ETL.

New feature: ?{textfile ‘url/file_name’} to upload CLOBs

Yesterday I’ve implemented a requested feature which allows inserting CLOB content from files. It works the same way as ?{file } for BLOBs.

A simple example:

INSERT INTOEmail (ID, Text) VALUES (1 , ?{textfile ‘email1.txt’})

Follow

Get every new post delivered to your Inbox.