Support for JDBC batching and query fetch has been added

JDBC batching is a very important feature which allows sending multiple commands to the database in one call. Scriptella batching is controlled by statement.batchSize parameter. The value of this parameter specifies number of statements to be combined in a batch before sending it to the database.

Please note that behavior for batching depends on the type of statements processed, as the result non-prepared SQL statements (statements without ? parameters) are processed in a single batch group different from parametrized prepared SQL statements. The grouping rules are the following:

  • Normal(non-prepared) statements are always grouped in a single batch per ETL script element.
  • Parameterized prepared statement use SQL text as a group key, i.e. different statements go into different groups what may sometimes introduce an undesired behavior.

As the result mixing parameterized (prepared statements) with normal statements in a single ETL element is NOT RECOMMENDED for batch-mode since statements are processed in different batch groups and results may be unexpected.

The following 2 examples show RECOMMENDED ways of batching.
Example 1. Extract->Load using PreparedStatement.setParameters/addBatch:

<connection id="out" ...>
   #enables batching and set batch size to 100
   statement.batchSize=100
</connection>
<query connection-id="in">
    SELECT * from Bug
    <script connection-id="out">
        INSERT INTO Bug VALUES (?ID, ?priority, ?summary, ?status);
    </script>
</query>
Example 2. Bulk load of non-prepared statements (without ?, but $ are allowed)
<script connection-id="out">
    INSERT INTO Bug VALUES (1, 'One');
    INSERT INTO Bug VALUES (2, 'Two');
    INSERT INTO Bug VALUES (3, '$Value');
    ....
</script>

Please pay attention to the following items when use batching:

  1. Queries are not supported in batch mode. Typically batches are intended for DB modifications, consider using a separate connection if you need querying.
  2. Batch buffers(pending SQL commands)  are flushed immediately before ETL script commits, rather than  after ETL element completes. You may face this problem  only while querying  tables being updated, which is not recommended (see item 1).
  3. Optimal size of batch varies between DBs and available JVM heap size, but in most cases it should not be less than 10.

Another important parameter statement.fetchSize is added to give the JDBC driver a hint about the number of rows that should be fetched from the database when more rows are needed for the result set.  Examples:

<connection url="jdbc:mysql://localhost:3306/db" ...>
   #MySQL-specific  hack
   #set fetchSize to Integer.MIN_VALUE to avoid loading all data into memory
   statement.fetchSize = -2147483648
</connection>
<connection url="jdbc:oracle:thin:@localhost:1521:orcl" ...>
   #For performance reasons increase fetchSize
   statement.fetchSize = 1000
</connection>

About Fyodor Kupolov
Scriptella ETL founder.

14 Responses to Support for JDBC batching and query fetch has been added

  1. Remo says:

    Nice one! I’m not on the office this month but I’ll definitely try this to see how it impacts my scripts, thanks a lot!

  2. scriptella says:

    Thanks! Would like to hear from you. For now I was able to test these changes on HSQLDB, MySQL and Postgres…

  3. Cristian says:

    Hi, greate work with scriptella , it is very usefull.

    Any idea of when will 1.1 be released?

  4. Riccardo says:

    hi,
    i tried to use scriptella to dump some big oracle tables to csv files.
    it works but performance is poor (avg 400 rps); there are no network or disk issues.
    tried to set the new statement.fetchSize parameter but no improvement.
    tried also to set statement.batchSize.
    any suggestions?

    thanks.

    statement.fetchSize=10000

    SELECT a,b FROM T

    $a,$b

    • Hi Riccardo,

      Thank you for your feedback. I need some additional information in order to understand what could be the reason for performance issues:
      1) You mention 400 rps. What type of hardware are you using? Are you using any alternative third party tools which provide a better throughput?
      2) Are you using 1.1 development version or 1.0 release?
      3) Did you try some Oracle specific connection properties like defaultRowPrefetch and useFetchSizeWithLongColumn. See here http://docs.oracle.com/cd/B14117_01/java.101/b10979/basic.htm#g1028323
      4) Also, please try increasing JVM memory parameters, e.g. running java -Xmx512m -jar scriptella.jar

      • Riccardo says:

        Hi,

        1)I give you a better measure: avg 70kb/s. Source Oracle table is a 15 colunm of mixed numbers, dates, varchars, 218k rows. Intel Core2 4gb ram running win xp pro. Ora server is 10g.
        I tested the same select on a SAS system connection on my machine (relies on Oracle client 10g), throughput is 1mb/s. (Output is an internal format, comparable to a fixex length text file).
        2)Scriptella 1.0
        3)No i didn’t. Where do i put those properties in xml?
        4)tried -Xmx512m but nothing changed.

        Thank you very much for your attention.

  5. Thanks for the details. Since the batch improvement are only available in 1.1, I would suggest to use a development snapshot available here: http://javaforge.com/displayDocument/scriptella-1.1-SNAPSHOT.zip?doc_id=90286

    You can put both JDBC-specific and Scriptella configuration properties inside connection element declaration, e.g.:

        
    <connection id="c1" driver="oracle" url="jdbc:oracle:thin:@localhost:1521:DB" user="sys as sysdba" password="password">
        statement.fetchSize = 1000
        defaultRowPrefetch= 1000
        useFetchSizeWithLongColumn = true
    </connection>
        
    

    In this case using -Xmx512 might be helpful since prefetched rows are stored in memory. I would also suggest using the latest thin driver available here http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html (assuming that you are using JDK6 for testing)

    Please let me know if this helps. I would also appreciate if you post here (or send to scriptella [at] gmail.com) an actual version of XML(+property file if any) without disclosing any business-specific details.

    Thanks,
    Fyodor

    • Riccardo says:

      I’ve downloaded 1.1 version and the fetch directives work well.
      I found optimal behaviour with
      statement.fetchSize = 10000
      useFetchSizeWithLongColumn = true

      Now the throughput is about 1mbs!
      This is for csv output connection…. Thank you very much.

      Now i got an other issue trying to swhitch to an oracle output connection (similar to input connection).
      The job is slow and i stopped it after about 5minutes.

      I tried to boost the performance by setting statement.batchSize to some value (tried 1000, 10000).
      The job gets faster and terminatesin about 4,5minutes.
      The execution seems to be insensitive to changes from 1000 to 10000.

      It seems to be quite slow compared to csv test (one could expect a performance of some tens of seconds at most).

      Is there any other trick i’m missing?

      As soon i make some serious work, i’ll be pleased to post some xml.

      • Thank you for the numbers. They are really helpful.

        Regarding the output, in addition to statement.batchSize I would suggest using defaultBatchValue=1000 (this is oracle specific property for setting batch size).

        Always run with enough memory, i.e. -Xmx512m or more

        Use ? instead of $ in SQL statements. This will enable statement caching. Something like this:
        INSERT INTO MYTABLE(ID, VALUE) VALUES (?1, ?2);

      • As a follow up to my previous comment. I would also suggest using autocommit.size connection property. It splits large transaction into multiple smaller one. This might be useful for large datasets:
        autocommit.size=1000
        Please be aware about the downside – the ETL is no longer transactional, so you must clean up tables before running it again.

  6. Riccardo says:

    All my attempts have been unsuccessful.
    I post the xml below.
    Plus I tried with 1000 value for the parameters in the target connection.
    The autocommit=true enforcement is successive to initial unsuccessful attempts.
    Also I tried with column list specification (c1,…) in the insert statement.

    The performance remains above 4 minutes.

        <connection id="db" driver="oracle" url="jdbc:oracle:thin:@host:port:dbname" 
    	user="*" password="*" classpath="../lib/ojdbc6.jar">
    	statement.fetchSize = 10000
        useFetchSizeWithLongColumn = true
    	</connection>
    	
        <connection id="dbt" driver="oracle" url="jdbc:oracle:thin:@host:port:dbname" 
    	user="*" password="*" classpath="../lib/ojdbc6.jar">
    	defaultBatchValue=10000
    	autocommit=true
    	autocommit.size=10000
    	statement.batchSize=10000
    	</connection>
    	
        <query connection-id="db">
    		SELECT c1, c2, c3,c4, c5, c6,c7,c8
            FROM sourcetable 
    		<script connection-id="dbt">
    			  INSERT INTO targettable
    			  VALUES (?1,?2,?3,?4,?5,?6,?7,?8);
            </script>
        </query>
    
  7. I afraid I cannot help you to further improve the transfer performance. Please keep in mind that data travels twice over the network as opposed to a local transfer. It’s hard for me estimate what is a good performance for this kind of deployment and network configuration.

    I can only suggest you to split the dataset and run ETLs in parallel. You can limit your select using a rownum within a specified interval, e.g.:

    
    <query connection-id="db">
        SELECT * FROM (SELECT c1, c2, c3,c4, c5, c6,c7,c8, rownum rn FROM sourcetable) WHERE rn BETWEEN ${rownum.from} and ${rownum.to}
        <script connection-id="dbt">
              INSERT INTO targettable VALUES (?1,?2,?3,?4,?5,?6,?7,?8);
    ....
    

    Split the dataset into N parallel jobs, let’s say 4, and run them separately passing a unique interval, e.g. 1-50000, 50001-100000, 100001-150001 etc.
    You can specify values of rownum.from/rownum.to in a properties file or simply pass on the command line:

    java -Drownum.from=1 -Drownum.to=50000 -Xmx256m -jar scriptella.jar
    
    • Riccardo says:

      It’s ok. Parallel jobs could be a good option. Thank you very much.
      I’m going to try to tune oracle driver with custom java code. If i go into something good i’ll let you know.
      Best regards
      Riccardo

  8. Sunny says:

    Hi Fyodor,

    I am using Scriptella 1.0 with merged code of fetchSize from 1.1 (dev release) to read records from postgres db and write to a csv file. The performance is poor if we have more than 5 million records. Could you please post postgres specific database optimitization properties to specify in connection element. Thanks in advance

    Regards
    Sunny

Leave a reply to Cristian Cancel reply