New feature: Allow different connection-id in onerror element →

New feature: Flexible formatting and parsing rules in CSV and text files

August 14, 2012 2 Comments

Why it is necessary?

Previously, fields in CSV files were parsed as text and necessary conversions had to be done manually before sending data to the output datasource. And vice versa, data has to be formatted before writing into a CSV/text file.

Now it’s possible to replace most of the boiler-plate code with a simple declarative approach. Basically formatting and parsing rules are defined as properties of a connection element.

Example. Import CSV data into a database

Let’s assume there is an input file with an exchange rate data in a CSV format:

CurrencyPair,RateDateTime,RateBid,RateAsk
EUR/USD,2012-02-05 17:00:34.427,1.311800,1.311970

and our goal is to import it data into the following table:

The following table describes parsing rules used when importing CSV data into a database:

CSV Column (Input)	Database Column (Output)	Parse type pattern	Notes
`CurrencyPair`	`CurrencyPair`	N/A	Text value, no conversion is necessary
`RateDateTime`	`Time`	`timestamp yyyy-mm-dd hh:mm:ss.fffffffff`	Values are in JDBC Timestamp escape format
RateBid	Bid	`number #.#`	Decimal number. Most DB engines handle this conversion transparently. In this example, it is used for demonstration purposes.
`RateAsk`	`Ask`	`number #.#`

These rules map into the following connection definition:

    <connection id="csv_in" driver="csv" url="eurusd_in.csv">
       #Define formatting of columns in the input CSV file
       format.CurrencyPair.trim=true
       format.RateBid.type=number
       format.RateBid.pattern=#.#
       format.RateAsk.type=number
       format.RateAsk.pattern=#.#
       format.RateDateTime.type=timestamp
    </connection>

Formats for decimal/date/time/timestamp types are supported out of the box. Detailed list is available on Text Driver JavaDoc page

The import logic is very simple – read CSV input and insert rows into a table (all conversion will be done automatically based on the parsing options above):

<query connection-id="csv_in">
    <script connection-id="db">
        INSERT INTO Rates(TIME, CurrencyPair, Bid, Ask) VALUES (?RateDateTime, ?CurrencyPair, ?RateBid, ?RateAsk);
    </script>
</query>

Example. Export data from a database table into a CSV file

Lets reuse data imported in the previous step. Now it will be exported into another CSV file. This time the format of the output file will be slightly changed – numbers will be rounded to 4 digits after decimal point and the time part will be formatted using a dd.MM.yyyy notation:

Database Column (Input)	CSV Column (Output)	Format type, pattern	Notes
`Time`	`RateDateTime`	`date dd.MM.yyyy HH:mm:ss`	Timestamps are formatted as date/time patterns with seconds precision
Bid	RateBid	`number ##.0000`	Decimal number. Decimal numbers are outputted with 4 digits after decimal point(with rounding)
`Ask`	`RateAsk`	`number ##.0000`

These formatting rules can be expressed in the following connection definition:

    <connection id="csv_out" driver="csv" url="eurusd_out.csv">
       quote=
       #Define formatting of columns in the output CSV file
       format.Time.type=date
       format.Time.pattern=dd.MM.yyyy HH:mm:ss
       format.Bid.type=number
       format.Bid.pattern=##.0000
       format.Ask.type=number
       format.Ask.pattern=##.0000
    </connection>

The complete source for these examples is available on Scriptella currency example at GitHub
The latest 1.1-SNAPSHOT can be downloaded from JavaForge.

Note for Maven users

Since 1.1 hasn’t been released yet, a snapshots repository has to be added to settings.xml or directly to the pom file:

        <repository>
            <id>scriptella-snapshots</id>
            <name>Scriptella Central Development Repository</name>
            <url>http://oss.sonatype.org/content/repositories/snapshots</url>
        </repository>

Use this snippet to add a dependency on scriptella-core or scriptella-drivers:

        <dependency>
            <groupId>com.javaforge.scriptella</groupId>
            <artifactId>scriptella-drivers</artifactId>
            <version>1.1-SNAPSHOT</version>
        </dependency>

Filed under Uncategorized

About Fyodor Kupolov
Scriptella ETL founder.

2 Responses to New feature: Flexible formatting and parsing rules in CSV and text files

Dhanush Gopinath says:

November 6, 2012 at 12:02 pm

Does it only take dd.MM.yyyy HH:mm:ss as the pattern.

I am trying the same to output to csv from my db. I keep the following format

null_string=NULL
quote=
format.date.pattern=yyyy-MM-dd

but in the CSV the date comes like this 2012-11-06 00:00:00.0

Reply
Fyodor Kupolov says:

November 8, 2012 at 5:58 pm

It looks like in your case the data is in JDBC timestamp format. In that case I’d suggest to use format.date.type=timestamp instead of date. Specifying a pattern is not necessary as it is always yyyy-mm-dd hh:mm:ss.fffffffff, where ffffffffff indicates nanoseconds.

Reply

Scriptella ETL Blog

New feature: Flexible formatting and parsing rules in CSV and text files

Why it is necessary?

Example. Import CSV data into a database

Example. Export data from a database table into a CSV file

Note for Maven users

2 Responses to New feature: Flexible formatting and parsing rules in CSV and text files

Leave a comment Cancel reply

Recent Posts

Archives

Scriptella on Facebook

Scriptella ETL Blog

New feature: Flexible formatting and parsing rules in CSV and text files

Why it is necessary?

Example. Import CSV data into a database

Example. Export data from a database table into a CSV file

Note for Maven users

Share this:

Related

2 Responses to New feature: Flexible formatting and parsing rules in CSV and text files

Leave a comment Cancel reply

Recent Posts

Archives

Scriptella on Facebook