Fixing Garbled Text When Syncing Oracle to Doris with SeaTunnel 2.3.9

admin10 hours ago

0 27 2 minutes read

When using Seatunnel 2.3.9 to i -sync data from Oracle In DorisYou can encounter garbled characters – especially if the oracle database uses Set the ASCII character. But don't bepanic – this article is walking with you Why This happens and How to fix it.

Advertise here

🧠 Causes of root

The issue comes from how the seatunnel reads data from Oracle. If Oracle uses a character set like ASCII, and you sync with Doris (which expects valid UTF-8 or other compatible encoding), Chinese characters may not be read.

The key is to interfere and re-encode The data when reading it from Oracle ResultSet.

🔍 Understanding Seatunnel reading flow

Let's look at Seatunnel internals holding JDBC Data Ingings:

1. `JdbcSourceFactory`

This class:

Loads your resource adjustments.
Construction JdbcSourceConfig and JdbcDialect.
Creates a JdbcSource For example.

2. `JdbcSource`

It: this:

Initialize a SourceSplitEnumerator To divide the tasks.
Creates a JdbcSourceReader to perform them.

3. `JdbcSourceReader`

Responsible for:

Developing JdbcInputFormat.
Repeatedly -calling to pollNext() Way to obtain data.

4. `pollNext()` Way

This method:

Call open() In JdbcInputFormat to prepare PreparedStatement and ResultSet.
Then calling nextRecord() to process the ResultSet and i -convert it to a SeaTunnelRow.

5. `nextRecord()` and the problem with being a

In JdbcInputFormat:

The nextRecord() Procedure calls toInternal() In JdbcRowConverter.
The default use of implementation JdbcFieldTypeUtils.getString(rs, resultSetIndex).

💥 Problem: If the resultset contains Chinese characters stored as ASCII, this method returns garbled text.

Solution approach

We need see resource censorship and re-encode data Once this is obtained From the resultset.

Here's how to do it:

🛠 Implementing steps

Step 1: Add charset parameters

In JdbcInputFormatAdd:

private final Map params;

To the builder:

public JdbcInputFormat(JdbcSourceConfig config, Map tables) {
    this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
    this.chunkSplitter = ChunkSplitter.create(config);
    this.jdbcRowConverter = jdbcDialect.getRowConverter();
    this.tables = tables;
    this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
}

Step 2: Pass `params` In the row converter

In nextRecord() Way of JdbcInputFormatUpdate the call method to:

SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params);

Step 3: Add a method of discounted

In AbstractJdbcRowConverterDetermine:

public static String convertCharset(byte[] value, String charSet) {
    if (value == null || value.length == 0) {
        return null;
    }
    log.info("Value bytes: {}", Arrays.toString(value));
    try {
        return new String(value, charSet);
    } catch (UnsupportedEncodingException e) {
        throw new RuntimeException(e);
    }
}

Step 4: Change `toInternal()` For the types of string

In AbstractJdbcRowConverterUpdate STRING I -type handling such as:

case STRING:
    if (params == null || params.isEmpty()) {
        fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
    } else {
        String sourceCharset = params.get("sourceCharset");
        if ("GBK".equalsIgnoreCase(sourceCharset)) {
            fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
        } else {
            fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
        }
    }
    break;

Step 5: rebuild and deploy

After making the above changes:

Rebuild connector-jdbc module.
Replace what is there connector-jdbc-2.3.9.jar Under Seatunnel's connectors Directory.
Restart the seatunnel cluster.

🧾 Adjustment tips

If your Oracle database are no issues with the prescodingYou don't have to pass the sourceCharset Ari -rian.
If necessary, pass it like this to your config:

sourceCharset=GBK

To ibug the flogging from connector-jdbcCheck Workers' logs In the seatunnel logs Directory.

Summary of summary

By adding a simple charsary transfer mechanism and tweaking the implementation of the JDBC resource, you can remove garbled characters when syncing Oracle data with Doris using the seatunnel.

No more broken characters – your data pipeline is just getting smarter. 🚀

admin10 hours ago

0 27 2 minutes read

🧠 Causes of root

🔍 Understanding Seatunnel reading flow

1. JdbcSourceFactory

2. JdbcSource

3. JdbcSourceReader

4. pollNext() Way

5. nextRecord() and the problem with being a

Solution approach

🛠 Implementing steps

Step 1: Add charset parameters

Step 2: Pass params In the row converter

Step 3: Add a method of discounted

Step 4: Change toInternal() For the types of string

Step 5: rebuild and deploy

🧾 Adjustment tips

Summary of summary

admin

Related Articles

Why Self-Custody is Pivotal for the Future of Crypto: Ledger EVP

Trump sued by 12 states claiming his tariffs are illegal tax on Americans

Score $500 Instantly Referring BlockDAG; Plus 4 More Next Crypto Presales To Explode Before Prices Surge

The Clash for Financial Supremacy?

Leave a Reply Cancel reply

Adblocker Detected