Crypto News

Fixing Garbled Text When Syncing Oracle to Doris with SeaTunnel 2.3.9

When using Seatunnel 2.3.9 to i -sync data from Oracle In DorisYou can encounter garbled characters – especially if the oracle database uses Set the ASCII character. But don't bepanic – this article is walking with you Why This happens and How to fix it.

🧠 Causes of root

The issue comes from how the seatunnel reads data from Oracle. If Oracle uses a character set like ASCII, and you sync with Doris (which expects valid UTF-8 or other compatible encoding), Chinese characters may not be read.

The key is to interfere and re-encode The data when reading it from Oracle ResultSet.

🔍 Understanding Seatunnel reading flow

Let's look at Seatunnel internals holding JDBC Data Ingings:

1. JdbcSourceFactory

This class:

  • Loads your resource adjustments.
  • Construction JdbcSourceConfig and JdbcDialect.
  • Creates a JdbcSource For example.

2. JdbcSource

It: this:

  • Initialize a SourceSplitEnumerator To divide the tasks.
  • Creates a JdbcSourceReader to perform them.

3. JdbcSourceReader

Responsible for:

  • Developing JdbcInputFormat.
  • Repeatedly -calling to pollNext() Way to obtain data.

4. pollNext() Way

This method:

  • Call open() In JdbcInputFormat to prepare PreparedStatement and ResultSet.
  • Then calling nextRecord() to process the ResultSet and i -convert it to a SeaTunnelRow.

5. nextRecord() and the problem with being a

In JdbcInputFormat:

  • The nextRecord() Procedure calls toInternal() In JdbcRowConverter.
  • The default use of implementation JdbcFieldTypeUtils.getString(rs, resultSetIndex).

💥 Problem: If the resultset contains Chinese characters stored as ASCII, this method returns garbled text.

Solution approach

We need see resource censorship and re-encode data Once this is obtained From the resultset.

Here's how to do it:

🛠 Implementing steps

Step 1: Add charset parameters

In JdbcInputFormatAdd:

private final Map params;

To the builder:

public JdbcInputFormat(JdbcSourceConfig config, Map tables) {
    this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
    this.chunkSplitter = ChunkSplitter.create(config);
    this.jdbcRowConverter = jdbcDialect.getRowConverter();
    this.tables = tables;
    this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
}

Step 2: Pass params In the row converter

In nextRecord() Way of JdbcInputFormatUpdate the call method to:

SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params);

Step 3: Add a method of discounted

In AbstractJdbcRowConverterDetermine:

public static String convertCharset(byte[] value, String charSet) {
    if (value == null || value.length == 0) {
        return null;
    }
    log.info("Value bytes: {}", Arrays.toString(value));
    try {
        return new String(value, charSet);
    } catch (UnsupportedEncodingException e) {
        throw new RuntimeException(e);
    }
}

Step 4: Change toInternal() For the types of string

In AbstractJdbcRowConverterUpdate STRING I -type handling such as:

case STRING:
    if (params == null || params.isEmpty()) {
        fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
    } else {
        String sourceCharset = params.get("sourceCharset");
        if ("GBK".equalsIgnoreCase(sourceCharset)) {
            fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
        } else {
            fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
        }
    }
    break;

Step 5: rebuild and deploy

After making the above changes:

  1. Rebuild connector-jdbc module.
  2. Replace what is there connector-jdbc-2.3.9.jar Under Seatunnel's connectors Directory.
  3. Restart the seatunnel cluster.

🧾 Adjustment tips

  • If your Oracle database are no issues with the prescodingYou don't have to pass the sourceCharset Ari -rian.
  • If necessary, pass it like this to your config:
sourceCharset=GBK
  • To ibug the flogging from connector-jdbcCheck Workers' logs In the seatunnel logs Directory.

Summary of summary

By adding a simple charsary transfer mechanism and tweaking the implementation of the JDBC resource, you can remove garbled characters when syncing Oracle data with Doris using the seatunnel.

No more broken characters – your data pipeline is just getting smarter. 🚀

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblocker Detected

Please consider supporting us by disabling your ad blocker