Fixing Garbled Text When Syncing Oracle to Doris with SeaTunnel 2.3.9

When using Seatunnel 2.3.9 to i -sync data from Oracle In DorisYou can encounter garbled characters – especially if the oracle database uses Set the ASCII character. But don't bepanic – this article is walking with you Why This happens and How to fix it.
🧠 Causes of root
The issue comes from how the seatunnel reads data from Oracle. If Oracle uses a character set like ASCII, and you sync with Doris (which expects valid UTF-8 or other compatible encoding), Chinese characters may not be read.
The key is to interfere and re-encode The data when reading it from Oracle ResultSet
.
🔍 Understanding Seatunnel reading flow
Let's look at Seatunnel internals holding JDBC Data Ingings:
1. JdbcSourceFactory
This class:
- Loads your resource adjustments.
- Construction
JdbcSourceConfig
andJdbcDialect
. - Creates a
JdbcSource
For example.
2. JdbcSource
It: this:
- Initialize a
SourceSplitEnumerator
To divide the tasks. - Creates a
JdbcSourceReader
to perform them.
3. JdbcSourceReader
Responsible for:
- Developing
JdbcInputFormat
. - Repeatedly -calling to
pollNext()
Way to obtain data.
4. pollNext()
Way
This method:
- Call
open()
InJdbcInputFormat
to preparePreparedStatement
andResultSet
. - Then calling
nextRecord()
to process theResultSet
and i -convert it to aSeaTunnelRow
.
5. nextRecord()
and the problem with being a
In JdbcInputFormat
:
- The
nextRecord()
Procedure callstoInternal()
InJdbcRowConverter
. - The default use of implementation
JdbcFieldTypeUtils.getString(rs, resultSetIndex)
.
💥 Problem: If the resultset contains Chinese characters stored as ASCII, this method returns garbled text.
Solution approach
We need see resource censorship and re-encode data Once this is obtained From the resultset.
Here's how to do it:
🛠 Implementing steps
Step 1: Add charset parameters
In JdbcInputFormat
Add:
private final Map params;
To the builder:
public JdbcInputFormat(JdbcSourceConfig config, Map tables) {
this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
this.chunkSplitter = ChunkSplitter.create(config);
this.jdbcRowConverter = jdbcDialect.getRowConverter();
this.tables = tables;
this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
}
Step 2: Pass params
In the row converter
In nextRecord()
Way of JdbcInputFormat
Update the call method to:
SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params);
Step 3: Add a method of discounted
In AbstractJdbcRowConverter
Determine:
public static String convertCharset(byte[] value, String charSet) {
if (value == null || value.length == 0) {
return null;
}
log.info("Value bytes: {}", Arrays.toString(value));
try {
return new String(value, charSet);
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}
Step 4: Change toInternal()
For the types of string
In AbstractJdbcRowConverter
Update STRING
I -type handling such as:
case STRING:
if (params == null || params.isEmpty()) {
fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
} else {
String sourceCharset = params.get("sourceCharset");
if ("GBK".equalsIgnoreCase(sourceCharset)) {
fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
} else {
fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
}
}
break;
Step 5: rebuild and deploy
After making the above changes:
- Rebuild
connector-jdbc
module. - Replace what is there
connector-jdbc-2.3.9.jar
Under Seatunnel'sconnectors
Directory. - Restart the seatunnel cluster.
🧾 Adjustment tips
- If your Oracle database are no issues with the prescodingYou don't have to pass the
sourceCharset
Ari -rian. - If necessary, pass it like this to your config:
sourceCharset=GBK
- To ibug the flogging from
connector-jdbc
Check Workers' logs In the seatunnellogs
Directory.
Summary of summary
By adding a simple charsary transfer mechanism and tweaking the implementation of the JDBC resource, you can remove garbled characters when syncing Oracle data with Doris using the seatunnel.
No more broken characters – your data pipeline is just getting smarter. 🚀