Efficient Transformers for Astronomical Images: Deconvolution and Denoising Unleashed

Authors:
(1) Hyosun Park, Department of Astronomy, University of Yonsei, Seoul, Republic of Korea;
(2) Yongsik Jo, Higher School of Artificial Intelligence, Unit, Ulsan, Republic of Korea;
(3) Seokun Kang, Higher School of Artificial Intelligence, Unit, Ulsan, Republic of Korea;
(4) Taehwan Kim, Higher School of Artificial Intelligence, Unit, Ulsan, Republic of Korea;
(5) Mr. James Jee, Department of Astronomy, University of Yonsei, Seoul, Republic of Korea and department of physics and astronomy, University of California, Davis, CA, United States.
Ties
Summary and 1 Introduction
2 method
2.1. Presentation and 2.2. Coder-Decoder architecture
2.3. Transformers for the restoration of the image
2.4. Details of the implementation
3 data and 3.1. HST data set
3.2. Gallsim data set
3.3. JWST data set
4 results of the JWST test data set and 4.1. PSNR and SSIM
4.2. Visual inspection
4.3. Restoration of morphological parameters
4.4. Restoration of photometric parameters
5 Application on real HST and 5.1 images. Restoration of single-time images and comparison with multi-epoch images
5.2. HST Multipoch image restoration and comparison with Multi-Epoch JWST images
6 limitations
6.1. Degradation of the quality of the restoration due to a high noise level
6.2. Punctual source recovery test
6.3. Artifacts due to the correlation of pixels
7 conclusions and thanks
Annex: A. Image restoration test with noise images only
References
2. Method
2.1. Preview
Throughout the article, we use the term restoration to designate the process that simultaneously improves resolution and reduces noise. Our objective is to restore the HST quality image with the quality of JWST quality based on the implementation of RESTORMER (Zamir et al. 2022) of the architecture of the transformer (Vaswani et al. 2017). We first briefly examine the architecture of encoder-decoder in §2.2. §2.3 describes the architecture of the transformer, including Zamir et al. (2022) Implementation of. The details of the implementation are presented in §2.4.
2.2. Coder-Decoder architecture
The architecture of encoder-décoder allows neural networks to learn to map the input data to produce data in a structured and hierarchical manner. The encoder captures characterizing the functionalities of the input data and the code in a compressed representation, while the decoder rebuilt or generates the desired output according to this coded representation. This architecture has been widely used in various applications, including translation of image to image, image segmentation, language translation, etc.
U-Net (Ronneberger et al. 2015) is a classic example of a Coder-Decoder architecture based on CNN. It consists of a contracting path, which serves as an encoder, an expanding path, which works like the decoder, and jumping connections which connect the corresponding layers between the contracting and expanding paths. In the convolutionary layers of the contracting path, the dimension is reduced by increasing the number of channels to be captured
Essential image characteristics. The expansion path only uses low -dimension coded information to reduce the numbers of channels and increase dimensions, aimed at restoring large images. To mitigate the loss of information in the contract path, the SKIP connection is used to concatenate the characteristics obtained from each layer of the coding step with each layer of the decoding step.
2.3. Transformers for the restoration of the image
In the transformer, the encoder consists of several layers of self-management mechanisms followed by neural networks in the position of position. “Attention” refers to a mechanism that allows models to focus on specific parts of input data during processing. It allows the model to selectively weigh different parts of the entry, by granting more importance to the relevant information and ignoring the non -relevant or less important parts. The key idea behind attention is to dynamically calculate weights for different parts of the input data, such as words in a sentence or pixels in an image, depending on their relevance for the current task. In the self-attenuated, each element (for example, word or pixel) in the input sequence is compared to all the other elements to calculate the weight of attention, which represent the importance of each element compared to the others. These weights of attention are then used to calculate a weighted sum of the input elements, resulting in a representation based on attention which highlights the relevant information.
The transformer decoder also consists of several layers of self-agency mechanisms, as well as additional attention mechanisms on the coder output. The decoder predicts an element of the output sequence at the same time, conditioned on the elements generated previously and the coded representation of the entry sequence.
The architecture of the transformer was initially proposed and applied to the task of automatic translation, which implies translating from the text from one language to another. The success of the transformer in automatic translation tasks has demonstrated its effectiveness in the capture of long -term dependencies in sequences and management of sequential data more effectively than traditional architectures. This breakthrough has aroused generalized interest in the architecture of the transformer, leading to its adoption and its adaptation for various image processing tasks. Transformers show promising results in tasks such as classification of images, detection of objects, semantic segmentation and the generation of images, traditionally dominated by CNN. The models of transformers more effectively capture the correlations of long -range pixel than models based on CNN.
However, the use of the transformer model on large images becomes difficult with its original implementation, which applies self-management layers to pixels. Indeed, the complexity of calculation degenerates quadratic with the number of pixels. Zamir et al. (2022) overcome this obstacle by substituting the original self-management block with the MDTA block, which implements self-attenuated in the field of functionality and only makes complexity increases only with the number of pixels. We propose to use Zamir et al. The effective transformer of the efficiency of 2022) to apply deconvolution and clearing to astronomical images. We briefly describe the two main components of restoring in §2.3.1 and §2.3.2. Readers are referred to Zamir et al. (2022) for more technical details.
2.3.1. MDTA block
MDTA is a crucial module in RESTORMER. By carrying out the self-attenuated in the dimension of the channel, MDTA calculates the interactions between the channels in the input functionality card, creating requests for requests. Thanks to this process, MDTA effectively models the interactions between the channels of the entry functionalities map, facilitating learning the overall context necessary for image restoration tasks.
MDTA also uses a deep convolution to accentuate the local context. This allows MDTA to highlight the local context of the entry image, ultimately allowing the modeling of global and local contexts.
2.3.2. GDFN block
GDFN, abbreviation of the Feed-Forward Gated-Dconv network, is another crucial module in RESTORMER. Use of a trigger mechanism to improve the power network, GDFN offers an improved information flow, which leads to high quality results for image restoration tasks
GDFN controls the flow of information through trigger layers, composed of an element multiplication of two linear projection layers, one of which is activated by the non-linearity of Gaussian error unit (Gelu). This allows GDFN to delete less informative features and to transmit hierarchically only precious information. Similar to the MDTA module, GDFN uses a mixture of local content. Thanks to this, GDFN emphasizes the local context of the input image, providing a more robust flow of information for improved results in image restoration tasks.
2.4. Details of the implementation
We use a transfer learning approach, where a model formed on a data set is reused for another related data set. First of all, we train on the pre-training data set (simplified galaxy images) for 150,000 iterations, followed by 150,000 additional iterations on the set of Finetun data (realistic galaxy images). The size of the lot remains fixed at 64. Our inference model is available publicly [1].
[1]