27

Cross-Attention is what you need! FusAtNet Fusion Network.

 3 years ago
source link: https://mc.ai/cross-attention-is-what-you-need-fusatnet-fusion-network/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Today, with recent advances in sensing, multimodal data is becoming easily available for various applications, especially in remote sensing (RS), where many data types like multispectral (MSI), hyperspectral (HSI), LiDAR etc. are available.

Today, multimodal data is easily available!

Effective fusion of these multi source datasets is becoming important, for these multi modality features have been shown to generate highly accurate land-cover maps. However, fusion in the context of RS is non-trivial considering the redundancy involved in the data and the large domain differences among multiple modalities. In addition, the feature extraction modules for different modalities hardly interact among themselves, which further limits their semantic relatedness.

Why is a single fused representation important?

Several advantages of combining multimodal images, including :

  1. generating a rich, fused representation, helps select task relevant features
  2. improved classification, increased confidence, reduced ambiguity
  3. complements missing or noisy data
  4. reducing data size

Interestingly, most common methods today often just use methods like early concatenation, CNN extracted feature level concatenation or multi-stream decision level fusion methods, totally overlooking cross-domain features. Visual a ttention , a recent addition to the deep-learning-researchers’ toolbox is largely unexplored in multi-modal domain.

A question arises: How to best fuse these modalities for a joint, rich representation which can be used in downstream tasks?

Generic schematic of a multimodal fusion based classification task. The objective is to effectively combine the two modalities (hereby HSI and LiDAR) such that the resultant representation has rich, fused features that are relevant and robust enough for accurate classification.

An ideal fusion method would synergistically combine the two modalities and ensure that the resultant product reflects the salient features of input modalities.

A New Concept: Cross Attention

In this work, we propose a new concept of “cross-attention and propose attention based HSI-LiDAR fusion in the context of land-cover classification.

Self-attention vs cross-attention for multimodal fusion. The self-attention module (left) works only on single modality where both the hidden representations as well as the attention mask are derived from the same modality (HSIs). On the other hand, in the cross-attention module (right), the attention mask is derived from a different modality (LiDAR) and is harnessed to enhance the latent features from the first modality

Cross attention is a a novel and intuitive fusion method in which attention masks from one modality (hereby LiDAR) are used to highlight the extracted features in another modality (hereby HSI). Note that this is different from self-attention where attention mask from HSI is used to highlight its own spectral features.

FusAtNet: Using Cross Attention in practice

We propose a feature fusion and extraction framework, namely FusAtNet, for collective land-cover classification of HSIs and LiDAR data in this paper. The proposed framework effectively utilizses HSI modality to generate an attention map using “self-attention” mechanism that highlights its own spectral features. Similarly, a “cross-attention” approach is simultaneously used to harness the LiDAR derived attention map that accentuates the spatial features of HSI. These attentive spectral and spatial representations are then explored further along with the original data to obtain modality-specific feature embeddings. The modality oriented joint spectro-spatial information thus obtained, is subsequently utilized to carry out the land-cover classification task.

Schematic of FusAtNet (presented on Houston dataset). Initially, the hyperspectral training samples XH are sent to the feature extractor FHS to get latent representations and to spectral attention module AS to generate spectral attention mask. Simultaneously, the corresponding LiDAR training samples XL are sent to spatial attention module AT to get the spatial attention mask. The attention masks are individually multiplied to the latent HSI representations to get MS and MT . MS and MT are then concatenated with XH and XL and sent to modality feature extractor FM and modality attention module AM. The outputs from the two are then multiplied to get FSS, which is then sent to the classification module C for pixel classification.

Results

Experimental evaluations on three HSI-LiDAR datasets show that the proposed method achieves the state-of-the-art classification performance, including on the largest HSI-LiDAR benchmark dataset available, Houston, opening new avenues in multimodality feature fusion classification.

Houston


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK