Abstract
Background
Objective
The main objective of this paper was to prove that singular type of brain activity limits the ability to generalize EEG_MLLM. We propose a multimodal EEG alignment framework named WaveMind for EEG-MLLM to solve the limitation.
Methodology
Architecture
Decode the raw EEG signals $X_e \in \mathbb{R}^{T \times C}$ into neural language tokens $W = {w_1, \dots, w_N}$ via LLM backbone, as shown in the Figure 1 below. As maany works have shown that inserting category information is beneficial for model generation. Therefore, WaveMind incorporates Retrieval-Augmented Generation (RAG) module that stores multimodal supervision’s features (i.e. $\hat{Z}^I$ and $\hat{Z}^T$) with their category.
Figure 1: The overall architecture of WaveMind. Left: three-stage training procedure. Right: inference procedure of WaveMind. The system projects EEG data into a unified semantic space and integrates retrieval-augmented generation (RAG) for robust language generation.
Training Paradigm
Stage 1: Dual-Supervision Representation Alignment
We align EEG features into a unified space using a dual-supervised CLIP framework:
• CLIP-ViT extracts image-guided features : $\mathbf{Z}_I \in \mathbb{R}^{768}$
• CLIP-BERT produces semantic features: $\mathbf{Z}_T \in \mathbb{R}^{768}$ After L2 normalization, both are mapped into the same CLIP space.
The objective function combines two InfoNCE losses: \(\mathcal{L} = \lambda \mathcal{L}_{\text{img}} + (1 - \lambda)\mathcal{L}_{\text{txt}}\). We train on 1.2M EEG pairs, outperforming 7 baseline encoders. Adding an auxiliary classification loss \(\mathcal{L}_{cls}\) showed no performance gain
Stage 2:Cold-start Training for the Adapter
We propose pre-training the adapter on image-domain data $\hat{Z_I}$ (sharing CLIP space with EEG features $\hat{Z_e}$) using LLaVA-Pretrain-558k before EEG instruction-tuning. This aligns the MLLM with CLIP space and initializes EEG-domain tuning.
Stage 3: EEG Instruction Tuning
At this stage, we perform instruction tuning using the WaveMind_Instruct-338K. In this stage, LoRA module and modality-adapter are unfrozen during training, while ATMM is frozen during training.
Figure 2: Instruction construction pipeline of WaveMind. The raw signals are first pre-processed into segments with the same configuration, then executed with different instruction synthesis processes depending on the type of supervision. We have constructed four types of instructions to ensure the model learns diverse knowledge.
Result
Classification Evaluation
We use WaveMind-Bench to evaluate the MLLM’s ability to recognize objects in visual stimuli and annotation categories represented in EEG. Table 4 presents MCQ classification results. WaveMind’s performance was assessed across three distinct methods: random/real EEG and with the RAG module. Crucially, when predicting with real EEG data, WaveMind significantly outperforms the random input baseline across all evaluated datasets. The RAGmodule substantially improves classification results across most EEG datasets. Notably, cognitive task classification and MCQs with many options show the most significant gains: THING-EEG’s 40-way accuracy doubled from 0.122 to 0.250, while ImageNet-EEG increased from 0.574 to 0.603. Non-cognitive tasks also benefited, with TUEV improving from 0.888 to 0.904 and TUAB reaching 0.575. Only SEED showed a slight decrease.
| Dataset | Evaluation Protocol | k | Random EEG | Real EEG | EEG w/RAG† | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 4 | k | 2 | 4 | k | 2 | 4 | k | |||
| TUEV | SI | 6 | 0.434 | 0.240 | 0.159 | 0.940 | 0.867 | 0.888 | 0.925 | 0.890 | 0.904 |
| TUAB | SI | 2 | 0.501 | / | / | 0.736 | / | / | 0.741 | / | / |
| SEED | SD | 4 | 0.515 | / | 0.335 | 0.684 | / | 0.543 | 0.676 | / | 0.529 |
| ImageNet-EEG | SD | 40 | 0.507 | 0.244 | 0.021 | 0.914 | 0.853 | 0.574 | 0.937 | 0.887 | 0.603 |
| THING-EEG | SD | 40 | 0.474 | 0.243 | 0.027 | 0.760 | 0.554 | 0.122 | 0.869 | 0.721 | 0.250 |
† RAG: Retrieval-Augmented Generation
Table1 : Averaged Classification Result on WaveMind-Bench. The weight accuracy over k options is reported where each question consists of 1 correct and k-1 wrong options. The model is asked to output the letter represented by the correct options
Generalization Evaluation
For cognitive tasks using the THING-EEG dataset, we additionally assess closed-set and Subject-Dependent conditions. As shown in Table5, it maintained consistent and accurate decoding performance whether encountering unseen object categories or untrained subjects.
| Closed-set (1573 class) | Zero-shot (200 clsss) | ||||||
|---|---|---|---|---|---|---|---|
| k | 2 | 4 | 40 | 2 | 4 | 40 | |
| Group | Chances | 0.500 | 0.250 | 0.033 | 0.500 | 0.250 | 0.033 |
| Real EEG | SD | 0.728 | 0.504 | 0.096 | 0.756 | 0.574 | 0.128 |
| SI | 0.680 | 0.419 | 0.074 | 0.689 | 0.442 | 0.058 | |
| EEG w/RAG | SD | 0.786 | 0.627 | 0.182 | 0.862 | 0.732 | 0.243 |
| SI | 0.698 | 0.492 | 0.108 | 0.761 | 0.578 | 0.159 | |
Table2: K-way Classification Performance on THING-EEG Dataset Weighted accuracy is reported as the class imbalance in the closed-set evaluation.
Contributions
• Unified EEG Alignment Framework: We propose WaveMind, a novel alignment framework that projects EEG signals paired with diverse modalities into a shared semantic space.
• Comprehensive Dataset and Benchmark: We synthesized WaveMind-Instruct, the first cross-task instruction dataset comprising 4 instruction-tuning types and 2 chat scenarios, along with WaveMind-Bench, which contains 12K MCQs, to facilitate evaluation of EEGMLLMs.
• Multi-Stage Training and Performance: We propose a three-stage training scheme to fully unlock the model’s ability to recognize and understand EEG. The model performs well in classification tasks and has initially acquired the ability to open question answering.
* Benyou is the corresponding author. ↩