WaveMind: Towards a Generalist EEG Foundation Model Aligned to Textual and Visual Modalities

Ziyi Zeng$^1$, Zhenyang Cai$^1$, Yixi Cai$^1$, Xidong Wang$^1$,
Rongsheng Wang$^1$, Siqi Cai$^2$, Benyou Wang$^1$* , Haizhou Li$^1$
$^1$ The Chinese University of Hong Kong, Shenzhen
$^2$ Harbin Institute of Technology, Shenzhen
*wangbenyou@cuhk.edu.cn

Abstract

While electroencephalography (EEG) interpretation using multimodal large language models (MLLMs) offers a novel approach for analyzing brain signals, the inherent complexity of brain activity poses significant challenges. This complexity stems from concurrent cognitive functions associated with consciousness and non-cognitive processes involved in homeostasis, generating distinct supervisory modalities during model training. To address these limitations, we propose WaveMind, an EEG alignment framework designed for EEG-MLLM training that projects EEG data into a shared semantic space across different supervisory modalities. To develop a cross-task EEG interpretation chatbot, we further contribute WaveMind-Instruct, comprising 338k GPT-assisted synthesized instruction pairs for fine-tuning. The resulting chatbot demonstrates robust classification performance and enables flexible, open-ended conversations covering four downstream tasks. Ablative analysis reveals the complementary relationship between diverse brain activities and supervision modalities, providing valuable insights for both neuroscience and the development of general-purpose EEG interpretation systems.

Background

Objective

The main objective of this paper was to prove that singular type of brain activity limits the ability to generalize EEG_MLLM. We propose a multimodal EEG alignment framework named WaveMind for EEG-MLLM to solve the limitation.

Methodology

Architecture

Decode the raw EEG signals $X_e \in \mathbb{R}^{T \times C}$ into neural language tokens $W = {w_1, \dots, w_N}$ via LLM backbone, as shown in the Figure 1 below. As maany works have shown that inserting category information is beneficial for model generation. Therefore, WaveMind incorporates Retrieval-Augmented Generation (RAG) module that stores multimodal supervision’s features (i.e. $\hat{Z}^I$ and $\hat{Z}^T$) with their category.

Figure 1: The overall architecture of WaveMind. Left: three-stage training procedure. Right: inference procedure of WaveMind. The system projects EEG data into a unified semantic space and integrates retrieval-augmented generation (RAG) for robust language generation.

Training Paradigm

Stage 1: Dual-Supervision Representation Alignment

We align EEG features into a unified space using a dual-supervised CLIP framework:

• CLIP-ViT extracts image-guided features : $\mathbf{Z}_I \in \mathbb{R}^{768}$

• CLIP-BERT produces semantic features: $\mathbf{Z}_T \in \mathbb{R}^{768}$ After L2 normalization, both are mapped into the same CLIP space.

The objective function combines two InfoNCE losses: \(\mathcal{L} = \lambda \mathcal{L}_{\text{img}} + (1 - \lambda)\mathcal{L}_{\text{txt}}\). We train on 1.2M EEG pairs, outperforming 7 baseline encoders. Adding an auxiliary classification loss \(\mathcal{L}_{cls}\) showed no performance gain

Stage 2:Cold-start Training for the Adapter

We propose pre-training the adapter on image-domain data $\hat{Z_I}$ (sharing CLIP space with EEG features $\hat{Z_e}$) using LLaVA-Pretrain-558k before EEG instruction-tuning. This aligns the MLLM with CLIP space and initializes EEG-domain tuning.

Stage 3: EEG Instruction Tuning

At this stage, we perform instruction tuning using the WaveMind_Instruct-338K. In this stage, LoRA module and modality-adapter are unfrozen during training, while ATMM is frozen during training.

Figure 2: Instruction construction pipeline of WaveMind. The raw signals are first pre-processed into segments with the same configuration, then executed with different instruction synthesis processes depending on the type of supervision. We have constructed four types of instructions to ensure the model learns diverse knowledge.

Result

Classification Evaluation

We use WaveMind-Bench to evaluate the MLLM’s ability to recognize objects in visual stimuli and annotation categories represented in EEG. Table 4 presents MCQ classification results. WaveMind’s performance was assessed across three distinct methods: random/real EEG and with the RAG module. Crucially, when predicting with real EEG data, WaveMind significantly outperforms the random input baseline across all evaluated datasets. The RAGmodule substantially improves classification results across most EEG datasets. Notably, cognitive task classification and MCQs with many options show the most significant gains: THING-EEG’s 40-way accuracy doubled from 0.122 to 0.250, while ImageNet-EEG increased from 0.574 to 0.603. Non-cognitive tasks also benefited, with TUEV improving from 0.888 to 0.904 and TUAB reaching 0.575. Only SEED showed a slight decrease.

Dataset Evaluation Protocol k Random EEG Real EEG EEG w/RAG
24k 24k 24k
TUEVSI6 0.4340.2400.159 0.9400.8670.888 0.9250.8900.904
TUABSI2 0.501// 0.736// 0.741//
SEEDSD4 0.515/0.335 0.684/0.543 0.676/0.529
ImageNet-EEGSD40 0.5070.2440.021 0.9140.8530.574 0.9370.8870.603
THING-EEGSD40 0.4740.2430.027 0.7600.5540.122 0.8690.7210.250

RAG: Retrieval-Augmented Generation

Table1 : Averaged Classification Result on WaveMind-Bench. The weight accuracy over k options is reported where each question consists of 1 correct and k-1 wrong options. The model is asked to output the letter represented by the correct options

Generalization Evaluation

For cognitive tasks using the THING-EEG dataset, we additionally assess closed-set and Subject-Dependent conditions. As shown in Table5, it maintained consistent and accurate decoding performance whether encountering unseen object categories or untrained subjects.

Closed-set (1573 class) Zero-shot (200 clsss)
k 2440 2440
Group Chances 0.5000.2500.033 0.5000.2500.033
Real EEG SD 0.7280.5040.096 0.7560.5740.128
SI 0.6800.4190.074 0.6890.4420.058
EEG w/RAG SD 0.7860.6270.182 0.8620.7320.243
SI 0.6980.4920.108 0.7610.5780.159

Table2: K-way Classification Performance on THING-EEG Dataset Weighted accuracy is reported as the class imbalance in the closed-set evaluation.

Contributions

Unified EEG Alignment Framework: We propose WaveMind, a novel alignment framework that projects EEG signals paired with diverse modalities into a shared semantic space.

Comprehensive Dataset and Benchmark: We synthesized WaveMind-Instruct, the first cross-task instruction dataset comprising 4 instruction-tuning types and 2 chat scenarios, along with WaveMind-Bench, which contains 12K MCQs, to facilitate evaluation of EEGMLLMs.

Multi-Stage Training and Performance: We propose a three-stage training scheme to fully unlock the model’s ability to recognize and understand EEG. The model performs well in classification tasks and has initially acquired the ability to open question answering.

* Benyou is the corresponding author.