[Lingtyp] VL-JEPA (Meta FAIR): Implications for the relationship between thought, language, and communication

Mon Jan 5 16:25:54 UTC 2026

Dear colleagues,

I would like to draw your attention to a recent development in vision–language modeling that appears highly relevant to linguistic discussions of the relationship between meaning and form.

On 11 December 2025, a team from Meta FAIR released VL-JEPA (Vision–Language Joint Embedding Predictive Architecture) (arXiv:2512.10942v1, Chen et al.; including Yann LeCun and Pascale Fung). VL-JEPA departs in a principled way from standard token-generative models. Instead of being trained to predict linguistic output token-by-token, it is trained to predict continuous semantic embeddings. Linguistic output is produced only optionally at inference time via a lightweight decoder, when human-readable communication is required.

VL-JEPA operates primarily in a non-linguistic semantic space. This suggests a model architecture where:

Meaning exists as an abstract, non-linguistic representation.
Form (symbolic, token-based representation) is a secondary encoding layer used for transmission.
Empirically, the model shows that core tasks—classification, retrieval, and visual question answering—can be performed efficiently at the level of these semantic embeddings. Linguistic form thus appears to serve communicative efficiency rather than being a prerequisite for the underlying cognitive processing itself. These results resonate with recent findings in cognitive neuroscience suggesting that language is a tool for communication rather than the primary medium of thought (e.g., work by Evelina Fedorenko and colleagues).

This development invites us to revisit the relationship between meaning and form—an issue also central to practical concerns such as glossing and corpus construction. If language production can occur via both form-based mechanisms (generative next-token selection) and meaning-based mechanisms (non-generative, semantics-driven decoding), it raises a fundamental question for our field: what exactly should linguists be annotating in corpora, and at which level of representation?

We previously discussed form–meaning separation on this list using the analogy of film-making; the distinction between silent film (primary semantic content) and text/sound (a linguistic layer added for transmission) may serve as a useful anchor for further thoughts here.

I will address this development in more detail in my upcoming workshop series. More generally, however, I look forward to any reflections the list may have on how these architectural shifts in AI might inform our typological theories.

Best wishes for the New Year,

Stela

--
Dr. Stela MANOVA  
Chief Executive Officer (CEO), MANOVA AI
Principal Investigator (PI), Gauss:AI Global 
Registration number 655453b
Sterngasse 3/2/6, 1010 Vienna, Austria
Email: manova at manova-ai.com <mailto:manova at manova-ai.com> 
           manova at gaussaiglobal.com <mailto:manova at gaussaiglobal.com> 
	   manova.stela at gmail.com <mailto:manova.stela at gmail.com> 
Web:  https://www.stelamanova.com <https://www.stelamanova.com/> 
          https://www.manova-ai.com <https://www.manova-ai.com/>
          https://gaussaiglobal.com <https://gaussaiglobal.com/>

Workshop Series (2026) Linguistics Meets ChatGPT: From Prompt to Theory <https://gaussaiglobal.com/LingTransformer/>
Call for Participation and 5-Minute Talks <https://drive.google.com/file/d/1WJ2tOSgyXDAt0S-AyYys05wU4HyvLykt/view?usp=sharing>
Registration for Block 1 workshops is now open -- visit the Block 1 webpage <https://gaussaiglobal.com/WS-Block-1/> to register! 
Apply for a student stipend!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20260105/6c79fd8b/attachment.htm>