VDT: General-purpose Video Diffusion Transformers via Mask Modeling Paper β’ 2305.13311 β’ Published May 22, 2023
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training Paper β’ 2103.06561 β’ Published Mar 11, 2021
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism Paper β’ 2401.02954 β’ Published Jan 5, 2024 β’ 50
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper β’ 2403.05525 β’ Published Mar 8, 2024 β’ 48
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling Paper β’ 2302.06605 β’ Published Feb 13, 2023
Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs Paper β’ 2406.09367 β’ Published Jun 13, 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining Paper β’ 2410.16166 β’ Published Oct 21, 2024
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22 β’ 126
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization Paper β’ 2503.10615 β’ Published Mar 13 β’ 17