1st Reivew
LLM
VLM
- [x] 2403.05468 Will GPT-4 Run DOOM? (arxiv.org), [[Will GPT-4 Run DOOM]]
- [ ] [[SpatialVLM- Endowing Vision-Language Models with Spatial Reasoning Capabilities]] ⭐
- [ ] [[An Image is Worth Half Tokens After Layer 2- Plug-and-Play Inference Acceleration for Large Vision-Language Models]]
- [ ] PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER, Oct 2023
- [ ] [[LLaMA-Adapter V2 Parmeter Efficient Visual Instruction Model]]
- [ ] [[PaLM2-VAdapter Progressively Aligned Language Model Makes a Strong Vision-language Adapter]]
- [ ] Matcha: "Chat with the Environment: Interactive Multimodal Perception using Large Language Models", IROS, 2023. Paper] Github Website
- [ ] [[LLaVA-Med Large Language and Vision Assistant for BioMedicine]]
- [ ] Luodian/Otter: 🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability. (github.com), May 2023
- 성능이 더 좋은 듯.
- Resolution영향이 없음.
- [[Otter, A Multi-Modal Model with In-Context Instruction Tuning]]
- [ ] 2204.14198 Flamingo: a Visual Language Model for Few-Shot Learning (arxiv.org), Nov 2022
VLA