论文部分内容阅读
We propose a learning architecture for integrating multi-modal information e.g., vision, audio information. In recent years, artificial intelligence (AI) is making major progress in key tasks like a language, vision, voice recognition tasks. Most studies