- Introduction -

The ICDAR 2023 BDVT-QA Competition (Competition on Born Digital Video Text Question Answering) is coming. Textual information plays an important role in video understanding, as text instances are either direct indicators of scenes or lingual cues about ongoing stories. Numerous works have been proposed for related tasks such as video text recognition and image text QA. In this competition, we would like to go one step forward to explore the video text QA problem, which requires a holistic, precise and in-depth understanding of text information across space and time over video frames. Typical video text related applications include navigation in advanced driver assistance system, assistive shopping on the live stream, and conversation understanding in drama. Though widely used, video text has been rarely explored because of its challenging factors such as arbitrarily-shaped text trajectories, animation of text’s presentation and long-term text language processing. To arouse interest in tackling these challenges, we come up with a novel task of question answering by reading text in born digital videos, which are widely spread on the Internet. We refer to this problem as Born Digital Video Text QA.

- Competition Tasks -

Currently, there are already several comprehensive benchmarks for text-based VQA (ST-VQA, TextVQA, etc). Nevertheless, these works mainly focus on text in images and there lacks benchmarks on video text-based tasks. In the proposed competition, we make a significant step further and present a new BDVT-QA (short for Born Digital Video Text QA) benchmark. To this end, we have collected more than 1,000 born digital videos (20 seconds per video on average) and annotated detailed information for each video, including locations of text trajectory, transcriptions of text lines and question-answer pairs. Questions in BDVT-QA are mainly about inferencing topics of videos and temporal context between description text, which means the potential answers are presented progressively along with the video. Two tasks are proposed within this competition: (1) End-to-End Video Text Spotting; (2) Video Text Question Answering.
Task1 : End-to-End Video Text Spotting
The objective of this task is to assess end-to-end system performance of video text spotting. It requires models to localize, track and recognize words simultaneously. In Task 1, the Normalized Edit Distance will be treated as the official ranking metric while the results of other metrics will be published for reference only.
Task2 : Video Text Question Answering
This task is the most generic and challenging one, since it requires the participants to combine video text spotting and video question answering technologies. The submitted methods for this task should be able to provide correct answers for the given questions by reading, tracking and comprehending all text instances in videos.

- Awards -

Bonus shared among the top three in each task. " $" for the US Dollar.

- Organizers -

阿里巴巴-达摩院
Alibaba Damo Academy
南京大学
Nanjing University
华中科技大学
Huazhong University of Science and Technology
中科院自动化所
Chinese Academy of Sciences

- Contact -

Competition Official Email: icdar2023_bdvtqa@list.alibaba-inc.com