- Introduction -

The ICDAR 2023 BDVT-QA Competition (Competition on Born Digital Video Text Question Answering) is coming. Textual information plays an important role in video understanding, as text instances are either direct indicators of scenes or lingual cues about ongoing stories. Numerous works have been proposed for related tasks such as video text recognition and image text QA. In this competition, we would like to go one step forward to explore the video text QA problem, which requires a holistic, precise and in-depth understanding of text information across space and time over video frames. Typical video text related applications include navigation in advanced driver assistance system, assistive shopping on the live stream, and conversation understanding in drama. Though widely used, video text has been rarely explored because of its challenging factors such as arbitrarily-shaped text trajectories, animation of text’s presentation and long-term text language processing. To arouse interest in tackling these challenges, we come up with a novel task of question answering by reading text in born digital videos, which are widely spread on the Internet. We refer to this problem as Born Digital Video Text QA.

- Competition Tasks -

Currently, there are already several comprehensive benchmarks for text-based VQA (ST-VQA, TextVQA, etc). Nevertheless, these works mainly focus on text in images and there lacks benchmarks on video text-based tasks. In the proposed competition, we make a significant step further and present a new BDVT-QA (short for Born Digital Video Text QA) benchmark. To this end, we have collected more than 1,000 born digital videos (20 seconds per video on average) and annotated detailed information for each video, including locations of text trajectory, transcriptions of text lines and question-answer pairs. Questions in BDVT-QA are mainly about inferencing topics of videos and temporal context between description text, which means the potential answers are presented progressively along with the video. Two tasks are proposed within this competition: (1) End-to-End Video Text Spotting; (2) Video Text Question Answering.