Evaluating and Improving Source Code Extraction From Programming Screencasts

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
The ability to extract and accurately transcribe source code from tutorial videos and online courses is crucial for software developers seeking to reuse and adapt code. However, current methods that rely on optical character recognition (OCR) often produce inaccurate results due to the complexity of the code and variations in screencast formats. In this dissertation, we present three studies that aim to evaluate and improve code extraction accuracy from screencasts and code images. In the first study, we empirically evaluate six OCR engines for extracting source code from screencasts and code images. Our results show that the transcription accuracy varies greatly from one OCR engine to another and that the most widely chosen OCR engine in previous studies is not the best choice. We also identify font type and size as factors that impact the results of some OCR engines, and provide guidelines for programming screencast creators and researchers aiming to analyze source code in screencasts. In the second study, we propose a novel method that utilizes a pre-trained encoder-decoder model for code understanding and generation to enhance code extraction accuracy. By leveraging a large and diverse dataset of source code images and a state-of-the-art encoder-decoder model, we demonstrate significant improvements in the accuracy of the extracted code, outperforming the baseline OCR engine. Additionally, our proposed approach is significantly more time-efficient compared to the baseline. In the third study, we explore the effectiveness of image pre-processing techniques for improving code extraction accuracy from programming screenshots and screencasts. We comprehensively evaluate various image pre-processing methods, including resampling and denoising, to enhance the quality of the input images and improve code extraction accuracy. Additionally, we explore the application of post-processing techniques, such as OCR post-processing with CodeT5-base-OCRfix and PLBART-base-OCRfix, to further refine the extracted code and improve overall accuracy. Overall, our research provides new opportunities for using advanced models and diverse datasets to enhance the accuracy of code extraction from screencasts and images, and sheds light on the importance of careful evaluation and selection of OCR engines for source code recognition.
Code Extraction, Programming Screencasts, OCR Postprocessing, OCR Image Preprocessing