Evaluating and Improving Source Code Extraction From Programming Screencasts

Malkadi, Abdulkarim

Evaluating and Improving Source Code Extraction From Programming Screencasts

Date

2023-07

Authors

Malkadi, Abdulkarim

Abstract

The ability to extract and accurately transcribe source code from tutorial videos and online courses is crucial for software developers seeking to reuse and adapt code. However, current methods that rely on optical character recognition (OCR) often produce inaccurate results due to the complexity of the code and variations in screencast formats. In this dissertation, we present three studies that aim to evaluate and improve code extraction accuracy from screencasts and code images. In the first study, we empirically evaluate six OCR engines for extracting source code from screencasts and code images. Our results show that the transcription accuracy varies greatly from one OCR engine to another and that the most widely chosen OCR engine in previous studies is not the best choice. We also identify font type and size as factors that impact the results of some OCR engines, and provide guidelines for programming screencast creators and researchers aiming to analyze source code in screencasts. In the second study, we propose a novel method that utilizes a pre-trained encoder-decoder model for code understanding and generation to enhance code extraction accuracy. By leveraging a large and diverse dataset of source code images and a state-of-the-art encoder-decoder model, we demonstrate significant improvements in the accuracy of the extracted code, outperforming the baseline OCR engine. Additionally, our proposed approach is significantly more time-efficient compared to the baseline. In the third study, we explore the effectiveness of image pre-processing techniques for improving code extraction accuracy from programming screenshots and screencasts. We comprehensively evaluate various image pre-processing methods, including resampling and denoising, to enhance the quality of the input images and improve code extraction accuracy. Additionally, we explore the application of post-processing techniques, such as OCR post-processing with CodeT5-base-OCRfix and PLBART-base-OCRfix, to further refine the extracted code and improve overall accuracy. Overall, our research provides new opportunities for using advanced models and diverse datasets to enhance the accuracy of code extraction from screencasts and images, and sheds light on the importance of careful evaluation and selection of OCR engines for source code recognition.

Keywords

Code Extraction, Programming Screencasts, OCR Postprocessing, OCR Image Preprocessing

URI

https://hdl.handle.net/20.500.14154/68867

Collections

SACM - United States of America

Full item page

Evaluating and Improving Source Code Extraction From Programming Screencasts

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By