As we wrote in a previous post on AI-Assisted Grading , we built Gradescope in order to give instructors grading superpowers. Our technology allows instructors to spend less time on grading and other administrative tasks so that they can spend more time interacting with students and improving instruction.
Gradescope is used to grade online assignments, programming projects, and scanned handwritten work, so part of what our technology needs to do is handle handwritten text. In this two-part post, we detail the challenges in addressing this problem, current State-of-the-Art (SOTA) approaches, and how our End-to-End Deep Learning system performs.
The ChallengeThe handwriting images we handle are test submissions turned in by students; starting out as physical papers filled out by hand, then converted to digital images from which the student’s work is automatically extracted. The resulting images (Figure 1) are partial or full-page sized and contain potentially multiple regions of handwritten text, math equations, tables, drawings, diagrams, side-notes, scratched-out text, and text inserted using an arrow / circumflex and other artifacts. The content varies widely, spanning many subjects from grade school level all the way up to postgraduate courses.
The role of our handwriting recognition Artificial Intelligence (AI) is to identify and transcribe the handwritten answers from these images. Furthermore, since we need to serve a variety of use cases, the AI must go beyond just text recognition and perform additional tasks. Specifically, it must:
- Identify regions that should be transcribed as text and regions that should not be: drawings, scratched-out text, special symbols, tables, and math.
- Correctly transcribe the regions of text.
- Emit the transcribed regions in the sequence they were intended to be read.
- Perform auxiliary tasks such as producing formatting and semantic cues.
- Perform at least as well as publicly available text recognition services.
We call this problem Full Page Handwriting Recognition (Full Page HTR) 1. This problem is much harder than classical Handwritten Text Recognition (HTR) which is limited to the recognition of text in images of single words or single lines of text.
Figure 1. Data examples: (a) Full page text with drawing. (b) Full page computer source code. (c) Diagrams and text with embedded math. (d) Math and text regions, embedded math, and stray artifacts.
Academic literature and the typical approaches to this problem usually only attempt to recognize cropped images of single words or lines of text. The task of cropping said words/lines is delegated to another step, called image segmentation. An end-to-end text-recognition system is expected to chain these two steps together, followed by a third step: stitching the individually recognized units back into a passage. This approach suffers from a few problems:
One, image segmentation is usually based on hand-crafted features and heuristics which are not robust to different sources of data, and might break under some unexpected scanning conditions i.e. they are brittle.2
Second, clean segmentation of text is not even possible in many cases e.g., when lines are curved or interspersed with non-textual symbols and artifacts which is very common with the data that we deal with.
Third, stitching a complete transcription from the individually transcribed text regions introduces yet another system, with its own potential for errors, and brittleness to changing data.
Fourth, in order to boost their accuracy, classical systems include closed lexicon decoding; a system that limits their vocabulary to a fixed set of words. This doesn’t work for us since we must cater to the terminology of many different subjects, international proper nouns, and even special things like chemical molecular formulas.
Finally, a multi-step design fragments the end-to-end task, making it difficult to perform sub-tasks that require information from another stage, e.g., stitching back the individually recognized pieces into a passage without losing the original formatting and indentation (important when transcribing computer source code) and other auxiliary tasks such as identifying tables, drawings, etc. and skipping over them even when they contain some text.
In view of these problems, we designed an End-to-End Deep Learning-based model architecture i.e., all of the above steps are implicit and learned from data. Adapting the model to a new dataset or adding new capabilities is just a matter of retraining or fine-tuning with different labels or different data.
Ultimately, our model beat the performance of text-recognition Cloud APIs available from all major vendors and established a new state of art in Full Page Handwriting Recognition. Read all about it here in the follow-up to this blog.
- Our definition of Full Page HTR goes beyond some other publications which define it as recognition of mere single paragraphs, not full pages. That problem should be instead called Full Paragraph HTR in our opinion.
- Such systems would need to be redeveloped in order to accommodate the new data or conditions.