The Center for Unified Biometrics and Sensors (CUBS), at the University at Buffalo is releasing a new handwriting dataset to the research community.
The Center for Unified Biometrics and Sensors (CUBS), at the University at Buffalo is releasing a new handwriting dataset to the research community. The IBM_UB dataset is a bi-modal (online and offline), multilingual corpus of ground-truthed handwritten documents. It contains a variety of handwritten content ranging from pages of free form cursive writing, to forms, spontaneously written letters, and tables of words, isolated characters and symbols. We expect this dataset to be a valuable resource for multilingual OCR development and for IR applications.
This corpus containing handwritten data was originally collected on IBM's CrossPad™ device. The CrossPad™ was a portable digital notepad that used an electronic pen that produced real ink on paper while simultaneously capturing the online pen trajectories. Thus, the handwriting sample was available both as a hardcopy (offline) paper document as well as online trajectory data in IBM's native format.
Researchers at the University at Buffalo (CUBS) have (a) converted the online data - originally in IBM's native format - into the InkML format, (b) scanned the hardcopy documents into 300dpi grayscale images (PNG format), (c) developed visualization tools for the online data and (d) developed correspondence between the online and offline data and generated the ground truth at different levels of granularity - for a sub-set of the entire corpus.
The current release of data comprises of two sections: IBM_UB_1 and IBM_UB_2.
IBM_UB_1(v1) contains free form cursive handwritten pages in English. It contains 6654 pages of online data collected from 43 writers (4138 summary pages and 2516 query pages). It also contains 5934 page of offline data collected from 41 writers (3714 summary pages and 2220 query pages).
IBM_UB_2 contains handwritten pages in French collected from 200 authors. The pages are in the form of booklets each of which has several typed lines that the author reproduces by hand. These lines contain short cursive sentences, or discrete characters, symbols and/or digits.
For this dataset:
Additional data sets will be released as more sections of the corpus are processed and prepared. Future releases will include data for other Latin script based languages (Italian and German) and evaluation tools for both online and offline recognition. An online word recognizer is being developed to baseline performance on the online data set and will be released under an open-source license.
Individuals and organizations interested in the data for research purposes will need to execute this sub-license agreement with UB before the data is released.
A. Shivram, C. Ramaiah, S. Setlur, and V. Govindaraju. IBM_UB_1: A Dual Mode Unconstrained English Handwriting Dataset. In Proc. of the 12th International Conference on Document Analysis and Recognition (ICDAR), pages 13 - 17, 2013. [.bib] [.ris]
For further details, please contact:
Prof. Venu Govindaraju
SUNY Distinguished Professor, University at Buffalo
+1 716 645 1558
We are grateful to IBM for the donation of this valuable data and especially to Michael Perrone of IBM Research for his leadership in making this possible.
This data release was funded in part by a generous gift from Google Inc. We would like to thank Henry Rowley and Ashok Popat from Google for their support.