a row of computer servers.

On this page:

UB Ridgebase Fingerprint Dataset

Contactless fingerprint matching using smartphone cameras can alleviate major challenges of traditional fingerprint systems including hygienic acquisition, portability and presentation attacks. However, development of practical and robust contactless fingerprint matching techniques is constrained by the limited availability of large scale real-world datasets. To motivate further advances in contactless fingerprint matching across sensors, we introduce the RidgeBase benchmark dataset.

RidgeBase consists of more than 15,000 contactless and contact-based fingerprint image pairs acquired from 88 individuals under different background and lighting conditions using two smartphone cameras and one flatbed contact sensor. RidgeBase is designed to promote research under different matching scenarios that include Single Finger Matching and Multi-Finger Matching for both contactless-to-contactless (CL2CL) and contact-to-contactless (C2CL) verification and identification.

RidgeBase dataset can be used for training and evaluating contactless fingerprint matching algorithms (CL2CL and C2CL) for three types of tasks:

  1. Task 1: Single Finger Matching
  2. Task 2: Four Finger Matching
  3. Task 3: Set-Based Matching

Detailed information and download instructions are available here: RidgeBase Benchmark Dataset.

IBM-UB Online and Offline Multi-lingual Handwriting Data Set

The Center for Unified Biometrics and Sensors (CUBS), at the University at Buffalo is releasing a new handwriting dataset to the research community. The IBM_UB dataset is a bi-modal (online and offline), multilingual corpus of ground-truthed handwritten documents. It contains a variety of handwritten content ranging from pages of free form cursive writing, to forms, spontaneously written letters, and tables of words, isolated characters and symbols. We expect this dataset to be a valuable resource for multilingual OCR development and for IR applications.

To request this dataset, please contact and indicate the specific dataset.

handwritten document.
handwritten document.
handwritten document.
handwritten document.

This corpus containing handwritten data was originally collected on IBM's CrossPad™ device. The CrossPad™ was a portable digital notepad that used an electronic pen that produced real ink on paper while simultaneously capturing the online pen trajectories. Thus, the handwriting sample was available both as a hardcopy (offline) paper document as well as online trajectory data in IBM's native format.

Researchers at the University at Buffalo (CUBS) have (a) converted the online data - originally in IBM's native format - into the InkML format, (b) scanned the hardcopy documents into 300dpi grayscale images (PNG format), (c) developed visualization tools for the online data and (d) developed correspondence between the online and offline data and generated the ground truth at different levels of granularity - for a sub-set of the entire corpus.

The current release of data comprises of two sections: IBM_UB_1 and IBM_UB_2.


IBM_UB_1(v1) contains free form cursive handwritten pages in English. It contains 6654 pages of online data collected from 43 writers (4138 summary pages and 2516 query pages). It also contains 5934 page of offline data collected from 41 writers (3714 summary pages and 2220 query pages).

  • The documents in the data set are characterized by summary text pages and corresponding query text pages
  • The summary text contains one or two pages of writing on a particular topic
  • The query text contains approximately 25 words that encapsulate the summary text
  • Each summary-query pair is labeled with a unique ID
  • Ground truth information is available for the online query text documents at the word level and for summary text documents at the page level
  • A page level correspondence between the online and offline documents has been established
  • A visualization tool for the online data is also being released


IBM_UB_2 contains handwritten pages in French collected from 200 authors. The pages are in the form of booklets each of which has several typed lines that the author reproduces by hand. These lines contain short cursive sentences, or discrete characters, symbols and/or digits.

For this dataset:

  • Ground truth information is available at the line level
  • A page level correspondence between the online and offline documents has been established

Additional data sets will be released as more sections of the corpus are processed and prepared. Future releases will include data for other Latin script based languages (Italian and German) and evaluation tools for both online and offline recognition. An online word recognizer is being developed to baseline performance on the online data set and will be released under an open-source license.

Individuals and organizations interested in the data for research purposes will need to execute this sub-license agreement with UB before the data is released.


A. Shivram, C. Ramaiah, S. Setlur, and V. Govindaraju. IBM_UB_1: A Dual Mode Unconstrained English Handwriting Dataset. In Proc. of the 12th International Conference on Document Analysis and Recognition (ICDAR), pages 13 - 17, 2013. [.bib] [.ris]

Contact Information

For further details, please contact:

Prof. Venu Govindaraju
SUNY Distinguished Professor, University at Buffalo
+1 716 645 1558


We are grateful to IBM for the donation of this valuable data and especially to Michael Perrone of IBM Research for his leadership in making this possible.

This data release was funded in part by a generous gift from Google Inc. We would like to thank Henry Rowley and Ashok Popat from Google for their support.

Dataset for Keystroke Dynamics and Mouse Movements

University at Buffalo, the State University of New York Dataset for Keystoke Dynamics and Mouse Movements. Funded in part by National Science Foundation Grant No. CNS-1314803.

The dataset contains keystrokes based on transcription as well as free text typing. A part of the dataset is generated using different types of keyboards across sessions. In addition, the mouse coordinate data and related events data are also made available with the keystroke dataset. 

To request this dataset, please contact and indicate the specific dataset.