Detecting Fraud from Accounting Restatements and Company Filings

Annual FIling for a company.

Detecting Fraud from Accounting Restatements and Company Filings 

Project description

A team of researchers in the MSS Department is researching the use of text mining to detect fraud from the EDGAR database. EDGAR -- the Electronic Data Gathering, Analysis, and Retrieval system — performs automated collection, validation, and indexing of submissions by companies who are required by law to file forms with the U.S. Securities and Exchange Commission (SEC).
This retrieval system comes with an API and we would like the incumbent to help with at least (possibly more) the following tasks:

  • Downloading appropriate data using the API -
  • Cleaning and Data Pre-processing — Removal of stop words, punctuation, extraction/count of named entities, and others.
  • Feature extraction -- We are considering extraction of linguistic and natural language processing features such as (n-grams, word embeddings, etc.) from the downloaded data.
  • Statistical Analysis of the Feature Extraction — Features (such as n-grams, word embedding, report length, presence or absence of special characters, titles of headers etc) will need to be incorporated into a statistical machine learning model possibly on a year-by-year basis and integrated with raw accounting data already available to us.
  •  Machine Learning Models — Work with our team to help with building machine learning models and appropriate baselines to design the system for fraud detection.
  •  Performance Reporting — Present precision-recall, AUC scores of models and visualize them
  • The incumbent should be willing to read research papers, interpret them and use them for coding and testing. Familiarity with MATLAB is a plus since we have an existing code base in MATLAB and may need to extend it for a few more recent years of data collection and analysis.

S/He should be willing to attend our research meetings (in-person and zoom) and if desirable help write up results as deemed appropriate.
It may be possible to obtain research credits / independent study credits if so desired. 

Project outcome

  •  Research papers published in conferences and journals
  •  Databases extracted from text made available to students and researchers
  •  Code submitted in a GitHub repository for reusability

Project details

Timing, eligibility and other details
Length of commitment 10-12 months
Start time Anytime
In-person, remote, or hybrid? Hybrid Project
Level of collaboration Individual student project
Benefits Academic credit
Who is eligible All undergraduate students with Ability to code in Java, Python, MATLAB; Knowledge and extensive use of databases such as PostGres SQL, SQL; Experience developing visualization software; Prior research experience beyond class projects 

Core partners

  • School of Management, Accounting, and MSS Departments 

Project mentor

Haimonti Dutta

Associate Professor

Management Science and Systems

Phone: (484) 432-1484


Start the project

  1. Email the project mentor using the contact information above to express your interest and get approval to work on the project. (Here are helpful tips on how to contact a project mentor.)
  2. After you receive approval from the mentor to start this project, click the button to start the digital badge. (Learn more about ELN's digital badge options.) 

Preparation activities

Once you begin the digital badge series, you will have access to all the necessary activities and instructions. Your mentor has indicated they would like you to also complete the specific preparation activities below. Please reference this when you get to Step 2 of the Preparation Phase. 

  • Ability to code in Java, Python and MATLAB (all three are essential). Please showcase projects done in each language
  • Extensive use of databases including writing SQL scripts.
  • Ability to read research papers and implement/ alter algorithmic implementations if necessary 


fraud detection; machine learning, accounting , Management Science and Systems, Accounting