Responding to COVID-19: Chat and email with us from 9am to 9pm, 7 days a week! We’re resuming limited services in the Main Library lobby, and the Health Sciences library is open to Health Sciences affiliates.Learn more about access during COVID-19..
In starting a text mining project, think about the following questions:
What is your research question?
What text(s) do you want to use?
Are the texts available in machine-readable form?
What is the quality of the texts? Do they need to be corrected/cleaned up?
Find content to mine: library resources
Text and data mining, and systematic downloading, is usually not permitted under most of the Library's license agreements. These are some resources that allow for text mining. For questions about text mining access to other library resources, please contact us.
Download XML and PDF files of two newspapers from the ProQuest Historical Newspapers collections: The New York Times (1851-1936) and The Washington Post (1877-1934). These files may be downloaded and used for text and data mining.
Find and download full-text corpus data of American English taken from spoken (transcripts), fiction, popular magazines, newspapers, and academic texts from 1990.
Access and download the complete COCA data sets in three different formats with your current UA NetID.
Search across multiple digitized primary source collections, including:
17th and 18th Century Burney Collection, Archives Unbound, Associated Press Collections Online, The Economist Historical Archive, 1843-2014, Eighteenth Century Collections Online, The Financial Times Historical Archive, Indigenous Peoples: North America, The Making of the Modern World, Nineteenth Century Collections Online, Smithsonian Collections Online, The Times and the Sunday Times Digital Archive, Times Literary Supplement Historical Archive, U.S. Declassified Documents Online.
Tools for finding term frequency and term clusters are also available in the database.
Access datasets for text and data mining from the Michigan State University Libraries. Includes corpora such as Feeding America: The Historic American Cookbook, U.S. Congressional Collection, and Sunday School Books in Nineteenth Century America.
Sample collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials. Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file.
MALLET (MAchine Learning for LanguagE Toolkit) is a Java-based package for natural language processing, topic modeling, document classification, clustering, and more. A graphical user interface (GUI)-based version is also available.