LibGuides: Text Mining: Text Mining Resources

What is text mining?

Text mining is a method of turning text into data for computational analysis. It can uncover patterns in large bodies of text (called corpora) that might otherwise be hidden. (Underwood, T. 2015. Seven Ways Humanists are Using Computers to Understand Text. The Stone and the Shell.)

How do I get started?

In starting a text mining project, think about the following questions:

What is your research question?
What text(s) do you want to use?
Are the texts available in machine-readable form?
What is the quality of the texts? Do they need to be corrected/cleaned up?

Find content to mine: library resources

Text and data mining, and systematic downloading, is usually not permitted under most of the Library's license agreements. These are some resources that allow for text mining. For questions about text mining access to other library resources, please contact us.

ProQuest Historical Newspaper Data Sets
Download XML and PDF files of two newspapers from the ProQuest Historical Newspapers collections: The New York Times (1851-1936) and The Washington Post (1877-1934). These files may be downloaded and used for text and data mining.

Corpus of Contemporary American English (COCA) This link opens in a new window
Find and download full-text corpus data of American English taken from spoken (transcripts), fiction, popular magazines, newspapers, and academic texts from 1990. Access and download the complete COCA data sets in three different formats with your current UA NetID.
Gale Primary Sources This link opens in a new window
A built-in tool for viewing term frequency is included in this database of digitized primary source collections, which include: 17th and 18th Century Burney Collection, Archives Unbound, Associated Press Collections Online, The Economist Historical Archive, 1843-2014, Eighteenth Century Collections Online, The Financial Times Historical Archive, Indigenous Peoples: North America, The Making of the Modern World, Nineteenth Century Collections Online, Smithsonian Collections Online, The Times and the Sunday Times Digital Archive, Times Literary Supplement Historical Archive, U.S. Declassified Documents Online.
Linguistic Data Consortium Corpora This link opens in a new window
Find supporting materials for language-related education, research, and technology development by creating and sharing language resources including lexicons, speech files, transcripts, and other text files from 1999 to present. Registration is required. (Please click "more".)
Registration is required to download any datasets and additional user agreements may be required. Register and create a new account here. When creating a new account, use "University of Arizona, Library System" as the organization and your UA email. You will be authorized by our corpus administrator and receive an email once your UA status is verified.
Some of this data is also available in the Library on DVDs and CD-ROMs (for check-out to use on computers outside the Library). Search for titles in the library's Catalog using Linguistic Data Consortium as the author, or search by known titles.

Find content to mine: other resources

American Presidency Project
Over a hundred thousand presidential documents consolidated, coded, and organized into a single searchable database.
BYU Corpora
Includes corpora of texts such as US Supreme Court Opinions, American Soap Operas, TIME magazine, Wikipedia, and more.
Chronicling America OCR Data
Download optical character recognition (OCR) text files of digitized historical newspapers from the Library of Congress/National Endowment for the Humanities.
CORE
Access data from open access research papers through an API or by downloading the dataset. CORE is a database of open access content developed by Jisc in the U.K.
Datasets for Digital Research
Access datasets for text and data mining from the Michigan State University Libraries. Includes corpora such as Feeding America: The Historic American Cookbook, U.S. Congressional Collection, and Sunday School Books in Nineteenth Century America.
Demo corpora
Sample collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials. Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file.
DH Toychest: Data collections and datasets
Find text collections that can be downloaded and used in text analysis and topic modeling tools.
Internet Archive
A digital library of internet sites and other cultural artifacts in digital form.
Library of Congress Selected Datasets
Project Gutenberg
Contains over 54,000 works of literature.

Use text analysis tools

Voyant
Use this free and user friendly tool for text analysis.
MALLET
MALLET (MAchine Learning for LanguagE Toolkit) is a Java-based package for natural language processing, topic modeling, document classification, clustering, and more. A graphical user interface (GUI)-based version is also available.
Topic Modeling Tool
A graphical user interface (GUI) version of MALLET for topic modeling.
Atlas.ti
Qualitative analysis software available on computers in the Main Library, Science-Engineering Library, and Fine Arts Library.
TAPoR
Find tools to use for text analysis and retrieval.

Use text mining tools with corpora included

Google Books Ngram Viewer
View the occurrence of words or phrases over time in the Google Books corpus.
HathiTrust Bookworm
View the occurrence of words or phrases in millions of volumes in HathiTrust.
Constellate
Text analytics service from ITHAKA. Build datasets for text mining from millions of documents. Constellate also offers classes and workshops to learn more.

Main Library | 1510 E. University Blvd. Tucson, AZ 85721
(520) 621-6442

Text Mining

Your Librarian

Your Librarian

What is text mining?

How do I get started?

Find content to mine: library resources

Find content to mine: other resources

Use text analysis tools

Use text mining tools with corpora included

Explore text mining projects

Additional resources

Information for

Libraries & Locations

Search form

Text Mining

Your Librarian

What is text mining?

How do I get started?

Find content to mine: library resources

Find content to mine: other resources

Use text analysis tools

Use text mining tools with corpora included

Explore text mining projects

Additional resources

Information for

Libraries & Locations

Connect