In starting a text mining project, think about the following questions:
What is your research question?
What text(s) do you want to use?
Are the texts available in machine-readable form?
What is the quality of the texts? Do they need to be corrected/cleaned up?
Find content to mine: library resources
Text and data mining, and systematic downloading, is usually not permitted under most of the Library's license agreements. These are some resources that allow for text mining. For questions about text mining access to other library resources, please contact us.
Download XML and PDF files of two newspapers from the ProQuest Historical Newspapers collections: The New York Times (1851-1936) and The Washington Post (1877-1934). These files may be downloaded and used for text and data mining.
Find and download full-text corpus data of American English taken from spoken (transcripts), fiction, popular magazines, newspapers, and academic texts from 1990. Access and download the complete COCA data sets in three different formats with your current UA NetID.
A built-in tool for viewing term frequency is included in this database of digitized primary source collections, which include: 17th and 18th Century Burney Collection, Archives Unbound, Associated Press Collections Online, The Economist Historical Archive, 1843-2014, Eighteenth Century Collections Online, The Financial Times Historical Archive, Indigenous Peoples: North America, The Making of the Modern World, Nineteenth Century Collections Online, Smithsonian Collections Online, The Times and the Sunday Times Digital Archive, Times Literary Supplement Historical Archive, U.S. Declassified Documents Online.
Find supporting materials for language-related education, research, and technology development by creating and sharing language resources including lexicons, speech files, transcripts, and other text files from 1999 to present. Registration is required. (Please click "more".)
Registration is required to download any datasets and additional user agreements may be required. Register and create a new account here. When creating a new account, use "University of Arizona, Library System" as the organization and your UA email. You will be authorized by our corpus administrator and receive an email once your UA status is verified. Some of this data is also available in the Library on DVDs and CD-ROMs (for check-out to use on computers outside the Library). Search for titles in the library's Catalog using Linguistic Data Consortium as the author, or search by known titles.
Access datasets for text and data mining from the Michigan State University Libraries. Includes corpora such as Feeding America: The Historic American Cookbook, U.S. Congressional Collection, and Sunday School Books in Nineteenth Century America.
Sample collections of texts that are ready-to-go for demonstration purposes or hands-on tutorials. Ideal collections for this purpose are public domain or open access, plain-text, relatively modest in number of files, organized neatly in a folder(s), and downloadable as a zip file.
MALLET (MAchine Learning for LanguagE Toolkit) is a Java-based package for natural language processing, topic modeling, document classification, clustering, and more. A graphical user interface (GUI)-based version is also available.