LibGuides: Digital Humanities Tools: Text Analysis

Text Analysis Tools

Voyant
The easiest to use web text analysis tool. Voyant is free and allows users to upload or paste text. The program will automatically determine word frequencies and colocates and display them graphically.
MALLET
MALLET (MAchine Learning for LanguagE Toolkit) is a collection of tools that facilitate document classification, sequence tagging, and topic modeling. There is also an add-on toolkit (Graphical Models in MALLET) for visualization.
The Stanford Natural Language Processing Group Software
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs. These packages are widely used in industry, academia, and government.
Taporware
This collection of text analysis tools hosted by the University of Alberta provides XML, HTML, and plain text analysis. Upload documents to extract common words, determine colocates, seperate HTML tags, and extract XML tagged information.
WordSeer
WordSeer is a collection of text analysis tools targeted at humanities scholars that includes side-by-side comparison, grammatical search, and document/sentence/word-set features.
JSTOR Data for Research
Data for Research is a free data mining tool for journal content on JSTOR, available to the public. DfR provides the ability to obtain data sets via bulk downloads, and includes a powerful faceted search interface, online viewing of document-level data, downloadable datasets (including word frequencies, citations, key terms, and ngrams).

For more information on specific Text Analysis Tools, please explore these links to access the video tutorials.

Text Corpora Tools

Google N-Grams
This is the classic interface designed by Google which allows users to plot single words and short phrases over time in a large subset (~5 million books) of the corpus.
BYU Google Books
This interface is the only of the above that allows users to search longer strings of words from the corpus. Offers the same corpora as available in N-Grams including American works (155 billion words) British works (34 billion words) Fiction (91 billion words) Spanish works (45 billion words), and a 1,000,000 book sample (89 billion words).

Open Source Text Corpora

Internet Archive and Open Library
The Internet Archive and Open Library offers over 6,000,000 fully accessible public domain eBooks.
Oxford Text Archive
Collection of more than 5,000 texts, more than 2,000 of which have been marked up and keyed in by hand. Includes a large number of early English texts from the ECCO-TCP collection as well as all of Shakespeare and other works.
Hathi Trust Research Center
The Hathi Trust Research Center provides access for non-profit and academic users to the data behind the millions of books within the Hathi Trust.
Chronicling America
Full text of hundreds of pre-1923 American Newspapers made available by the Library of Congress.
Mark Davies' Corpora Site
Mark Davies at BYU hosts several large corpora including a 100+ million word corpus of Time Magazine (1923-2006).
Open Culture
Free cultural and educational media, including ebooks.
Project Gutenberg
Thousands of out-of-copyright books and digital texts.
Open Library
"One web page for every book." Browse millions of book titles, many of which are available to read online or download.

Main Library | 1510 E. University Blvd. Tucson, AZ 85721
(520) 621-6442

Digital Humanities Tools

Librarian

Text Analysis Tools

Text Corpora Tools

Open Source Text Corpora

Information for

Libraries & Locations

Search form

Digital Humanities Tools

Librarian

Text Analysis Tools

Text Corpora Tools

Open Source Text Corpora

Information for

Libraries & Locations

Connect