Responding to COVID-19: Chat and email with us from 9am-9pm, 7 days a week! The Weaver Library is open Sunday through Friday 9am-9pm, and we’re providing limited services in the Main Library lobby. The Health Sciences Library is open to Health Sciences affiliates.Learn more about access during COVID-19.
The easiest to use web text analysis tool. Voyant is free and allows users to upload or paste text. The program will automatically determine word frequencies and colocates and display them graphically.
MALLET (MAchine Learning for LanguagE Toolkit) is a collection of tools that facilitate document classification, sequence tagging, and topic modeling. There is also an add-on toolkit (Graphical Models in MALLET) for visualization.
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs. These packages are widely used in industry, academia, and government.
This collection of text analysis tools hosted by the University of Alberta provides XML, HTML, and plain text analysis. Upload documents to extract common words, determine colocates, seperate HTML tags, and extract XML tagged information.
Data for Research is a free data mining tool for journal content on JSTOR, available to the public. DfR provides the ability to obtain data sets via bulk downloads, and includes a powerful faceted search interface, online viewing of document-level data, downloadable datasets (including word frequencies, citations, key terms, and ngrams).
Developed by the Culturomics folks at Harvard it limits itself to only those digitized texts which have information about them (Full title, Publication Date, Publication Place, etc.) on OpenLibrary.org. As a resuly users can run queries in highly selective corpora based on subject (books on world history, American books on science, etc.) though these corpora are much smaller than those in the full Google Books collection.
This interface is the only of the above that allows users to search longer strings of words from the corpus. Offers the same corpora as available in N-Grams including American works (155 billion words) British works (34 billion words) Fiction (91 billion words) Spanish works (45 billion words), and a 1,000,000 book sample (89 billion words).
Collection of more than 5,000 texts, more than 2,000 of which have been marked up and keyed in by hand. Includes a large number of early English texts from the ECCO-TCP collection as well as all of Shakespeare and other works.