Text Analysis Using Software
Text Analysis in R
- Tidytext package: Provides functionality for preprocessing, sentiment analysis, and topic modeling. Integrates with other Tidyverse packages for streamlined data analysis.
- Quanteda package: Provides functionality for preprocessing, wordscores, and latent semantic scaling
Text Analysis in Python
- NLTK library: Refers to the ‘Natural Language Toolkit.’ Provides functionality for preprocessing, word frequency, and chunking/chinking.
- spaCy library: Provides functionality for preprocessing and dependency parsing.
- Gensim: Provides functionality for topic modeling and implements Word2Vec/Doc2Vec models.
- Polyglot library: A multilingual library that supports over 130 languages. Provides functionality for tokenization, language detection, word embeddings, and sentiment analysis.
- Hugging Face Transformers: An open-source library that provides access to thousands of pre-trained models and datasets for advanced natural language processing (NLP) tasks like text classification, translation, and text generation/summarization.
- Hugging Face also offers a free Large Language Models (LLM) Course.
For additional software, platforms, and programming tools see: Text Analysis: A Guide to Text Mining Tools and Methods
Text mining (or text-as-data) uses computational, mathematical, or statistical methods to draw insights from a body of texts (referred to as a corpus). It is generally used when the body of texts is too large for manual analysis. In the digital humanities, it is often followed by manual analysis and interpretation of the results (a ‘human in the loop’ approach).
A corpus refers to the collection of written texts that a researcher uses to train algorithms and conduct computational or statistical analyses.
- Text corpora are generally large, unstructured, and electronically stored/processed.
- Text data can be sourced from library databases, open sources, interview/survey responses, social media, and web scraping.
- Note: Texts must be machine-readable in order to undergo text mining. You can use ABBYY FineReader (available in The Scholars’ Commons) to perform OCR (Optical Character Recognition) on images of text, including scanned pages, to output text that is machine readable.
- Tesseract is a free software used to OCR documents. It requires use of the command line.
The command line uses manually-typed commands to interact with the computer’s operating system directly—rather than relying on a graphical user interface. MALLET and Tesseract require use of the command line.
Preprocessing prepares your corpus for analysis by reducing noise within the text data. R packages and Python libraries provide functionality for preprocessing. If you are using software, such as Voyant, then you can preprocess your corpus using OpenRefine.
Preprocessing can include the following steps:
- Text cleaning converts all text to lowercase and removes URLs, special characters, punctuation marks, and numbers.
- Tokenization converts your raw text into a series of ‘tokens,’ which are generally individual words. For example, the sentence, “I would like a coffee,” would be tokenized into [“I”, “would”, “like”, “a”, “coffee”].
- Researchers may also be interested in grouping—or chunking—words into meaningful phrases. For example, “the quick brown fox” could be chunked as a noun phrase. Chinking removes unwanted elements, such as specific parts of speech. For more information, see Chunking and Chinking in NLP.
- Stop word removal deletes words like ‘the,’ ‘a,’ and ‘is,’ which carry little semantic value and can slow down your analysis. Creating a list of stopwords is often an important part of curating your data.
- For lists of stop words in multiple languages, see: stopwords-iso.
- Finally, researchers might choose to stem or lemmatize their text. Stemming reduces words to their root form. Lemmatization converts words to their dictionary form (or ‘lemma’). Lemmatization is more accurate, but it is also more computationally intensive.
- Note: If you are grouping your text data into meaningful phrases (or ‘chunks’), then you should not stem or lemmatize.
Term frequency examines the importance of specific terms based on how frequently they appear within your text corpus. See The President’s Words for an example of term frequency analysis.
- Researchers can compute either absolute or relative frequency. While absolute frequency is the raw count, relative frequency is the count expressed as a proportion of the total number of data points.
- If your text corpus is comprised of documents that vary considerably in length (interview transcripts, political speeches, etc.), then you can compute counts relative to all terms within a document to control for length.
Sentiment analysis is used to classify and interpret emotional valence within text data. It generally relies on a pre-defined lexicon to assign a polarity score ranging from -1 (highly negative) to 1 (highly positive) to each word within a corpus. 0 indicates neutrality.
- To study sentiment across a text, most tools will simply add the polarity scores of each word and calculate an average. See Introduction to tidytext for a use case tracking sentiment in Jane Austen’s novels.
- The Python library, VADER uses a more complex, rule-based metric to identify sentiment in social media data.
NLP uses both supervised and unsupervised learning methods to analyze text corpora. While supervised methods rely on pre-defined labels to perform tasks like sentiment analysis and text classification, unsupervised methods find hidden patterns within unstructured data.
Topic Modeling is an algorithm-based tool that identifies a series of topics—or frequently co-occurring words—across a large set of texts.
- Topic modeling uses unsupervised methods like Latent Dirichlet Allocation (LDA) and Structured Topic Modeling (STM).
- Use Tidytext and STM for topic modeling in R, and Gensim for topic modeling in Python.
- Note:Voyant Tools also performs LDA using the jsLDA implementation.
Semantics refers to meaning and how it is represented by and extracted from natural language.
Word embeddings represent each word in a corpus as a vector, which is referred to as an ‘embedding.’ This is a means of reducing dimensionality within text data and identifying semantic similarity.
- The popular Word2Vec algorithm uses a word’s context (i.e. the words that appear before and after it) to create the vector representation. Words like ‘coffee,’ ‘tea,’ and ‘water’ should have similar vector representations—indicating semantic similarity.
- The Doc2Vec algorithm generates a unique vector for an entire document (in addition to its word vectors). This allows for comparisons between documents.
Latent Semantic Scaling (LSS) is a semi-supervised of analyzing texts across domains and languages. It is a word embedding-based approach, which uses a small set of sentiment seed words to calculate polarity scores for all words in a text based on their semantic proximity to the seed words. From there, researchers can calculate average scores per document.
- Seed words can be selected from secondary literature, meaning that LSS is a flexible (and computationally-efficient) method. See Watanabe (2021).
HuggingFace transformers perform advanced NLP tasks, such as text clustering, classification, and summarization/generation:
- Text classification is a supervised learning method, which uses pre-defined labels/categories to organize texts based on their content. This can be useful for organizing survey responses.
- Text clustering is an unsupervised learning method, which automatically groups texts based on their content.
- Note: This often follows Doc2Vec in a text clustering pipeline. Researchers use Doc2Vec to convert their documents into vectors and then apply a clustering algorithm, such as k-means. From there, they can use keywords to interpret the results.
- Text summarization creates a brief synopsis that preserves the key arguments and themes of a longer document. It relies on two techniques: extractive and abstractive summarization. See Text Summarization in NLP.