Natural language processing (NLP) is a promising approach for analyzing large
volumes of climate-change and infrastructure-related scientific literature.
However, best-in-practice NLP techniques require large collections of relevant
documents (corpus). Furthermore, NLP techniques using machine learning and deep
learning techniques require labels grouping the articles based on user-defined
criteria for a significant subset of a corpus in order to train the supervised
model. Even labeling a few hundred documents with human subject-matter experts
is a time-consuming process. To expedite this process, we developed a weak
supervision-based NLP approach that leverages semantic similarity between
categories and documents to (i) establish a topic-specific corpus by subsetting
a large-scale open-access corpus and (ii) generate category labels for the
topic-specific corpus. In comparison with a months-long process of
subject-matter expert labeling, we assign category labels to the whole corpus
using weak supervision and supervised learning in about 13 hours. The labeled
climate and NCF corpus enable targeted, efficient identification of documents
discussing a topic (or combination of topics) of interest and identification of
various effects of climate change on critical infrastructure, improving the
usability of scientific literature and ultimately supporting enhanced policy
and decision making. To demonstrate this capability, we conduct topic modeling
on pairs of climate hazards and NCFs to discover trending topics at the
intersection of these categories. This method is useful for analysts and
decision-makers to quickly grasp the relevant topics and most important
documents linked to the topic.