Textual data is unstructured and there is a need to process it into number. Natural Language Processing (NLP) are techniques to help convert text to numbers and as such it is broadly a preprocessing field.
Through the 1980s, all NLP systems were primarily based on complex sets of hand-written rules and not until the late 1980s did statistical algorithms start to replace the hand-written rules and gain ubiquity.
This could be used for news, for trade journals, for annual reports and other disclosures.
Here we construct a heat map showing sentiment trends for S&P 500 GICS industry groups.
The rows show the quarterly changes and they are sorted by Q2 2017 sentiment in descending order.
At a stock level the sentiment is measured using the proportion of negative words in a stock's earnings call (count and division).
One could easily visualize sentiment trends for an industry group and spot potential inflection points and accelerations.
Although the above is still done, most funds have largely switched to statistical NLP methods. NLP becomes especially useful when you combine it with machine learning to develop prediction models based on textual information.
We have to identify a method to represent text (words) to models, and for that we need some notion of distance/similarity between words.
Below is one of the simplest ways to present words, it goes by many names like bag-of-words, count vectorizer, or word-document matrix.
Word vs Vector
Simple counting is a fully legitimate way to encode text, the problem is that it doesn’t convey a lot of information regarding the surrounding words, relationships, context, position etc.
We will soon see how we can represent a word not just as a count or a single scalar value, but as a large vector of attributes.
This simple conversion from word to vector opens up endless possibilities → words become features with which we can perform prediction, sentiment analysis, translation and many other tasks.
There are more than 1 million English word, and they are all related in some way or another, e.g., (a) Stock → Share; (b) Investor → Shareholder; (c) Equity → Capital; (d) Revenue → Sales
The “Shareholder” vector of lenghts 3 could perhaps encode information about the function (activity vs ), the count ( vs. plural), the industry ( vs consumer).
Maybe you have also heard about BERT, GPT-3, BART, T5, Longformer, or LUKE, these model are transformer models and they can all be used to create word embeddings.
Transformers is a type of architecture for NLP tasks like CNN is for image tasks.
With the previous Word2Vec neural networks model, each word has a fixed representation regardless of the context within which the word appears (each word has only one vector).
Whereas transformers neural networks like BERT produces word representations that are dynamically informed by the words around them (as such you need to feed it more than a single word).
“The man was accused of robbing a bank.” | “The man went fishing by the bank of the river.”
Word2Vec would produce the same word embedding for the word “bank” in both sentences, while under BERT the word embedding for “bank” would be different for each sentence.
You can also think of the word pike, it is not a very common word but it has so many different meanings, all depending on the context:
Using the language model and the vectors, we can among other things identify similarity in words, sentences, and documents.
We can also track the change in similarity over time, if a company’s 10k is fundamentally being altered (i.e., less similar than previous year), it could be indicative of some large structural change.
We could also see if a news article (document) is novel after we have linked it to a ticker using NER, if categorized as novel, we can run it it ran through a sentiment classifier, and update our opinion on the stock.
Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging.
NER models could be trained to identify specific entities in a text, such as dates, individuals and places.
As an example, you would have news articles from Bloomberg, and your job is to identify the entity and then obtain the topic as well as sentiment related to the entity.
D. Downstream Tasks
The news articles produced by the mainstream news providers like Thomson Reuters, Bloomberg and Factset are usually accessed via news feeds services provided by the vendors.
The news items usually contain a timestamp, a short headline and sometimes tags and other metadata.
In the past decade, most data vendors have invested substantially in infrastructure and human resources to process and enrich the articles they publish by providing insights from the textual contents of news.
Currently, Bloomberg, Thomson Reuters, RavenPack, among others, provide their own low-latency sentiment analysis and topic classification services.
Primary source news
Primary information sources that journalists research before they write articles include Securities and Exchange Commission (SEC) filings, product prospectuses, court documents and merger deals.
In particular, SEC's Electronic Data Gathering, Analysis and Retrieval (EDGAR) system provides free access to more than 21 million company filings in the US, including registration statements, periodic reports and other forms.
As such it has been the focus of numerous NLP research projects (Li 2010; Bodnaruk et al. 2015; Hadlock and Pierce 2010; Gerde 2003; Grant and Conlon 2006).
Analysis of most reports in EDGAR is fairly straightforward as they have consistent structure, making section identification and extracting relevant text easy by using HTML parsers.
In comparison with EDGAR, the contents and structures of company filings in the UK are less standardized as the firm management has more discretion over
There are three classification models that outputs labels and magnitudes over words, sentences, and documents: (1) Self-trained models, (2) pre-trained models, and (3) zero-shot models.
E. Feature Development
What are word vectors?
We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts, measuring similarity as the vector dot (scalar) product