Introduction to Natural Language Processing for Research: All in One View

Content from Introduction to Natural Language Processing

Last updated on 2024-05-12 | Edit this page

Estimated time: 10 minutes

Overview

Questions

What are some common research applications of NLP?
What are the basic concepts and terminology of NLP?
How can I use NLP in my research field?
How can I acquire data for NLP tasks?

Objectives

Define natural language processing and its goals.
Identify the main research applications and challenges of NLP.
Explain the basic concepts and terminology of NLP, such as tokens, lemmas, and n-grams.
Use some popular datasets and libraries to acquire data for NLP tasks.

1.1. Introduction to NLP Workshop

Natural Language Processing (NLP) is becoming a popular and robust tool for a wide range of research projects. In this episode, we embark on a journey to explore the transformative power of NLP tools in the realm of research.

It is tailored for researchers who are keen on harnessing the capabilities of NLP to enhance and expedite their work. Whether you are delving into text classification, extracting pivotal information, discerning sentiments, summarizing extensive documents, translating across languages, or developing sophisticated question-answering systems, this session will lay the foundational knowledge you need to leverage NLP effectively.

We will begin by delving into the Common Applications of NLP in Research, showcasing how these tools are not just theoretical concepts but practical instruments that drive forward today’s innovative research projects. From analyzing public sentiment to extracting critical data from a plethora of documents, NLP stands as a pillar in the modern researcher’s toolkit.

Prompt: “NLP for Research” [DALL-E 3]

Next, we’ll demystify the Basic Concepts and Terminology of NLP. Understanding these fundamental terms is crucial, as they form the building blocks of any NLP application. We’ll cover everything from the basics of a corpus to the intricacies of transformers, ensuring you have a solid grasp of the language used in NLP.

Finally, we’ll guide you through Data Acquisition: Dataset Libraries, where you’ll learn about the treasure troves of data available at your fingertips. We’ll compare different libraries and demonstrate how to access and utilize these resources through hands-on examples.

By the end of this episode, you will not only understand the significance of NLP in research but also be equipped with the knowledge to start applying these tools to your own projects. Prepare to unlock new potentials and streamline your research process with the power of NLP!

Discussion

Teamwork: What are some examples of NLP in your everyday life? Think of some situations where you interact with or use NLPs, such as online search, voice assistants, social media, etc. How do these examples demonstrate the usefulness of NLP in research projects?

Discussion

Teamwork: What are some examples of NLP in your daily research tasks? What are challenges of NLP that make it difficult, complex, and/or inaccurate?

1.2. Common Applications of NLP in Research

Sentiment Analysis is a powerful tool for researchers, especially in fields like market research, political science, and public health. It involves the computational identification of opinions expressed in text, categorizing them as positive, negative, or neutral.

Example

In market research, for instance, sentiment analysis can be applied to product reviews to gauge consumer satisfaction: a study could analyze thousands of online reviews for a new smartphone model to determine the overall public sentiment. This can help companies identify areas of improvement or features that are well-received by consumers.

Information Extraction is crucial for quickly gathering specific information from large datasets. It is used extensively in legal research, medical research, and scientific studies to extract entities and relationships from texts.

Example

In legal research, for example, information extraction can be used to sift through case law to find precedents related to a particular legal issue. A researcher could use NLP to extract instances of “negligence” from thousands of case files, aiding in the preparation of legal arguments.

Text Summarization helps researchers by providing concise summaries of lengthy documents, such as research papers or reports, allowing them to quickly understand the main points without reading the entire text.

Example

In biomedical research, text summarization can assist in literature reviews by providing summaries of research articles. For example, a researcher could use an NLP model to summarize articles on gene therapy, enabling them to quickly assimilate key findings from a vast array of publications.

Topic Modeling is used to uncover latent topics within large volumes of text, which is particularly useful in fields like sociology and history to identify trends and patterns in historical documents or social media data.

source

Example

For example, in historical research, topic modeling can reveal prevalent themes in primary source documents from a particular era. A historian might use NLP to analyze newspapers from the early 20th century to study public discourse around significant events like World War I.

Named Entity Recognition is a process where an algorithm takes a string of text (sentence or paragraph) and identifies relevant nouns (people, places, and organizations) that are mentioned in that string.

Example

NER is used in many fields in NLP, and it can help answer many real-world questions, such as: Which companies were mentioned in the news article? Were specified products mentioned in complaints or reviews? Does the tweet (recently rebranded to X) contain the name of a person? Does the tweet contain this person’s location?

source

Challenges of NLP

One of the significant challenges in NLP is dealing with the ambiguity of language. Words or phrases can have multiple meanings, and determining the correct one based on context can be difficult for NLP systems. In a research paper discussing “bank erosion,” an NLP system might confuse “bank” with a financial institution rather than the geographical feature, leading to incorrect analysis.

This challenge leads to the fact that the classical NLP systems often struggle with contextual understanding which is crucial in text analysis tasks. This can lead to misinterpretation of the meaning and sentiment of the text. If a research paper mentions “novel results,” an NLP system might interpret “novel” as a literary work instead of “new” or “original,” which could mislead the analysis of the paper’s contributions.

Suggested Resources:

Python’s Natural Language Toolkit (NLTK) for sentiment analysis
TextBlob, a library for processing textual data
Stanford NER for named entity recognition
spaCy, an open-source software library for advanced NLP
Sumy, a Python library for automatic summarization of text documents
BERT-based models for extractive and abstractive summarization
Gensim for topic modeling and document similarity analysis
MALLET, a Java-based package for statistical natural language processing

1.3. Basic Concepts and Terminology of NLP

Discussion

Teamwork: What are some of the basic concepts and terminology of natural language processing that you are familiar with or want to learn more about? Share your knowledge or questions with a partner or a small group, and try to explain or understand some of the key terms of natural language processing, such as tokens, lemmas, n-grams, etc.

Corpus: A corpus is a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject. In NLP, a corpus is used as a large and structured set of texts that can be used to perform statistical analysis and hypothesis testing, check occurrences, or validate linguistic rules within a specific language territory.

Token and Tokenization: Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful in situations where certain characters or words need to be treated as a single entity, despite any spaces or punctuation that might separate them.

Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form—generally a written word form. The idea is to remove affixes to get to the root form of the word. Stemming is often used in search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of the index while increasing retrieval accuracy.

source

Lemmatization: Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization, the root word is called Lemma. A lemma is the canonical form, dictionary form, or citation form of a set of words. For example, runs, running, and running are all forms of the word run, therefore run is the lemma of all these words.

Part-of-Speech (PoS) Tagging: Part-of-speech tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. It is a necessary step before performing more complex NLP tasks like parsing or grammar checking.

Chunking: Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens that may not represent the actual meaning of the text, it’s also interested in extracting entities like noun phrases, verb phrases, etc. It’s basically a meaningful grouping of words or tokens.

Word Embeddings: Word embeddings are a type of word representation that allows words with similar meanings to have a similar representation. They are a distributed representation of text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

Transformers: Transformers are models that handle the ordering of words and other elements in a language. They are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. They are the foundation of most recent advances in NLP, including models like BERT and GPT.

1.4. Data Acquisition: Dataset Libraries:

Different data libraries offer various datasets that are useful for training and testing NLP models. These libraries provide access to a wide range of text data, from literary works to social media posts, which can be used for tasks such as sentiment analysis, topic modeling, and more.

Natural Language Toolkit (NLTK): NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text-processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

spaCy: spaCy is a free, open-source library for advanced Natural Language Processing in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning.

Gensim: Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. Targeted at the NLP and information retrieval communities, Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that only target batch and in-memory processing.

Hugging Face’s datasets: This library provides a simple command-line interface to download and pre-process any of the major public datasets (text datasets in 467 languages and dialects, etc.) provided on the HuggingFace Datasets Hub. It’s designed to let the community easily add and share new datasets. Hugging Face Datasets simplifies working with data for machine learning, especially NLP tasks. It provides a central hub to find and load pre-processed datasets, saving you time on data preparation. You can explore a vast collection of datasets for various tasks and easily integrate them with other Hugging Face libraries.

Data acquisition using Hugging Face datasets library. First, we start with installing the library:

PYTHON

pip install datasets

# To import the dataset, we can write:
from datasets import load_dataset

Use load_dataset with the dataset identifier in quotes. For example, to load the SQuAD question answering dataset:

PYTHON

squad_dataset = load_dataset("squad")
# Use the info attribute to view the trainset information:
print(squad_dataset["train"].info)

Each data point is a dictionary with keys corresponding to data elements (e.g., question, context). Access them using those keys within square brackets:

PYTHON

question = squad_dataset["train"][0]["question"]
context = squad_dataset["train"][0]["context"]

We can use the print() function to see the output:

PYTHON

print(f"Question: {question}")
print(f"Context: {context}")

Challenge:

Q: Use the nltk library to acquire data for natural language processing tasks. You can use the following code to load the nltk library and download some popular datasets:

PYTHON


import nltk
nltk.download()

Choose one of the datasets from the nltk downloader, such as brown, reuters, or gutenberg, and load it using the nltk.corpus module. Then, print the name, size, and description of the dataset.

Show me the solution

A: You can use the following code to access the dataset information: Use the nltk library to acquire data for NLP tasks. Import the necessary libraries:

PYTHON


import nltk
from nltk.corpus import gutenberg, brown

# Download the required data:
nltk.download('gutenberg')
nltk.download('brown')

print(gutenberg.readme())

# Access the downloaded data:
gutenberg_text = gutenberg.raw('austen-emma.txt')
brown_text = brown.words()

Key Points

Image datasets can be found online or created uniquely for your research question.
Images consist of pixels arranged in a particular order.
Image data is usually preprocessed before use in a CNN for efficiency, consistency, and robustness.
Input data generally consists of three sets: a training set used to fit model parameters; a validation set used to evaluate the model fit on training data; and a test set used to evaluate the final model performance.

Content from Introduction to Text Preprocessing

Last updated on 2024-05-12 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How can I prepare data for NLP text analysis?
How can I use spaCy for text preprocessing?

Objectives

Define text preprocessing and its purpose for NLP tasks.
Perform sentence segmentation, tokenization lemmatization, and stop-words removal, using spaCy.

Text preprocessing is the method of cleaning and preparing text data for use in NLP. This step is vital because it transforms raw data into a format that can be analyzed and used effectively by NLP algorithms.

2.1. Sentence Segmentation

Sentence segmentation divides a text into its constituent sentences, which is essential for understanding the structure and flow of the content. We start with a field-specific text example and see how it works. We can start with a paragraph about perovskite nanocrystals from the context of material engineering. Divide it into sentences.

We can use the open-source library, spaCy, to perform this task. First, we import the spaCy library:

PYTHON


import spacy

Then we need to Load the English language model:

PYTHON

nlp = spacy.load("en_core_web_sm")

We can store our text here:

PYTHON

perovskite_text = "Perovskite nanocrystals are a class of semiconductor nanocrystals with unique properties that distinguish them from traditional quantum dots. These nanocrystals have an ABX3 composition, where 'A' can be cesium, methylammonium (MA), or formamidinium (FA); 'B' is typically lead or tin; and 'X' is a halogen ion like chloride, bromide, or iodide. Their remarkable optoelectronic properties, such as high photoluminescence quantum yields and tunable emission across the visible spectrum, make them ideal for applications in light-emitting diodes, lasers, and solar cells."

Now we process the text with spaCy:

PYTHON

doc = nlp(perovskite_text)

To extract sentences from the processed text we use the list() function:

PYTHON

sentences = list(doc.sents)

We use for loop and print() function to output each sentence to show the segmentation:

PYTHON

for sentence in sentences:
   print(sentence.text)

Output: Perovskite nanocrystals are a class of semiconductor nanocrystals with unique properties that distinguish them from traditional quantum dots.
These nanocrystals have an ABX3 composition, where 'A' can be cesium, methylammonium (MA), or formamidinium (FA); 'B' is typically lead or tin; and 'X' is a halogen ion like chloride, bromide, or iodide.
Their remarkable optoelectronic properties, such as high photoluminescence quantum yields and tunable emission across the visible spectrum, make them ideal for applications in light-emitting diodes, lasers, and solar cells.

Discussion

Q: Let’s try again by completing the code below to segment sentences from a paragraph about “your field of research”:

PYTHON

import spacy
nlp = _____.load("en_core_web_sm")
# Add the paragraph about your field of research here
text = "___" 
doc = nlp(___)
# Fill in the blank to extract sentences:
sentences = list(______) 
# Fill in the blank to print each sentence
for sentence in sentences:
  print(______)

Show me the solution

PYTHON

import spacy
nlp = spacy.load("en_core_web_sm")
# Add the paragraph about your field of research here
text = "***" # varies based on your field of research
doc = nlp(text)
# Fill in the blank to extract sentences:
sentences = list(doc.sents) 
# Fill in the blank to print each sentence
for sentence in sentences:
  print(sentence.text)

Discussion

Teamwork: Why is text preprocessing necessary for NLP tasks? Think of some examples of NLP tasks that require text preprocessing, such as sentiment analysis, machine translation, or text summarization. How does text preprocessing improve the performance and accuracy of these tasks?

Challenge

Q: Use the spaCy library to perform sentence segmentation and tokenization on the following text:

PYTHON

text: "The research (Ref. [1]) focuses on developing perovskite nanocrystals with a bandgap of 1.5 eV, suitable for solar cell applications!".

Print the number of sentences and tokens in the text, and the list of sentences and tokens. You can use the following code to load the spaCy library and the English language model:

PYTHON

import spacy
nlp = spacy.load("en_core_web_sm")

Show me the solution

PYTHON


import spacy
# Load the English language model:
nlp = spacy.load("en_core_web_sm")

# Define the text with marks, letters, and numbers:
text = "The research (Ref. [1]) focuses on developing perovskite nanocrystals with a bandgap of 1.5 eV, suitable for solar cell applications.!"

# Process the text with spaCy
doc = nlp(text)

# Print the original text:
print("Original text:", text)

# Sentence segmentation:
sentences = list(doc.sents)

# Print the sentences:
print("Sentences:")
for sentence in sentences:
    print(sentence.text)

# Tokenization:
tokens = [token.text for token in doc]

# Print the tokens:
print("Tokens:")
print(tokens)

2.2. Tokeniziation

As already mentioned, in the first episode, Tokenization breaks down text into individual words or tokens, which is a fundamental step for many NLP tasks.

Discussion

Teamwork: To better understand how it works let’s Match tokens from the provided paragraph about perovskite nanocrystals with similar tokens from another scientific text. This helps in understanding the common vocabulary used in the scientific literature. Using the sentences we listed in the previous section, we can see how Tokenization performs. Assuming ‘sentences’ is a list of sentences from the previous example, choose a sentence to tokenize:

PYTHON

sentence_to_tokenize = sentences[0]
# Tokenize the chosen sentence by using a list comprehension:
tokens = [token.perovskite_text for token in sentence_to_tokenize]
# We can print the tokens:
print(tokens)

Output: [‘Perovskite’, ‘nanocrystals’, ‘are’, ‘a’, ‘class’, ‘of’, ‘semiconductor’, ‘nanocrystals’, ‘with’, ‘unique’, ‘properties’, ‘that’, ‘distinguish’, ‘them’, ‘from’, ‘traditional’, ‘quantum’, ‘dots’, ‘.’]

Tokenization is not just about splitting text into words; it’s about understanding the boundaries of words and symbols in different contexts, which can vary greatly between languages and even within the same language in different settings.

Callout

Tokenization is very important for text analysis tasks such as sentiment analysis. Here we can compare two different texts from different fields and see how their associated tokens are different:

PYTHON


perovskite_tokens = [token.text for token in nlp(perovskite_text)]

Now, we can add a new text from the trading context for comparison. Tokenization of a trading text can be performed similarly to the previous text.

PYTHON

trading_text = "Trading strategies often involve analyzing patterns and executing trades based on predicted market movements. Successful traders analyze trends and volatility to make informed decisions."

trading_tokens = [token.text for token in nlp(trading_text)]

We can see the results by using the print() function. The tokens from both texts:

PYTHON

print("Perovskite Tokens:", perovskite_tokens)
print("Trading Tokens:", trading_tokens)

Output:
Perovskite Tokens: ['Perovskite', 'nanocrystals', 'are', 'a', 'class', 'of', 'semiconductor', 'nanocrystals', 'with', 'unique', 'properties', 'that', 'distinguish', 'them', 'from', 'traditional', 'quantum', 'dots', '.']
Trading Tokens: ['Trading', 'strategies', 'often', 'involve', 'analyzing', 'patterns', 'and', 'executing', 'trades', 'based', 'on', 'predicted', 'market', 'movements', '.', 'Successful', 'traders', 'analyze', 'trends', 'and', 'volatility', 'to', 'make', 'informed', 'decisions', '.']

The tokens from the perovskite text will be specific to materials science, while the trading tokens will include terms related to market analysis. The scientific texts may use more complex and compound words while trading texts might include more action-oriented and analytical language. This comparison helps in understanding the specialized language used in different fields.

2.3. Stemming and Lemmatization

Stemming and lemmatization are techniques used to reduce words to their base or root form, aiding in the normalization of text. As discussed in the previous episode, these two methods are different. Decide whether stemming or lemmatization would be more appropriate for analyzing a set of research texts on perovskite nanocrystals.

Discussion

Teamwork: From the differences between lemmatization and stemming that we learned in the last episode, which technique will you select to get more accurate text analysis results? Explain why?

Following our initial example for Tokenization, we can see how lemmatization works. We start with processing the text with spaCy to perform lemmatization:

PYTHON


lemmas = [token.lemma_ for token in doc]

We can print the original text and the lemmatized text:

PYTHON


print("Original Text:", perovskite_text)
print("Lemmatized Text:", ' '.join(lemmas))

Output:

Original Text: Perovskite nanocrystals are a class of semiconductor nanocrystals with unique properties that distinguish them from traditional quantum dots. These nanocrystals have an ABX3 composition, where ‘A’ can be cesium, methylammonium (MA), or formamidinium (FA); ‘B’ is typically lead or tin; and ‘X’ is a halogen ion like chloride, bromide, or iodide. Their remarkable optoelectronic properties, such as high photoluminescence quantum yields and tunable emission across the visible spectrum, make them ideal for applications in light-emitting diodes, lasers, and solar cells.

Lemmatized Text: Perovskite nanocrystal be a class of semiconductor nanocrystal with unique property that distinguish they from traditional quantum dot . these nanocrystal have an ABX3 composition , where ’ A ’ can be cesium , methylammonium ( MA ) , or formamidinium ( FA ) ; ’ b ’ be typically lead or tin ; and ’ x ’ be a halogen ion like chloride , bromide , or iodide . their remarkable optoelectronic property , such as high photoluminescence quantum yield and tunable emission across the visible spectrum , make they ideal for application in light - emit diode , laser , and solar cell .

Callout

The spaCy library does not have stemming capabilities and if we want to compare stemming and lemmatization, we also need to use another language processing library called NLTK (refer to episode 1).

Based on what we just learned let’s compare lemmatization and stemming. First, we need to import the necessary libraries for stemming and lemmatization:

PYTHON


import spacy
import nltk
from nltk.stem.porter import PorterStemmer
nltk.download('punkt')

Next, we can create an instance of the PorterStemmer for NLTK and load the English language model for spaCy (similar to what we did earlier in this episode).

PYTHON


stemmer = PorterStemmer()
nlp = spacy.load("en_core_web_sm")

We can conduct stemming and lemmatization with identical text data:

text

text = ” Perovskite nanocrystals are a class of semiconductor nanocrystals with unique properties that distinguish them from traditional quantum dots. These nanocrystals have an ABX3 composition, where ‘A’ can be cesium, methylammonium (MA), or formamidinium (FA); ‘B’ is typically lead or tin; and ‘X’ is a halogen ion like chloride, bromide, or iodide. Their remarkable optoelectronic properties, such as high photoluminescence quantum yields and tunable emission across the visible spectrum, make them ideal for applications in light-emitting diodes, lasers, and solar cells.” Before we can stem or lemmatize, we need to tokenize the text.

PYTHON


from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)  
# Apply stemming to each token:
stemmed_tokens = [stemmer.stem(token) for token in tokens]

For lemmatization, we process the text with spaCy and extract the lemma for each token:

PYTHON


doc = nlp(text)
lemmatized_tokens = [token.lemma_ for token in doc]

Finally, we can compare the stemmed and lemmatized tokens:

PYTHON


print("Original Tokens:", tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Output

Original Tokens: [‘Perovskite’, ‘nanocrystals’, ‘are’, ‘a’, ‘class’, ‘of’, ‘semiconductor’, ‘nanocrystals’, ‘with’, ‘unique’, ‘properties’, ‘that’, ‘distinguish’, ‘them’, ‘from’, ‘traditional’, ‘quantum’, ‘dots’, ‘.’, ‘These’, ‘nanocrystals’, ‘have’, ‘an’, ‘ABX3’, ‘composition’, ‘,’, ‘where’, “‘“, ‘A’,”’”, ‘can’, ‘be’, ‘cesium’, ‘,’, ‘methylammonium’, ‘(’, ‘MA’, ‘)’, ‘,’, ‘or’, ‘formamidinium’, ‘(’, ‘FA’, ‘)’, ‘;’, “‘“, ‘B’,”’”, ‘is’, ‘typically’, ‘lead’, ‘or’, ‘tin’, ‘;’, ‘and’, “‘“, ‘X’,”’”, ‘is’, ‘a’, ‘halogen’, ‘ion’, ‘like’, ‘chloride’, ‘,’, ‘bromide’, ‘,’, ‘or’, ‘iodide’, ‘.’, ‘Their’, ‘remarkable’, ‘optoelectronic’, ‘properties’, ‘,’, ‘such’, ‘as’, ‘high’, ‘photoluminescence’, ‘quantum’, ‘yields’, ‘and’, ‘tunable’, ‘emission’, ‘across’, ‘the’, ‘visible’, ‘spectrum’, ‘,’, ‘make’, ‘them’, ‘ideal’, ‘for’, ‘applications’, ‘in’, ‘light-emitting’, ‘diodes’, ‘,’, ‘lasers’, ‘,’, ‘and’, ‘solar’, ‘cells’, ‘.’]
Stemmed Tokens: [‘perovskit’, ‘nanocryst’, ‘are’, ‘a’, ‘class’, ‘of’, ‘semiconductor’, ‘nanocryst’, ‘with’, ‘uniqu’, ‘properti’, ‘that’, ‘distinguish’, ‘them’, ‘from’, ‘tradit’, ‘quantum’, ‘dot’, ‘.’, ‘these’, ‘nanocryst’, ‘have’, ‘an’, ‘abx3’, ‘composit’, ‘,’, ‘where’, “‘“, ‘a’,”’”, ‘can’, ‘be’, ‘cesium’, ‘,’, ‘methylammonium’, ‘(’, ‘ma’, ‘)’, ‘,’, ‘or’, ‘formamidinium’, ‘(’, ‘fa’, ‘)’, ‘;’, “‘“, ‘b’,”’”, ‘is’, ‘typic’, ‘lead’, ‘or’, ‘tin’, ‘;’, ‘and’, “‘“, ‘x’,”’”, ‘is’, ‘a’, ‘halogen’, ‘ion’, ‘like’, ‘chlorid’, ‘,’, ‘bromid’, ‘,’, ‘or’, ‘iodid’, ‘.’, ‘their’, ‘remark’, ‘optoelectron’, ‘properti’, ‘,’, ‘such’, ‘as’, ‘high’, ‘photoluminesc’, ‘quantum’, ‘yield’, ‘and’, ‘tunabl’, ‘emiss’, ‘across’, ‘the’, ‘visibl’, ‘spectrum’, ‘,’, ‘make’, ‘them’, ‘ideal’, ‘for’, ‘applic’, ‘in’, ‘light-emit’, ‘diod’, ‘,’, ‘laser’, ‘,’, ‘and’, ‘solar’, ‘cell’, ‘.’]
Lemmatized Tokens: [‘Perovskite’, ‘nanocrystal’, ‘be’, ‘a’, ‘class’, ‘of’, ‘semiconductor’, ‘nanocrystal’, ‘with’, ‘unique’, ‘property’, ‘that’, ‘distinguish’, ‘they’, ‘from’, ‘traditional’, ‘quantum’, ‘dot’, ‘.’, ‘these’, ‘nanocrystal’, ‘have’, ‘an’, ‘ABX3’, ‘composition’, ‘,’, ‘where’, “‘“, ‘A’,”’”, ‘can’, ‘be’, ‘cesium’, ‘,’, ‘methylammonium’, ‘(’, ‘MA’, ‘)’, ‘,’, ‘or’, ‘formamidinium’, ‘(’, ‘FA’, ‘)’, ‘;’, “‘“, ‘b’,”’”, ‘be’, ‘typically’, ‘lead’, ‘or’, ‘tin’, ‘;’, ‘and’, “‘“, ‘x’,”’”, ‘be’, ‘a’, ‘halogen’, ‘ion’, ‘like’, ‘chloride’, ‘,’, ‘bromide’, ‘,’, ‘or’, ‘iodide’, ‘.’, ‘their’, ‘remarkable’, ‘optoelectronic’, ‘property’, ‘,’, ‘such’, ‘as’, ‘high’, ‘photoluminescence’, ‘quantum’, ‘yield’, ‘and’, ‘tunable’, ‘emission’, ‘across’, ‘the’, ‘visible’, ‘spectrum’, ‘,’, ‘make’, ‘they’, ‘ideal’, ‘for’, ‘application’, ‘in’, ‘light’, ‘-’, ‘emit’, ‘diode’, ‘,’, ‘laser’, ‘,’, ‘and’, ‘solar’, ‘cell’, ‘.’]

We can see how stemming often cuts off the end of words, sometimes resulting in non-words, while lemmatization returns the base or dictionary form of the word. For example, stemming might reduce “properties” to “properti” while lemmatization would correctly identify the lemma as “property”. Lemmatization provides a more readable and meaningful result, which is particularly useful in NLP tasks that require understanding the context and meaning of words.

Challenge

Q: Use the spaCy library to perform lemmatization on the following text: “Perovskite nanocrystals are a promising class of materials for optoelectronic applications due to their tunable bandgaps and high photoluminescence efficiencies.” Print the original text and the lemmatized text. You can use the following code to load the spacy library and the English language model:

PYTHON


import spacy
nlp = spacy.load("en_core_web_sm")

Show me the solution

PYTHON


import spacy
# Load the English language model:
nlp = spacy.load("en_core_web_sm")

# Define the text:
text = "Perovskite nanocrystals are a promising class of materials for optoelectronic applications due to their tunable bandgaps and high photoluminescence efficiencies."

# Process the text with spaCy:
doc = nlp(text)

# Print the original text:
print("Original text:", text)
# Print the lemmatized text:
lemmatized_text = " ".join([token.lemma_ for token in doc])
print("Lemmatized text:", lemmatized_text)

2.4. Stop-words Removal

Removing stop-words, which are common words that add little value to the analysis (such as ‘and’ and ‘the’), helps focus on the important content. Assuming ‘doc’ is the processed text from the previous example for ‘perovskite nanocrystals’, we can define a list to hold non-stop words list comprehensions:

PYTHON


filtered_sentence = [word for word in doc if not word.is_stop]
# print the filtered sentence and see how it is changed:
print("Filtered sentence:", filtered_sentence)

Output: Filtered sentence: [Perovskite, nanocrystals, class, semiconductor, nanocrystals, unique, properties, distinguish, traditional, quantum, dots, ., nanocrystals, ABX3, composition, ,, ‘,’, cesium, ,, methylammonium, (, MA, ), ,, formamidinium, (, FA, ), ;, ‘, B,’, typically, lead, tin, ;, ‘, X,’, halogen, ion, like, chloride, ,, bromide, ,, iodide, ., remarkable, optoelectronic, properties, ,, high, photoluminescence, quantum, yields, tunable, emission, visible, spectrum, ,, ideal, applications, light, -, emitting, diodes, ,, lasers, ,, solar, cells, .]

List comprehensions provide a convenient method for rapidly generating lists based on a straightforward condition.

Challenge

Q: To see how list comprehensions are created, fill in the missing parts of the code to remove stop-words from a given sentence.

PYTHON

import _____
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a very simple and short sentence.")
filtered_sentence = [____ for ____ in doc if not ____]
print("Filtered sentence:", filtered_sentence)

Show me the solution

PYTHON

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a very simple and short sentence.")
filtered_sentence = [word for words in doc if not word.is_stop]
print("Filtered sentence:", filtered_sentence)

Warning!

While stopwords are often removed to improve analysis, they can be important for certain tasks like sentiment analysis, where the word ‘not’ can change the entire meaning of a sentence.

What Else Might We Want to Know About Tokenization?

It is important to note that tokenization is just the beginning. In modern NLP, vectorization, and embeddings play a pivotal role in capturing the context and meaning of text.

Vectorization is the process of converting tokens into a numerical format that machine learning models can understand. This often involves creating a bag-of-words model, where each token is represented by a unique number in a vector. Embeddings are advanced representations where words are mapped to vectors of real numbers. They capture not just the presence of tokens but also the semantic relationships between them. This is achieved through techniques like Word2Vec, GloVe, or BERT, which we will explore in the second part of our workshop.

These embeddings allow models to understand the text in a more nuanced way, leading to better performance on tasks such as sentiment analysis, machine translation, and more.

Stay tuned for our next session, where we will dive deeper into how we can use vectorization and embeddings to enhance our NLP models and truly capture the richness of language.

Key Points

Text preprocessing is essential for cleaning and standardizing text data.
Techniques like sentence segmentation, tokenization, stemming, and lemmatization are fundamental to text preprocessing.
Removing stop-words helps in focusing on the important words in text analysis.
Tokenization splits sentences into tokens, which are the basic units for further processing.

Content from Text Analysis

Last updated on 2024-05-15 | Edit this page

Estimated time: 12 minutes

Overview

Questions

What are text analysis methods?
How can I perform text analysis?

Objectives

Define objectives associated with each one of the text analysis techniques.
Implement named entity recognition, and topic modeling using Python libraries and frameworks, such as NLTK, and Gensim.

3.1. Introduction to Text-Analysis

In this episode, we will learn how to analyze text data for NLP tasks. We will explore some common techniques and methods for text analysis, such as named entity recognition, topic modeling, and text summarization. We will use some popular libraries and frameworks, such as spaCy, NLTK, and Gensim, to implement these techniques and methods.

Discussion

Teamwork: What are some of the goals of text analysis for NLP tasks in your research field (e.g. material science)? Think of some examples of NLP tasks that require text analysis, such as literature review, patent analysis, or material discovery. How does text analysis help to achieve these goals?

Discussion

Teamwork: Name some of the common techniques in text analysis and associating libraries. Briefly explain how they differ from each other in terms of their objectives and required libraries.

3.1. Named Entity Recognition

Named Entity Recognition is a process of identifying and classifying key elements in text into predefined categories. The categories could be names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Next, let’s discuss how it works.

Discussion

Teamwork: Discuss what tasks can be done with NER.

A: NER can help with 1) categorizing resumes, 2) categorizing customer feedback, 3) categorizing research papers, etc.

Using a text example from Wikipedia can help us to see how NER works. Note that the spaCy library is a common framework here as well. Thus, first, we make sure that the library is installed and imported:

PYTHON


pip install spacy
import spacy

Create an NLP model (nlp) and download the small English model from spaCy that is suitable for general tasks.

PYTHON


nlp = spacy.load("en_core_web_sm")

Create a variable to store your text and then apply the model to process your text (text from Wikipedia):

Text

Use for loop to print all the named entities in the document:

PYTHON


doc = nlp(text)

For ent in doc.ents:
    Print(ent.text, ent.label_)

The results will be:


output:

Australian Shares Exchange Ltd ORG
ASX ORG
Australian NORP
Australia GPE
the Australian Shares Exchange ORG
Australia GPE
Australia GPE
The Sydney Stock Exchange ORG
ASX ORG
1 April 1987 DATE
the Australian Parliament ORG
six CARDINAL
the Sydney Futures Exchange ORG
2006 DATE
Today DATE
ASX ORG
A$4.685 billion MONEY
around A$1.6 trillion MONEY
20 CARDINAL

Challenge

Q: How can you interpret the labels in the output?

Show me the solution

A: You can use the following code to get information about each one of the labels. For example, we want to know what GPE represents here. We can use explain() to get the required information: spacy.explain(‘GPE’)

PYTHON


spacy.explain(‘GPE’)

Output: ‘Countries, cities, states’

Challenge

Q: Can we also use other libraries for NER analysis? Use the NLTK library to perform named entity recognition on the following text:

Text

text = ” Perovskite nanocrystals have emerged as a promising class of materials for next-generation optoelectronic devices due to their unique properties. Their crystal structure allows for tunable bandgaps, which are the energy differences between occupied and unoccupied electronic states. This tunability enables the creation of materials that can absorb and emit light across a wide range of the electromagnetic spectrum, making them suitable for applications like solar cells, light-emitting diodes (LEDs), and lasers. Additionally, perovskite nanocrystals exhibit high photoluminescence efficiencies, meaning they can efficiently convert absorbed light into emitted light, further adding to their potential for various optoelectronic applications.”

Challenge (continued)

Print the original text and the list of named entities and their types. You can use the following code to load the NLTK library and the pre-trained model for named entity recognition:

PYTHON


import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
...

Show me the solution

A: Download the necessary NLTK resources and import the required toolkit:

PYTHON


import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

Store the text:

Text

text = “Perovskite nanocrystals have emerged as a promising class of materials for next-generation optoelectronic devices due to their unique properties. Their crystal structure allows for tunable bandgaps, which are the energy differences between occupied and unoccupied electronic states. This tunability enables the creation of materials that can absorb and emit light across a wide range of the electromagnetic spectrum, making them suitable for applications like solar cells, light-emitting diodes (LEDs), and lasers. Additionally, perovskite nanocrystals exhibit high photoluminescence efficiencies, meaning they can efficiently convert absorbed light into emitted light, further adding to their potential for various optoelectronic applications.”

PYTHON


# Tokenize the text:
tokens = nltk.word_tokenize(text)

# Assign part-of-speech tags:
pos_tags = nltk.pos_tag(tokens)

# Perform named entity recognition:
named_entities = nltk.ne_chunk(pos_tags)

# Print the original text:
print("Original Text:")
print(text)

# Print named entities and their types:
print("\nNamed Entities:")
for entity in named_entities:
   if type(entity) == nltk.Tree:
   print(f"Entity: {''.join(word for word, _ in entity.leaves())}, Type: {entity.label()}")

Output

Original Text: Perovskite nanocrystals have emerged as a promising class of materials for next-generation optoelectronic devices due to their unique properties. Their crystal structure allows for tunable bandgaps, which are the energy differences between occupied and unoccupied electronic states. This tunability enables the creation of materials that can absorb and emit light across a wide range of the electromagnetic spectrum, making them suitable for applications like solar cells, light-emitting diodes (LEDs), and lasers. Additionally, perovskite nanocrystals exhibit high photoluminescence efficiencies, meaning they can efficiently convert absorbed light into emitted light, further adding to their potential for various optoelectronic applications.

Named Entities:
Entity: Perovskite, Type: ORGANIZATION
Entity: light-emitting diodes (LEDs), Type: ORGANIZATION

[((‘Perovskite’, ‘NNP’), ‘ORGANIZATION’), (‘nanocrystals’, ‘NNP’), (‘have’, ‘VBP’), (‘emerged’, ‘VBD’), (‘as’, ‘IN’), (‘a’, ‘DT’), (‘promising’, ‘JJ’), (‘class’, ‘NN’), (‘of’, ‘IN’), (‘materials’, ‘NNS’), (‘for’, ‘IN’), (‘next-generation’, ‘JJ’), (‘optoelectronic’, ‘JJ’), (‘devices’, ‘NNS’), (‘due’, ‘IN’), (‘to’, ‘TO’), (‘their’, ‘PRP$'), ('unique', 'JJ'), ('properties', 'NNS'), ('.', '.'), ('Their', 'PRP$’), (‘crystal’, ‘NN’), (‘structure’, ‘NN’), (‘allows’, ‘VBZ’), (‘for’, ‘IN’), (‘tunable’, ‘JJ’), (‘bandgaps’, ‘NNS’), (‘,’, ‘,’), (‘which’, ‘WDT’), (‘are’, ‘VBP’), (‘the’, ‘DT’), (‘energy’, ‘NN’), (‘differences’, ‘NNS’), (‘between’, ‘IN’), (‘occupied’, ‘VBN’), (‘and’, ‘CC’), (‘unoccupied’, ‘VBN’), (‘electronic’, ‘JJ’), (‘states’, ‘NNS’), (‘.’, ‘.’), (‘This’, ‘DT’), (‘tunability’, ‘NN’), (‘enables’, ‘VBZ’), (‘the’, ‘DT’), (‘creation’, ‘NN’), (‘of’, ‘IN’), (‘materials’, ‘NNS’), (‘that’, ‘WDT’), (‘can’, ‘MD’), (‘absorb’, ‘VB’), (‘and’, ‘CC’), (‘emit’, ‘VB’), (‘light’, ‘NN’), (‘across’, ‘IN’), (‘a’, ‘DT’), (‘wide’, ‘JJ’), (‘range’, ‘NN’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘electromagnetic’, ‘JJ’), (‘spectrum’, ‘NN’), (‘,’, ‘,’), (‘making’, ‘VBG’), (‘them’, ‘PRP’), (‘suitable’, ‘JJ’), (‘for’, ‘IN’), (‘applications’, ‘NNS’), (‘like’, ‘IN’), (‘solar’, ‘JJ’), (‘cells’, ‘NNS’), (‘,’, ‘,’), (‘light-emitting’, ‘JJ’), (‘diodes’, ‘NNS’), (‘(’, ‘(’, ‘LEDs’, ‘NNPS’), ‘)’, ‘)’), (‘and’, ‘CC’), (‘lasers’, ‘NNS’), (‘.’, ‘.’), (‘Additionally’, ‘RB’), (‘,’, ‘,’), (‘perovskite’, ‘NNP’), (‘nanocrystals’, ‘NNP’), (‘exhibit’, ‘VBP’), (‘high’, ‘JJ’), (‘photoluminescence’, ‘NN’), (‘efficiencies’, ‘NNS’), (‘,’, ‘,’), (‘meaning’, ‘VBG’), (‘they’, ‘PRP’), (‘can’, ‘MD’), (‘efficiently’, ‘RB’), (‘convert’, ‘VB’), (‘absorbed’, ‘VBN’), (‘light’, ‘NN’), (‘into’, ‘IN’), (‘emitted’, ‘VBN’), (‘light’, ‘NN’), (‘,’, ‘,’), (‘further’, ‘RB’), (‘adding’, ‘VBG’), (‘to’, ‘TO’), (‘their’, ‘PRP$’), (‘potential’, ‘NN’), (‘for’, ‘IN’), (‘various’, ‘JJ’), (‘optoelectronic’, ‘JJ’), (‘applications’, ‘NNS’), (‘.’, ‘.’)]

Why NER?

When do we need to perform NER for your research?

NER helps in quickly finding specific information in large datasets, which is particularly useful in research fields for categorizing the text based on the entities. NER is also called entity chunking and entity extraction.

3.2. Topic Modeling

Topic Modeling is an unsupervised model for discovering the abstract “topics” that occur in a collection of documents. It is useful in understanding the main themes of a large corpus of text. To better understand this and to find the connection between concepts we have learned so far, let’s match the following terms to their brief definitions:

Challenge

Teamwork: To better understand this and to find the connection between concepts we have learned so far, let’s match the following terms to their brief definitions:

Show me the solution

Callout

There are some new concepts in this section that are new to you. Although a detailed explanation of these concepts is out of the scope of this workshop we learn their basic definitions. You already learned a few of them in the earlier activity. Another one is Bag-of-words. We will learn more about Bag-of-words (BoW) in episode 5. BoW is defined as a representation of text that describes the occurrence of words within a document. It is needed in the topic modeling analysis to view the frequency of the words in a document regardless of the order of the words in the text.

To see how Topic Modeling can help us in action to classify a text, let’s see the following example. We need to install the Gensim library and import the necessary modules:

PYTHON


pip install genism
import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

at this stage, we preprocess including the text tokenization (text from Wikipedia):

Text

text = “Australian Shares Exchange Ltd (ASX) is an Australian public company that operates Australia’s primary shares exchange, the Australian Shares Exchange (sometimes referred to outside of Australia as, or confused within Australia as, The Sydney Stock Exchange, a separate entity). The ASX was formed on 1 April 1987, through incorporation under legislation of the Australian Parliament as an amalgamation of the six state securities exchanges and merged with the Sydney Futures Exchange in 2006. Today, ASX has an average daily turnover of A$4.685 billion and a market capitalisation of around A$1.6 trillion, making it one of the world’s top 20 listed exchange groups, and the largest in the southern hemisphere. ASX Clear is the clearing house for all shares, structured products, warrants and ASX Equity Derivatives. ASX Group[3] is a market operator, clearing house and payments system facilitator. It also oversees compliance with its operating rules, promotes standards of corporate governance among Australia’s listed companies and helps to educate retail investors. Australia’s capital markets. Financial development – Australia was ranked 5th out of 57 of the world’s leading financial systems and capital markets by the World Economic Forum; Equity market – the 8th largest in the world (based on free-float market capitalisation) and the 2nd largest in Asia-Pacific, with A$1.2 trillion market capitalisation and average daily secondary trading of over A$5 billion a day; Bond market – 3rd largest debt market in the Asia Pacific; Derivatives market – largest fixed income derivatives in the Asia-Pacific region; Foreign exchange market – the Australian foreign exchange market is the 7th largest in the world in terms of global turnover, while the Australian dollar is the 5th most traded currency and the AUD/USD the 4th most traded currency pair; Funds management – Due in large part to its compulsory superannuation system, Australia has the largest pool of funds under management in the Asia-Pacific region, and the 4th largest in the world. Its primary markets are the AQUA Markets. Regulation. The Australian Securities & Investments Commission (ASIC) has responsibility for the supervision of real-time trading on Australia’s domestic licensed financial markets and the supervision of the conduct by participants (including the relationship between participants and their clients) on those markets. ASIC also supervises ASX’s own compliance as a public company with ASX Listing Rules. ASX Compliance is an ASX subsidiary company that is responsible for monitoring and enforcing ASX-listed companies’ compliance with the ASX operating rules. The Reserve Bank of Australia (RBA) has oversight of the ASX’s clearing and settlement facilities for financial system stability. In November 1903 the first interstate conference was held to coincide with the Melbourne Cup. The exchanges then met on an informal basis until 1937 when the Australian Associated Stock Exchanges (AASE) was established, with representatives from each exchange. Over time the AASE established uniform listing rules, broker rules, and commission rates. Trading was conducted by a call system, where an exchange employee called the names of each company and brokers bid or offered on each. In the 1960s this changed to a post system. Exchange employees called”chalkies” wrote bids and offers in chalk on blackboards continuously, and recorded transactions made. The ASX (Australian Stock Exchange Limited) was formed in 1987 by legislation of the Australian Parliament which enabled the amalgamation of six independent stock exchanges that formerly operated in the state capital cities. After demutualisation, the ASX was the first exchange in the world to have its shares quoted on its own market. The ASX was listed on 14 October 1998.[7] On 7 July 2006 the Australian Stock Exchange merged with SFE Corporation, holding company for the Sydney Futures Exchange. Trading system. ASX Group has two trading platforms – ASX Trade,[12] which facilitates the trading of ASX equity securities and ASX Trade24 for derivative securities trading. All ASX equity securities are traded on screen on ASX Trade. ASX Trade is a NASDAQ OMX ultra-low latency trading platform based on NASDAQ OMX’s Genium INET system, which is used by many exchanges around the world. It is one of the fastest and most functional multi-asset trading platforms in the world, delivering latency down to ~250 microseconds. ASX Trade24 is ASX global trading platform for derivatives. It is globally distributed with network access points (gateways) located in Chicago, New York, London, Hong Kong, Singapore, Sydney and Melbourne. It also allows for true 24-hour trading, and simultaneously maintains two active trading days which enables products to be opened for trading in the new trading day in one time zone while products are still trading under the previous day. Opening times. The normal trading or business days of the ASX are week-days, Monday to Friday. ASX does not trade on national public holidays: New Year’s Day (1 January), Australia Day (26 January, and observed on this day or the first business day after this date), Good Friday (that varies each year), Easter Monday, Anzac day (25 April), Queen’s birthday (June), Christmas Day (25 December) and Boxing Day (26 December). On each trading day there is a pre-market session from 7:00 am to 10:00 am AEST and a normal trading session from 10:00 am to 4:00 pm AEST. The market opens alphabetically in single-price auctions, phased over the first ten minutes, with a small random time built in to prevent exact prediction of the first trades. There is also a single-price auction between 4:10 pm and 4:12 pm to set the daily closing prices. Settlement. Security holders hold shares in one of two forms, both of which operate as uncertificated holdings, rather than through the issue of physical share certificates: Clearing House Electronic Sub-register System (CHESS). The investor’s controlling participant (normally a broker) sponsors the client into CHESS. The security holder is given a “holder identification number” (HIN) and monthly statements are sent to the security holder from the CHESS system when there is a movement in their holding that month. Issuer-sponsored. The company’s share register administers the security holder’s holding and issues the investor with a security-holder reference number (SRN) which may be quoted when selling. Holdings may be moved from issuer-sponsored to CHESS or between different brokers by electronic message initiated by the controlling participant. Short selling. Main article: Short (finance). Short selling of shares is permitted on the ASX, but only among designated stocks and with certain conditions: ASX trading participants (brokers) must report all daily gross short sales to ASX. The report will aggregate the gross short sales as reported by each trading participant at an individual stock level. ASX publishes aggregate gross short sales to ASX participants and the general public.[13] Many brokers do not offer short selling to small private investors. LEPOs can serve as an equivalent, while contracts for difference (CFDs) offered by third-party providers are another alternative. In September 2008, ASIC suspended nearly all forms of short selling due to concerns about market stability in the ongoing global financial crisis.[14][15] The ban on covered short selling was lifted in May 2009.[16] Also, in the biggest change for ASX in 15 years, ASTC Settlement Rule 10.11.12 was introduced, which requires the broker to provide stocks when settlement is due, otherwise the broker must buy the stock on the market to cover the shortfall. The rule requires that if a Failed Settlement Shortfall exists on the second business day after the day on which the Rescheduled Batch Instruction was originally scheduled for settlement (that is, generally on T+5), the delivering settlement participant must either: close out the Failed Settlement Shortfall on the next business day by purchasing the number of Financial Products of the relevant class equal to the shortfall; or acquire under a securities lending arrangement the number of Financial Products of the relevant class equal to the shortfall and deliver those Financial Products in Batch Settlement no more than two business days later.[17] Options. Options on leading shares are traded on the ASX, with standardised sets of strike prices and expiry dates. Liquidity is provided by market makers who are required to provide quotes. Each market maker is assigned two or more stocks. A stock can have more than one market maker, and they compete with one another. A market maker may choose one or both of: Make a market continuously, on a set of 18 options. Make a market in response to a quote request, in any option up to 9 months out. In both cases there is a minimum quantity (5 or 10 contracts depending on the shares) and a maximum spread permitted. Due to the higher risks in options, brokers must check clients’ suitability before allowing them to trade options. Clients may both take (i.e. buy) and write (i.e. sell) options. For written positions, the client must put up margin. Interest rate market. The ASX interest rate market is the set of corporate bonds, floating rate notes, and bond-like preference shares listed on the exchange. These securities are traded and settled in the same way as ordinary shares, but the ASX provides information such as their maturity, effective interest rate, etc., to aid comparison.[18] Futures. The Sydney Futures Exchange (SFE) was the 10th largest derivatives exchange in the world, providing derivatives in interest rates, equities, currencies and commodities. The SFE is now part of ASX and its most active products are: SPI 200 Futures – Futures contracts on an index representing the largest 200 stocks on the Australian Stock Exchange by market capitalisation. AU 90-day Bank Accepted Bill Futures – Australia’s equivalent of T-Bill futures. 3-Year Bond Futures – Futures contracts on Australian 3-year bonds. 10-Year Bond Futures – Futures contracts on Australian 10-year bonds. The ASX trades futures over the ASX 50, ASX 200 and ASX property indexes, and over grain, electricity and wool. Options over grain futures are also traded. Market indices. The ASX maintains stock indexes concerning stocks traded on the exchange in conjunction with Standard & Poor’s. There is a hierarchy of index groups called the S&P/ASX 20, S&P/ASX 50, S&P/ASX 100, S&P/ASX 200 and S&P/ASX 300, notionally containing the 20, 50, 100, 200 and 300 largest companies listed on the exchange, subject to some qualifications. Sharemarket Game. The ASX Sharemarket Game give members of the public and secondary school students the chance to learn about investing in the sharemarket using real market prices. Participants receive a hypothetical $50,000 to buy and sell shares in 150 companies and track the progress of their investments over the duration of the game.[19] Merger talks with SGX. ASX was (25 October 2010) in merger talks with Singapore Exchange (SGX). While there was an initial expectation that the merger would have created a bourse with a market value of US$14 billion,[20] this was a misconception; the final proposal intended that the ASX and SGX bourses would have continued functioning separately. The merger was blocked by Treasurer of Australia Wayne Swan on 8 April 2011, on advice from the Foreign Investment Review Board that the proposed merger was not in the best interests of Australia.[21]”

PYTHON


tokens = simple_preprocess(text)

For Topic Modeling we need to map each word to a unique ID through creating a dictionary:

PYTHON


dictionary = corpora.Dictionary([tokens])

And the created dictionary should be converted into a bag-of-words. We do that with doc2bow().

PYTHON


corpus = [dictionary.doc2bow(tokens)]

Next, we use Latent Dirichlet Allocation (LDA) which is a popular topic modeling technique because it assumes documents are produced from a mixture of topics. These topics then generate words based on their probability distribution. Set up the LDA model with the number of topics and train it on the corpus:

PYTHON


lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=1)

Finally, we can print the topics and their word distributions from our text:

PYTHON


topics = lda_model.print_topics(num_words=4)
    for topic in topics:
    print(topic)

Discussion

Teamwork: How does the topic modeling method help researchers? What about the text summarization? What are some of the challenges and limitations of text analysis in your research field (material science)? Consider some of the factors that affect the quality and accuracy of text analysis, such as data availability, language diversity, and domain specificity. How do these factors pose problems or difficulties for text analysis in material science?

Challenge

Q: Use the Gensim library to perform topic modeling on the following text, print the original text and the list of topics and their keywords.

Text

Challenge (continued)

You can use the following code to load the Gensim library and the pre-trained model for topic modeling:

PYTHON


import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

Show me the solution

PYTHON


import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
tokens = simple_preprocess(text)
dictionary = corpora.Dictionary([tokens])
corpus = [dictionary.doc2bow(tokens)]
model = LdaModel(corpus, num_topics=2, id2word=dictionary)
print(text)
print(model.print_topics())

Challenge

Q: Use the genism to perform topic modeling on the following two different texts and provide a comparison.

Text

text1 = “Perovskite nanocrystals have emerged as a promising class of materials for next-generation optoelectronic devices due to their unique properties. Their crystal structure allows for tunable bandgaps, which are the energy differences between occupied and unoccupied electronic states. This tunability enables the creation of materials that can absorb and emit light across a wide range of the electromagnetic spectrum, making them suitable for applications like solar cells, light-emitting diodes (LEDs), and lasers. Additionally, perovskite nanocrystals exhibit high photoluminescence efficiencies, meaning they can efficiently convert absorbed light into emitted light, further adding to their potential for various optoelectronic applications.”

text2 = “Graphene is a one-atom-thick sheet of carbon atoms arranged in a honeycomb lattice. It is a remarkable material with unique properties, including high electrical conductivity, thermal conductivity, mechanical strength, and optical transparency. Graphene has the potential to revolutionize various fields, including electronics, photonics, and composite materials. Due to its excellent electrical conductivity, graphene is a promising candidate for next-generation electronic devices, such as transistors and sensors. Additionally, its high thermal conductivity makes it suitable for heat dissipation applications.”

Show me the solution

A: After storing the two texts in text1 and text2, preprocess the text (e.g., tokenization, stop word removal, stemming/lemmatization). Split the texts into documents:

PYTHON


import gensim
from gensim import corpora
documents = [text1.split(), text2.split()]

# Create a dictionary:
dictionary = corpora.Dictionary(documents)

# Create a corpus (bag of words representation):
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Train the LDA model (adjust num_topics as needed):
lda_model = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=3, passes=20)

# Print the original texts:
print("Original Texts:")
print(f"Text 1:\n{text1}\n")
print(f"Text 2:\n{text2}\n")

# Identify shared and distinct keywords for each topic:
print("Topics and Keywords:")
for topic in lda_model.show_topics(formatted=False):
    print(f"\nTopic {topic[0]}:")
    topic_words = [w[0] for w in topic[1]]
    print(f"Text 1 Keywords:", [w for w in topic_words if w in text1])
    print(f"Text 2 Keywords:", [w for w in topic_words if w in text2])

# Explain the conceptual similarity:
print("\nConceptual Similarity:")
print("Both texts discuss novel materials (perovskite nanocrystals and graphene) with unique properties. While the specific applications and functionalities differ slightly, they both highlight the potential of these materials for various technological advancements.")

Output

Original Texts:

Text 1: Perovskite nanocrystals have emerged as a promising class of materials for next-generation optoelectronic devices due to their unique properties. Their crystal structure allows for tunable bandgaps, which are the energy differences between occupied and unoccupied electronic states. This tunability enables the creation of materials that can absorb and emit light across a wide range of the electromagnetic spectrum, making them suitable for applications like solar cells, light-emitting diodes (LEDs), and lasers. Additionally, perovskite nanocrystals exhibit high photoluminescence efficiencies, meaning they can efficiently convert absorbed light into emitted light, further adding to their potential for various optoelectronic applications.

Text 2: Graphene is a one-atom-thick sheet of carbon atoms arranged in a honeycomb lattice. It is a remarkable material with unique properties, including high electrical conductivity, thermal conductivity, mechanical strength, and optical transparency. Graphene has the potential to revolutionize various fields, including electronics, photonics, and composite materials. Due to its excellent electrical conductivity, graphene is a promising candidate for next-generation electronic devices, such as transistors and sensors. Additionally, its high thermal conductivity makes it suitable for heat dissipation applications.

Topics and Keywords:

Topic 0:

Text 1 Keywords: ['applications', 'devices', 'material', 'optoelectronic', 'properties']
Text 2 Keywords: ['applications', 'conductivity', 'electronic', 'graphene', 'material', 'potential']

Topic 1:
	
Text 1 Keywords: ['bandgaps', 'crystal', 'electronic', 'properties', 'structure']
Text 2 Keywords: ['conductivity', 'electrical', 'graphene', 'material', 'properties']

Topic 2:

Text 1 Keywords: ['absorption', 'emit', 'light', 'spectrum']
Text 2 Keywords: ['conductivity', 'graphene', 'material', 'optical', 'potential']

Conceptual Similarity:

Both texts discuss novel materials (perovskite nanocrystals and graphene) with unique properties. While the specific applications and functionalities differ slightly (optoelectronic devices vs. electronic devices), they both highlight the potential of these materials for various technological advancements. Notably, both topics identify “material,” “properties,” and “applications” as keywords, suggesting a shared focus on the materials’ characteristics and their potential uses. Additionally, keywords like “electronic,” “conductivity,” and “potential” appear in both texts within different topics, indicating a conceptual overlap in exploring the electronic properties and potential applications of these materials.

Challenge

Q: Use the Gensim library to perform topic modeling on the following text print the original text and the list of topics and their keywords.

Text

text = “Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.”

Challenge (continued)

You can use the following code to load the Gensim library and the pre-trained model for topic modeling:

PYTHON


import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

Show me the solution

PYTHON


import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

tokens = simple_preprocess(text)
dictionary = corpora.Dictionary([tokens])
corpus = [dictionary.doc2bow(tokens)]
model = LdaModel(corpus, num_topics=2, id2word=dictionary)
print(text)
print(model.print_topics())


output = Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

[(0, '0.051*"natural" + 0.051*"language" + 0.051*"processing" + 0.027*"nlp" + 0.027*"challenges" + 0.027*"speech" + 0.027*"recognition" + 0.027*"understanding" + 0.027*"generation" + 0.027*"frequently"'), (1, '0.051*"natural" + 0.051*"language" + 0.051*"computers" + 0.027*"interactions" + 0.027*"between" + 0.027*"human" + 0.027*"languages" + 0.027*"particular" + 0.027*"program" + 0.027*"process"')]

Challenge of using small size corpus

The warning message “too few updates, training might not converge”) arises when you are using a very small corpus for topic modeling with Latent Dirichlet Allocation (LDA) in Gensim. LDA relies on statistical analysis of documents to discover hidden topics. With a limited corpus (one document in your case), there aren’t enough data points for the model to learn robust topics. Increasing the number of documents (corpus size) generally improves the accuracy and convergence of LDA models.

3.3. Text Summarization

Text summarization in NLP is the process of creating a concise and coherent version of a longer text document, preserving its key information. There are two primary approaches to text summarization:

Extractive Summarization: This method involves identifying and extracting key sentences or phrases directly from the original text to form the summary. It is akin to creating a highlight reel of the most important points.
Abstractive Summarization: This approach goes beyond mere extraction; it involves understanding the main ideas and then generating new, concise text that captures the essence of the original content. It is similar to writing a synopsis or an abstract for a research paper.

In the next part of the workshop, we will explore advanced tools like transformers, which can generate summaries that are more coherent and closer to what a human might write. Transformers use models like BERT and GPT to understand the context and semantics of the text, allowing for more sophisticated abstractive summaries.

Challenge

Q: Fill in the blanks with the correct terms related to text summarization:

—— summarization selects sentences directly from the original text, while —— summarization generates new sentences.
—— are advanced tools used for generating more coherent and human-like summaries.
The —— and —— models are examples of transformers that understand the context and semantics of the text.
—— summarization can often create summaries that are more —— and coherent than —— methods.
Advanced summarization tools use —— and —— to interpret and condense text.

Show me the solution

Extractive summarization selects sentences directly from the original text, while abstractive summarization generates new sentences.
Transformers are advanced tools used for generating more coherent and human-like summaries.
The BERT and GPT models are examples of transformers that understand the context and semantics of the text.
Abstractive summarization can often create summaries that are more concise and coherent than extractive methods.
Advanced summarization tools use machine learning and natural language processing to interpret and condense text.

Callout

In the rapidly evolving field of NLP, summarization tasks are increasingly being carried out using transformer-based models due to their advanced capabilities in understanding context and generating coherent summaries. Tools like Gensim’s summarization module

PYTHON

from gensim.summarization import summarize

have become outdated and were removed in its 4.0 release (source), as they relied on extractive methods that simply selected parts of the existing text, which is less effective compared to the abstractive approach of transformers. These cutting-edge transformer models, which can create concise and fluent summaries by generating new sentences, are leading to the gradual disappearance of older, less efficient summarization methods.

Key Points

Named Entity Recognition (NER) is crucial for identifying and categorizing key information in text, such as names of people, organizations, and locations.
Topic Modeling helps uncover the underlying thematic structure in a large corpus of text, which is beneficial for summarizing and understanding large datasets.
Text Summarization provides a concise version of a longer text, highlighting the main points, which is essential for quick comprehension of extensive research material.

Content from Word Embedding

Last updated on 2024-05-10 | Edit this page

Estimated time: 16 minutes

Overview

Questions

What is a vector space in the context of NLP?
How can I visualize vector space in a 2D model?
How can I use embeddings and how do embeddings capture the meaning of words?

Objectives

Be able to explain vector space and how it is related to text analysis.
Identify the tools required for text embeddings.
To explore the Word2Vec algorithm and its advantages over traditional models.

source

4.1. Introduction to Vector Space & Embeddings:

We have discussed how tokenization works and how it is important in text analysis, however, this is not the whole story of preprocessing. For conducting robust and reliable text analysis with NLP models, vectorization and embedding are required after tokenization. To understand this concept, we first talk about vector space.

Vector space models represent text data as vectors, which can be used in various machine learning algorithms. Embeddings are dense vectors that capture the semantic meanings of words based on their context.

embedding_2 source

Discussion

Teamwork: Discuss how tokenization affects the representation of text in vector space models. Consider the impact of ignoring common words (stop words) and the importance of word order.

Show me the solution

source

A: Ignoring stop words might lead to loss of some contextual information but can also reduce noise. Preserving word order can be crucial for understanding the meaning, especially in languages with flexible syntax.

Tell me MORE!

Tokenization is a fundamental step in the processing of text for vector space models. It involves breaking down a string of text into individual units, or “tokens,” which typically represent words or phrases. Here’s how tokenization impacts the representation of text in vector space models:

Granularity: Tokenization determines the granularity of text representation. Finer granularity (e.g., splitting on punctuation) can capture more nuances but may increase the dimensionality of the vector space.
Dimensionality: Each unique token becomes a dimension in the vector space. The choice of tokenization can significantly affect the number of dimensions, with potential implications for computational efficiency and the “curse of dimensionality.”
Semantic Meaning: Proper tokenization ensures that semantically significant units are captured as tokens, which is crucial for the model to understand the meaning of the text.

Ignoring common words, or “stop words,” can also have a significant impact:

Noise Reduction: Stop words are often filtered out to reduce noise since they usually don’t carry important meaning and are highly frequent (e.g., “the,” “is,” “at”).

Focus on Content Words: By removing stop words, the model can focus on content words that carry the core semantic meaning, potentially improving the performance of tasks like information retrieval or topic modeling.

Computational Efficiency: Ignoring stop words reduces the dimensionality of the vector space, which can make computations more efficient.

The importance of word order is another critical aspect:

Contextual Meaning: Word order is essential for capturing the syntactic structure and meaning of a sentence. Traditional bag-of-words models ignore word order, which can lead to a loss of contextual meaning.

Phrase Identification: Preserving word order allows for the identification of multi-word expressions and phrases that have distinct meanings from their constituent words.

Word Embeddings: Advanced models like word embeddings (e.g., Word2Vec) and contextual embeddings (e.g., BERT) can capture word order to some extent, leading to a more nuanced understanding of text semantics.

In summary, tokenization, the treatment of stop words, and the consideration of word order are all crucial factors that influence how text is represented in vector space models, affecting both the quality of the representation and the performance of downstream tasks.

Tokenization Vs. Vectorization Vs. Embedding

Initially, tokenization breaks down text into discrete elements, or tokens, which can include words, phrases, symbols, and even punctuation, each represented by a unique numerical identifier. These tokens are then mapped to vectors of real numbers within an n-dimensional space, a process that is part of embedding. During model training, these vectors are adjusted to reflect the semantic similarities between tokens, positioning those with similar meanings closer together in the embedding space. This allows the model to grasp the nuances of language and transforms raw text into a format that machine learning algorithms can interpret, paving the way for advanced text analysis and understanding.

source

4.2. Bag of Words & TF-IDF:

Feature extraction in machine learning involves creating numerical features that describe a document’s relationship to its corpus. Traditional methods like Bag-of-Words and TF-IDF count words or n-grams, with the latter assigning weights based on a word’s importance, calculated by Term Frequency (TF) and Inverse Document Frequency (IDF). TF measures a word’s importance within a document, while IDF assesses its rarity across the corpus.

source

The product of TF and IDF gives the TF-IDF score, which balances a word’s frequency in a document against its commonness in the corpus. This approach helps to highlight significant words while diminishing the impact of commonly used words like “the” or “a.”

Instructor Note

BoW “encodes the total number of times a document uses each word in the associated corpus through the CounterVectorizer.”
TF-IDF “creates features for each document based on how often each word shows up in a document versus the entire corpus.
source

Discussion

Teamwork: Discuss how each method represents the importance of words and the potential impact on sentiment analysis.

Show me the solution

A: To compare the Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) methods in representing text data and their implications for sentiment analysis.

Data Collection: Gather a corpus of product reviews. For this activity, let’s assume we have a list of reviews stored in a variable called reviews. Clean the text data by removing punctuation, converting to lowercase, and possibly removing stop words. Use a vectorizer to convert the reviews into a BoW representation.

Discuss how BoW represents the frequency of words without considering the context or rarity across documents. Use a vectorizer to convert the same reviews into a TF-IDF representation. Discuss how TF-IDF represents the importance of words by considering both the term frequency and how unique the word is across all documents.

Teamwork

Sentiment Analysis Implications:

Analyze a corpus of product reviews using both BoW and TF-IDF. Consider how the lack of context in BoW might affect sentiment analysis. Evaluate whether TF-IDF’s emphasis on unique words improves the model’s ability to understand sentiment.

Share Findings: Groups should present their findings, highlighting the strengths and weaknesses of each method.

Show me the solution

PYTHON


from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample corpus of product reviews
reviews = [
"Great product, really loved it!",
"Bad quality, totally disappointed.",
"Decent product for the price.",
"Excellent quality, will buy again!"
]

# Initialize the CountVectorizer for BoW
bow_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the reviews
bow_matrix = bow_vectorizer.fit_transform(reviews)

# Display the BoW matrix
print("Bag of Words Matrix:")
print(bow_matrix.toarray())

# Initialize the TfidfVectorizer for TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the reviews
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews)

# Display the TF-IDF matrix
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

The BoW matrix shows the frequency of each word in the reviews, disregarding context and word importance. The TF-IDF matrix shows the weighted importance of words, giving less weight to common words and more to unique ones.

In sentiment analysis, BoW might misinterpret sentiments due to ignoring context, while TF-IDF might capture nuances better by emphasizing words that are significant in a particular review.

By comparing BoW and TF-IDF, participants can gain insights into how each method processes text data and their potential impact on NLP tasks like sentiment analysis. This activity encourages critical thinking about feature representation in machine learning models.

4.3. Word2Vec Algorithm:

More advanced techniques like Word2Vec and GLoVE, as well as feature learning during neural network training, have also been developed to improve feature extraction.

Word2Vec uses neural networks to learn word associations from large text corpora. It has two architectures: Skip-Gram and Continuous Bag-of-Words (CBOW).

After training, it discards the final layer and outputs word embeddings that capture context. These embeddings capture the context of words, making similar contexts yield similar embeddings. Post-data preprocessing, these numerical features can be used in various NLP models for tasks like classification or named entity recognition.

Now let’s see how this framework can be used in practice. First import required libraries: Start by importing necessary libraries like gensim for Word2Vec and nltk for tokenization. Next, prepare the data: Tokenize your text data into words.

PYTHON


from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Sample text
text = "Tokenization splits text into words. Embeddings capture semantic meaning."
# Tokenize the text
tokens = word_tokenize(text.lower())

Now train the model: Use the Word2Vec class from gensim to train your model on the tokenized sentences.

PYTHON


from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Sample text
text = "Tokenization splits text into words. Embeddings capture semantic meaning."
# Tokenize the text
tokens = word_tokenize(text.lower())

Retrieve Vectors: After training, use the model to get vectors for words of interest.

PYTHON


# Display the vector
print(vector_embeddings)

The code tokenizes the sample text, trains a Word2Vec model, and retrieves the vector for the word ‘embeddings’.

The resulting vector is a 50-dimensional representation of ‘embeddings’, capturing its context within the sample text. This vector can then be used in various NLP tasks to represent the semantic meaning of the word ‘embeddings’.

By understanding the roles of tokenization and embedding, we can better prepare text data for complex NLP tasks and build models that more accurately interpret human language.

What is GLoVE?

Global Vectors for Word Representation (GLoVE)

GLoVE is a model for learning word embeddings, which are representations of words in the form of high-dimensional vectors. Unlike Word2Vec, which uses a neural network to learn word embeddings from local context information, GLoVE is designed to capture both global statistics and local context. Here’s how GLoVE stands out:

Matrix Factorization: GLoVE uses matrix factorization on a word co-occurrence matrix that reflects how often each word appears in the context of every other word within a large corpus.
Global Word-Word Co-Occurrence: It focuses on word-to-word co-occurrence globally across the entire corpus, rather than just within a local context window as in Word2Vec.
Weighting Function: GLoVE employs a weighting function that helps to address the disparity in word co-occurrence frequencies, giving less weight to rare and frequent co-occurrences.

The main difference between GLoVE and Word2Vec is that GLoVE is built on the idea that word meanings can be derived from their co-occurrence probabilities with other words, and hence it incorporates global corpus statistics, whereas Word2Vec relies more on local context information. This allows GLoVE to effectively capture both the semantic and syntactic relationships between words, making it powerful for various natural language processing tasks.

Key Points

Tokenization is crucial for converting text into a format usable by machine learning models.
BoW and TF-IDF are fundamental techniques for feature extraction in NLP.
Word2Vec and GloVE generate embeddings that encapsulate word meanings based on context and co-occurrence, respectively.
Understanding these concepts is essential for building effective NLP models that can interpret and process human language.

Content from Transformers for Natural Language Processing

Last updated on 2024-05-15 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How do Transformers work?
How can I use Transformers for text analysis?

Objectives

To be able to describe Transformers’ architecture.
To be able to implement sentiment analysis, and text summarization using transformers.

Transformers have revolutionized the field of NLP since their introduction by the Google team in 2017. Unlike previous models that processed text sequentially, Transformers use an attention mechanism to process all words at once, allowing them to capture context more effectively. This parallel processing capability enables Transformers to handle long-range dependencies and understand the nuances of language better than their predecessors. For now, try to recognize the building blocks of the general structure of a transformer

5.1. Introduction to Artificial Neural Networks

To understand how Transformers work we also need to learn about artificial neural networks (ANNs). Imagine a neural network as a team of workers in a factory. Each worker (neuron) has a specific task (processing information), and they pass their work along to the next person in line until the final product (output) is created.

Just like a well-organized assembly line, a neural network processes information in stages, with each neuron contributing to the final result.

Activity

Teamwork: Take a look at the architecture of a simple ANN below. Identify the underlying layers and components of this ANN and add the correct name label to each one.

Show me the solution

What is Multilayer Perceptron then?

In the context of machine learning, a multilayer perceptron (MLP) is indeed a fully connected multi-layer neural network and is a classic example of a feedforward artificial neural network (ANN). It typically includes an input layer, one or more hidden layers, and an output layer. When an MLP has more than one hidden layer, it can be considered a deep ANN, part of a broader category known as deep learning.

Summation and Activation Function

If we zoom into a neuron in the hidden layer, we can see the mathematical operations (weights summation and activation function). An input is transformed at each hidden layer node through a process that multiplies the input (x_i) by learned weights (w_i), adds a bias (b), and then applies an activation function to determine the node’s output. This output is either passed on to the next layer or contributes to the final output of the network. Essentially, each node performs a small calculation that, when combined with the operations of other nodes, allows the network to process complex patterns and data.

What happens next? How to optimize an ANN?

Backpropagation is an algorithmic cornerstone in the training of ANNs, serving as a method for optimizing weights and biases through gradient descent. Conceptually, it is akin to an iterative refinement process where the network’s output error is propagated backward, layer by layer, using the chain rule of calculus. This backward flow of error information allows for the computation of gradients, which inform the magnitude and direction of adjustments to be made to the network’s parameters. The objective is to iteratively reduce the differences between the predicted output and the actual target values. This systematic adjustment of parameters, guided by error gradients, incrementally leads to a more accurate ANN model.

Challenge

Teamwork: When we talk about ANNs, we also consider their parameters. But what are the parameters? Draw a small neural network with 3 following layers: x1

Input Layer: 3 neurons
Hidden Layer: 4 neurons
Output Layer: 1 neurons

Connect each neuron in the input layer to every neuron in the hidden layer (next layer). How many connections (weights) do we have?
Now, add a bias for each neuron in the hidden layer. How many biases do we have?
Repeat the process for the hidden layer to the output layer.

Show me the solution

(3 { neurons} x 4 { neurons} + 4{ biases}) = 16
(4 { neurons} x 1 { neurons} + 1{ biases}) = 5
Total parameters for this network: (16 + 5 = 21)

Challenge

Q: Add another hidden layer with 4 neurons to the previous ANN and calculate the number of parameters.

Show me the solution

We would add: - (4 * 4) weights from the first to the second hidden layer - (4) biases for the new hidden layer - (4 * 1) weights from the second hidden layer to the output layer (we already counted the biases for the output layer)

That’s an additional (16 + 4 = 20) parameters, bringing our total to (21 + 20 = 41) parameters.

5.2. Transformers

As mentioned in the introduction, Most of the recent NLP models are built based on Transformers. Building on our understanding of ANNs, let’s explore the architecture of transformers. Transformers consist of several key components that work together to process and generate data.

Activity

Teamwork: We go back to the first figure of this episode. In the simplified schematic below, write the function of each component in the allocated textbox:

Show me the solution

Briefly, we can say:

Encoder: Processes input text into contextualized representations, enabling the understanding of the context within the input sequence. It is like the ‘listener’ in a conversation, taking in information and understanding it.
Decoder: Generates output sequences by translating the contextualized representations from the encoder into coherent text, often using mechanisms like masked multi-head attention and encoder-decoder attention to maintain sequence order and coherence. This acts as the ‘speaker’ in the conversation, generating the output based on the information processed by the encoder.
Positional Encoding: Adds unique information to each word embedding, indicating the word’s position in the sequence, which is essential for the model to maintain the order of words and understand their relative positions within a sentence
Input Embedding: The input text is converted into vectors that the model can understand. Think of it as translating words into a secret code that the transformer can read.
Output Embedding: Similar to input embedding, but for the output text. It translates the transformer’s secret code back into words we can understand.
Softmax Output: Applies the softmax function to the final layer’s outputs to convert them into a probability distribution, which helps in tasks like classification and sequence generation by selecting the most likely next word or class. It is like choosing the best response in a conversation from many options.

Attention Mechanism

So far, we have learned what the architecture of a transformer block looks like. However, for simplicity, many parts of this architecture have not been considered.

In the following section, we will show the underlying components of a transformer.

For more details see source.

Attention mechanisms in transformers, allow LLMs to focus on different parts of the input text to understand context and relationships between words. The concept of ‘attention’ in encoders and decoders is akin to the selective focus of ‘fast reading,’ where one zeroes in on crucial information and disregards the irrelevant. This mechanism adapts to the context of a query, emphasizing different words or tokens based on the query’s intent. For instance, in the sentence “Sarah went to a restaurant to meet her friend that night,” the words highlighted would vary depending on whether the question is about the action (What?), location (Where?), individuals involved (Who?), or time (When?).

source

In transformer models, this selective focus is achieved through ‘queries,’ ‘keys,’ and ‘values,’ all represented as vectors. A query vector seeks out the closest key vectors, which are encoded representations of values. The relationship between words, like ‘where’ and ‘restaurant,’ is determined by their frequency of co-occurrence in sentences, allowing the model to assign greater attention to ‘restaurant’ when the query pertains to a location. This dynamic adjustment of focus enables transformers to process language with a nuanced understanding of context and relevance.

Discussion

Teamwork: Have you heard of any other applications of the Transformers rather than in NLPs? Explain why transformers can be useful for other AI applications. Share your thoughts and findings with other groups.

Show me the solution

A: Transformers, initially popular in NLP, have found applications beyond text analysis. They excel in computer vision, speech recognition, and even genomics. Their versatility extends to music generation and recommendation systems. Transformers’ innovative architecture allows them to adapt to diverse tasks, revolutionizing AI applications.

Transformers in Text Translation

Imagine you want to translate the sentence “What time is it?” from English to German using a transformer. The input embedding layer converts each English word into a vector. The six layers of encoders process these vectors, understanding the context of the sentence. The six layers of decoders then start generating the German translation, one word at a time.

For each word, the Softmax output predicts the most likely next word in German. The output embedding layer converts these predictions back into readable German words. By the end, you get the German translation of “What time is it?” as “Wie spät ist es?”

Pipelines

The pipeline module from Hugging Face’s transformers library is a high-level API that simplifies the use of complex machine learning models for a variety of NLP tasks. It is a versatile tool for NLP tasks, enabling users to perform text generation, sentiment analysis, question answering, summarization, and translation with minimal code. By abstracting away the intricacies of model selection, tokenization, and output generation, the pipeline module makes state-of-the-art AI accessible to developers of all skill levels, allowing them to harness the power of language models efficiently and intuitively.

What are other sequential learning models?

Transformers are essential for NLP tasks because they overcome the limitations of earlier models like recurrent neural networks (RNNs) and long short-term memory models (LSTMs), which struggled with long sequences and were computationally intensive respectively. Transformers, in contrast to the sequential input processing of RNNs, handle entire sequences simultaneously. This parallel processing capability enables data scientists to employ GPUs to train large language models (LLMs) based on transformers, which markedly decreases the duration of training.

source

5.3. Sentiment Analysis

Sentiment analysis is a powerful tool in NLP that helps determine the emotional tone behind the text. It is used to understand opinions, sentiments, emotions, and attitudes from various entities and classify them according to their polarity.

Activity

Teamwork: How do you categorize the following text in terms of positive and negative sounding? Select an Emoji.

“A research team has unveiled a novel ligand exchange technique that enables the synthesis of organic cation-based perovskite quantum dots (PQDs), ensuring exceptional stability while suppressing internal defects in the photoactive layer of solar cells.” source

Computer models can do this job for us! Let’s see how it works through a step-by-step example: First, install the required libraries and pipelines:

PYTHON


pip install transformers
from transformers import pipeline

Now, initialize the sentiment analysis pipeline and analyze the sentiment of a sample text:

PYTHON


sentiment_pipeline = pipeline('sentiment-analysis')
text = " A research team has unveiled a novel ligand exchange technique that enables the synthesis of organic cation-based perovskite quantum dots (PQDs), ensuring exceptional stability while suppressing internal defects in the photoactive layer of solar cells."
sentiment = sentiment_pipeline(text)

After the analysis is completed, you can print out the results:

PYTHON


print(f"Sentiment: {sentiment[0]['label']}, Confidence: {sentiment[0]['score']:.2f}")

# Output

Output: Sentiment: POSITIVE, Confidence: 1.00

In this example, the sentiment analysis pipeline from the Hugging Face library is used to analyze the sentiment of a research paper abstract. The model predicts the sentiment as positive, negative, or neutral, along with a confidence score. This can be particularly useful for gauging the reception of research papers in a field.

Activity

Teamwork: Fill in the blanks to complete the sentiment analysis process: Install the __________ library for sentiment analysis. Use the __________ function to create a sentiment analysis pipeline. The sentiment analysis model will output a __________ and a __________ score.

VADRER

Valence Aware Dictionary and sEntiment Reasoner (VADER) is a lexicon and rule-based sentiment analysis tool that is particularly attuned to sentiments expressed in social media. VADER analyzes the sentiment of the text and returns a dictionary with scores for negative, neutral, positive, and a compound score that aggregates them. It is useful for quick sentiment analysis, especially on social media texts. Let’s how we can use this framework.

First, we need to import the SentimentIntensityAnalyzer module from VADER library:

PYTHON


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Initialize VADER sentiment intensity analyzer:
analyzer = SentimentIntensityAnalyzer()

# We use the same sample text:
text = " A research team has unveiled a novel ligand exchange technique that enables the synthesis of organic cation-based perovskite quantum dots (PQDs), ensuring exceptional stability while suppressing internal defects in the photoactive layer of solar cells."

# Now we can analyze sentiment:
vader_sentiment = analyzer.polarity_scores(text)

# Print the sentiment:
print(f"Sentiment: {vader_sentiment}")
Output: Sentiment: {'neg': 0.069, 'neu': 0.818, 'pos': 0.113, 'compound': 0.1779}

Discussion

Teamwork: Which framework do you think could be more helpful for research applications? Elaborate your opinion. Share your thoughts with other team members.

Show me the solution

A: Transformers use deep learning models that can understand context and nuances of language, making them suitable for complex and lengthy texts. They can be particularly useful for sentiment analysis of research papers, as they can understand the complex language and context often found in academic writing. This allows for a more nuanced understanding of the sentiment conveyed in the papers. VADER, on the other hand, is a rule-based model that excels in analyzing short texts with clear sentiment expressions, often found in social media.

Challenge

Use the transformers library to perform sentiment analysis on the following text:

“Perovskite nanocrystals have emerged as a promising class of materials for next-generation optoelectronic devices due to their unique properties. Their crystal structure allows for tunable bandgaps, which are the energy differences between occupied and unoccupied electronic states. This tunability enables the creation of materials that can absorb and emit light across a wide range of the electromagnetic spectrum, making them suitable for applications like solar cells, light-emitting diodes (LEDs), and lasers.”

Print the original text and the sentiment score and label. You can use the following code to load the transformers library and the pre-trained model and tokenizer for sentiment analysis:

PYTHON


from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis")

Show me the solution

PYTHON

from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis")
text = "This book is amazing. It is well-written, engaging, and informative. I learned a lot from reading it and I highly recommend it to anyone interested in natural language processing."
print(text)
print(sentiment_analysis(text))

Output:

PYTHON


output: "Perovskite nanocrystals have emerged as a promising class of materials for next-generation optoelectronic devices due to their unique properties. Their crystal structure allows for tunable bandgaps, which are the energy differences between occupied and unoccupied electronic states. This tunability enables the creation of materials that can absorb and emit light across a wide range of the electromagnetic spectrum, making them suitable for applications like solar cells, light-emitting diodes (LEDs), and lasers."

[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

Challenge

Comparing Transformer with VADER on a large size text. Use the Huggingface library database.

Show me the solution

PYTHON

from transformers import pipeline

5.4. Text Summarization

Text summarization is the process of distilling the most important information from a source (or sources) to produce an abbreviated version for a particular user and task. It can be broadly classified into two types: extractive and abstractive summarization.

Discussion

How extractive and abstractive summarization methods are different? Connect the following text boxes to the correct category. Share your results with other group members.

Show me the solution

Now, let’s see how to use the Hugging Face Transformers library to perform abstractive summarization. First, from the transformers import pipeline:

PYTHON


# Initialize the summarization pipeline
summarizer = pipeline("summarization")

Input a sample text from an article from source:

Input Text

text = “A groundbreaking research breakthrough in solar energy has propelled the development of the world’s most efficient quantum dot (QD) solar cell, marking a significant leap towards the commercialization of next-generation solar cells. This cutting-edge QD solution and device have demonstrated exceptional performance, retaining their efficiency even after long-term storage. Led by Professor Sung-Yeon Jang from the School of Energy and Chemical Engineering at UNIST, a team of researchers has unveiled a novel ligand exchange technique. This innovative approach enables the synthesis of organic cation-based perovskite quantum dots (PQDs), ensuring exceptional stability while suppressing internal defects in the photoactive layer of solar cells. Our developed technology has achieved an impressive 18.1% efficiency in QD solar cells,” stated Professor Jang. This remarkable achievement represents the highest efficiency among quantum dot solar cells recognized by the National Renewable Energy Laboratory (NREL) in the United States. The increasing interest in related fields is evident, as last year, three scientists who discovered and developed QDs, as advanced nanotechnology products, were awarded the Nobel Prize in Chemistry. QDs are semiconducting nanocrystals with typical dimensions ranging from several to tens of nanometers, capable of controlling photoelectric properties based on their particle size. PQDs, in particular, have garnered significant attention from researchers due to their outstanding photoelectric properties. Furthermore, their manufacturing process involves simple spraying or application to a solvent, eliminating the need for the growth process on substrates. This streamlined approach allows for high-quality production in various manufacturing environments. However, the practical use of QDs as solar cells necessitates a technology that reduces the distance between QDs through ligand exchange, a process that binds a large molecule, such as a ligand receptor, to the surface of a QD. Organic PQDs face notable challenges, including defects in their crystals and surfaces during the substitution process. As a result, inorganic PQDs with limited efficiency of up to 16% have been predominantly utilized as materials for solar cells. In this study, the research team employed an alkyl ammonium iodide-based ligand exchange strategy, effectively substituting ligands for organic PQDs with excellent solar utilization. This breakthrough enables the creation of a photoactive layer of QDs for solar cells with high substitution efficiency and controlled defects. Consequently, the efficiency of organic PQDs, previously limited to 13% using existing ligand substitution technology, has been significantly improved to 18.1%. Moreover, these solar cells demonstrate exceptional stability, maintaining their performance even after long-term storage for over two years. The newly-developed organic PQD solar cells exhibit both high efficiency and stability simultaneously. Previous research on QD solar cells predominantly employed inorganic PQDs,” remarked Sang-Hak Lee, the first author of the study. Through this study, we have demonstrated the potential by addressing the challenges associated with organic PQDs, which have proven difficult to utilize. This study presents a new direction for the ligand exchange method in organic PQDs, serving as a catalyst to revolutionize the field of QD solar cell material research in the future,” commented Professor Jang. The findings of this study, co-authored by Dr. Javid Aqoma Khoiruddin and Sang-Hak Lee, have been published online in Nature Energy on January 27, 2024. The research was made possible through the support of the ‘Basic Research Laboratory (BRL)’ and ‘Mid-Career Researcher Program,’ as well as the ‘Nano·Material Technology Development Program,’ funded by the National Research Foundation of Korea (NRF) under the Ministry of Science and ICT (MSIT). It has also received support through the ’Global Basic Research Lab Project.”

Now we can perform summarization and print the results:

PYTHON


summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
# Print the summary:
print("Summary:", summary[0]['summary_text'])


Output:

Sumy for summarization

Sumy is a Python library for extractive summarization. It uses algorithms like LSA to rank sentences based on their importance and creates a summary by selecting the top-ranked sentences. We can see how it works in practice: We start with importing the PlaintextParser and LsaSummarizer modules:

PYTHON


from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

To create a parser we use the same text sample from an article from source:

PYTHON


parser = PlaintextParser.from_string(text, Tokenizer("english"))

# Next, we initialize the LSA summarize:

summarizer = LsaSummarizer()

# Summarize the text and print the results

summary = summarizer(parser.document, 5)

for sentence in summary:
    print(sentence)

Output:

Sumy extracts key sentences from the original text, which can be quicker but may lack the cohesiveness of an abstractive summary. On the other hand, Transformer is suitable for generating a new summary that captures the text’s essence in a coherent and often more readable form.

Activity

Teamwork: Which framework could be more useful for text summarizations in your field of research? Explain why?

Show me the solution

A: Transformers are particularly useful for summarizing research papers and documents where understanding the context and generating a coherent summary is crucial. They can produce summaries that are not only concise but also maintain the narrative flow, making them more readable. Sumy, while quicker and less resource-intensive, is best suited for scenarios where extracting key information without the need for narrative flow is acceptable.

Challenge

Use the transformers library to perform text summarization on the following text [generated by Copilot]:

Input Text

text: “Perovskite nanocrystals are a class of semiconductor nanocrystals that have attracted a lot of attention in recent years due to their unique optical and electronic properties. Perovskite nanocrystals have an ABX3 composition, where A is a monovalent cation (such as cesium, methylammonium, or formamidinium), B is a divalent metal (such as lead or tin), and X is a halide (such as chloride, bromide, or iodide). Perovskite nanocrystals can emit brightly across the entire visible spectrum, with tunable colors depending on their composition and size. They also have high quantum yields, fast radiative decay rates, and narrow emission line widths, making them ideal candidates for various optoelectronic applications. The first report of perovskite nanocrystals was published in 2014 by Protesescu et al., who synthesized cesium lead halide nanocrystals using a hot-injection method. They demonstrated that the nanocrystals had cubic or orthorhombic crystal structures, depending on the halide ratio, and that they exhibited strong photoluminescence with quantum yields up to 90%. They also showed that the emission wavelength could be tuned from 410 nm to 700 nm by changing the halide composition or the nanocrystal size. Since then, many other groups have developed various synthetic methods and strategies to control the shape, size, composition, and surface chemistry of perovskite nanocrystals. One of the remarkable features of perovskite nanocrystals is their defect tolerance, which means that they can maintain high luminescence even with a high density of surface or bulk defects. This is in contrast to other semiconductor nanocrystals, such as CdSe, which require surface passivation to prevent non-radiative recombination and quenching of the emission. The defect tolerance of perovskite nanocrystals is attributed to their electronic band structure, which has a large density of states near the band edges and a small effective mass of the charge carriers. These factors reduce the formation energy and the localization of defects and enhance the radiative recombination rate of the excitons. Another interesting aspect of perovskite nanocrystals is their weak quantum confinement, which means that their emission properties are not strongly affected by their size. This is because the exciton binding energy of perovskite nanocrystals is much larger than the quantum confinement energy, and thus the excitons are localized within a few unit cells regardless of the nanocrystal size. As a result, perovskite nanocrystals can exhibit narrow emission line widths even with a large size distribution, which simplifies the synthesis and purification processes. Moreover, perovskite nanocrystals can show dual emission from both the band edge and the surface states, which can be exploited for color tuning and white light generation. Perovskite nanocrystals have been applied to a wide range of photonic devices, such as light-emitting diodes, lasers, solar cells, photodetectors, and scintillators. Perovskite nanocrystals can offer high brightness, color purity, and stability as light emitters, and can be integrated with various substrates and architectures. Perovskite nanocrystals can also act as efficient light absorbers and charge transporters and can be coupled with other materials to enhance the performance and functionality of the devices. Perovskite nanocrystals have shown promising results in terms of efficiency, stability, and versatility in these applications. However, perovskite nanocrystals also face some challenges and limitations, such as the toxicity of lead, the instability under ambient conditions, the hysteresis and degradation under electrical or optical stress, and the reproducibility and scalability of the synthesis and fabrication methods. These issues need to be addressed and overcome to realize the full potential of perovskite nanocrystals in practical devices. Therefore, further research and development are needed to improve the material quality, stability, and compatibility of perovskite nanocrystals, and to explore new compositions, structures, and functionalities of these fascinating nanomaterials.”

Challenge (continued)

Print the summarized text.

PYTHON


from transformers import pipeline
summarizer = pipeline("summarization")
...

Show me the solution

A: You can use the following code to load the transformers library and the pre-trained model and tokenizer for text summarization:

PYTHON


from transformers import pipeline
summarizer = pipeline("summarization")
text = " Perovskite nanocrystals are a class of semiconductor nanocrystals that have attracted…

Output:

Key Points

Transformers revolutionized NLP by processing words in parallel through an attention mechanism, capturing context more effectively than sequential models
The summation and activation function within a neuron transform inputs through weighted sums and biases, followed by an activation function to produce an output.
Transformers consist of encoders, decoders, positional encoding, input/output embedding, and softmax output, working together to process and generate data.
Transformers are not limited to NLP and can be applied to other AI applications due to their ability to handle complex data patterns.
Sentiment analysis and text summarization are practical applications of transformers in NLP, enabling the analysis of emotional tone and the creation of concise summaries from large texts.

Content from Large Language Models

Last updated on 2024-05-12 | Edit this page

Estimated time: 12 minutes

Overview

Questions

What are the main features of large language models?
How is BERT different from GPT models?
How can I use open-source LLMs, such as LLM examples in huggingface, for research tasks?

Objectives

Be able to explain the structure of large language models and their main components
Identify differences between BERT and GPT.
Be able to use open-source LLMs, such as huggingface, for text summarization, classification, and generation.

6.1. Introduction to LLMs

Large Language Models (LLMs) have become a cornerstone of modern natural language processing (NLP). Since the introduction of the transformer architecture in 2017, LLMs have leveraged this design to achieve remarkable language understanding and generation capabilities. In the previous episode, we discussed the transformer architecture, which is integral to all LLMs, utilizing its encoder and decoder components to process language.

LLMs have several key features.

Challenge

Fill in the above feature placeholders. Discuss what are these key components. Explain the key features in detail and compare your thoughts with the other group members:

Transformer Architecture: A neural network design that uses self-attention mechanisms to weigh the influence of different parts of the input data.
Pre-training: involves teaching LLMs to anticipate words in sentences, using either bi-directional or uni-directional approaches, (based on the LLM type), without the need for understanding or experience.

Show me the solution

Transformer Architecture: A neural network design that uses self-attention mechanisms to weigh the influence of different parts of the input data.
Pre-training: involves teaching LLMs to anticipate words in sentences, using either bi-directional or uni-directional approaches, (based on the LLM type), without the need for understanding or experience.
Word/Token Embedding: The process of converting words or phrases into numerical form (vectors) that computers can understand.

RECALL embedding?

source

Context Window: The range of words the model considers for predicting the next word or understanding the current word within a sentence.
Parameters: The aspects of the model that are learned from training data and determine the model’s behavior.
Transfer Learning: The process LLMs use to apply their prior knowledge to new tasks.

Thus, the completed graph will be:

We can categorize LLMs based on the transformer architecture. Let’s have another look into the transformer architecture, this time we categorize them based on the two main components: Encoder and Decoder. LLMs can be designed to handle different tasks based on their underlying transformer blocks and whether they have encoder-only, decoder-only, or encoder-decoder layers.

Challenge

How do you think we should connect each one of the following transformers to the correct color?

Show me the solution

• Encoders are used for understanding tasks like sentence classification.

• Decoders excel in generative tasks like text generation.

• The combination of encoders and decoders in transformers allows them to be versatile and perform a variety of tasks, from translation to summarization, depending on the specific requirements of the task at hand.

Encoder Vs. Decoder and/or BERT Vs. GPT

We will see models like BERT use encoders for bidirectional understanding, and models like GPT use decoders for generating coherent text, making them suitable for chatbots or virtual assistants.

Discussion

Teamwork: Think of some examples of traditional NLP models, such as n-gram models, hidden Markov models, LSTMs, and RNNs. How do large language models differ from them in terms of architecture, data, and performance?

A: Traditional NLP models, such as n-gram models, hidden Markov models (HMMs), Long Short-Term Memory Networks (LSTMs), and Recurrent Neural Networks (RNNs), differ significantly from the recent LLMs. N-gram models predict the next item in a sequence based on the previous ‘n-1’ items without any deep understanding of context. HMMs are statistical models that output probabilities of sequences and are often used for tasks like part-of-speech tagging. LSTMs and RNNs are types of neural networks that can process sequences of data and are capable of learning order dependence in sequence prediction.

Compared to these traditional models, LLMs have several key differences: - Architecture: Novel LLMs use transformer architectures, which are more advanced than the simple recurrent units of RNNs or the gated units of LSTMs. Transformers use self-attention to weigh the influence of different parts of the input data, which is more effective for understanding context. - Data: Novel LLMs are trained on massive datasets, often sourced from the internet, which allows them to learn a wide variety of language patterns, common knowledge, and even reasoning abilities. Traditional models typically use smaller, more curated datasets. - Performance: Novel LLMs generally outperform traditional models in a wide range of language tasks due to their ability to understand and generate human-like text. They can capture subtleties and complexities of language that simpler models cannot, leading to more accurate and coherent outputs.

6.2. BERT

Bidirectional Encoder Representations from Transformers (BERT) is an LLM that uses an encoder-only architecture from transformers. It is designed to understand the context of a word based on all of its surroundings (bidirectional context). Let’s guess the missing words in the text below to comprehend the workings of BERT:

Challenge

Complete the following paragraph:

“BERT is a revolutionary language model that uses an ______ (encoder) to process words in a sentence. Unlike traditional models, it predicts words based on the ______ rather than in sequence. Its training involves ______, where words are intentionally hidden, or ______, and the model learns to predict them. This results in rich ______ that capture the nuanced meanings of words.”

Show me the solution

“BERT is a revolutionary language model that uses an encoder to process words in a sentence. Unlike traditional models, it predicts words based on the context rather than in sequence. Its training involves self-supervised learning, where words are intentionally hidden, or ‘masked’, and the model learns to predict them. This results in rich embeddings that capture the nuanced meanings of words.”

Heads-up: MLM & NSP

Pre-training of language models involves a process where models like BERT and GPT learn to predict words in sentences without specific task training. This is achieved through methods like the Masked Language Model (MLM) for bi-directional models, which predict masked words using surrounding context. MLM in BERT predicts missing words in a sentence by masking them during training.

For Next Sentence Prediction (NSP) BERT learns to predict if two sentences logically follow each other.

6.3. GPT

Generative Pretrained Transformer (GPT) models, on the other hand, use a decoder-only architecture. They excel at generating coherent and contextually relevant text. Check the following table that summarizes three different LLMs. The middle column misses some information about GPT models. With the help of your teammates complete the table and explain the differences in the end.

Challenge

Write in the gray boxes with the correct explanations.

source

Show me the solution

source

Discussion

Teamwork: From what you learned above how can you explain the differences between the three LLM types? Discuss in groups and share your answers.

A: We can see it as the processes of trying to understand a conversation (BERT), versus trying to decide what to say next in the conversation (GPT). BERT is like someone who listens to the entire conversation before and after a word to really understand its meaning.

For example, in the sentence “I ate an apple,” BERT would look at both “I ate an” and “apple” to figure out what “an” refers to. It’s trained by playing a game of ‘guess the missing word,’ where some words are hidden (masked) and it has to use the context to fill in the blanks.

GPT, on the other hand, is like a storyteller who only needs to know what was said before to continue the tale. It would take “I ate an” and predict that the next word might be “apple.” It learns by reading a lot of text and practicing how to predict the next word in a sentence.

Both are smart in their own ways, but they’re used for different types of language tasks. BERT is great for understanding the context of words, while GPT is excellent at generating new text based on what it’s seen before. The following schematics demonstrate their performing differences:

source

How LLMs can be Compared? What is HELM?

Models are often benchmarked using standardized datasets and metrics. The Holistic Evaluation of Language Models (HELM) by Stanford provides a comprehensive framework for evaluating LLMs across multiple dimensions.

source

GPT-4 models are outperforming other LLM models in terms of accuracy.

Discussion

What are some examples of LLMs, and how are they trained and used for research tasks? Consider some of the main features and characteristics of LLMs, such as transformer architecture, self-attention mechanism, pre-training and fine-tuning, and embedding capabilities. How do these features enable LLMs to perform various NLP tasks, such as text classification, text generation, or question answering?

Challenge

How can we compare different LLMs? Are there any benchmarks?

Show me the solution

A: Comparing Performance (Benchmarking): 1. Performance can be compared based on the model’s architecture, computational efficiency, and suitability for specific tasks. 2. Benchmarks and leaderboards (such as HELM) can provide insights into how different models perform on standardized datasets. 3. Community feedback and use-case studies can also inform the practical effectiveness of different LLMs.

6.4. Open-Source LLMs:

It is very important for researchers to openly have access to capable LLMs for their studies. Fortunately, some companies are supporting open-source LLMs. The BLOOM model, developed by the BigScience Workshop in collaboration with Hugging Face and other organizations, was released on July 6, 2022. It offers a wide range of model sizes, from 1.1 billion to 176 billion parameters, and is licensed under the open RAIL-M v1. BLOOM is known for its instruct models, coding capabilities, customization finetuning, and being open source. It is more openly accessible and benefits from a large community and extensive support.

On the other hand, the LLaMA model, developed by Meta AI, was released on February 24, 2023. It is available in four sizes: 7 billion, 13 billion, 33 billion, and 65 billion parameters. The license for LLaMA is restricted to noncommercial use, and access is primarily for researchers. Despite its smaller size, LLaMA is parameter-efficient and has outperformed GPT-3 on many benchmarks. However, its accessibility is more gated compared to BLOOM, and community support is limited to approved researchers.

Now let’s summarize what we learned here in the following table:

Hugging Face provides several different LLMs. Now we want to see how we can use an open-source model. using the Hugging Face datasets library and an open-source Large Language Model (LLM). We will go through the process of setting up the environment, installing necessary libraries, loading a dataset, and then using an LLM to process the data. We will start with setting up the environment.

Heads up

Before we begin, ensure that you have Python installed on your system. Python 3.6 or later is recommended. You can download Python from the official Python website.

Next, we will install the necessary libraries through the terminal or command prompt:

PYTHON


pip install datasets transformers

We use the squad dataset here, which is a question-answering dataset Question-answering is one of main goals of utilizing LLMs for research projects. When you run this script, the expected output should be the answer to the question based on the provided context. Here is how to load it:

PYTHON


from datasets import load_dataset
# Load the SQuAD dataset
squad_dataset = load_dataset('squad')

# Print the first example in the training set
print(squad_dataset['train'][0])

Now, we can load a pre-trained model from Hugging Face. For this example, let’s use the bert-base-uncased model, which is different from BLOOM and is suitable for question-answering tasks:

PYTHON


from transformers import AutoModelForQuestionAnswering, AutoTokenizer
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')

We need to define the question and context here:

PYTHON


question = "What is the name of the university in Paris that was founded in 1257?"
context = "The University of Paris, founded in 1257, is often referred to as the Sorbonne after the college created by Robert de Sorbon. It is one of the world's oldest universities."

Recall to be able to feed data into the model, we should already tokenize our data. Once we have our data tokenized, we can use the model to make predictions. Here is how to tokenize the first example:

PYTHON



# Tokenize the first example
Inputs = tokenizer(squad_dataset['train'][0]['question'], squad_dataset['train'][0]['context'], return_tensors='pt')

# Get model predictions
outputs = model(**inputs)

Note that the model outputs are raw logits. We need to convert these into an answer by selecting the tokens with the highest start and end scores:

PYTHON



import torch
# Find the tokens with the highest `start` and `end` scores
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

# Convert tokens to the answer string
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens (inputs['input_ids'][0][answer_start:answer_end]))
print(answer)

This will print the answer to the question based on the context provided in the dataset. In this case, the output would be:

PYTHON


Output: the sorbonne

This output indicates that the model has correctly identified “the Sorbonne” as the name of the university in Paris founded in 1257, based on the context given. Remember, the actual output may vary slightly depending on the model version and the specific weights used at the time of inference.

Discussion

Teamwork: What are the challenges and implications of LLMs, such as scalability, generalization, and social impact? What does it mean when an LLM hallucinates?

Challenge

Use the OpenAI library to access and use an open-source LLM for text summarization. You can use the following code to load the OpenAI library and the pre-trained model and tokenizer for text summarization:

PYTHON


import openai
openai.api_key = "sk-<your_api_key>"
text_summarizer = openai.Completion.create(engine="davinci", task="summarize")

Use the text_summarizer to summarize the following text.

Text

“Perovskite nanocrystals are a class of semiconductor nanocrystals, which exhibit unique characteristics that separate them from traditional quantum dots. Perovskite nanocrystals have an ABX3 composition where A = cesium, methylammonium (MA), or formamidinium (FA); B = lead or tin; and X = chloride, bromide, or iodide. Their unique qualities largely involve their unusual band structure which renders these materials effectively defect-tolerant or able to emit brightly without surface passivation. This is in contrast to other quantum dots such as CdSe which must be passivated with an epitaxially matched shell to be bright emitters. In addition to this, lead-halide perovskite nanocrystals remain bright emitters when the size of the nanocrystal imposes only weak quantum confinement. This enables the production of nanocrystals that exhibit narrow emission linewidths regardless of their polydispersity. The combination of these attributes and their easy-to-perform synthesis has resulted in numerous articles demonstrating the use of perovskite nanocrystals as both classical and quantum light sources with considerable commercial interest. Perovskite nanocrystals have been applied to numerous other optoelectronic applications such as light-emitting diodes, lasers, visible communication, scintillators, solar cells, and photodetectors. The first report of perovskite nanocrystals was published in 2014 by Protesescu et al., who synthesized cesium lead halide nanocrystals using a hot-injection method. They showed that the nanocrystals can emit brightly when excited by ultraviolet or blue light, and their colors are tunable across the entire visible spectrum by changing the halide from chloride (UV/blue) to bromide (green) and iodide (red). They also demonstrated that the nanocrystals can be incorporated into thin films and show high photoluminescence quantum yields (PLQYs) of up to 90%. Since then, many other synthetic methods have been developed to produce perovskite nanocrystals with different shapes, sizes, compositions, and surface ligands. Some of the common methods include ligand-assisted reprecipitation, antisolvent precipitation, solvothermal synthesis, microwave-assisted synthesis, and microfluidic synthesis. Perovskite nanocrystals can be classified into different types based on their structure, dimensionality, and composition. The most common type is the three-dimensional (3D) perovskite nanocrystals, which have a cubic or orthorhombic crystal structure and a band gap that depends on the halide content. The 3D perovskite nanocrystals can be further divided into pure halide perovskites (such as CsPbX3) and mixed halide perovskites (such as CsPb(Br/I)3), which can exhibit color tuning, anion exchange, and halide segregation phenomena. Another type is the two-dimensional (2D) perovskite nanocrystals, which have a layered structure with organic cations sandwiched between inorganic perovskite layers. The 2D perovskite nanocrystals have a quantum well-like band structure and a band gap that depends on the thickness of the perovskite layers. The 2D perovskite nanocrystals can also be mixed with 3D perovskite nanocrystals to form quasi-2D perovskite nanocrystals, which can improve the stability and emission efficiency of the nanocrystals. A third type is the metal-free perovskite nanocrystals, which replace the metal cations (such as Pb or Sn) with other elements (such as Bi or Sb). The metal-free perovskite nanocrystals have a lower toxicity and higher stability than the metal-based perovskite nanocrystals, but they also have a lower PLQY and a broader emission linewidth. The development of perovskite nanocrystals in the past few years has been remarkable, with significant advances in synthesis, characterization, and application. However, there are still some challenges and opportunities for further improvement. One of the major challenges is the stability of perovskite nanocrystals, which are sensitive to moisture, oxygen, heat, light, and electric fields. These factors can cause degradation, phase transition, and non-radiative recombination of the nanocrystals, resulting in reduced emission intensity and color stability. Several strategies have been proposed to enhance the stability of perovskite nanocrystals, such as surface passivation, encapsulation, doping, alloying, and embedding in matrices. Another challenge is the toxicity of perovskite nanocrystals, which are mainly composed of lead, a heavy metal that can cause environmental and health hazards. Therefore, there is a need to develop lead-free or low-lead perovskite nanocrystals that can maintain the high performance and tunability of the lead-based ones. Some of the promising candidates include tin-based, bismuth-based, and antimony-based perovskite nanocrystals. A third challenge is the scalability and integration of perovskite nanocrystals, which are essential for practical applications. There is a need to develop cost-effective and large-scale synthesis methods that can produce high-quality and uniform perovskite nanocrystals. Moreover, there is a need to develop efficient and reliable fabrication techniques that can integrate perovskite nanocrystals into various devices and platforms. In conclusion, perovskite nanocrystals are a fascinating class of nanomaterials that have shown remarkable potential for various photonic applications. They have unique properties such as defect tolerance, high quantum yield, fast radiative decay, and narrow emission linewidth in weak confinement, which make them ideal candidates for light emission devices. They also have a wide color tunability from ultraviolet to near-infrared regions, which makes them suitable for various wavelength-dependent applications. However, there are still some challenges that need to be overcome, such as stability, toxicity, scalability, and integration. Therefore, further research and development are needed to address these issues and to explore new opportunities for perovskite nanocrystals in the field of nanophotonics.

Show me the solution

Print the summarized text.

PYTHON

import openai
openai.api_key = "sk-<your_api_key>"
text_summarizer = openai.Completion.create(engine="davinci", task="summarize")
text = " Perovskite nanocrystals are a class of semiconductor …"
summary = text_summarizer(text)['choices'][0]['text']
print(text)
print(summary)


output:

Challenge

Use the huggingface library to access and use an open-source domain-specific LLM for text classification. You can use the following code to load the huggingface library and the pre-trained model and tokenizer for text classification:

PYTHON


from transformers import pipeline
text_classifier = pipeline("text-classification")
...

Use the text_classifier to classify the following text into one of the categories: metals, ceramics, polymers, or composites. Print the text and the predicted category and score.

PYTHON


Text: "Polyethylene is a thermoplastic polymer that consists of long chains of ethylene monomers. It is one of the most common and widely used plastics in the world. It has many applications, such as packaging, bottles, containers, films, pipes, and cables. Polyethylene can be classified into different grades based on its density, molecular weight, branching, and crystallinity."

Show me the solution

PYTHON


from transformers import pipeline
text_classifier = pipeline("text-classification")
text = "Polyethylene is a thermoplastic polymer that consists of long chains of ethylene monomers. It is one of the most common and widely used plastics in the world. It has many applications, such as packaging, bottles, containers, films, pipes, and cables. Polyethylene can be classified into different grades based on its density, molecular weight, branching, and crystallinity."
print(text)
print(text_classifier(text))


output: "Polyethylene is a thermoplastic polymer that consists of long chains of ethylene monomers. It is one of the most common and widely used plastics in the world. It has many applications, such as packaging, bottles, containers, films, pipes, and cables. Polyethylene can be classified into different grades based on its density, molecular weight, branching, and crystallinity."
"[{'label': 'polymers', 'score': 0.9987659454345703}]"

Challenge

Use the huggingface library to access and use an open-source LLM for text generation. You can use the following code to load the huggingface library and the pre-trained model and tokenizer for text generation:

PYTHON


from transformers import pipeline
text_generator = pipeline("text-generation")

Use the text_generator to generate a paragraph of text based on the following prompt: “The applications of nanomaterials in material science are”. Print the prompt and the generated text.

Show me the solution

PYTHON


from transformers import pipeline
text_generator = pipeline("text-generation")
prompt = "The applications of nanomaterials in material science are"
generated_text = text_generator(prompt)[0]['generated_text']
print(prompt)
print(generated_text)


output: "The applications of nanomaterials in material science are
The applications of nanomaterials in material science are diverse and promising. Nanomaterials are materials that have at least one dimension in the nanometer range (1-100 nm). Nanomaterials can exhibit novel physical, chemical, and biological properties that are different from their bulk counterparts, such as high surface area, quantum confinement, enhanced reactivity, and tunable functionality. Nanomaterials can be used for various purposes in material science, such as improving the performance and functionality of existing materials, creating new materials with unique properties, and enabling new functionalities and devices. Some examples of nanomaterials applications in material science are:

- Nanocomposites: Nanomaterials can be incorporated into other materials, such as polymers, metals, ceramics, or biomaterials, to form nanocomposites that have improved mechanical, thermal, electrical, optical, or magnetic properties. For instance, carbon nanotubes can be used to reinforce polymer composites and increase their strength, stiffness, and conductivity.
- Nanocoatings: Nanomaterials can be used to coat the surface of other materials, such as metals, glass, or plastics, to provide them with enhanced protection, durability, or functionality. For example, titanium dioxide nanoparticles can be used to coat glass and make it self-cleaning, anti-fogging, and anti-bacterial.
- Nanosensors: Nanomaterials can be used to sense and measure various physical, chemical, or biological parameters, such as temperature, pressure, strain, pH, or biomolecules. For example, gold nanoparticles can be used to detect and quantify the presence of specific DNA sequences or proteins by changing their color or fluorescence.
- Nanomedicine: Nanomaterials can be used for various biomedical applications, such as drug delivery, imaging, diagnosis, or therapy. For example, magnetic nanoparticles can be used to deliver drugs to specific target sites in the body by using an external magnetic field, or to enhance the contrast of magnetic resonance imaging (MRI).

Key Points

LLMs are based on the transformer architecture.
BERT and GPT have distinct approaches to processing language.
Open source LLMs provide transparency and customization for research applications.
Benchmarking with HELM offers a holistic view of model performance.

Content from Domain-Specific LLMs

Last updated on 2024-05-20 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How can tune the LLMs to be domain-specific?
What are some available approaches to empower LLMs solve specific research problems?
Which approach should I use for my research?
What are the challenges and trade-offs of domain-specific LLMs?

Objectives

Be able to identify approaches by which LLMs can be tuned for solving research problems.
Be able to use introductory approaches for creating domain-specific LLMs.

7.1. Introduction to DSL (Available Approaches)

To enhance the response quality of an LLM for solving specific problems we need to use strategies by which we can tune the LLM. Generally, there are four ways to enhance the performance of LLMs:

1. Prompt Optimization:

To elicit specific and accurate responses from LLMs by designing prompts strategically.

Zero-shot Prompting: This is the simplest form of prompting where the LLM is given a task or question without any context or examples. It relies on the LLM’s pre-existing knowledge to generate a response.

Example

“What is the capital of France?” The LLM would respond with “Paris” based on its internal knowledge.

Few-shot Prompting: In this technique, the LLM is provided with a few examples to demonstrate the expected response format or content.

Example

To determine sentiment, you might provide examples like “I love sunny days. (+1)” and “I hate traffic. (-1)” before asking the LLM to analyze a new sentence.

2. Retrieval Augmented Generation (RAG):

To supplement the LLM’s generative capabilities with information retrieved from external databases or documents.

Retrieval: The LLM queries a database to find relevant information that can inform its response.

Example

If asked about recent scientific discoveries, the LLM might retrieve articles or papers on the topic.

Generation: After retrieving the information, the LLM integrates it into a coherent response.

Example

Using the retrieved scientific articles, the LLM could generate a summary of the latest findings in a particular field.

source

3. Fine-Tuning:

To adapt a general-purpose LLM to excel at a specific task or within a particular domain.

Language Modeling Task Fine-Tuning: This involves training the LLM on a large corpus of text to improve its ability to predict the next word or phrase in a sentence.

Example

An LLM fine-tuned on legal documents would become better at generating text that resembles legal writing.

Supervised Q&A Fine-Tuning: Here, the LLM is trained on a dataset of question-answer pairs to enhance its performance on Q&A tasks.

Example

An LLM fine-tuned with medical Q&A pairs would provide more accurate responses to health-related inquiries.

4. Training from Scratch:

Builds a model specifically for a domain, using relevant data from the ground up.

Discussion

Teamwork: Which approach do you think is more computation-intensive? Which is more accurate? How are these qualities related? Evaluate the trade-offs between fine-tuning and other approaches.

Show me the solution

source

Discussion

Teamwork: What is DSL and why are they useful for research tasks? Think of some examples of NLP tasks that require domain-specific LLMs, such as literature review, patent analysis, or material discovery. How do domain-specific LLMs improve the performance and accuracy of these tasks?

7.2. Prompting

Prompting is a technique employed to craft concise and clear instructions that guide the LLM in generating more accurate domain-specific outputs. In many cases, it does not coding but keep in mind that the quality of the input critically influences the output’s quality. It is a relatively easy and fast DSL method for harnessing LLM’s capabilities. We can see how it works through the following example:

PYTHON

Install the Hugging Face libraries
!pip install transformers datasets

from transformers import pipeline

# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Example research question
question = "What is the role of CRISPR-Cas9 in genome editing?"

# Candidate topics to classify the question
topics = ["Biology", "Technology", "Healthcare", "Genetics", "Ethics"]

# Perform zero-shot classification
result = classifier(question, candidate_labels=topics)

# Output the results
print(f"Question: {question}")
print("Classified under topics with the following scores:")
for label, score in zip(result['labels'], result['scores']):
print(f"{label}: {score:.4f}")

Which tool do researchers require?

For research applications where highly reliable answers are crucial, Prompt Engineering combined with Retrieval-Augmented Generation (RAG) is often the most suitable approach. This combination allows for flexibility and high-quality outputs by leveraging both the generative capabilities of LLMs and the precision of domain-specific data sources:

Be careful when fine-tuning a model

When fine-tuning a BERT model from Hugging Face, for instance, it is essential to approach the process with precision and care.

Begin by thoroughly understanding BERT’s architecture and the specific task at hand to select the most suitable model variant and hyperparameters.
Prepare your dataset meticulously, ensuring it is clean, well-represented, and split correctly to avoid data leakage and overfitting.
Hyperparameter selection, such as learning rates and batch sizes, should be made with consideration, and regularization techniques like dropout should be employed to enhance the model’s ability to generalize.
Evaluate the model’s performance using appropriate metrics and address any class imbalances with weighted loss functions or similar strategies. Save checkpoints to preserve progress and document every step of the fine-tuning process for transparency and reproducibility.
Ethical considerations are paramount; strive for a model that is fair and unbiased. Ensure compliance with data protection regulations, especially when handling sensitive information.
Lastly, manage computational resources wisely and engage with the Hugging Face community for additional support. Fine-tuning is iterative, and success often comes through continuous experimentation and learning.

Challenge

Check the following structure. Guess which optimization strategy is represented in these architectures.

source

Show me the solution

RAG addresses the challenge of real-time data fetching by merging the generative talents of these models with the ability to consult a broad document corpus, enhancing their responses. The potential for live-RAG in chatbots suggests a future where AI can conduct on-the-spot searches, access up-to-date information, and rival search engines in answering timely questions.

Discussion

Teamwork: What are the challenges and trade-offs of domain-specific LLMs, such as data availability, model size, and complexity?

Consider some of the factors that affect the quality and reliability of domain-specific LLMs, such as the amount and quality of domain-specific data, the computational resources and time required for training or fine-tuning, and the generalization and robustness of the model. How do these factors pose problems or difficulties for domain-specific LLMs and how can we overcome them?

Discussion

What are some available approaches for creating domain-specific LLMs, such as fine-tuning and knowledge distillation?

Consider some of the main steps and techniques for creating domain-specific LLMs, such as selecting a general LLM, collecting and preparing domain-specific data, training or fine-tuning the model, and evaluating and deploying the model. How do these approaches differ from each other and what are their advantages and disadvantages?

Now let’s try One-shot and Few-shot prompting examples and see how they can help us to enhance the sensitivity of the LLM to our field of study: One-shot prompting involves providing the model with a single example to follow. It is like giving the model a hint about what you expect. We will go through an example using Hugging Face’s transformers library:

PYTHON


from transformers import pipeline

# Load a pre-trained model and tokenizer
model_name = "gpt2"
generator = pipeline('text-generation', model=model_name)

# One-shot example
prompt = "Translate 'Hello, how are you?' to French:\nBonjour, comment ça va?\nTranslate 'I am learning new things every day' to French:"
result = generator(prompt, max_length=100)

# Output the result
print(result[0]['generated_text'])

In this example, we provide the model with one translation example and then ask it to translate a new sentence. The model uses the context from the one-shot example to generate the translation.

But what if we have a Few-Shot Prompting? Few-shot prompting gives the model several examples to learn from. This can improve the model’s ability to understand and complete the task.

Here is how you can implement few-shot prompting:

PYTHON


from transformers import pipeline

# Load a pre-trained model and tokenizer
model_name = "gpt2"
generator = pipeline('text-generation', model=model_name)

# Few-shot examples
prompt = """\
Q: What is the capital of France?
A: Paris.

Q: What is the largest mammal?
A: Blue whale.

Q: What is the human body's largest organ?
A: The skin.

Q: What is the currency of Japan?
A:"""
result = generator(prompt, max_length=100)

# Output the result
print(result[0]['generated_text'])

In this few-shot example, we provide the model with three question-answer pairs before posing a new question. The model uses the pattern it learned from the examples to answer the new question.

Challenge

To summarize this approach in a few steps, fill in the following gaps: 1. Choose a Model: Select a — model from Hugging Face that suits your task.

Load the Model: Use the — function to load the model and tokenizer.
Craft Your Prompt: Write a — that includes one or more examples, depending on whether you’re doing one-shot or few-shot prompting.
Generate Text: Call the — with your prompt to generate the —.
Review the Output: Check the generated text to see if the model followed the — correctly.

Show me the solution

Choose a Model: Select a pre-trained model from Hugging Face that suits your task.
Load the Model: Use the pipeline function to load the model and tokenizer.
Craft Your Prompt: Write a prompt that includes one or more examples, depending on whether you’re doing one-shot or few-shot prompting.
Generate Text: Call the generator with your prompt to generate the output.
Review the Output: Check the generated text to see if the model followed the examples correctly.

Prompting Quality

Remember, the quality of the output heavily depends on the quality and relevance of the examples you provide. It’s also important to note that larger models tend to perform better at these tasks due to their greater capacity to understand and generalize from examples.

Key Points

Domain-specific LLMs are essential for tasks that require specialized knowledge.
Prompt engineering, RAG, fine-tuning, and training from scratch are viable approaches to create DSLs.
A mixed prompting-RAG approach is often preferred for its balance between performance and resource efficiency.
Training from scratch offers the highest quality output but requires significant resources.

Content from Wrap-up and Final Project

Last updated on 2024-05-21 | Edit this page

Estimated time: 12 minutes

Overview

Questions

What are the core concepts and techniques we’ve learned about NLP and LLMs?
How can these techniques be applied to solve real-world problems?
What are the future directions and opportunities in NLP?

Objectives

To be able to synthesize the key concepts from each episode.
To plan a path for further learning and exploration in NLP and LLMs.

8.1. Takeaway from This Workshop

We have covered a vast landscape of NLP, starting with the basics and moving towards the intricacies of LLMs. Here is a brief recap to illustrate our journey:

Text Preprocessing: Imagine cleaning a dataset of tweets for sentiment analysis. We learned how to remove noise and prepare the text for accurate classification.
Text Analysis: Consider the task of extracting key information from news articles. Techniques like Named Entity Recognition helped us identify and categorize entities within the text.
Word Embedding: We explored how words can be converted into vectors, enabling us to capture semantic relationships, as seen in the Word2Vec algorithm.
Transformers and LLMs: We saw how transformers like BERT and GPT can be fine-tuned for tasks such as summarizing medical research papers and showcasing their power and flexibility.

Quiz

A) Gensim

B) Word2Vec

C) Named Entity Recognition

D) Part-of-Speech Tagging

E) Stop-words Removal

F) Transformers

G) Bag of Words

H) Tokenization

I) BERT

J) Lemmatization

1. A masked language model for NLP tasks that require a good contextual understanding of an entire sequence.