03) Literature workflow with LLMs (function-based)

Goal (60–70 min): go from a research question → build a small paper corpus → extract a structured table → write a mini literature review with links.

By the end you will be able to: - Search arXiv for relevant papers - Build a structured table (goal/method/findings/limitations) - Select Top‑K papers for a question - Produce a mini review grounded in the extracted table

Workshop style: you should only need to call a few functions. The API details stay hidden.

1) Quick setup

# remove # from the line below and run if packages are not installed 
# !pip install numpy pandas matplotlib openai requests
# We import only what we need.
import os  # Environment variables for API keys and configuration
from openai import OpenAI  # OpenAI-compatible client (works with Groq base_url too)
  • Paste your Groq API key below
  • You can also change the model if you like.
  • chat() can be used to have a continuous chat. Usage: chat(“Your question”)
  • chat_reset() can used to delete the history and set a new system prompt. Usage: chat_reset(“You are a helper for a bioinformatician.”) or simply chat_reset()
# --- Setup ---

# paste your key here for the session.
os.environ["GROQ_API_KEY"] = ""  # <-- Paste key for workshop only. Do NOT share publicly.

# Keep the same model string used in the workshop.
model = "llama-3.3-70b-versatile"  # You can change this later if needed.

# Keep the same base URL pattern used in the workshop.
os.environ["BASE_URL"] = "https://api.groq.com/openai/v1"  # Groq's OpenAI-compatible endpoint


# Create the client using the API key and base URL.
client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],  # Read the key from the environment variable
    base_url=os.environ["BASE_URL"],     # Read the base URL from the environment variable
)

# We keep a small conversation history in memory to make multi-turn interactions easy.
chat_history = []

def chat_reset(system_prompt="You are a helpful research assistant."):
    """Reset chat history to a single system message (in-place)."""
    chat_history[:] = [{"role": "system", "content": system_prompt}]

chat_reset()

def chat(user_text, temperature=0.2, max_turns=8):
    """Multi-turn helper that keeps a short conversation history.
    Use this instead of calling the API directly in most workshop cells.
    """
    # Add the user's message to the conversation history.
    chat_history.append({"role": "user", "content": user_text})
    
    # Keep only the system message plus the most recent turns (to manage context length).
    system = chat_history[:1]  # First message is the system instruction
    recent = chat_history[-(max_turns * 2):]  # Each turn has user+assistant, hence *2
    window = system + recent  # Final message window sent to the model
    
    # Make the API call.
    resp = client.chat.completions.create(
        model=model,         # Model name
        messages=window,     # Conversation history window
        temperature=temperature,  # Randomness / creativity control
    )
    
    # Extract the assistant's reply text.
    reply = resp.choices[0].message.content
    
    # Store the assistant reply so the next call has memory.
    chat_history.append({"role": "assistant", "content": reply})
    
    # Return just the text (simple and workshop-friendly).
    return reply



print("Setup complete. `client`, `model`, and `chat()` are available.")
print("Testing chat function...")
print("User: Hello LLM")
print("Assistant: " + chat("Heloo LLM"))  # Test the chat function 

chat_reset()  # Reset chat history for next cells
Setup complete. `client`, `model`, and `chat()` are available.
Testing chat function...
User: Hello LLM
Assistant: Hello. It's nice to meet you. Is there something I can help you with or would you like to chat? I'm here to assist you with any questions or topics you'd like to discuss.

2) Install / imports

We’ll use: - requests to query the arXiv API - pandas to manage a table of papers

import requests  # HTTP requests to arXiv
import pandas as pd  # Tables for papers
import textwrap  # Pretty printing / wrapping long text
import re  # Light text cleaning

3) Step 1 — Search arXiv (function)

arXiv provides a public API. We’ll query it with a keyword string and return a tidy table.

Design principle: - You should not have to read XML. The function returns a DataFrame.

What we store per paper

  • title
  • authors
  • published
  • summary (abstract)
  • link (arXiv URL)
# --- arXiv search helper (no lxml dependency) ---

import xml.etree.ElementTree as ET  # Built-in XML parser (no extra install needed)

def search_arxiv(query, max_results=15, sort_by="relevance"):
    """Search arXiv and return a DataFrame without requiring lxml."""
    base = "http://export.arxiv.org/api/query"

    params = {
        "search_query": query,
        "start": 0,
        "max_results": max_results,
        "sortBy": sort_by,
    }

    r = requests.get(base, params=params, timeout=30)
    r.raise_for_status()

    root = ET.fromstring(r.text)
    ns = {"atom": "http://www.w3.org/2005/Atom"}

    records = []
    for entry in root.findall("atom:entry", ns):
        title = entry.findtext("atom:title", default="", namespaces=ns)
        summary = entry.findtext("atom:summary", default="", namespaces=ns)
        published = entry.findtext("atom:published", default="", namespaces=ns)
        link = entry.findtext("atom:id", default="", namespaces=ns)

        authors = []
        for author in entry.findall("atom:author", ns):
            name = author.findtext("atom:name", default="", namespaces=ns)
            if name:
                authors.append(name)

        records.append({
            "title": re.sub(r"\s+", " ", title).strip(),
            "authors": ", ".join(authors),
            "published": published,
            "summary": re.sub(r"\s+", " ", summary).strip(),
            "link": link,
        })

    return pd.DataFrame(records)

Try it

Pick a topic. For example: - "retrieval augmented generation" - "LLM scientific discovery" - "LLM systematic review"

Tip: arXiv queries support field prefixes like ti: (title) and abs: (abstract).

# --- Example search ---

query = 'abs:"retrieval augmented generation" OR abs:"RAG"'
papers = search_arxiv(query, max_results=10)

print("Found", len(papers), "papers")
papers.head(5)
Found 10 papers
title authors published summary link
0 Benchmarking Large Language Models in Retrieva... Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun 2023-09-04T08:28:44Z Retrieval-Augmented Generation (RAG) is a prom... http://arxiv.org/abs/2309.01431v2
1 GFM-RAG: Graph Foundation Model for Retrieval ... Linhao Luo, Zicheng Zhao, Gholamreza Haffari, ... 2025-02-03T07:04:29Z Retrieval-augmented generation (RAG) has prove... http://arxiv.org/abs/2502.01113v3
2 CommunityKG-RAG: Leveraging Community Structur... Rong-Ching Chang, Jiawei Zhang 2024-08-16T05:15:12Z Despite advancements in Large Language Models ... http://arxiv.org/abs/2408.08535v1
3 MRAG: Benchmarking Retrieval-Augmented Generat... Wei Zhu 2026-01-23T07:07:13Z While Retrieval-Augmented Generation (RAG) has... http://arxiv.org/abs/2601.16503v1
4 Self-adaptive Multimodal Retrieval-Augmented G... Wenjia Zhai 2024-10-15T06:39:35Z Traditional Retrieval-Augmented Generation (RA... http://arxiv.org/abs/2410.11321v1

4) Step 2 — Summarise + extract structured fields (function)

Now we convert a list of papers into a structured table. This is the most useful pattern for research workflows: - the model outputs fields, not free text - you can filter/sort the results - you can later ground your review in this table

Fields we extract

  • research_goal
  • method
  • key_findings
  • limitations

We keep outputs short, and we tell the model not to invent details beyond the abstract.

# --- LLM extraction helper ---

def extract_from_abstract(title, abstract, link, temperature=0.2):
    """Use the LLM to extract structured fields from a single abstract."""

    # We ask for a strict JSON object to reduce ambiguity.
    prompt = f"""
You are extracting information for a research literature table.

RULES:
- Use ONLY the abstract text below.
- Do NOT invent details.
- If something is missing, write "unknown".
- Keep each value <= 25 words.

TITLE: {title}
LINK: {link}

ABSTRACT:
{abstract}

Return a JSON object with exactly these keys:
- research_goal
- method
- key_findings
- limitations
"""

    # Call the workshop helper.
    # Note: we reset before each extraction to avoid context bleed between papers.
    chat_reset("You are a careful research assistant that follows instructions.")
    raw = chat(prompt, temperature=temperature)

    return raw


def summarise_and_extract(df_papers, temperature=0.2, max_papers=None):
    """Batch extract structured fields from a DataFrame of papers.

    Parameters
    ----------
    df_papers : pd.DataFrame
        Must include columns: title, summary, link.
    temperature : float
        Lower values are better for extraction.
    max_papers : int or None
        Optional cap for workshop speed.
    """
    rows = []

    # Optionally cap the number of papers (useful in a workshop).
    work = df_papers.copy()
    if max_papers is not None:
        work = work.head(max_papers)

    for i, r in work.iterrows():
        title = str(r.get("title", ""))
        abstract = str(r.get("summary", ""))
        link = str(r.get("link", ""))

        # Extract JSON-like text from the model.
        raw = extract_from_abstract(title, abstract, link, temperature=temperature)

        # Store the raw output; we'll parse it later (parsing is brittle across models).
        rows.append({
            "title": title,
            "link": link,
            "abstract": abstract,
            "extraction_raw": raw,
        })

    return pd.DataFrame(rows)

Run extraction on a small set

Start with 5–8 papers for speed. You can scale up later.

# --- Extract structured fields ---

extracted = summarise_and_extract(papers, temperature=0.2, max_papers=6)
extracted[["title", "link", "extraction_raw"]].head(3)
title link extraction_raw
0 Benchmarking Large Language Models in Retrieva... http://arxiv.org/abs/2309.01431v2 ```\n{\n "research_goal": "Evaluate Retrieval...
1 GFM-RAG: Graph Foundation Model for Retrieval ... http://arxiv.org/abs/2502.01113v3 ```\n{\n "research_goal": "Improve retrieval ...
2 CommunityKG-RAG: Leveraging Community Structur... http://arxiv.org/abs/2408.08535v1 ```\n{\n "research_goal": "Enhance fact-check...

5) Parse the JSON-like output (best-effort)

Models are usually good at returning JSON, but sometimes they include extra text. This parser is best-effort: - it tries to find the first JSON object in the output - if parsing fails, it leaves fields blank so you can fix manually

In production workflows, you would tighten this using tool/function calling or a stricter JSON mode (where available).

import json

def parse_first_json(text):
    """Extract and parse the first JSON object from a text string (best effort)."""
    if text is None:
        return None
    s = str(text)

    # Find a JSON object by locating the first '{' and the last '}'
    start = s.find("{")
    end = s.rfind("}")
    if start == -1 or end == -1 or end <= start:
        return None

    candidate = s[start:end+1]

    try:
        return json.loads(candidate)
    except Exception:
        return None


def normalize_extractions(df_extracted):
    """Turn raw extraction strings into a clean table."""
    records = []
    for _, r in df_extracted.iterrows():
        parsed = parse_first_json(r.get("extraction_raw", "")) or {}
        records.append({
            "title": r.get("title", ""),
            "link": r.get("link", ""),
            "research_goal": parsed.get("research_goal", ""),
            "method": parsed.get("method", ""),
            "key_findings": parsed.get("key_findings", ""),
            "limitations": parsed.get("limitations", ""),
        })
    return pd.DataFrame(records)


table = normalize_extractions(extracted)
table.head(10)
title link research_goal method key_findings limitations
0 Benchmarking Large Language Models in Retrieva... http://arxiv.org/abs/2309.01431v2 Evaluate Retrieval-Augmented Generation on lar... Established Retrieval-Augmented Generation Ben... LLMs struggle with negative rejection and info... unknown
1 GFM-RAG: Graph Foundation Model for Retrieval ... http://arxiv.org/abs/2502.01113v3 Improve retrieval augmented generation Graph foundation model with graph neural network State-of-the-art performance on QA datasets Unknown
2 CommunityKG-RAG: Leveraging Community Structur... http://arxiv.org/abs/2408.08535v1 Enhance fact-checking process CommunityKG-RAG framework Improves accuracy and relevance unknown
3 MRAG: Benchmarking Retrieval-Augmented Generat... http://arxiv.org/abs/2601.16503v1 Evaluate Retrieval-Augmented Generation in bio... Introduced MRAG benchmark and toolkit RAG enhances reliability, influenced by retrie... LLM responses less readable for long-form ques...
4 Self-adaptive Multimodal Retrieval-Augmented G... http://arxiv.org/abs/2410.11321v1 Improve Retrieval-Augmented Generation Self-adaptive Multimodal Retrieval-Augmented G... Surpasses state-of-the-art in accuracy and gen... unknown
5 Observations on Building RAG Systems for Techn... http://arxiv.org/abs/2404.00657v1 Review RAG for technical documents Experiments and prior art review Best practices and challenges unknown

6) Step 3 — Select Top‑K papers for your question

There are many ways to choose Top‑K papers.

For the workshop we provide two simple methods: 1) Keyword score (fast, local) 2) LLM relevance score (slower, but often better)

Start with keyword scoring. Use LLM scoring if results look off.

# --- Top-K selection helpers ---

def select_top_k_keyword(df, query, k=8):
    """Simple keyword scoring on extracted fields."""
    q = str(query).lower()

    # Build a text blob per row.
    blob = (
        df["title"].fillna("") + " " +
        df["research_goal"].fillna("") + " " +
        df["method"].fillna("") + " " +
        df["key_findings"].fillna("") + " " +
        df["limitations"].fillna("")
    ).str.lower()

    # Score by counting occurrences of query terms.
    terms = [t for t in re.split(r"\W+", q) if t]

    def _score(text):
        return sum(text.count(t) for t in terms)

    scored = df.copy()
    scored["score"] = blob.apply(_score)

    return scored.sort_values("score", ascending=False).head(k).reset_index(drop=True)


def select_top_k_llm(df, question, k=8, temperature=0.2):
    """Ask the LLM to score each row for relevance to a question."""
    scores = []

    for _, r in df.iterrows():
        # Keep context short so scoring is fast.
        snippet = f"""
TITLE: {r.get('title','')}
GOAL: {r.get('research_goal','')}
METHOD: {r.get('method','')}
FINDINGS: {r.get('key_findings','')}
LIMITATIONS: {r.get('limitations','')}
LINK: {r.get('link','')}
"""

        prompt = f"""
You are ranking papers for relevance.

QUESTION: {question}

PAPER SUMMARY:
{snippet}

Return ONLY an integer score from 0 to 5 where:
0 = not relevant
5 = directly answers the question
"""

        chat_reset("You are a strict scorer. Output ONLY the integer.")
        out = chat(prompt, temperature=temperature).strip()

        # Parse integer safely.
        m = re.search(r"\d+", out)
        score = int(m.group(0)) if m else 0
        scores.append(score)

    scored = df.copy()
    scored["score"] = scores
    return scored.sort_values("score", ascending=False).head(k).reset_index(drop=True)

Choose your selection method

  1. Start with keyword
  2. If needed, try LLM scoring (slower)
# --- Top-K selection ---

question = "How does RAG reduce hallucinations and what are the key failure modes?"

topk_kw = select_top_k_keyword(table, query=question, k=6)
topk_kw[["score", "title", "link"]]
score title link
0 4 MRAG: Benchmarking Retrieval-Augmented Generat... http://arxiv.org/abs/2601.16503v1
1 4 CommunityKG-RAG: Leveraging Community Structur... http://arxiv.org/abs/2408.08535v1
2 4 Observations on Building RAG Systems for Techn... http://arxiv.org/abs/2404.00657v1
3 2 GFM-RAG: Graph Foundation Model for Retrieval ... http://arxiv.org/abs/2502.01113v3
4 2 Self-adaptive Multimodal Retrieval-Augmented G... http://arxiv.org/abs/2410.11321v1
5 1 Benchmarking Large Language Models in Retrieva... http://arxiv.org/abs/2309.01431v2
# --- Define your research question ---
question = "How does RAG reduce hallucinations and what are common failure modes?"

# We use LLM-based scoring instead of keyword matching.
# This is slower but often more semantically accurate.
topk_llm = select_top_k_llm(
    table,          # The structured table with extracted fields
    question=question,
    k=6,            # Number of papers to keep
    temperature=0.0 # Keep deterministic for scoring
)

# Show results
topk_llm[["score", "title", "link"]]
score title link
0 4 Observations on Building RAG Systems for Techn... http://arxiv.org/abs/2404.00657v1
1 4 MRAG: Benchmarking Retrieval-Augmented Generat... http://arxiv.org/abs/2601.16503v1
2 2 CommunityKG-RAG: Leveraging Community Structur... http://arxiv.org/abs/2408.08535v1
3 2 Benchmarking Large Language Models in Retrieva... http://arxiv.org/abs/2309.01431v2
4 0 GFM-RAG: Graph Foundation Model for Retrieval ... http://arxiv.org/abs/2502.01113v3
5 0 Self-adaptive Multimodal Retrieval-Augmented G... http://arxiv.org/abs/2410.11321v1

7) Step 4 — Write a mini literature review grounded in Top‑K

Now we generate a short review that is grounded in the selected papers.

Important: we are not asking the model to cite external knowledge. We are asking it to use only what we extracted.

To keep it checkable, we include links next to each paper mention.

# --- Context builder + review writer ---

def build_context(df_topk, max_chars=12000):
    """Build a compact context string from Top-K rows."""
    parts = []
    for i, r in df_topk.iterrows():
        parts.append(
            f"[{i+1}] {r.get('title','')}\n"
            f"Link: {r.get('link','')}\n"
            f"Goal: {r.get('research_goal','')}\n"
            f"Method: {r.get('method','')}\n"
            f"Findings: {r.get('key_findings','')}\n"
            f"Limitations: {r.get('limitations','')}\n"
        )
    ctx = "\n---\n".join(parts)
    return ctx[:max_chars]


def write_mini_review(question, df_topk, temperature=0.2):
    """Write a short grounded review using only the provided table context."""
    context = build_context(df_topk)

    prompt = f"""
You are writing a short literature overview for a researcher.

QUESTION:
{question}

CONTEXT (use ONLY this):
{context}

Write:
1) A 2–3 paragraph mini review answering the question.
2) A bullet list of 3 common limitations / open problems.

Rules:
- Do NOT invent papers or claims.
- When you refer to a paper, cite it like [1], [2], etc.
- If context is insufficient, say what is missing.
"""

    chat_reset("You are a careful research assistant. Stay grounded in the provided context.")
    return chat(prompt, temperature=temperature)
# --- Generate the mini review ---

review = write_mini_review(question, topk_llm, temperature=0.2)
print(review)
Retrieval-Augmented Generation (RAG) has been shown to reduce hallucinations in various studies. Hallucinations refer to the generation of inaccurate or unrelated information. According to [2], RAG enhances reliability, which suggests that it can mitigate hallucinations by providing more accurate and relevant information. The retrieval component of RAG helps to ground the generated text in actual information, reducing the likelihood of hallucinations. However, the effectiveness of RAG in reducing hallucinations depends on the retrieval approach used, as noted in [2].

The reduction of hallucinations in RAG can be attributed to its ability to incorporate external knowledge into the generation process. For instance, [3] introduces CommunityKG-RAG, a framework that leverages community structures in knowledge graphs to improve fact-checking. This approach enhances accuracy and relevance, which are critical in reducing hallucinations. Similarly, [5] proposes a graph foundation model for RAG, which achieves state-of-the-art performance on QA datasets. These studies demonstrate the potential of RAG in reducing hallucinations by incorporating external knowledge and improving the generation process.

However, despite the advancements in RAG, there are still common failure modes and limitations. For example, [4] notes that large language models struggle with negative rejection and information integration, which can lead to hallucinations. Additionally, [2] mentions that LLM responses can be less readable for long-form questions, which may also contribute to hallucinations. To better understand how RAG reduces hallucinations and common failure modes, more research is needed on the specific mechanisms and limitations of RAG.

Common limitations and open problems:
* Large language models struggle with negative rejection and information integration, leading to potential hallucinations [4].
* LLM responses can be less readable for long-form questions, which may contribute to hallucinations [2].
* The limitations of RAG in terms of its ability to handle unknown or unseen data are not well understood, as many studies do not report on limitations [1], [3], [5], [6].

8) Optional take-home ideas

If you want to extend this workflow later: - Increase corpus size (e.g., 50–200 papers) - Add PDF fetching + chunking (but keep it optional — PDF extraction can be messy) - Swap Top‑K scoring to embeddings / vector search - Use stricter JSON outputs (tool calling / JSON mode where supported)

Next notebook: Structured extraction as a general technique (beyond arXiv).