02) Core concepts

Goal (15 min): understand the practical concepts you need to use LLMs responsibly in research.

By the end you will be able to: - Explain (and detect) hallucinations - Understand why context matters (and how to manage it) - Use temperature intentionally (without overthinking the maths) - Write prompts that produce structured, checkable outputs

1) Quick setup

# remove # from the line below and run if packages are not installed 
# !pip install numpy pandas matplotlib openai requests

# We import only what we need.
import os  # Environment variables for API keys and configuration
from openai import OpenAI  # OpenAI-compatible client (works with Groq base_url too)

Paste your Groq API key below
You can also change the model if you like.
chat() can be used to have a continuous chat. Usage: chat(“Your question”)
chat_reset() can used to delete the history and set a new system prompt. Usage: chat_reset(“You are a helper for a bioinformatician.”) or simply chat_reset()

# --- Setup ---

# paste your key here for the session.
os.environ["GROQ_API_KEY"] = ""  # <-- Paste key for workshop only. Do NOT share publicly.

# Keep the same model string used in the workshop.
model = "llama-3.3-70b-versatile"  # You can change this later if needed.

# Keep the same base URL pattern used in the workshop.
os.environ["BASE_URL"] = "https://api.groq.com/openai/v1"  # Groq's OpenAI-compatible endpoint



# Create the client using the API key and base URL.
client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],  # Read the key from the environment variable
    base_url=os.environ["BASE_URL"],     # Read the base URL from the environment variable
)

# We keep a small conversation history in memory to make multi-turn interactions easy.
chat_history = []

def chat_reset(system_prompt="You are a helpful research assistant."):
    """Reset chat history to a single system message (in-place)."""
    chat_history[:] = [{"role": "system", "content": system_prompt}]

chat_reset()

def chat(user_text, temperature=0.2, max_turns=8):
    """Multi-turn helper that keeps a short conversation history.
    Use this instead of calling the API directly in most workshop cells.
    """
    # Add the user's message to the conversation history.
    chat_history.append({"role": "user", "content": user_text})
    
    # Keep only the system message plus the most recent turns (to manage context length).
    system = chat_history[:1]  # First message is the system instruction
    recent = chat_history[-(max_turns * 2):]  # Each turn has user+assistant, hence *2
    window = system + recent  # Final message window sent to the model
    
    # Make the API call.
    resp = client.chat.completions.create(
        model=model,         # Model name
        messages=window,     # Conversation history window
        temperature=temperature,  # Randomness / creativity control
    )
    
    # Extract the assistant's reply text.
    reply = resp.choices[0].message.content
    
    # Store the assistant reply so the next call has memory.
    chat_history.append({"role": "assistant", "content": reply})
    
    # Return just the text (simple and workshop-friendly).
    return reply



print("Setup complete. `client`, `model`, and `chat()` are available.")
print("Testing chat function...")
print("User: Hello LLM")
print("Assistant: " + chat("Heloo LLM"))  # Test the chat function 

chat_reset()  # Reset chat history for next cells

Setup complete. `client`, `model`, and `chat()` are available.
Testing chat function...
User: Hello LLM
Assistant: Hello. It's nice to meet you. Is there something I can help you with or would you like to chat? I'm here to assist you with any questions or topics you'd like to discuss.

2) Hallucinations (what they are and how to handle them)

Hallucination = the model produces a confident-looking statement that is not grounded in your provided context.

In research, this shows up as: - invented citations / DOIs - incorrect numbers or claims - fabricated details about a paper that sounds plausible

Rule of thumb: treat the model as a drafting and reasoning tool, not as an authoritative source. Always verify critical facts against primary sources.

2.1 A quick demonstration

We will ask the model for very specific facts without giving it any sources. Then we will ask it to provide a verification plan.

The point is not to “trick” the model. The point is to learn the habit: ask for checks.

# --- Hallucination demo ---

# We ask a very specific question *without giving any citations or context*.
# In real research, this is where hallucinations can sneak in.
question = "What is the exact DOI of the first paper that introduced the Transformer architecture? Provide the DOI and the full citation."

# We use a low temperature to reduce randomness (but hallucinations can still happen).
answer = chat(question, temperature=0.2)

print(answer)

The first paper that introduced the Transformer architecture is:

"Attention Is All You Need" by Vaswani et al.

The DOI of this paper is: 10.5555/3295222.3295349

The full citation is:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).

However, a more commonly used citation format is:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

2.2 A better pattern: ask for uncertainty + a verification plan

Instead of trusting the first answer, ask the model to: - state what it knows vs. guesses - propose how to verify - provide queries you could run in Google Scholar / arXiv / Semantic Scholar

This turns the model into a research assistant rather than a source of truth.

# --- Verification plan prompt ---

prompt = """
You may be wrong. Do NOT guess.
1) Say whether you are certain about the DOI (yes/no).
2) If not certain, propose a verification plan with 3-5 concrete steps.
3) Provide 3 search queries I can paste into Google Scholar.
"""

plan = chat(prompt, temperature=0.2)

print(plan)

1. No, I am not certain about the DOI.

2. To verify the DOI, I propose the following verification plan:
* Step 1: Check the official NeurIPS (Conference on Neural Information Processing Systems) website for the paper "Attention Is All You Need" by Vaswani et al.
* Step 2: Search for the paper on academic databases such as Google Scholar, arXiv, or ResearchGate to find the correct DOI.
* Step 3: Verify the DOI on the paper's official publication page or through a reliable citation manager like Crossref or PubMed.
* Step 4: Cross-check the DOI with other trusted sources, such as the authors' personal websites or institutional repositories.
* Step 5: If necessary, contact the authors or the publication team to confirm the correct DOI.

3. Here are three search queries you can paste into Google Scholar:
* "Attention Is All You Need" Vaswani
* "Vaswani et al. 2017 Attention Is All You Need"
* "Transformer architecture introduction paper Vaswani"

3) Context: why it matters

LLMs generate answers based on the messages you provide (system + conversation history + any pasted content).

If you don’t provide: - the relevant paragraphs - the table / figure description - the definitions / constraints

…then the model must guess, which increases error risk.

Two practical tricks: 1) Keep the prompt short and specific 2) Provide just enough context (not everything)

3.1 Context budgeting (a practical habit)

Below is a template you can reuse: - Task: what you want - Context: pasted text (only the relevant parts) - Output format: what you want back (table, bullets, JSON) - Checks: ask the model to quote supporting lines / highlight uncertainty

# --- Context budgeting template ---

task = "Summarise the key claim and the main limitation."
context = """
[Paste 1-3 relevant paragraphs here. Keep it short.]
"""

prompt = f"""
TASK:
{task}

CONTEXT:
{context}

OUTPUT FORMAT:
- Key claim (1 sentence)
- Main limitation (1 sentence)
- 1 quote from the context that supports the key claim
- 1 quote from the context that supports the limitation

If the context is insufficient, say what is missing.
"""

print(chat(prompt, temperature=0.2))

There is no context provided. To complete the task, I need 1-3 relevant paragraphs from the paper "Attention Is All You Need" by Vaswani et al. Please provide the context, and I will be happy to assist you in summarizing the key claim and the main limitation.

4) Temperature: practical guidance

Temperature controls randomness in the model’s outputs.

Practical defaults: - 0.0–0.3: summarisation, extraction, careful writing - 0.4–0.8: brainstorming, alternatives, creative ideation - 0.9+: rarely needed for research workflows

In this workshop, we mostly stay around 0.2.

# --- Temperature demo (short) ---

prompt = "Give me one concise sentence describing how RAG helps with literature review."

# Low temperature: more consistent phrasing.
print("Temperature 0.05:")
for i in range(3):
    chat_reset()
    print(chat(prompt, temperature=0.05))

# Higher temperature: more variation / creativity.
print("\nTemperature 0.95:")
for i in range(3):
    chat_reset()
    print(chat(prompt, temperature=0.95))

Temperature 0.05:
RAG (Retrieval-Augmented Generation) assists with literature review by efficiently retrieving and summarizing relevant existing research, saving time and effort in identifying key studies and synthesizing findings.
RAG (Retrieval-Augmented Generation) assists with literature review by efficiently retrieving and summarizing relevant existing research, saving time and effort in identifying key studies and synthesizing findings.
RAG (Retrieval-Augmented Generation) assists with literature review by efficiently retrieving and summarizing relevant existing research, saving time and effort in identifying key studies and synthesizing findings.

Temperature 0.95:
RAG (Retrieval-Augmented Generator) helps with literature review by efficiently searching and summarizing relevant papers, allowing researchers to quickly identify key findings and gaps in existing research.
RAG (Retrieve, Augment, Generate) assists with literature reviews by efficiently retrieving relevant papers, augmenting the search with additional sources, and generating summaries to help researchers quickly identify key findings and trends.
RAG (Retrieval-Augmented Generator) assists with literature reviews by efficiently retrieving and summarizing relevant existing research, saving time and effort in identifying and synthesizing key findings and concepts.

5) Prompt patterns that work in research

Most failures in LLM usage come from underspecified prompts.

Here are reliable patterns: 1) Role + task + constraints 2) Ask for structured outputs (tables / JSON / bullet fields) 3) Ask for caveats (“what could be wrong?”) 4) Ask for sources (“cite from the provided context”), or explicitly say “no external claims”

5.1 Structured output example

In later notebooks, we will extract structured fields from abstracts. Here is the basic pattern with a small example text.

# --- Structured output example ---

abstract = """
We propose a method for improving retrieval-augmented generation by re-ranking passages using a lightweight classifier.
Across three benchmarks, we show improved factual consistency and a reduction in citation errors.
Limitations include dependence on retrieval quality and reduced performance on highly novel queries.
"""

prompt = f"""
Extract the following fields from the text.

TEXT:
{abstract}

Return a JSON object with exactly these keys:
- research_goal
- method
- evidence
- limitation

Rules:
- Do not invent anything not present in the text.
- Keep each value under 25 words.
"""

print(chat(prompt, temperature=0.2))

```
{
  "research_goal": "improving retrieval-augmented generation",
  "method": "re-ranking passages using classifier",
  "evidence": "improved factual consistency",
  "limitation": "dependence on retrieval quality"
}
```

5.2 Safety check: force the model to separate facts from assumptions

This is especially useful when you ask the model to interpret results or summarise claims.

# --- Facts vs assumptions prompt pattern ---

text = """
This approach improves performance and should generalise to other datasets.
"""

prompt = f"""
Analyse the text below and separate:
1) Facts explicitly stated
2) Assumptions / claims that would need evidence

TEXT:
{text}
"""

print(chat(prompt, temperature=0.2))

**Facts explicitly stated:**
 None, the text only mentions a claim about the approach.

**Assumptions / claims that would need evidence:**
1. This approach improves performance.
2. The approach should generalise to other datasets.

6) RAG in one minute (concept only)

RAG (Retrieval-Augmented Generation) means: 1) retrieve relevant documents (papers, notes, PDFs) 2) pass the most relevant parts into the model as context 3) generate an answer that is grounded in that context

Why it matters: - reduces hallucination risk - enables citations (links / quotes) from provided sources - makes answers checkable

We implement RAG hands-on in the merged literature workflow notebook.

7) Mini exercise (2–3 minutes)

Pick one research topic you care about. Ask the model for: - 3 subquestions worth investigating - a structured table describing what evidence you would need

Keep temperature around 0.4 for this brainstorming task.

# --- Mini exercise ---

topic = "Your topic here (e.g., 'LLMs for systematic reviews in medicine')"

prompt = f"""
Topic: {topic}

1) Propose 3 specific research subquestions.
2) For each subquestion, give a 2-column table:
   - What evidence/data would answer it?
   - What risks/biases could mislead us?
"""

print(chat(prompt, temperature=0.4))

**Topic: LLMs for Systematic Reviews in Medicine**

Here are 3 specific research subquestions:

1. Can LLMs accurately extract relevant data from medical literature?
2. How do LLMs compare to human reviewers in terms of efficiency and accuracy in systematic reviews?
3. Can LLMs help reduce bias in systematic reviews by identifying and mitigating potential sources of error?

Here are the 2-column tables for each subquestion:

**Subquestion 1: Can LLMs accurately extract relevant data from medical literature?**

| What evidence/data would answer it? | What risks/biases could mislead us? |
| --- | --- |
| Benchmarking datasets with annotated medical texts | Overfitting to specific datasets or domains |
| Comparison of LLM-extracted data to human-annotated gold standards | Lack of generalizability to diverse medical literature |

**Subquestion 2: How do LLMs compare to human reviewers in terms of efficiency and accuracy in systematic reviews?**

| What evidence/data would answer it? | What risks/biases could mislead us? |
| --- | --- |
| Head-to-head comparisons of LLMs and human reviewers on systematic review tasks | Selection bias in choosing which reviews to compare |
| Time-to-completion and accuracy metrics for LLMs and human reviewers | Differences in training data or expertise between LLMs and human reviewers |

**Subquestion 3: Can LLMs help reduce bias in systematic reviews by identifying and mitigating potential sources of error?**

| What evidence/data would answer it? | What risks/biases could mislead us? |
| --- | --- |
| Analysis of LLM-identified biases and errors in systematic reviews | Overreliance on LLMs to detect biases, rather than human critical appraisal |
| Comparison of bias detection and mitigation strategies with and without LLMs | Failure to account for LLMs' own potential biases or limitations |

8) Wrap-up

Key takeaways: - Hallucinations happen — mitigate with context + structured outputs + verification prompts - Context is your strongest control lever - Temperature is a simple knob: low for extraction, medium for brainstorming - RAG makes answers more checkable (we do it next)