04) LLMs for coding + data cleaning (workshop style)

Goal (25–30 min): learn a reliable workflow for using an LLM to write useful code for research — with one realistic data-cleaning example.

By the end you will be able to: - Ask for a function with a clear goal (inputs/outputs) - Ask for test code to validate it quickly - Ask for edge cases and improve robustness - Ask the LLM to explain the code line-by-line - Ask about failure modes (when your approach can mislead)

We keep everything function-based: you call chat() and copy/paste the code you want.

1) Quick setup

# remove # from the line below and run if packages are not installed 
# !pip install numpy pandas matplotlib openai requests

# We import only what we need.
import os  # Environment variables for API keys and configuration
from openai import OpenAI  # OpenAI-compatible client (works with Groq base_url too)

Paste your Groq API key below
You can also change the model if you like.
chat() can be used to have a continuous chat. Usage: chat(“Your question”)
chat_reset() can used to delete the history and set a new system prompt. Usage: chat_reset(“You are a helper for a bioinformatician.”) or simply chat_reset()

# --- Setup ---

# paste your key here for the session.
os.environ["GROQ_API_KEY"] = ""  # <-- Paste key for workshop only. Do NOT share publicly.

# Keep the same model string used in the workshop.
model = "llama-3.3-70b-versatile"  # You can change this later if needed.

# Keep the same base URL pattern used in the workshop.
os.environ["BASE_URL"] = "https://api.groq.com/openai/v1"  # Groq's OpenAI-compatible endpoint



# Create the client using the API key and base URL.
client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],  # Read the key from the environment variable
    base_url=os.environ["BASE_URL"],     # Read the base URL from the environment variable
)

# We keep a small conversation history in memory to make multi-turn interactions easy.
chat_history = []

def chat_reset(system_prompt="You are a helpful research assistant."):
    """Reset chat history to a single system message (in-place)."""
    chat_history[:] = [{"role": "system", "content": system_prompt}]

chat_reset()

def chat(user_text, temperature=0.2, max_turns=8):
    """Multi-turn helper that keeps a short conversation history.
    Use this instead of calling the API directly in most workshop cells.
    """
    # Add the user's message to the conversation history.
    chat_history.append({"role": "user", "content": user_text})
    
    # Keep only the system message plus the most recent turns (to manage context length).
    system = chat_history[:1]  # First message is the system instruction
    recent = chat_history[-(max_turns * 2):]  # Each turn has user+assistant, hence *2
    window = system + recent  # Final message window sent to the model
    
    # Make the API call.
    resp = client.chat.completions.create(
        model=model,         # Model name
        messages=window,     # Conversation history window
        temperature=temperature,  # Randomness / creativity control
    )
    
    # Extract the assistant's reply text.
    reply = resp.choices[0].message.content
    
    # Store the assistant reply so the next call has memory.
    chat_history.append({"role": "assistant", "content": reply})
    
    # Return just the text (simple and workshop-friendly).
    return reply



print("Setup complete. `client`, `model`, and `chat()` are available.")
print("Testing chat function...")
print("User: Hello LLM")
print("Assistant: " + chat("Heloo LLM"))  # Test the chat function 

chat_reset()  # Reset chat history for next cells

Setup complete. `client`, `model`, and `chat()` are available.
Testing chat function...
User: Hello LLM
Assistant: Hello. It's nice to meet you. Is there something I can help you with or would you like to chat? I'm here to assist you with any questions or topics you'd like to discuss.

2) The LLM coding checklist (fast)

When you ask an LLM to write code, include:

Clear goal — what the function should do
Inputs/outputs — types, columns, shapes
Constraints — e.g. do not modify inputs in place
Edge cases — what could break it
Explanation — ask it to explain the code back to you
Failure modes — ask when this approach could mislead
Iterate — run the code, show errors/output, refine

In this notebook we will apply most of these using one data-cleaning example.

3) A realistic example: cleaning a messy dataset

Imagine you received a CSV from a collaborator. It contains: - missing values - numeric columns stored as strings - stray whitespace in text fields - an optional date column that might need parsing

We will use the LLM to: 1) generate a reusable cleaning function 2) generate a small test dataset 3) improve the function for edge cases 4) discuss failure modes (when cleaning choices can bias results)

# --- Create a tiny "messy" dataset to motivate the task ---

import pandas as pd  # Tables
import numpy as np   # NaN values and numeric helpers

# We build a small DataFrame with common messiness patterns.
df_raw = pd.DataFrame({
    "patient_id": [" 001", "002 ", "003", "004"],          # whitespace around IDs
    "age": ["34", None, "29", "unknown"],                   # numeric stored as strings + missing + bad token
    "biomarker": [1.2, np.nan, 0.9, 1.5],                         # numeric + NaN
    "group": [" control", "treatment ", None, "control"],   # whitespace + missing
    "date": ["2024-01-05", "05/01/2024", "", None],         # mixed formats + empty + missing
})

df_raw

	patient_id	age	biomarker	group	date
0	001	34	1.2	control	2024-01-05
1	002	None	NaN	treatment	05/01/2024
2	003	29	0.9	None
3	004	unknown	1.5	control	None

4) Step A — Ask for a cleaning function (prompt → code)

Prompt pattern

Ask for: - a function with a clear signature - no in-place modification - minimal dependencies - readable code

Below we ask the LLM to generate a function named clean_research_df(df).

# --- Ask the LLM for a cleaning function ---

# We reset chat history so earlier prompts don't influence this code generation.
chat_reset("You are a senior Python data scientist. Write clean, readable code.")

prompt = """
Write a Python function named clean_research_df(df) that cleans a pandas DataFrame.

Requirements:
1) Input: a pandas DataFrame.
2) Output: a NEW cleaned DataFrame (do NOT modify the input df in place).
3) Strip whitespace from ALL string/object columns.
4) For numeric-looking columns that may contain strings (e.g., '34', '29', 'unknown'):
   - convert to numeric with errors coerced to NaN.
5) For numeric columns, fill missing values with the column mean.
6) If a column named 'date' exists:
   - parse it to datetime with errors coerced to NaT.
7) Keep the code minimal and readable (no classes).

Return ONLY Python code (no markdown).
"""

generated_code = chat(prompt, temperature=0.2)

print(generated_code)

```python
import pandas as pd
import numpy as np

def clean_research_df(df):
    # Create a copy of the input DataFrame to avoid modifying it in place
    cleaned_df = df.copy()

    # Strip whitespace from all string/object columns
    for col in cleaned_df.select_dtypes(include=['object']).columns:
        cleaned_df[col] = cleaned_df[col].str.strip()

    # Convert numeric-looking columns to numeric with errors coerced to NaN
    for col in cleaned_df.select_dtypes(include=['object']).columns:
        cleaned_df[col] = pd.to_numeric(cleaned_df[col], errors='coerce')

    # Fill missing values in numeric columns with the column mean
    for col in cleaned_df.select_dtypes(include=['int64', 'float64']).columns:
        cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mean())

    # Parse 'date' column to datetime with errors coerced to NaT if it exists
    if 'date' in cleaned_df.columns:
        cleaned_df['date'] = pd.to_datetime(cleaned_df['date'], errors='coerce')

    return cleaned_df
```

Run the generated code

Copy/paste the generated function into the cell below.

In a workshop, this copy/paste step is intentional: it makes it clear what code came from the model.

# --- Paste the generated function here ---
# After pasting, run this cell.

# Example:
# def clean_research_df(df):
#     ...

import pandas as pd
import numpy as np

def clean_research_df(df):
    # Create a copy of the input DataFrame to avoid modifying it in place
    cleaned_df = df.copy()

    # Strip whitespace from all string/object columns
    for col in cleaned_df.select_dtypes(include=['object', 'string']).columns:
        cleaned_df[col] = cleaned_df[col].str.strip()

    # Convert numeric-looking columns to numeric with errors coerced to NaN
    for col in cleaned_df.select_dtypes(include=['object', 'string']).columns:
        cleaned_df[col] = pd.to_numeric(cleaned_df[col], errors='coerce')

    # Fill missing values in numeric columns with the column mean
    for col in cleaned_df.select_dtypes(include=['int64', 'float64']).columns:
        cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mean())

    # Parse 'date' column to datetime with errors coerced to NaT if it exists
    if 'date' in cleaned_df.columns:
        cleaned_df['date'] = pd.to_datetime(cleaned_df['date'], errors='coerce')

    return cleaned_df

Apply the function

Once you have pasted the function, run it on df_raw.

# --- Apply the cleaning function (after you pasted it above) ---

df_clean = clean_research_df(df_raw)
df_clean

# pass

	patient_id	age	biomarker	group	date
0	1	34.0	1.2	NaN	NaT
1	2	31.5	1.2	NaN	NaT
2	3	29.0	0.9	NaN	NaT
3	4	31.5	1.5	NaN	NaT

5) Step B — Ask for test code

Testing doesn’t have to be fancy. Even a small “toy” test dataset plus a few checks can catch many mistakes.

Prompt pattern

Ask for a minimal test that verifies: - input not modified - whitespace removed - numeric conversion works - NaNs filled for numeric columns - date parsed if present

# --- Ask the LLM for simple test code ---

chat_reset("You write short, practical test code for a workshop.")

prompt = """
Write a small Python test snippet (no pytest) to validate a function clean_research_df(df).

Assume pandas is imported.

The test should:
1) Create a small messy DataFrame.
2) Call clean_research_df(df).
3) Print checks that demonstrate:
   - the original DataFrame did not change
   - whitespace stripping worked
   - numeric conversion worked (bad tokens become NaN)
   - missing numeric values were filled
   - 'date' column was parsed to datetime if present

Return ONLY Python code (no markdown).
"""

test_code = chat(prompt, temperature=0.2)
print(test_code)

```python
import pandas as pd
import numpy as np

def clean_research_df(df):
    df = df.copy()
    df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
    df = df.apply(pd.to_numeric, errors='coerce')
    df = df.fillna(0)
    if 'date' in df.columns:
        df['date'] = pd.to_datetime(df['date'], errors='coerce')
    return df

# Create a small messy DataFrame
df = pd.DataFrame({
    'id': [' 1 ', ' 2 ', ' 3 '],
    'name': [' John ', ' Mary ', ' David '],
    'age': [' 25 ', ' thirty ', ' 30 '],
    'date': ['2022-01-01', '2022-02-02', 'not a date']
})

# Store the original DataFrame
original_df = df.copy()

# Call clean_research_df(df)
clean_df = clean_research_df(df)

# Print checks
print("Original DataFrame did not change:")
print(original_df.equals(df))

print("\nWhitespace stripping worked:")
print(clean_df['name'].equals(pd.Series(['John', 'Mary', 'David'])))

print("\nNumeric conversion worked:")
print(clean_df['age'].equals(pd.Series([25.0, np.nan, 30.0])))

print("\nMissing numeric values were filled:")
print(clean_df['age'].equals(pd.Series([25.0, 0.0, 30.0])))

print("\n'date' column was parsed to datetime:")
print(clean_df['date'].dtype == 'datetime64[ns]')
```

6) Step C — Edge cases: ask what could break

After you run the function once, ask: - What edge cases could break this? - How should we handle them?

Examples: - column already numeric - all values missing (mean is NaN) - mixed types (numbers + strings) - date formats inconsistent

# --- Ask for edge cases + an improved version (FIXED: includes the actual function) ---

import inspect  # Lets us extract the actual function source code from the notebook

# Reset chat so the model follows instructions cleanly (but we will INCLUDE the function code explicitly)
chat_reset("You are careful. Identify edge cases and improve the function. Follow instructions exactly.")

# Try to capture the function code the user actually pasted/ran earlier in the notebook.
# This avoids triple-quote issues and ensures the LLM is reasoning about the real function.
try:
    function_code = inspect.getsource(clean_research_df)
except Exception as e:
    # Fallback: if inspect can't retrieve it (rare), ask the user to paste it.
    function_code = "PASTE clean_research_df(df) FUNCTION CODE HERE"

prompt = f"""
You are reviewing and improving a Python function.

FUNCTION:
{function_code}

TASK:
1) List 5 realistic edge cases that could break this function or produce surprising results.
2) Provide an improved version of the function that handles those edge cases.

OUTPUT FORMAT:
- Bullet list of edge cases (short)
- Then ONLY Python code for the improved function (no markdown, no extra commentary)

RULES:
- Keep code minimal and readable.
- Do not add extra dependencies.
- Do not change the function name (must remain clean_research_df).
"""

edge_case_response = chat(prompt, temperature=0.2)
print(edge_case_response)

* Empty DataFrame
* DataFrame with no string or object columns
* DataFrame with no numeric columns
* DataFrame with a 'date' column that is not in a parseable format
* DataFrame with columns that have a mix of numeric and non-numeric data

```python
import pandas as pd
import numpy as np

def clean_research_df(df):
    if df.empty:
        return df

    cleaned_df = df.copy()

    for col in cleaned_df.select_dtypes(include=['object', 'string']).columns:
        cleaned_df[col] = cleaned_df[col].str.strip()

    for col in cleaned_df.select_dtypes(include=['object', 'string']).columns:
        try:
            cleaned_df[col] = pd.to_numeric(cleaned_df[col], errors='coerce')
        except ValueError:
            pass

    for col in cleaned_df.select_dtypes(include=['int64', 'float64']).columns:
        if cleaned_df[col].notnull().any():
            cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mean())
        else:
            cleaned_df[col] = cleaned_df[col].fillna(0)

    if 'date' in cleaned_df.columns:
        try:
            cleaned_df['date'] = pd.to_datetime(cleaned_df['date'], errors='coerce')
        except ValueError:
            pass

    return cleaned_df
```

7) Step D — Ask the model to explain the code

This is a very effective habit. It helps you: - understand what the model wrote - spot silent errors - learn patterns you can reuse

# --- Ask for an explanation of YOUR function (correct version) ---

chat_reset("Explain code clearly to a researcher. Be concise.")

# Try to capture the function code the user actually pasted/ran earlier in the notebook.
# This avoids triple-quote issues and ensures the LLM is reasoning about the real function.
try:
    function_code = inspect.getsource(clean_research_df)
except Exception as e:
    # Fallback: if inspect can't retrieve it (rare), ask the user to paste it.
    function_code = "PASTE clean_research_df(df) FUNCTION CODE HERE"
    
    

prompt = f'''Explain the following Python function line by line in simple terms.
Also point out 3 places where silent mistakes could happen.

FUNCTION:
{function_code}
'''

explanation = chat(prompt, temperature=0.2)
print(explanation)

Here's a line-by-line explanation of the function:

1. `def clean_research_df(df):` 
   - This line defines a function named `clean_research_df` that takes one argument, `df`, which is expected to be a pandas DataFrame.

2. `cleaned_df = df.copy()` 
   - This line creates a copy of the input DataFrame `df` and assigns it to `cleaned_df`. This is done to avoid modifying the original DataFrame.

3. `for col in cleaned_df.select_dtypes(include=['object', 'string']).columns:` 
   - This line starts a loop that will iterate over all columns in `cleaned_df` that contain string or object data types.

4. `cleaned_df[col] = cleaned_df[col].str.strip()` 
   - Inside the loop, this line removes leading and trailing whitespace from all string values in the current column.

5. `for col in cleaned_df.select_dtypes(include=['object', 'string']).columns:` 
   - This line starts another loop that will iterate over the same columns as before (those with string or object data types).

6. `cleaned_df[col] = pd.to_numeric(cleaned_df[col], errors='coerce')` 
   - Inside the loop, this line attempts to convert the values in the current column to numeric values. If a value cannot be converted, it will be replaced with NaN (Not a Number).

7. `for col in cleaned_df.select_dtypes(include=['int64', 'float64']).columns:` 
   - This line starts a loop that will iterate over all columns in `cleaned_df` that contain integer or floating-point data types.

8. `cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mean())` 
   - Inside the loop, this line replaces any missing values (NaN) in the current column with the mean value of that column.

9. `if 'date' in cleaned_df.columns:` 
   - This line checks if `cleaned_df` has a column named 'date'.

10. `cleaned_df['date'] = pd.to_datetime(cleaned_df['date'], errors='coerce')` 
    - If the 'date' column exists, this line attempts to convert its values to datetime format. If a value cannot be converted, it will be replaced with NaT (Not a Time).

11. `return cleaned_df` 
    - This line returns the cleaned DataFrame.

As for silent mistakes, here are three potential issues:

1. **Overwriting numeric columns**: In the second loop, the function attempts to convert all string/object columns to numeric. However, if a column was previously converted to numeric in the first loop (because it contained numeric-looking strings), this conversion will overwrite the existing numeric values with NaN if they cannot be converted again. To avoid this, the function should only attempt to convert columns that were not already converted to numeric.

2. **Mean calculation with NaN**: When calculating the mean of a column to fill missing values, the function will ignore NaN values. However, if a column contains only NaN values, the mean will also be NaN, and the function will not replace the NaN values. To avoid this, the function should check if a column contains only NaN values before attempting to calculate its mean.

3. **Loss of original data types**: The function converts all string/object columns to numeric, which can lead to loss of original data types. For example, if a column contains a mix of numeric and non-numeric strings, the non-numeric strings will be replaced with NaN. To avoid this, the function should preserve the original data types of columns that cannot be converted to numeric.

8) Step E — Failure modes (research thinking)

Cleaning choices can change your conclusions.

For example: - mean imputation can shrink variance - missingness might be systematic (not random) - stripping whitespace might merge categories unintentionally - date parsing might silently fail and create missing timestamps

Here we ask the model to reason about when this approach could mislead.

# --- Ask about failure modes ---

chat_reset("You are a careful applied statistician. Focus on research risks.")

# Try to capture the function code the user actually pasted/ran earlier in the notebook.
# This avoids triple-quote issues and ensures the LLM is reasoning about the real function.
try:
    function_code = inspect.getsource(clean_research_df)
except Exception as e:
    # Fallback: if inspect can't retrieve it (rare), ask the user to paste it.
    function_code = "PASTE clean_research_df(df) FUNCTION CODE HERE"
    

prompt = """
In the cleaning approach we used (whitespace stripping, numeric coercion, mean imputation, date parsing):

1) List 5 ways these choices could mislead downstream analysis.
2) For each, suggest one simple mitigation a researcher could do quickly.

Keep it practical and not too long.

FUNCTION:
{function_code}
"""

risks = chat(prompt, temperature=0.2)
print(risks)

Here are 5 ways the cleaning approach could mislead downstream analysis, along with a simple mitigation for each:

1. **Loss of meaningful zeros**: Whitespace stripping and numeric coercion might remove meaningful zeros, leading to incorrect interpretations.
Mitigation: Verify that zeros are correctly preserved by checking a sample of data before and after cleaning.

2. **Over-imputation**: Mean imputation might over-estimate or under-estimate values, especially for skewed distributions.
Mitigation: Use a simple visualization (e.g., histogram) to check the distribution of imputed values against the original data.

3. **Date parsing errors**: Incorrect date parsing might lead to incorrect time-related analyses.
Mitigation: Spot-check a sample of parsed dates against the original data to ensure accuracy.

4. **Masking of outliers**: Mean imputation might mask outliers, leading to incorrect models or conclusions.
Mitigation: Use a simple outlier detection method (e.g., boxplot) to identify potential outliers before imputing missing values.

5. **Ignoring data context**: Automated cleaning might ignore important context-specific information (e.g., missing values indicating a specific condition).
Mitigation: Review the data cleaning process with a domain expert to ensure that context-specific information is preserved and correctly interpreted.

These mitigations can help researchers quickly identify and address potential issues in their data cleaning approach.

9) Wrap-up

What you learned: - How to prompt for a function with a clear goal - How to prompt for quick test code - How to improve robustness via edge cases - How to use explanation + failure modes to build trust

Key habit: treat the LLM like a fast collaborator. You still run the code, check outputs, and decide what is scientifically appropriate.