Quality Enginering POV: A Step-by-Step & Demo

August 28, 2025

Luis Osuna

Building an end-to-end AI chatbot with Python, Scrapy, Open AI [GPT-3.5] and custom training data

A complete guide to collecting real-world developer knowledge from Stack Overflow, preparing high-quality JSONL datasets, fine-tuning GPT-3.5 Turbo, and deploying a fully functional, AI-powered chatbot using Python, Scrapy, and OpenAI’s fine-tuning tools.

Real-world scenario usage:

In QA workflows, seconds matter. Whether you’re writing test assertions, debugging a flaky selector, or onboarding a new team member, access to focused, contextual knowledge can make or break velocity. This chatbot was fine-tuned using hundreds of Stack Overflow Q&A examples.

This same approach can be expanded far beyond dev/QA knowledge, enabling teams to instantly query internal documentation, surface business-specific insights, and accelerate decision-making with a domain-trained conversational assistant.

Not in Scope

Python Cron Jobs / Scheduling This project does not cover setting up scheduled tasks or background job automation using cron, APScheduler, or similar Python scheduling libraries.
Graphical User Interfaces (UI) The chatbot does not include a web-based or graphical UI (e.g., using Streamlit, Flask, or React). It operates purely through the command line or API interactions.

What you will need:

Stackoverflow API Key
Open AI API Key
Tokens to fine-tune your models (Credit card)

Once the core logic is in place, the project is quick to build and execute.

1. Scrape Stack Overflow

Tools: Scrapy, BeautifulSoup. Purpose: Extract real developer Q&A data for model training.

Using Python’s Scrapy and BeautifulSoup, we built a crawler that fetched high-voted questions and their accepted answers based on Python tags.

pip install scrapy
scrapy startproject project_name
scrapy startproject stackoverflowdemo

-- This will create a directory structure like this:
stackoverflowdemo/
scrapy.cfg
stackoverflowdemo/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/         --- Add a file stackoverflow.py
__init__.py

What is really important is the spider stackoverflow.py.

We first get all questions related to Python, and we limit the page size to 100.

start_urls = [
"https://api.stackexchange.com/2.3/questions?order=desc&sort=votes&tagged=Python&site=stackoverflow&key=rl_7MDaPq5dVss5TykzTyAY4soTi&pagesize=100"
]

In start_requests, we go through every question_id to obtain the answer_ids.

Url - question_id = 231767
https://api.stackexchange.com/2.3/questions/231767/answers?order=desc&sort=votes&site=stackoverflow&key=rl_7MDaPq5dVss5TykzTyAY4soTi

In parse_answers go iterate again to get the details of the answer (text). I’m limiting the cycle to only get the first 3 answers

ython code snippet showing how to scrape Stack Overflow questions and answers with Scrapy and prepare data for JSONL transformation.

Finally, in collect_answers we “beautify” the body of the data and dump it into a JSON file.

Python function using BeautifulSoup and HTML unescape to clean Stack Overflow answers and format Q&A into structured JSON.

2. Store Q&A in JSON Format

Format: .jsonl (JSON Lines) Why: OpenAI expects training data in newline-delimited JSON format.

Each Q&A pair is saved in structured JSON, keeping fields like question, tags, answer, and link.

In pipelines.py is where you will have the logic to make the JSON dump.

Python Scrapy pipeline code for exporting Stack Overflow Q&A data into a JSONL file format.

Your file should look like this format. (with around 300 lines of data).

{"question": "How do I split the definition of a long string over multiple lines?", "tags": ["python", "string", "multiline", "multilinestring"], "link": "https://stackoverflow.com/questions/10660435/how-do-i-split-the-definition-of-a-long-string-over-multiple-lines", "answer_count": 31, "answers": {"id": 24331604, "body": "Breaking lines by  \\  works for me.  Here is an example:   longStr = \"This is a very long string \" \\         \"that I wrote to help somebody \" \\         \"who had a question about \" \\         \"writing long strings in Python\"", "score": 199}}

Some questions may include multiple answers (typically 2 or 3), which can lead to occasional repetition in the dataset. A potential improvement would be to group similar answers or select only the most relevant one for training.

3. Data Conversion & Transformation

In this step, we have to copy and paste the output file from the scraper to the Chatbot project. (I created a separate one, structure at the end of the article)

As you can see, there is an “input_file” and an “output_file”. For the model I’m using gpt-3.5-turbo-0125, this is the structure that I need.

Python script converting Stack Overflow Q&A JSON into JSONL format for OpenAI fine-tuning datasets.

4. Fine-Tune GPT-3.5 Turbo

Upload dataset
Start fine-tune

Python script to upload a JSONL dataset and start a fine-tuning job with OpenAI GPT-3.5 Turbo.

OpenAI fine-tuning dashboard showing a successful GPT-3.5 Turbo fine-tune job with status, job ID, and training details.

5. Deploy and Execute the Chatbot

Interface: Command Line or Python script

Python chatbot loop using a fine-tuned GPT-3.5 Turbo model to process user input and generate responses.

Final structure:

DEMO/
├── Chatbot/
│   ├── chatbotmodel.py          # Script for interacting with the fine-tuned chatbot model
│   ├── fileconversion.py        # Script to convert StackOverflow Q&A data into fine-tuning format
│   ├── training.py              # Script to upload data and start the fine-tuning process
│   ├── requirements.txt         # List of Python dependencies
│   ├── output/                  # Directory for storing output files
│   │   ├── stackoverflow_qna.jsonl  # Raw scraped Q&A data
│   │   ├── fine_tune_ready.jsonl    # Data formatted for fine-tuning
│   │   └── cleaned_data.jsonl       # Optional: Cleaned and processed data
│   └── README.md                # Documentation for the Chatbot module
├── stackoverflowdemo/           # Scrapy project directory for scraping StackOverflow data
│   ├── scrapy.cfg               # Scrapy configuration file
│   ├── stackoverflowdemo/       # Main Scrapy project module
│   │   ├── __init__.py          # Marks this directory as a Python package
│   │   ├── items.py             # Define data structures for scraped items
│   │   ├── middlewares.py       # Scrapy middlewares (if needed)
│   │   ├── pipelines.py         # Define pipelines for processing scraped data
│   │   ├── settings.py          # Scrapy project settings
│   │   └── spiders/             # Directory for Scrapy spiders
│   │       ├── __init__.py      # Marks this directory as a Python package
│   │       └── stackoverflow.py # Spider for scraping StackOverflow data
└── README.md                    # Documentation for the entire project

DEMO

https://github.com/luisosuna/Demo

Now, let’s address the concepts of:

temperature: 0.2 – Makes the output focused and deterministic – (In this example, I chose to use 0.7)
frequency_penalty: 0.5 -Reduces repeated words within the same response
presence_penalty: 0.3 – Discourages repeating ideas/topics to make replies diverse

Notice the difference in the answers.

Luis Osuna

Principal Software Automation Technical Lead and Quality Engineering Director

Principal Software Automation Technical Lead and Quality Engineering Director with over 13 years of experience driving large-scale projects across diverse industries, including Financial Services, Healthcare, and Networking. Expertise in both Waterfall and Agile/Scrum methodologies, with a strong track record of delivering high-quality solutions.

Quality Enginering POV: A Step-by-Step & Demo

Building an end-to-end AI chatbot with Python, Scrapy, Open AI [GPT-3.5] and custom training data

Real-world scenario usage:

Not in Scope

What you will need:

1. Scrape Stack Overflow

2. Store Q&A in JSON Format

3. Data Conversion & Transformation

4. Fine-Tune GPT-3.5 Turbo

5. Deploy and Execute the Chatbot

DEMO

Luis Osuna

Related Articles

How to Manage Your Remote Team Efficiently Using Scrum

Why “T-Shaped” Engineers are the Secret Weapon You’re not Using Yet

Join the Airship Book Club

Engineering Excellence
One Roadmap at a Time

Work With Us

Services

Industries

Resources

Services

Quality Enginering POV: A Step-by-Step & Demo

Building an end-to-end AI chatbot with Python, Scrapy, Open AI [GPT-3.5] and custom training data

Real-world scenario usage:

Not in Scope

What you will need:

1. Scrape Stack Overflow

2. Store Q&A in JSON Format

3. Data Conversion & Transformation

4. Fine-Tune GPT-3.5 Turbo

5. Deploy and Execute the Chatbot

DEMO

Luis Osuna

Related Articles

How to Manage Your Remote Team Efficiently Using Scrum

Why “T-Shaped” Engineers are the Secret Weapon You’re not Using Yet

Join the Airship Book Club

Work With Us

Services

Industries

Resources

Services

What if your next big breakthrough started here?