JUNE 29, 2023

                                                            AIDEN SWANN

Caltech-GPT is a nano-GPT architecture trained on data scraped from caltech.edu as a student experiment.

Why did you choose to go to Caltech ?

These narrow possible is inexpensive, rather-than-on-a-down-style The other behavior of the control and ethical dimensional communication form every perspective in the same world's so-careers from the own scientific doubled detection Caltech professor Earth Evolutionary Brown Professor of Physics Harvard Xenational Laboratory is established in the Kavli Nanoscience Institute for Physics, he Investigator, and researchers

This is a 420-character inference of a new large language model (LLM) which I call Caltech-GPT. Although entirely lacking in cohesion, grammar, and occasionally making up words, it somehow captures something about the essence of Caltech.

I remember the first time I tried chat-GPT. I was absolutely floored. The model's ability to generate human-like and human-surpassing responses was unlike anything I'd ever seen before. In this post, I will first discuss briefly the principles underlying LLM and then discuss my simplified version: Caltech-GPT.

Large language models (LLM) are incredibly complex systems based on deep learning architectures, specifically a type of neural network called a Transformer. Their functioning is driven by a process known as unsupervised learning, where the model trains on vast amounts of text data and learns to generate human-like text based on the patterns it identifies. The model iterates through the dataset of text learning to predict the next word in a sequence.

Untitled

Transformer architecture from the original attention is all you need paper

Large Language Models (LLMs) learn from extensive internet-derived datasets, comprising billions of sentences in diverse languages, giving them a comprehensive grasp of human language, context, and grammar. This raw data undergoes preprocessing for cleaning and organization. The model then trains, reading the sentences and predicting the next word based on the preceding ones, with its layers of artificial neurons each learning different language aspects. This process adjusts the model's internal parameters to minimize prediction errors. Once trained, LLMs can generate grammatically and contextually coherent text, with each word's choice based on its predicted probability given the previous words. Moreover, LLMs can be fine-tuned for specific tasks or domains, such as medical literature, to enhance proficiency in the relevant terms and concepts.

Untitled

GPT-3 model parameters as published by OpenAI.

It is worth noting that Caltech-GPT is a massive simplification of the cutting edge LLM technology found in models like GPT3 or GPT4. Firstly the number of parameters is massively less.

While the least advanced GPT-3 model has 125M parameters Caltech-GPT has only 12M. Caltech-GPT is forked from nano-GPT written by Andrej Karpathy. Karpathy also made this incredible video which I highly recommend watching which explains nano-GPT and transformers in detail. Please check out Karpathy’s video.

Data Sourcing and Preprocessing

I now want to discuss the aspects of this project which make it unique from nano-GPT. This is where the Caltech comes into the picture. My goal with this project was to see how much capabilities I could develop with only the data found on the caltech.edu domain. My first step was to create a web scrapper which would crawl through all Caltech websites extracting all of the links and text.

if start_url and is_valid(start_url, domain) and start_url not in visited_links:
        to_visit.insert(0, start_url)

    if not to_visit:
        to_visit.append(f'https://{domain}')

    while to_visit:
        current_link = to_visit.pop(0)

        if current_link not in visited_links and is_allowed_file_type(current_link) and not is_bot_protected(current_link):
            visited_links.add(current_link)
            print(f'Crawling {current_link}')
            save_text(current_link, output_file)
            save_visited_url(current_link, visited_urls_file)

            new_links = get_all_links(current_link, domain)
            to_visit.extend(new_links)

            save_to_visit_list(to_visit, to_visit_file)

            time.sleep(REQUEST_DELAY)

Above is the main loop for the web scraper. The flow of the program can be illustrated as follows.

Once this raw text file is generated we need to complete a number of preprocessing steps. This is necessary because the raw text contains a lot of links and other garbage which we want to remove. The preprocessing which I conduct this Caltech-GPT is extremely basic and shown below.

def is_line_allowed(line):
    allowed_chars = string.ascii_letters + string.digits + string.punctuation + string.whitespace
    allowed_chars = allowed_chars[:-2]
    return all(char in allowed_chars for char in line)

def is_line_valid(line):
    return len(line.strip()) >= 100

def remove_unwanted_lines(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f_in, open(output_file, 'w', encoding='utf-8') as f_out:
        for line in f_in:
            if is_line_allowed(line) and is_line_valid(line):
                f_out.write(line)