chat-with-your-doc
is a demonstration application that leverages the capabilities of ChatGPT/GPT-4 and LangChain to enable users to chat with their documents. This repository hosts the codebase, instructions, and resources needed to set up and run the application.
The primary goal of this project is to simplify the interaction with documents and extract valuable information with using natural language. This project is built using LangChain and GPT-4/ChatGPT to deliver a smooth and natural conversational experience to the user, with support for both Azure OpenAI Services
and OpenAI
- 20230709: Add Support for OpenAI API
- 20230703: Web UI changed to Streamlit, with support for streaming
- Upload documents as external knowledge base for GPT-4/ChatGPT, support both
Azure OpenAI Services
andOpenAI
- Support various format including PDF, DOCX, PPTX, TXT and etc.
- Chat with the document content, ask questions, and get relevant answers based on the context.
- User-friendly interface to ensure seamless interaction.
- [ x ] Show source documents for answers in the web gui
- [ x ] Support streaming of answers
- Support swith of chain type and streaming LangChain output in the web gui
Suggest to install on Ubuntu instead of CentOS/Debian. See Issue #12
To get started with Chat-with-your-doc
, follow these steps:
- Clone the repository:
git clone https://github.com/linjungz/chat-with-your-doc.git
- Change into the
chat-with-your-doc
directory:
cd chat-with-your-doc
- Install the required Python packages:
Create virtual environment:
python3 -m venv .venv
source .venv/bin/activate
Install depenancies:
pip install -r requirements.txt
In this project we're supporting both API from OpenAI and Azure OpenAI Service. There're some environmnet variables that are common for the two APIs while some are unique. The following table lists all the env vars that're supported:
Environment Variables | Azure OpenAI Service | OpenAI |
---|---|---|
OPENAI_API_BASE | ✅ | |
OPENAI_API_KEY | ✅ | ✅ |
OPENAI_GPT_DEPLOYMENT_NAME | ✅ | |
OPENAI_EMBEDDING_DEPLOYMENT_NAME | ✅ | ✅ |
CHAT_MODEL_NAME | ✅ | |
REQUEST_TIMEOUT | ✅ | ✅ |
VECTORDB_PATH | ✅ | ✅ |
TEMPERATURE | ✅ | ✅ |
CHUNK_SIZE | ✅ | ✅ |
CHUNK_OVERLAP | ✅ | ✅ |
- Obtain your Azure OpenAI API key, Endpoint and Deployment Name from the Azure Portal.
- Create
.env
in the root dir and set the environment variables in the file:
OPENAI_API_BASE=https://your-endpoint.openai.azure.com
OPENAI_API_KEY=your-key-here
OPENAI_GPT_DEPLOYMENT_NAME=your-gpt-deployment-name
OPENAI_EMBEDDING_DEPLOYMENT_NAME=your-embedding-deployment-name
Here's where you can find the deployment names for GPT and Embedding:
- Obtain your OpenAI API key from the platform.openai.com.
- Create
.env
in the root dir and set the environment variable in the file:
OPENAI_API_KEY=your-key-here
CHAT_MODEL_NAME="gpt-4-0314"
This will initialize the application based on Streamlit
and open up the user interface in your default web browser. You can now upload a document to create a knowledge base and start a conversation with it.
$ streamlit run chat_web_st.py --server.address '0.0.0.0'
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
You can now view your Streamlit app in your browser.
URL: http://0.0.0.0:8501```
Note that the previous Web UI built using Gradio is deprecated and no longer maintained. You could find the code in the chat_web.py file.
The CLI application is built to support both ingest
and chat
commands. Python library typer
is used to build the command line interface.
This command would take the documents as input, split the texts, generate the embeddings and store in a vector store FAISS
. The vector store would be store locally for later used for chat.
For example if you want to put all the PDFs in the directory into one single vector store named surface
, you could run:
$ python chat_cli.py ingest --path "./data/source_documents/*.pdf" --name surface
Note that the path should be enclosed with double quotes to avoid shell expansion.
This command would start a interactive chat, with documents as a external knowledge base in a vector store. You could choose which knowledge base to load for chat.
Two sample documents about Surface has been provided in the data/source_document directory and already ingested into the default vector store index
, stored in the data/vector_store. You could run the following command to start a chat with the documents:
$ python chat_cli.py chat
Or you could specify the vector store to load for chat:
$ python chat_cli.py chat --name surface
Langchain
is leveraged to quickly build a workflow interacting with Azure GPT-4. ConversationalRetrievalChain
is used in this particular use case to support chat history. You may refer to this link for more detail.
For chaintype
, by default stuff
is used. For more detail, please refer to this link
- The LangChain usage is inspired by gpt4-pdf-chatbot-langchain
- The integration of langchain streaming and Stremlit is inspired by Examples from Streamlit
- The processing of documents is inspired by OpenAIEnterpriseChatBotAndQA
chat-with-your-doc
is released under the MIT License. See the LICENSE
file for more details.