Datasets
GSM8K (Grade School Math 8K) contains 8.5K human-created grade school math problems (7.5k training problems and 1k test problems).
Clever way of generating dataset
Structured LLM Output format
direct prompting
avoid manually copy and paste code into separate files
Use the following to avoid having to manually copy and paste your LLM code into separate files:
Please create a single code block containing
cat << EOF
statements that I can copy/paste to create all those files
(Credit: Jeremy Howard)
get clean markdown format
Use the following to get clean markdown output:
Put all your output in a
markdown
block
libraries
annotation
Misc issues during LLM workflow
Bearer error
When working with LLM frameworks such as LlamaIndex along with Streamlit. We might encounter the following error:
LocalProtocolError: Illegal header value b'Bearer '
Often seen when the OpenAI API key was not found. If the key is in the .env
file, we can do:
from dotenv import load_dotenv
load_dotenv()assert os.getenv("OPENAI_API_KEY") is not None, "Please set the OPENAI_API_KEY environment variable"
Note that this only works on local environment. If a remote Github CI workflow is involved, we will need to use Github secrets in actions. We should also make sure that no local dependencies (e.g., test data) need to be accessed during the workflow unless they are made available.
Large Context Window LLMs
Resources
Hands-on Deep Learning and LLM fundamentals
The BEST & FREE courses:
- Neural Networks: Zero to Hero by Andrej Karpathy
- Practical Deep Learning for Coders Part 1 and Part 2 by Jeremy Howard
- Build a Large Language Model from scratch by Sebastian Raschka (Book)
Data processing tools
- URL to LLM-friendly input by Jina.ai
- Webpage to Markdown