How anyone can use AI to create novel datasets
An example of classifying text data from the Fed's FOMC announcements
Autonomous Econ is a newsletter that empowers economists and policy analysts by equipping them with Python and data science skills. The content is designed to boost their productivity through automation and transform them into savvier analysts.
Upcoming posts will roughly be split between practical guides (80 percent) and data journalism pieces where I demonstrate the tools (20 percent).
If you find the content valuable, please consider becoming a free subscriber by clicking on the button below.
Recent advancements in large language models (LLMs), such as ChatGPT, have simplified the process of analyzing and creating datasets from text-based sources. For instance, they can facilitate economic sentiment analysis from news articles or social media platforms.
In this post, I'll guide you through using a language model to extract information from URL links and generate a structured CSV file for further analysis or as input for other models. I'll demonstrate this by instructing a LLM to extract data from Federal Open Market Committee (FOMC) statements on the Federal Reserve website and classifying the statements on a numerical scale from very dovish (-1) to very hawkish (1). This demo can be easily replicated by anyone in an internet browser using Google Colab with my concise template notebook. Use it as a starting point for creating your own unique datasets.
Introducing LangChain
Last year, I was exploring ways to create a personal recipe database that I could integrate with an AI assistant to streamline my shopping and cooking. As a food nerd, I was looking for structured information on ingredients, serving sizes, and macronutrients for all my bookmarked recipes. Manually collecting this crucial information from each website would have been tedious.
I knew ChatGPT could read these links and extract the information I needed, but I wanted to save this data in a structured file, which wasn't possible through the ChatGPT User Interface (UI). That's when I discovered a tutorial by DeepLearning.AI. It showed how one can use LLMs to analyze and parse information in a structured manner before saving it into a file format like CSV.
This is achieved using a Python module named LangChain, which serves as a bridge between LLMs like ChatGPT and other Python modules. It allows users to automate repetitive queries within specific contexts. It enables you to create custom GPTs similar to OpenAI's built-in GPT-creator, but with even greater versatility, given its access to the wider Python universe of libraries.
Using LLMs to interpret 'Fed Speak'
After a successful experiment with the recipe dataset, my mind turned to other use cases, especially those in economics. I learned that organizations had been experimenting with LLMs to analyze the vast amount of text data produced by the meetings of the Federal Open Market Committee (FOMC).
Research by Fed economists demonstrated that language models can effectively classify the tone of FOMC statements as either dovish or hawkish and explain the reasoning behind their classifications. They can also identify exogenous monetary policy shocks using the narrative approach. Moreover, it was shown that GPT models, like GPT-3 and GPT-4, outperformed all previous natural language models in these tasks.
This has a clear implication: it eliminates the need for analysts to painstakingly sift through text and make such classifications and judgments manually. Economists are now able to build rich datasets of monetary policy with very little effort.
A simple demo with FOMC announcements
Now, I want to provide a blueprint of how you could leverage LLMs in a similar way to build a dataset like the one shown at the beginning of the post. You could follow this to collect similar data for another central bank, or a completely different use case like fiscal policy or sentiment analysis. It's easier than you might think, and you can follow the steps in this Google Colab notebook.1
The only prerequisite is that you need to create an account with OpenAI and generate your API key. This is needed at the beginning when setting the environment variables.
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "<YOUR APIR KEY>"
We then choose the OpenAI model we would like to use. There are a variety of models to choose from, and you can find an overview of them here. I won't delve into too much detail on model choice, but the key considerations to keep in mind when choosing a model relate to the size of the context window (length of text) and the complexity of your task.2 I found that GPT-3.5 with a 16k context window was sufficient for demo purposes, as I am only analyzing the short FOMC announcements rather than the detailed minutes of each meeting.
# set model and model params
model="gpt-3.5-turbo-16k"
chat = ChatOpenAI(temperature=0.0, model_name=model)
Now that the environment is set up, we can enter all the URL links of the FOMC statements we would like to extract data from. Here, we use WebBaseLoader()
to load the text from each link into the `docs` object.
# Loading multiple web links into a document object
urls = [
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20240320a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20230201a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20220504a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20220316a.htm",
"https://www.federalreserve.gov/newsevents/pressreleases/monetary20200315a.htm",
]
loader = WebBaseLoader(web_path=urls)
docs = loader.load()
We need a way to instruct the language model on how to organize and structure the output it generates. For each piece of information we want to collect, a ResponseSchema()
is used to define what that schema is.
# create the instructions that will convert the LLM response into a JSON format
date_schema = ResponseSchema(
name="fomc_date", description="date of fomc announcement"
)
target_range_schema = ResponseSchema(
name="fed_funds_target_range", description="target range for the federal funds rate")
decision_schema = ResponseSchema(
name="rate_decision", description="decision for the federal funds rate"
)
policy_stance_schema = ResponseSchema(
name="policy_stance", description="policy stance of statement ranging from -1 (very dovish) to 1 (very hawkish)"
)
This list of response_schemas
is then used to provide format instructions, telling the language model to store each item in JSON format. JSON format is a convenient format for storing information that can be utilized by various programs and Python modules.
response_schemas = [
date_schema,
target_range_schema,
decision_schema,
policy_stance_schema,
]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()
print(format_instructions)
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```": ```json { "fomc_date": string // date of fomc announcement "fed_funds_target_range": string // target range for the federal funds rate "rate_decision": string // decision for the federal funds rate "policy_stance": string // policy stance of statement ranging from -1 (very dovish) to 1 (very hawkish) }
Next, we create a prompt template that includes the text from the FOMC statement and the instructions on how the output should be formatted. Think of the prompt template as the final blurb given to the language model, where we only change the FOMC statement each time.
The prompt template should contain specific instructions and context on how the language model should interpret and extract the information. For example, we need to describe our hawkish/dovish scale so the model can accurately classify the tone.
# Create the extraction instructions for the prompt template
extract_template = """\
For the following text, extract the following information:
date_schema: date of fomc announcement. Use the following format: dd/mm/YYYY.
target_range_schema: the target range for the federal funds rate that the Committee decided on. Record the range as values with 2 decimal places.
decision_schema: decision by the FOMC for the federal funds rate. Classify as either 'raise','maintain' or 'lower'.
policy_stance_schema: classify the text with the following values depending on how dovish or hawkish the overall message was in the FOMC statement.
-1: Strongly expresses a belief that the economy may be growing too slowly and/or inflation is too low and may need stimulus through monetary policy.
-0.5: Overall message expresses a belief that the economy may be growing too slowly and/or inflation is too low and may need stimulus through monetary policy.
0: Expresses neither a hawkish nor dovish view and the Fed is on track to achieve its employment and inflation goals.
0.5: Overall message expresses a belief that the economy is growing too quickly and/or inflation is too high and may need to be slowed down through monetary policy.
1: Strongly expresses a belief that the economy is growing too quickly and/or inflation is too high and may need to be slowed down through monetary policy.
text: {text}
{format_instructions}
Finally, the fun part: we use the language model to process every item in our document object and store the extracted information.
# Generate a response for each document based on the prompt template and save it into a list
output_list = []
for i in range(len(docs)):
doc = docs[i]
messages = prompt.format_messages(
text=doc,
format_instructions=format_instructions
)
response = chat(messages)
parsed_response = output_parser.parse(response.content)
output_list.append(parsed_response)
We can convert the JSON object into a pandas DataFrame, which can then be saved down into a CSV file.
# Convert your json list into a pandas dataframe
json_list = json.dumps(output_list)
df = pd.read_json(json_list)
# save df to csv file
df.to_csv("fomc_sample.csv", index=False)
In this simple demo, you can see that the model does a sufficient job of identifying the policy stance for each statement. It correctly captures the extremely dovish tone at the onset of the COVID-19 pandemic in 2020 and the hawkish stance over 2022 and 2023 as inflation took off. Of course, to validate this approach on a larger scale, one would need to manually classify a subset of the statements to check how well the model performs. The prompt instructions could then be fine-tuned to further improve the model, or you could use a different model altogether.
What unique dataset will you create?
Recent models, with their extensive pre-loaded knowledge and nuanced understanding, eliminate the need for additional data training or the creation of complex rules and keyword lists. Utilizing LLMs, as outlined in my approach, opens numerous opportunities for time-constrained analysts to generate novel datasets from text. This technique is not limited to text from URL links; it can also be applied to plain text and PDF files using different loaders from LangChain.
Personally, I'm keen to see language models applied to the analysis of text-rich fiscal policy statements, and I'm convinced there are many other potential applications. If you have ideas for other useful applications, feel free to share them in the comments below.
You can directly open the notebook within Google Colab by going to 'File' -> 'Open notebook' -> and entering the notebook repository URL into the search field of the 'GitHub' tab. Otherwise, download the notebook directly from the GitHub site and manually upload it into Google Colab.
Note that it does cost money to perform queries with the language model you choose, with the cost varying depending on the length of the input and output text as well as the complexity of the model. For small prototypes, the cost is relatively inexpensive, and OpenAI even provides some free credit up front. However, if you are performing queries at scale, you should consider using an enterprise account.