At CommieCapital, I developed a multi-agent, multi-step large language model (LLM) designed to process financial news reports into quantifiable data. The model’s primary objective was to generate a 0-9 rating by answering specific questions about each report. These ratings were then correlated with the respective stock's price movements to identify patterns and trends. The ultimate goal was to uncover the factors most likely to influence a stock's price movement based on historical data presented in the reports.
RAG: First, each report is divided into smaller chunks and converted into vector representations, which are stored in a PostgreSQL database. Using LlamaIndex, we retrieve the most relevant information from the database based on the query. The retrieved documents are then returned in descending order of relevance to the query.
Rerank & Sort: We use Gemini Flash to rank the relevance of each document chunk. The documents are sorted by relevance, and those with low relevance scores are filtered out. This redundancy measure ensures the returned information is highly relevant while eliminating unnecessary data, reducing token costs.
Reasoning: In this step, we use GPT-4o to assess the potential impact of the retrieved information based on the query. Additional context, such as the company's financial data, is provided to enhance the model's reasoning performance. To ensure consistency and accuracy, we utilize multi-shot prompting and OpenAI's structured outputs.
Relevance check: After generating reasoning, we perform a double-check using two models—GPT-4o-mini and Gemini Flash—to verify the reasoning's relevance to the query. If both models determine the reasoning is irrelevant, we skip the rating step and assign a score of 0. This approach saves token costs and prevents irrelevant information from distorting the output. Multi-shot prompting is employed to further enhance model accuracy.
Rating: Three separate LLMs generate a rating on a 0-9 Likert scale based on the provided criteria. The scores from these models are averaged to produce a final rating.
Evaluation: As a final redundancy measure, we check for significant discrepancies among the scores generated in the rating step. If discrepancies are detected, the process is rerun using larger models (o-1 and o-1 mini) to produce a more reliable final rating.
I manually evaluated over 50 reports (>1200 rows), assigning ratings based on the severity of the information in each report using the Likert scale. In most cases, the model's generated ratings closely aligned with my own, with only minor discrepancies observed occasionally. Overall, there was a strong correlation (>0.95) between my personal ratings and the model's ratings. Given the inherent subjectivity in some of the generated reasoning, this performance exceeds our expectations.
Through statistical analysis of stock price data against the generated ratings, we identified key factors that are likely to influence stock price movements based on the information presented in the reports.
Tools/Frameworks I used: Python, LlamaIndex, GPT API, Gemini API