The ability to swiftly process and analyze vast amounts of unstructured data from conferences is a significant competitive advantage. Over the past year, we've developed sophisticated Generative AI (GenAI) pipelines that not only handle unstructured data but transform it into actionable insights with remarkable efficiency.
By the end of this post, you'll understand:
- The challenges of traditional abstract analysis
- How to build a structured information extraction framework
- Real-world applications and insights gained
- Broader applications beyond the San Antonio Breast Cancer Symposium (SABCS)
- Technical implementation strategies with code samples
Understanding the Challenge: Navigating a PDF Data Dump
This year, the San Antonio Breast Cancer Symposium (SABCS) released their abstracts as a 3,000-page PDF booklet, each containing crucial information about new treatments, clinical trial results, and technological innovations. Traditionally, organizations deploy teams of analysts who spend weeks manually reviewing these abstracts, facing several limitations:
- Overwhelming Volume: The sheer number of abstracts exceeds human processing capacity.
- Inconsistent Data Capture: Manual extraction often leads to variability in the data collected.
- Missed Insights: Time constraints can cause critical information to be overlooked.
- Limited Pattern Recognition: Identifying trends across numerous abstracts is challenging.
- Shallow Analysis: Depth of analysis is limited due to tight deadlines.

Reimagining Abstract Analysis with GenAI
The key is to augment manual analysis with GenAI tools that enhance their capabilities. Here's how we approached this challenge for SABCS 2024.
1. Building a Structured Information Extraction Framework
Stage 1: Converting Unstructured PDFs into Structured Data
Working with PDFs is notoriously challenging due to their unstructured nature. Traditional methods rely heavily on complex regular expressions to parse text, which are brittle and often fail with inconsistent formatting. To overcome this, we implemented a multi-step pipeline:
- PDF Text Extraction: Utilized advanced text extraction tools to retrieve text while preserving layout information.
- Page Classification with LLMs: Leveraged GPT-4o to classify pages as complete abstracts or continuations, enhancing accuracy.
- Abstract Boundary Detection: By analyzing page content and using LLMs for pattern recognition, we accurately identified where abstracts begin and end.
Stage 2: Structured Data Extraction Using LLMs
Once we isolated each abstract, the next step was to extract over 35 key data points from each one. We crafted meticulous prompts to guide GPT-4 in parsing complex scientific language and extracting structured data reliably.
Our comprehensive framework captures:

2. Turning Data into Actionable Insights
With structured data in hand, we quickly uncovered significant trends within the first hour of processing SABCS 2024 abstracts.
Sample Output



Trend Summarization & Pattern Recognition
Use Case 1: Trends in Antibody-Drug Conjugates (ADCs)
- Post-Progression Strategies for T-DXd: T-DXd's widespread adoption as the standard of care is driving research on optimizing post-progression strategies, focusing on sequencing and identifying biomarkers for response (SESS-1115, SESS-1219, SESS-2102).
- Payload Resistance in Sequential ADC Use: ADCs with the same payload (e.g., topoisomerase 1 inhibitors like SG and T-DXd) may face resistance in later lines, necessitating strategic sequencing and alternative payloads (SESS-1262, SESS-2163, SESS-1809, SESS-452, SESS-796).
- Sequential ADCs with Different Payloads: Sequential ADCs with distinct payloads (e.g., T-DM1 followed by T-DXd) may offer benefits (SESS-2163).
- Emerging Targets for ADCs: Novel ADCs target diverse antigens like HER3, LIV-1, and SORT1, broadening the therapeutic landscape and addressing patients who do not benefit from current ADCs (SESS-1749, SESS-3612, SESS-2069, SESS-630, SESS-1509).
- Bispecific and Multi-Payload ADCs: Bispecific and multi-payload ADCs are being developed to overcome resistance and enhance efficacy, targeting multiple pathways simultaneously (SESS-630, SESS-3669).
- Role of Biomarkers and Tumor Microenvironment: Optimizing ADC use requires better understanding of biomarkers and the tumor microenvironment, leveraging spatial analysis and advanced imaging to improve efficacy and address resistance (SESS-1077, SESS-1547, SESS-2524, SESS-1817, SESS-2476, SESS-1810).
Use Case 2: Overview of Abstracts mentioning Natera
- Role of ctDNA in Breast Cancer Management: ctDNA studies highlight its role in monitoring therapeutic response, disease progression, and recurrence risk, with ctDNA positivity linked to higher anxiety, pain, and progression risks, while improving disease management perceptions (P3-01-22, P5-12-19, P4-03-29, P2-03-21).
- Advancements in Genomic Surveillance: Integration of ctDNA testing with WES reveals actionable mutations like PIK3CA and their association with disease-free survival, advancing early-stage HR+ breast cancer surveillance and treatment stratification (PS9-01).
Use Case 3: Real-World Data Outcomes on Ribociclib
- Higher Real-World Eligibility: Ribociclib's broader eligibility criteria (NATALEE trial) result in a larger proportion of HR+/HER2- breast cancer patients qualifying for treatment compared to abemaciclib (41.3% vs. 17.5%).
- Improved Disease-Free Survival: Ribociclib-eligible patients demonstrate better 5-year disease-free survival (86%) compared to abemaciclib-eligible patients (77%) in real-world settings, with outcomes favoring ribociclib's use in earlier-stage disease.
- Comparable Effectiveness in Elderly Patients: While ribociclib shows significant benefits in younger populations, its progression-free survival (PFS) in elderly patients is comparable to other CDK4/6 inhibitors, highlighting its potential as a well-tolerated option across age groups.
Use Case 4: Trends and Patterns in Spatial Technologies for Breast Cancer Research
- Integration of Spatial Profiling and Omics: Spatial transcriptomics and profiling are increasingly used to study tumor-immune-stromal interactions, revealing mechanisms of therapy response and recurrence (P2-02-19, PS18-04, P1-04-26, PS13-05).
- Tumor Microenvironment and Therapy Resistance: ECM alterations, immune suppression, and macrophage dynamics within the tumor microenvironment drive resistance across breast cancer subtypes (P2-06-05, P2-06-06, PS18-04).
- Personalized Therapeutic Strategies: Spatial analyses uncover subtype-specific features and mechanisms of resistance (e.g., ILC, IBC), guiding more precise and effective treatments (PS18-04, P1-04-26, PS13-05).
- Advances in Predictive Technologies: Novel platforms like mSIGHT, 3D spheroid models, and immune heatmaps enhance the ability to model and predict outcomes based on TME characteristics (P2-02-19, P2-06-05, P1-04-26).
- Standardization for Clinical Adoption: Simplified spatial tools, such as immune cell heatmaps and virtual multiplex imaging, are paving the way for scalable applications in oncology (P1-04-26, P2-02-19).
Optional Stage 3: Sales and Account Mapping Applications
Although the SABCS booklet lacks organization-level details, we have developed pipelines that can link each author to their LinkedIn, Research Organization Registry (ROR), or publication profiles. This makes it an efficient tool for meeting planning, targeted outreach or KOL mapping
Broader Applications Beyond SABCS
While our focus here is on SABCS, the implications of this approach are far-reaching. Imagine applying this methodology to:
- Clinical Trial Monitoring: Stay ahead by tracking global trial developments.
- Patent Analysis: Uncover innovation trends and potential IP opportunities.
- Regulatory Submissions: Streamline the review of complex regulatory documents.
- Competitive Intelligence: Gather and analyze data on market movements and competitor strategies.
- Literature Reviews: Accelerate research by quickly synthesizing large volumes of scientific publications.
Meet us at JPM 2025
The integration of GenAI into scientific analysis is not just a technological advancement—it's a strategic imperative. We're excited to explore how these innovations can drive value for your organization. I'll be attending the JP Morgan Healthcare Conference in January 2025 and would welcome the opportunity to connect.
If you're interested in:
- Tailoring GenAI solutions to your needs
- Identifying custom use cases aligned with your goals
- Exploring collaborative partnerships
- Seamlessly integrating new tools into your workflows
Let's schedule a meeting at JPM 2025, feel free to reach out directly at madan@decibio.com.
I look forward to connecting with you!
Appendix: Technical Implementation Strategies
In this appendix, we delve into the technical strategies employed to build our structured information extraction framework for the San Antonio Breast Cancer Symposium (SABCS) abstracts. This approach leverages advanced technologies to efficiently process unstructured data, transforming it into actionable insights.
Overview
Processing large volumes of unstructured data like PDFs presents significant challenges due to inconsistent formatting and complex content structures. Traditional methods relying on regular expressions (regex) are often brittle and time-consuming. To overcome these obstacles, we adopted a novel approach that leverages Large Language Models (LLMs) and modern orchestration tools.
Key Technologies and Tools
- LlamaParse: A document parsing service from LlamaIndex that converts complex documents into structured data. It handles PDFs, PowerPoints, Word documents, and spreadsheets, providing a clean JSON output.
- LLMs (e.g., GPT-4o): Used to intelligently detect abstract boundaries and extract key information without relying on rigid regex patterns.
- LiteLLM: A Python SDK that allows us to interface with over 100 LLMs using a unified API. It provides consistent input/output formats and simplifies integration with various LLM providers.
- Prefect: A workflow orchestration tool that manages data pipelines, ensuring scalability and reliability.
- PostgreSQL (Cloud-Hosted): Used as our database for storing and retrieving parsed abstracts, providing robust and scalable data storage.
Processing Pipeline
1. Converting Unstructured PDFs into Structured Data
Text Extraction with LlamaParse
We started by using LlamaParse to extract text from the SABCS PDF booklet. LlamaParse simplifies the extraction process by handling the complexities of PDF formats and providing a structured JSON output.
curl -X 'POST' \ 'https://api.cloud.llamaindex.ai/api/parsing/upload' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \ -F 'file=@/path/to/your/file.pdf;type=application/pdf'
This command uploads the PDF file to LlamaParse and retrieves the parsed content in JSON format, preserving the document's structure.
Intelligent Abstract Boundary Detection with LLMs
Instead of using regex to find patterns and abstract boundaries—a method prone to errors due to inconsistent formatting—we leveraged LLMs to analyze the text and identify where each abstract begins and ends.
We crafted prompts that instructed the LLM to recognize specific patterns at the start of abstracts, such as:
- Starts with 1-3 uppercase letters
- Followed by one or more numbers
- Contains hyphens separating numbers
- Ends with a colon
- Positioned at the very beginning of the text
Example Patterns:
- GS1-01:
- P1-01-15:
- OT3-01:
By using LLMs for boundary detection, we improved accuracy and saved significant time compared to manual regex pattern crafting.
2. Structured Data Extraction Using LLMs
After isolating each abstract, we employed LLMs to extract over 35 key data fields. We designed detailed prompts to guide the model in parsing complex scientific language and outputting the information in a structured JSON format.
To interact with various LLMs seamlessly, we used LiteLLM, a Python SDK that allows us to call over 100 LLMs using a consistent interface. LiteLLM provides:
- Unified API: Standardizes requests across different LLM providers.
- Consistent Output: Ensures text responses are always available at ['choices'][0]['message']['content'].
- Retry and Fallback Logic: Automatically handles retries and load balancing across multiple deployments.
- Cost Tracking: Monitors spend and sets budgets per project.
By leveraging LiteLLM, we could easily switch between different LLM providers, such as OpenAI, Anthropic, or HuggingFace models, depending on our requirements.
Prompt Engineering
This is the special sauce of the pipeline that requires the most iteration to ensure that all relevant fields are extracted in the correct format and with the right specifications. We also ensure proper versioning and check that any edits do not have unintended consequences.
Key Data Fields Extracted:
- Title
- Abstract Number
- Authors
- Funding Source
- Study Phase
- Sample Size
- Breast Cancer Subtype
- Primary Endpoint
- Key Findings
- Drug Names
- Biomarkers Evaluated
- Technologies Mentioned
- (and over 20 additional fields)
Extraction Guidelines:
- Use standardized terminology.
- Note specific platforms or instruments.
- Include performance metrics where available.
- Highlight novel applications and competitive advantages.
By providing the LLM with comprehensive guidelines and examples, we ensured consistency and reliability in the extracted data.
3. Orchestrating the Workflow with Prefect
To manage the data processing tasks efficiently, we utilized Prefect for workflow orchestration. Prefect allowed us to build scalable and maintainable data pipelines.
Workflow Overview:
- Task Definition: Each processing step, such as text extraction or data analysis, is defined as a task.
- Flow Creation: Tasks are connected in a flow that outlines the sequence of operations.
- Execution: The flow is executed, with Prefect handling retries, error handling, and task dependencies.
Example Code:

Explanation:
- Task Definition: The analyze_abstract function is a Prefect task that processes each abstract.
- LLM Interaction: Uses LiteLLM to interact with the LLM, passing in the compiled prompt and receiving the structured data.
- Data Storage: Updates the record in the PostgreSQL database with the extracted information.
4. Data Storage and Access with PostgreSQL
Parsed data is stored in a cloud-hosted PostgreSQL database, which offers a robust and scalable solution for data storage. This setup enables:
- Efficient Data Retrieval: Quick access to structured data for analysis.
- Integration: Easy integration with analytics tools and downstream applications.
- Scalability: Handling large volumes of data without performance degradation.
Benefits of This Approach
- Efficiency Gains: Significantly reduces the time and effort required to process large datasets.
- Improved Accuracy: LLMs handle inconsistencies in formatting better than regex, reducing parsing errors.
- Scalability: The use of Prefect and cloud-hosted PostgreSQL ensures that the system can handle increasing amounts of data.
- Flexibility: Using LiteLLM allows us to switch between different LLM providers and models easily.
- Actionable Insights: Rapid extraction and analysis of data enable quicker decision-making and trend identification.
Conclusion
By leveraging advanced tools like LlamaParse and integrating LLMs into our processing pipeline, we've transformed the way we handle unstructured data from scientific conferences like SABCS. This framework not only accelerates data processing but also enhances the depth and quality of insights derived. Our approach demonstrates the potential of combining AI technologies with thoughtful workflow design to tackle complex data challenges.
Meet us at JPM 2025
The integration of GenAI into scientific analysis is not just a technological advancement—it's a strategic imperative. We're excited to explore how these innovations can drive value for your organization. I'll be attending the JP Morgan Healthcare Conference in January 2025 and would welcome the opportunity to connect.
If you're interested in:
- Tailoring GenAI solutions to your needs
- Identifying custom use cases aligned with your goals
- Exploring collaborative partnerships
- Seamlessly integrating new tools into your workflows
Let's schedule a meeting at JPM 2025, feel free to reach out directly at madan@decibio.com.
I look forward to connecting with you!