API reference
tasks
cross_cutting_themes
cross_cutting_themes(questions_themes: dict[int, DataFrame], llm: RunnableWithFallbacks, n_concepts: int = 5, min_themes: int = 5, config: RunnableConfig | None = None) -> tuple[pd.DataFrame, pd.DataFrame]
Identify cross-cutting themes using a single-pass agent approach.
This function analyzes refined themes from multiple questions to identify semantic patterns that span across different questions, creating cross-cutting theme categories that represent common concerns or policy areas.
The analysis uses a single-pass process: 1. Identify high-level cross-cutting themes across all questions 2. Map individual themes to the identified cross-cutting themes 3. Refine descriptions based on assigned themes
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
questions_themes
|
dict[int, DataFrame]
|
Dictionary mapping question numbers to their refined themes DataFrames. Each DataFrame should have columns: - topic_id: Theme identifier (e.g., 'A', 'B', 'C') - topic: String in format "topic_name: topic_description" |
required |
llm
|
RunnableWithFallbacks
|
Language model instance configured for structured output |
required |
n_concepts
|
int
|
The target number of cross-cutting themes to generate |
5
|
min_themes
|
int
|
Minimum number of themes required for a valid cross-cutting theme group. Groups with fewer themes will be discarded. Defaults to 5. |
5
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: A tuple containing: - DataFrame with cross-cutting themes with columns: - name: Name of the cross-cutting theme - description: Description of what this theme represents - themes: Dictionary mapping question_number to list of theme_keys e.g., {1: ["A", "B"], 3: ["C"]} - Empty DataFrame (for consistency with other core functions) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If questions_themes is empty or contains invalid data |
KeyError
|
If required columns are missing from themes DataFrames |
Source code in src/themefinder/tasks.py
detail_detection
async
detail_detection(responses_df: DataFrame, llm: RunnableWithFallbacks, question: str, batch_size: int = 20, prompt_template: str | Path | PromptTemplate = 'detail_detection', system_prompt: str = CONSULTATION_SYSTEM_PROMPT, concurrency: int = 10, config: RunnableConfig | None = None) -> tuple[pd.DataFrame, pd.DataFrame]
Identify responses that provide high-value detailed evidence.
This function processes survey responses in batches to analyze their level of detail and evidence using a language model. It identifies responses that contain specific examples, data, or detailed reasoning that provide strong supporting evidence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
responses_df
|
DataFrame
|
DataFrame containing survey responses to analyze. Must contain 'response_id' and 'response' columns. |
required |
llm
|
RunnableWithFallbacks
|
Language model instance to use for detail detection. |
required |
question
|
str
|
The survey question. |
required |
batch_size
|
int
|
Number of responses to process in each batch. Defaults to 20. |
20
|
prompt_template
|
str | Path | PromptTemplate
|
Template for structuring the prompt to the LLM. Can be a string identifier, path to template file, or PromptTemplate instance. Defaults to "detail_detection". |
'detail_detection'
|
system_prompt
|
str
|
System prompt to guide the LLM's behavior. Defaults to CONSULTATION_SYSTEM_PROMPT. |
CONSULTATION_SYSTEM_PROMPT
|
concurrency
|
int
|
Number of concurrent API calls to make. Defaults to 10. |
10
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: A tuple containing two DataFrames: - The first DataFrame contains the rows that were successfully processed by the LLM - The second DataFrame contains the rows that could not be processed by the LLM |
Note
The function uses response_id_integrity_check to ensure responses maintain their original order and association after processing.
Source code in src/themefinder/tasks.py
find_themes
async
find_themes(responses_df: DataFrame, llm: RunnableWithFallbacks, question: str, system_prompt: str = CONSULTATION_SYSTEM_PROMPT, verbose: bool = True, concurrency: int = 10, config: RunnableConfig | None = None) -> dict[str, str | pd.DataFrame]
Process survey responses through a multi-stage theme analysis pipeline.
This pipeline performs sequential analysis steps: 1. Initial theme generation 2. Theme condensation (combining similar themes) 3. Theme refinement 4. Mapping responses to refined themes 5. Detail detection
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
responses_df
|
DataFrame
|
DataFrame containing survey responses |
required |
llm
|
RunnableWithFallbacks
|
Language model instance for text analysis |
required |
question
|
str
|
The survey question |
required |
system_prompt
|
str
|
System prompt to guide the LLM's behaviour. Defaults to CONSULTATION_SYSTEM_PROMPT. |
CONSULTATION_SYSTEM_PROMPT
|
verbose
|
bool
|
Whether to show information messages during processing. Defaults to True. |
True
|
concurrency
|
int
|
Number of concurrent API calls to make. Defaults to 10. |
10
|
config
|
RunnableConfig | None
|
Optional LangChain config for tracing/callbacks. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, str | DataFrame]
|
dict[str, str | pd.DataFrame]: Dictionary containing results from each pipeline stage: - question: The survey question string - themes: DataFrame with the final themes output - mapping: DataFrame mapping responses to final themes - detailed_responses: DataFrame with detail detection results - unprocessables: DataFrame containing the inputs that could not be processed by the LLM |
Source code in src/themefinder/tasks.py
theme_clustering
theme_clustering(themes_df: DataFrame, llm: RunnableWithFallbacks, max_iterations: int = 5, target_themes: int = 10, significance_percentage: float = 10.0, return_all_themes: bool = False, system_prompt: str = CONSULTATION_SYSTEM_PROMPT, config: RunnableConfig | None = None) -> tuple[pd.DataFrame, pd.DataFrame]
Perform hierarchical clustering of themes using an agentic approach.
This function takes a DataFrame of themes and uses the ThemeClusteringAgent to iteratively merge similar themes into a hierarchical structure, then selects the most significant themes based on a threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
themes_df
|
DataFrame
|
DataFrame containing themes with columns: - topic_id: Unique identifier for each theme - topic_label: Short descriptive label for the theme - topic_description: Detailed description of the theme - source_topic_count: Number of source responses for this theme |
required |
llm
|
RunnableWithFallbacks
|
Language model instance configured with structured output for HierarchicalClusteringResponse |
required |
max_iterations
|
int
|
Maximum number of clustering iterations. Defaults to 5. |
5
|
target_themes
|
int
|
Target number of themes to cluster down to. Defaults to 10. |
10
|
significance_percentage
|
float
|
Percentage threshold for selecting significant themes. Defaults to 10.0. |
10.0
|
return_all_themes
|
bool
|
If True, returns all clustered themes. If False, returns only significant themes. Defaults to False. |
False
|
system_prompt
|
str
|
System prompt to guide the LLM's behavior. Defaults to CONSULTATION_SYSTEM_PROMPT. |
CONSULTATION_SYSTEM_PROMPT
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: A tuple containing: - DataFrame of clustered themes (all or significant based on return_all_themes) - Empty DataFrame (for consistency with other functions) |
Source code in src/themefinder/tasks.py
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 | |
theme_condensation
async
theme_condensation(themes_df: DataFrame, llm: RunnableWithFallbacks, question: str, batch_size: int = 75, prompt_template: str | Path | PromptTemplate = 'theme_condensation', system_prompt: str = CONSULTATION_SYSTEM_PROMPT, concurrency: int = 10, config: RunnableConfig | None = None, **kwargs) -> tuple[pd.DataFrame, pd.DataFrame]
Condense and combine similar themes identified from survey responses.
This function processes the initially identified themes to combine similar or overlapping topics into more cohesive, broader categories using an LLM.
When the theme count exceeds the batch size, a first pass condenses within each batch independently, then a second pass merges across batches. The model decides organically how many themes to produce — there is no artificial target.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
themes_df
|
DataFrame
|
DataFrame containing the initial themes identified from survey responses. |
required |
llm
|
RunnableWithFallbacks
|
Language model instance to use for theme condensation. |
required |
question
|
str
|
The survey question. |
required |
batch_size
|
int
|
Number of themes to process in each batch. Defaults to 75. |
75
|
prompt_template
|
str | Path | PromptTemplate
|
Template for structuring the prompt to the LLM. Can be a string identifier, path to template file, or PromptTemplate instance. Defaults to "theme_condensation". |
'theme_condensation'
|
system_prompt
|
str
|
System prompt to guide the LLM's behavior. Defaults to CONSULTATION_SYSTEM_PROMPT. |
CONSULTATION_SYSTEM_PROMPT
|
concurrency
|
int
|
Number of concurrent API calls to make. Defaults to 10. |
10
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: A tuple containing two DataFrames: - The first DataFrame contains the rows that were successfully processed by the LLM - The second DataFrame contains the rows that could not be processed by the LLM |
Source code in src/themefinder/tasks.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | |
theme_generation
async
theme_generation(responses_df: DataFrame, llm: RunnableWithFallbacks, question: str, batch_size: int = 50, partition_key: str | None = None, prompt_template: str | Path | PromptTemplate = 'theme_generation', system_prompt: str = CONSULTATION_SYSTEM_PROMPT, concurrency: int = 10, config: RunnableConfig | None = None) -> tuple[pd.DataFrame, pd.DataFrame]
Generate themes from survey responses using an LLM.
This function processes batches of survey responses to identify common themes or topics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
responses_df
|
DataFrame
|
DataFrame containing survey responses. Must include 'response_id' and 'response' columns. |
required |
llm
|
RunnableWithFallbacks
|
Language model instance to use for theme generation. |
required |
question
|
str
|
The survey question. |
required |
batch_size
|
int
|
Number of responses to process in each batch. Defaults to 50. |
50
|
partition_key
|
str | None
|
Column name to use for batching related responses together. Defaults to None for sequential batching, but can be set to another column name for different grouping strategies. |
None
|
prompt_template
|
str | Path | PromptTemplate
|
Template for structuring the prompt to the LLM. Can be a string identifier, path to template file, or PromptTemplate instance. Defaults to "theme_generation". |
'theme_generation'
|
system_prompt
|
str
|
System prompt to guide the LLM's behavior. Defaults to CONSULTATION_SYSTEM_PROMPT. |
CONSULTATION_SYSTEM_PROMPT
|
concurrency
|
int
|
Number of concurrent API calls to make. Defaults to 10. |
10
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: A tuple containing two DataFrames: - The first DataFrame contains the rows that were successfully processed by the LLM - The second DataFrame contains the rows that could not be processed by the LLM |
Source code in src/themefinder/tasks.py
theme_mapping
async
theme_mapping(responses_df: DataFrame, llm: RunnableWithFallbacks, question: str, refined_themes_df: DataFrame, batch_size: int = 20, prompt_template: str | Path | PromptTemplate = 'theme_mapping', system_prompt: str = CONSULTATION_SYSTEM_PROMPT, concurrency: int = 10, config: RunnableConfig | None = None) -> tuple[pd.DataFrame, pd.DataFrame]
Map survey responses to refined themes using an LLM.
This function analyzes each survey response and determines which of the refined themes best matches its content. Multiple themes can be assigned to a single response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
responses_df
|
DataFrame
|
DataFrame containing survey responses. Must include 'response_id' and 'response' columns. |
required |
llm
|
RunnableWithFallbacks
|
Language model instance to use for theme mapping. |
required |
question
|
str
|
The survey question. |
required |
refined_themes_df
|
DataFrame
|
Single-row DataFrame where each column represents a theme (from theme_refinement stage). |
required |
batch_size
|
int
|
Number of responses to process in each batch. Defaults to 20. |
20
|
prompt_template
|
str | Path | PromptTemplate
|
Template for structuring the prompt to the LLM. Can be a string identifier, path to template file, or PromptTemplate instance. Defaults to "theme_mapping". |
'theme_mapping'
|
system_prompt
|
str
|
System prompt to guide the LLM's behavior. Defaults to CONSULTATION_SYSTEM_PROMPT. |
CONSULTATION_SYSTEM_PROMPT
|
concurrency
|
int
|
Number of concurrent API calls to make. Defaults to 10. |
10
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: A tuple containing two DataFrames: - The first DataFrame contains the rows that were successfully processed by the LLM - The second DataFrame contains the rows that could not be processed by the LLM |
Source code in src/themefinder/tasks.py
theme_refinement
async
theme_refinement(condensed_themes_df: DataFrame, llm: RunnableWithFallbacks, question: str, batch_size: int = 10000, prompt_template: str | Path | PromptTemplate = 'theme_refinement', system_prompt: str = CONSULTATION_SYSTEM_PROMPT, concurrency: int = 10, config: RunnableConfig | None = None) -> tuple[pd.DataFrame, pd.DataFrame]
Refine and standardise condensed themes using an LLM.
This function processes previously condensed themes to create clear, standardised theme descriptions. It also transforms the output format for improved readability by transposing the results into a single-row DataFrame where columns represent individual themes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
condensed_themes
|
DataFrame
|
DataFrame containing the condensed themes from the previous pipeline stage. |
required |
llm
|
RunnableWithFallbacks
|
Language model instance to use for theme refinement. |
required |
question
|
str
|
The survey question. |
required |
batch_size
|
int
|
Number of themes to process in each batch. Defaults to 10000. |
10000
|
prompt_template
|
str | Path | PromptTemplate
|
Template for structuring the prompt to the LLM. Can be a string identifier, path to template file, or PromptTemplate instance. Defaults to "theme_refinement". |
'theme_refinement'
|
system_prompt
|
str
|
System prompt to guide the LLM's behavior. Defaults to CONSULTATION_SYSTEM_PROMPT. |
CONSULTATION_SYSTEM_PROMPT
|
concurrency
|
int
|
Number of concurrent API calls to make. Defaults to 10. |
10
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame]: A tuple containing two DataFrames: - The first DataFrame contains the rows that were successfully processed by the LLM - The second DataFrame contains the rows that could not be processed by the LLM |
Note
The function adds sequential response_ids to the input DataFrame and transposes the output for improved readability and easier downstream processing.
Source code in src/themefinder/tasks.py
353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 | |