Tags · mitdbg/palimpzest

0.7.10

Remove Embedding Models from LLM Rules for Convert Operators (#180)

* don't consider embedding models in rules

* bumping version

Jun 30, 2025
dcaa718
zip
tar.gz
Downloads

0.7.9

Re-organize Research Scripts for Abacus Paper (#176)

* update README

* 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py (#78)

* Support add_columns in Dataset. Support demo in df-newinterface.py

Currently we have to do

records, _ = qr3.run()
outputDf = DataRecord.to_df(records)

I'll try to make qr3.run().to_df() work in another PR.

* ruff check --fix

* Support run().to_df()

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output.

We support to_df() in this change.

I'll send out following commits to update other demos.

* run check --fix

* fix typo in DataRecordCollection

* Update records.py

* fix tiny bug in mab processor.

The code will run into issue if we don't return any stats for this function in

```
                            max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets)
                            if (
                                not prev_logical_op_is_filter
                                or (
                                    prev_logical_op_is_filter
                                    and max_quality_record_set.record_op_stats[0].passed_operator
                                )
```

* update record.to_df interface

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

* Update demo for the new execute() output format

* better way to get plan from output.run()

* fix getting plan from DataRecordCollection.

people used to get plan from execute() of streaming processor, which is not a good practice.

I update plan_str to plan_stats, and they need to get physical plan from processor.

Consider use better ways to provide executed physical plan to  DataRecordCollection, possibly from stats.

* Update df-newinterface.py

* update code based on comments from Matt.

1. add cardinality param in add_columns
2. remove extra testdata files
3. add __iter__ in DataRecordCollection to help iter over streaming output.

* see if copilot just saved me 20 minutes

* fix package name

* use sed to get version from pyproject.toml

* bump project version; keep docs behind to test ci pipeline

* bumping docs version to match code version

* use new __iter__ method in demos where possible

* add type hint for output of __iter__; use __iter__ in unit tests

* Update download-testdata.sh (#89)

Added enron-tiny.csv

* Clean up the retrieve API (#79)

* Clean up the retrieve operator interface

* fix comments

* Update to the new to_df() API

* Code update for #84 (#101)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* changed types to make use of Python type system; updated use of types in tests; updated docs and README

* update test to match no longer allowing None default

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Skip an operator if this is a duplicate op instead of raise error (#102)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* Skip an operator when it doesn't need any logicalOP instead of raise error

#Final Effects
1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface.
2. No extra convert() will be added to the plan.
3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error.

#Issue
Currently Dataset(src, schema) initiation has 2 responsibilities:
1. read source
2. convert source to schema.

When we use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like:
1. Read source to schema that DataSource provides. This schema is derived by system, so the users don't know (don't need to know).
2. Convert Source schema to DefaultSchema.

So everytime, the system will make one more convert call to convert SourceSchema to DefaultSchema, which is definitely wrong.

#Solution
1. We use schema from Datasource if exists, which is reasonable.
2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror.

#Real Examples
##Before
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> DefaultSchema
    (contents, filename, text_conte) -> (value)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper
    (value) -> (contents, filename, paper_auth)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.0]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

 3. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 4. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

##After
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> ScientificPaper
    (contents, filename, text_conte) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 3. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

* make equality check for new field names a bit more explicit

* fix fixture usage

* update all plans within code base to explicitly convert when needed; and removed unnecessary schemas for reading from datasource

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Refactor demos to use .sem_add_columns or .add_columns instead of convert(), remove Schema from demos when possible. (#104)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* use field_values instead of field_types as field_values have the actual values,

use field_values instead of field_types as field_values have the actual values, since field_values have the actual key-value pairs, while field_types are just contain fields and their types.

records[0].schema is the schema of the output, which doesn't mean we already populate the schema into record.

* Remove .convert() and use .sem_add_columns or .add_columns instead

This change is based on #101 and #102, please review them first then this change.

1. This is to refactor all demos to use .sem_add_columns or .add_columns, and remove .convert().

2. Remove Schema from demos, except demos using ValidationDataSource and dataset.retrieve() that need schema now. We can refactor these cases later.

* ruff check --fix

* fix unittest

* demos fixed and unit tests running

* fix add_columns --> sem_add_columns in demo

* udpate quickstart to reflect code changes; shorten text as much as possible

* passing unit tests

* remove convert() everywhere

* fixes to correct errors in demos; update quickstart and docs

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Simplify Datasource (#103)

## Summary of PR changes

**Note 1:** I did not change anything related to val_datasource (including tangential functions like Dataset._set_data_source()) as that will all be modified in a subsequent PR to reflect our discussion re: validation data.

**Note 2:** I have completely commented out datamanager.py and config.py; for now I am willing to leave the code around in case we desperately need it for PalimpChat. However, my hope is that PalimpChat can be tweaked to work without the data manager and those files can be deleted before merging dev into main

**Note 3:** Despite the branch name, fixing the progress managers will be part of a separate PR.

- Collapsed all four `DataSource` classes down to a single `DataReader` class
- Limit the number of methods the user needs to implement to just `__len__()` and `__getitem__()`
    - (Switched from using `get_item() --> __getitem__()` in `DataReader`)
- Provided `DataReader` directly to scan operators (also renamed `DataSourcePhysicalOp --> ScanPhysicalOp`
- Removed `DataDirectory()` from `src/` entirely; this included commenting out things which made use of the cache (e.g. caching computed `DataRecords` and codegen examples)
- Got rid of `dataset_id` everywhere (which tracks with the previous bullet)
- Removed the `Config` class which was a relic of a bygone era (and also intertwined with the `DataDirectory()`)
- Updated all demos to use `import palimpzest as pz` to make the import statement(s) more welcoming
- Fixed one bug resulting from converts now producing union schemas. Instead of including the `output_schema` in an operators' `get_id_params()` we simply report the `generated_fields`.
- Changed `source_id --> source_idx` everywhere (this eliminated some weird renaming logic)
- Finally, I added a large set of documentation for the DataSource class(es)

* Multi-LLM Refinement Pipeline for Query Output Validation (#118)

* Multi-LLM Refinement Pipeline for Query Output Validation  (#92)

## Summary of PR

This PR contains the work to add a new `CriticConvert` physical operator to PZ. At a high-level, this operator runs a bonded convert, and then asks a critic model if the answer produced by the bonded convert can be improved upon. The original output and the critique are then fed into a refinement model, which produces the improved output.

The work to implement this includes:
1. Defining the physical operator in `src/palimpzest/query/operators/critique_and_refine_convert.py`
2. Adding an implementation rule for this physical operator in `src/palimpzest/query/optimizer/rules.py`
3. Adding boolean flag(s) to enable allowing / disallowing this physical optimization
4. Adding base prompts for the critique and refinement generations

One other change which this work spawned was an attempt to improve the management and construction of our prompts -- and to decouple this logic from the `BaseGenerator` class. On the management side, I split our single `prompts.py` file into a set of files. On the construction side, I created a `PromptFactory` class which templates prompts based on the `prompt_strategy` and input record.

The `PromptFactory` is not a perfect solution, but I think it is a step in the right direction.

Finally, I fixed an error which previously filtered out `RAGConvert` operators from being considered by the `Optimizer`, and I made 2-3 more miscellaneous small tweaks.

---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* MkDocs Site for Palimpzest API Documentation  (#116)

## Summary of PR Changes
1. Changed `docs` to use [MkDocs](https://www.mkdocs.org/) instead of Sphinx
2. Created initial `Getting Started` content
3. Created placeholders for `User Guide` content (to follow in a subsequent PR)
4. Added autogenerated docs for our most user-facing code (we will need to add docstrings to our code in a subsequent PR)
5. Made small tweaks to `src/` to allow users to specify policy using kwargs in `.run()`
6. Renamed the `testdata/enron-tiny/` files so that they're not so damn weird
---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* remove registration of sources from CI; only check version bump if there is a code change

* remove filter for only checking version bump when src files changed

* Rename `nocache` --> `cache` everywhere (#128)

* first commit

* Removed myenv

* added to git ignore

* addressed the comments in review

* flip one minor comment

* minor spacing fix

* fix spaces in a few more spots

---------

Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* adding citation (and making 'others' explicit) (#136)

* Make Generator thread-safe (#139)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* Begin Process of Improving Index Abstraction(s) in PZ (#138)

* quick and dirty implementation which tracks retrieve costs

* bug fixes and currently unused index code

* add default search func which I forgot to implement and add chromadb to pyproject.toml

* leaving TODO

* hotfix to add cost for retrieve operation

* another hotfix to add ragatouille dependency

* Add logger for PZ (#134)

* add logger for PZ

1. When verbose=True, we save all logs to log_file and print them on  console;
2. when verbose=False, we only save ERROR+ log to file and print ERROR+.

I just add logging to somewhere I think might be important for the execution, we always can add/remove for more or less.

Also I might update the logging message based on my later annotation work. But this PR should setup the logging mechanism for now.

* ruff fix

* update code based on comments

1. not logging output_records
2. not logging plan_stats
3. make the files to ".pz_logs"

---------

Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* fix merge bug (#141)

* ruff fix

* update log dir and fix tiny bug

* fix merge bug

* Use a singleton API client for operators (#140)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* also create parent dir. if missing

* CUAD benchmark (#143)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* fix CUAD benchmarlk

* fix type

* minor fixes

* Limit the Scope of Logging within the Optimizer (#144)

* making it possible to set log level based on env. variable; adding time limit on seven filters test

* deleting instead of commenting out

* Remove Conventional LLM Convert; Update Bonded LLM Convert retry logic (#145)

* use NullHandler in __init__ and let application control logging config (#146)

* use NullHandler in __init__ and let application control logging config

* ruff fix

* Fix Progress Manager and Simplify `execute_plan` methods (#148)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* address comments

* The great deletion (#149)

* Adding Preliminary Work on Abacus and MAB Sentinel Execution (#147)

* updating models to avoid llama3

* fix parsing bugs and some generation errors

* don't require json for proposer and code synth generations; fix prompt format instruction for proposers

* fix typo/bug

* fix bugs in generator prep for field_answers; fix bug in filter impl.; other improvements

* adding new file for abacus workload

* fix len

* fix errors with dataset copy; prompt construction; and more

* remove JSON instruction from MOA proposer

* fixed bugs in optimizer configuration, llama 3.3 generation, and filter generation

* clean up demos; fix missing base prompt from map

* add one more missing base prompt

* prepare demo for full run; get embedding cost info from RAGConvert; use reasoning output from Critique

* add script to generate text-embedding-3-small reaction embeddings

* write to .chroma

* run full scale generation

* compute embeddings slowly and add progress bar

* add sleep

* fix import

* add total iters

* create embeddings before ingesting

* fix index start and finish

* load embeddings and insert directly

* make chroma use cosine sim.; finish initial search fcn. for biodex workload; naming tweak in rag convert

* capturing gen stats in Retrieve

* added UDF map operator; rewrote biodex pipeline to match docetl impl.; switched to using __name__ for functions instead of str()

* add optimizations back in

* write data to csv in demo

* limit to same model choice(s) as docetl and lotus

* fix punctuation error(s)

* try run without filter

* remove unused demo file

* remove print

* remove prints

* remove costed_phys_op_ids which were used for debugging

* try slightly diff. approach

* remove temp changes while branch is in PR review

* remove depends_on for map

* fix iteration bug in sentinel processors

* one more hotfix

* fix more errors w/SentinelPlanStats and sentinel processors

* remove logger lib to reduce confusion (#159)

* Update research.md (#160)

AISD @ NAACL 2025

* Add Pneuma-Palimpzest Integration Demo (#158)

* Add Pneuma demo

* Remove dataset semantic column addition

* Fix progress managers episode 2 attack of the clones (#156)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* catch errors in generating embeddings

* fix comments

* Merging in Changes for Sentinel Progress Bars; Split Convert (off by default); `demos/enron-demo.py`; and MMQA Benchmark (#163)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* adding prints to generator; turn progress off in favor of verbose for now

* catch errors in generating embeddings

* inspect frontier updates

* remove args.workload

* fix num_inputs in selectivity computation

* pdb in score

* fixed score fn issue

* use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier

* fix progress counter

* debug

* fix empty stats

* only count stats from newly computed results

* fix tuple unpacking

* only update sample counts for llm ops

* de-dup duplicate record

* ugh

* dont forget to increment

* plz

* more plz

* increment

* recycle ops back onto reservoir so they may be reconsidered in the future

* remove pdb

* add progress to script args

* try without rag

* use term recall

* just check in on term recall

* make it easier to turn off progress

* remove pdb

* try to get re-rank to keep all inputs

* try to generate more reactions

* track total LLM calls

* 10x parallelism

* try retrieve directly on fulltext

* up max workers

* adding enron-demo w/optimization

* remove config option

* adding recall and precision to output

* allow operators to be recycled back onto frontier

* revert to using reactions instead of fulltext for similarity

* better cycling of off-frontier operators

* safety check on reservoir ops

* remove pdb

* fixing 5 results per query

* investigate sampling behavior

* check on seeds

* remove pdb

* test SplitConvert

* debug chunking

* fix bug in rag and split convert

* run with chunks

* test chunking logic

* fix chunking logic

* sum list

* remove split merge for now

* minor fixes to CUAD script

* add embedding scripts for mmqa tables and image titles

* address issue with empty titles and title collisions

* prepare script for using clip embeddings for images

* fix bug

* get full space of possible extensions

* debug

* weird bug fix?

* more debug

* fix idiotic mistake

* handle corrupted images and minor things

* add another corrupted image

* another one

* anotha

* more bad images

* last disallow file

* prepare cuad for runs

* specify execution strategy

* up samples

* add sentinel execution strategy to output name

* adding plan str and more stats

* specify no prior

* verbose=False

* fix comment; comment out prints

* make split merge optional for now

* addressing comments

* applying syntax changes to pneuma demo and supporting strings within retrieve

* bump version; fix lint; fix docs

* more docs tweaks; tweaking dependencies

* fix install issues

* one more version fix

* one more version fix

* one more version fix

* one more version fix

* last try

* change runner python version

* actually changing runner python version

* increase time limit for runners

* increase time limit for runners

* Merge in Changes From Final Abacus Work (WIP) (#173)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* adding prints to generator; turn progress off in favor of verbose for now

* catch errors in generating embeddings

* inspect frontier updates

* remove args.workload

* fix num_inputs in selectivity computation

* pdb in score

* fixed score fn issue

* use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier

* fix progress counter

* debug

* fix empty stats

* only count stats from newly computed results

* fix tuple unpacking

* only update sample counts for llm ops

* de-dup duplicate record

* ugh

* dont forget to increment

* plz

* more plz

* increment

* recycle ops back onto reservoir so they may be reconsidered in the future

* remove pdb

* add progress to script args

* try without rag

* use term recall

* just check in on term recall

* make it easier to turn off progress

* remove pdb

* try to get re-rank to keep all inputs

* try to generate more reactions

* track total LLM calls

* 10x parallelism

* try retrieve directly on fulltext

* up max workers

* adding enron-demo w/optimization

* remove config option

* adding recall and precision to output

* allow operators to be recycled back onto frontier

* revert to using reactions instead of fulltext for similarity

* better cycling of off-frontier operators

* safety check on reservoir ops

* remove pdb

* fixing 5 results per query

* investigate sampling behavior

* check on seeds

* remove pdb

* test SplitConvert

* debug chunking

* fix bug in rag and split convert

* run with chunks

* test chunking logic

* fix chunking logic

* sum list

* remove split merge for now

* minor fixes to CUAD script

* add embedding scripts for mmqa tables and image titles

* address issue with empty titles and title collisions

* prepare script for using clip embeddings for images

* fix bug

* get full space of possible extensions

* debug

* weird bug fix?

* more debug

* fix idiotic mistake

* handle corrupted images and minor things

* add another corrupted image

* another one

* anotha

* more bad images

* last disallow file

* prepare cuad for runs

* specify execution strategy

* up samples

* add sentinel execution strategy to output name

* adding plan str and more stats

* specify no prior

* verbose=False

* fix comment; comment out prints

* make split merge optional for now

* addressing comments

* applying syntax changes to pneuma demo and supporting strings within retrieve

* add prints

* debug sample sets

* checking in code before tweaks to mab

* state of repo after running final Abacus experiments

* revert to opt-profiling-data

* removing print statement

* remove prints

* final fixes

* removing ragatouille dependency

* fix ruff lint checks

* bump version

* passing tests locally

* remove pdb

* fix complaint about match

* Move Abacus Research Scripts into Separate Folder (#175)

* re-organizing abacus research-related scripts

* fix model selection and other tweaks

* add data download script

* bump version

* remove scripts from root

* removing python files which were merged back in from main

---------

Co-authored-by: Jun <130543538+chjuncn@users.noreply.github.com>
Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Sivaprasad Sudhir <sivaprasad2626@gmail.com>
Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>
Co-authored-by: Bari Bo LeBari <143016395+lilbarbar@users.noreply.github.com>
Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Tranway1 <tranway@qq.com>
Co-authored-by: Luthfi Balaka <luthfibalaka@gmail.com>

Jun 18, 2025
cd02dbf
zip
tar.gz
Downloads

0.7.8

Merging in Changes from Final Research Work on Abacus (#174)

* update README

* 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py (#78)

* Support add_columns in Dataset. Support demo in df-newinterface.py

Currently we have to do

records, _ = qr3.run()
outputDf = DataRecord.to_df(records)

I'll try to make qr3.run().to_df() work in another PR.

* ruff check --fix

* Support run().to_df()

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output.

We support to_df() in this change.

I'll send out following commits to update other demos.

* run check --fix

* fix typo in DataRecordCollection

* Update records.py

* fix tiny bug in mab processor.

The code will run into issue if we don't return any stats for this function in

```
                            max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets)
                            if (
                                not prev_logical_op_is_filter
                                or (
                                    prev_logical_op_is_filter
                                    and max_quality_record_set.record_op_stats[0].passed_operator
                                )
```

* update record.to_df interface

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

* Update demo for the new execute() output format

* better way to get plan from output.run()

* fix getting plan from DataRecordCollection.

people used to get plan from execute() of streaming processor, which is not a good practice.

I update plan_str to plan_stats, and they need to get physical plan from processor.

Consider use better ways to provide executed physical plan to  DataRecordCollection, possibly from stats.

* Update df-newinterface.py

* update code based on comments from Matt.

1. add cardinality param in add_columns
2. remove extra testdata files
3. add __iter__ in DataRecordCollection to help iter over streaming output.

* see if copilot just saved me 20 minutes

* fix package name

* use sed to get version from pyproject.toml

* bump project version; keep docs behind to test ci pipeline

* bumping docs version to match code version

* use new __iter__ method in demos where possible

* add type hint for output of __iter__; use __iter__ in unit tests

* Update download-testdata.sh (#89)

Added enron-tiny.csv

* Clean up the retrieve API (#79)

* Clean up the retrieve operator interface

* fix comments

* Update to the new to_df() API

* Code update for #84 (#101)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* changed types to make use of Python type system; updated use of types in tests; updated docs and README

* update test to match no longer allowing None default

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Skip an operator if this is a duplicate op instead of raise error (#102)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* Skip an operator when it doesn't need any logicalOP instead of raise error

#Final Effects
1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface.
2. No extra convert() will be added to the plan.
3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error.

#Issue
Currently Dataset(src, schema) initiation has 2 responsibilities:
1. read source
2. convert source to schema.

When we use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like:
1. Read source to schema that DataSource provides. This schema is derived by system, so the users don't know (don't need to know).
2. Convert Source schema to DefaultSchema.

So everytime, the system will make one more convert call to convert SourceSchema to DefaultSchema, which is definitely wrong.

#Solution
1. We use schema from Datasource if exists, which is reasonable.
2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror.

#Real Examples
##Before
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> DefaultSchema
    (contents, filename, text_conte) -> (value)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper
    (value) -> (contents, filename, paper_auth)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.0]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

 3. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 4. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

##After
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> ScientificPaper
    (contents, filename, text_conte) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 3. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

* make equality check for new field names a bit more explicit

* fix fixture usage

* update all plans within code base to explicitly convert when needed; and removed unnecessary schemas for reading from datasource

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Refactor demos to use .sem_add_columns or .add_columns instead of convert(), remove Schema from demos when possible. (#104)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* use field_values instead of field_types as field_values have the actual values,

use field_values instead of field_types as field_values have the actual values, since field_values have the actual key-value pairs, while field_types are just contain fields and their types.

records[0].schema is the schema of the output, which doesn't mean we already populate the schema into record.

* Remove .convert() and use .sem_add_columns or .add_columns instead

This change is based on #101 and #102, please review them first then this change.

1. This is to refactor all demos to use .sem_add_columns or .add_columns, and remove .convert().

2. Remove Schema from demos, except demos using ValidationDataSource and dataset.retrieve() that need schema now. We can refactor these cases later.

* ruff check --fix

* fix unittest

* demos fixed and unit tests running

* fix add_columns --> sem_add_columns in demo

* udpate quickstart to reflect code changes; shorten text as much as possible

* passing unit tests

* remove convert() everywhere

* fixes to correct errors in demos; update quickstart and docs

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Simplify Datasource (#103)

## Summary of PR changes

**Note 1:** I did not change anything related to val_datasource (including tangential functions like Dataset._set_data_source()) as that will all be modified in a subsequent PR to reflect our discussion re: validation data.

**Note 2:** I have completely commented out datamanager.py and config.py; for now I am willing to leave the code around in case we desperately need it for PalimpChat. However, my hope is that PalimpChat can be tweaked to work without the data manager and those files can be deleted before merging dev into main

**Note 3:** Despite the branch name, fixing the progress managers will be part of a separate PR.

- Collapsed all four `DataSource` classes down to a single `DataReader` class
- Limit the number of methods the user needs to implement to just `__len__()` and `__getitem__()`
    - (Switched from using `get_item() --> __getitem__()` in `DataReader`)
- Provided `DataReader` directly to scan operators (also renamed `DataSourcePhysicalOp --> ScanPhysicalOp`
- Removed `DataDirectory()` from `src/` entirely; this included commenting out things which made use of the cache (e.g. caching computed `DataRecords` and codegen examples)
- Got rid of `dataset_id` everywhere (which tracks with the previous bullet)
- Removed the `Config` class which was a relic of a bygone era (and also intertwined with the `DataDirectory()`)
- Updated all demos to use `import palimpzest as pz` to make the import statement(s) more welcoming
- Fixed one bug resulting from converts now producing union schemas. Instead of including the `output_schema` in an operators' `get_id_params()` we simply report the `generated_fields`.
- Changed `source_id --> source_idx` everywhere (this eliminated some weird renaming logic)
- Finally, I added a large set of documentation for the DataSource class(es)

* Multi-LLM Refinement Pipeline for Query Output Validation (#118)

* Multi-LLM Refinement Pipeline for Query Output Validation  (#92)

## Summary of PR

This PR contains the work to add a new `CriticConvert` physical operator to PZ. At a high-level, this operator runs a bonded convert, and then asks a critic model if the answer produced by the bonded convert can be improved upon. The original output and the critique are then fed into a refinement model, which produces the improved output.

The work to implement this includes:
1. Defining the physical operator in `src/palimpzest/query/operators/critique_and_refine_convert.py`
2. Adding an implementation rule for this physical operator in `src/palimpzest/query/optimizer/rules.py`
3. Adding boolean flag(s) to enable allowing / disallowing this physical optimization
4. Adding base prompts for the critique and refinement generations

One other change which this work spawned was an attempt to improve the management and construction of our prompts -- and to decouple this logic from the `BaseGenerator` class. On the management side, I split our single `prompts.py` file into a set of files. On the construction side, I created a `PromptFactory` class which templates prompts based on the `prompt_strategy` and input record.

The `PromptFactory` is not a perfect solution, but I think it is a step in the right direction.

Finally, I fixed an error which previously filtered out `RAGConvert` operators from being considered by the `Optimizer`, and I made 2-3 more miscellaneous small tweaks.

---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* MkDocs Site for Palimpzest API Documentation  (#116)

## Summary of PR Changes
1. Changed `docs` to use [MkDocs](https://www.mkdocs.org/) instead of Sphinx
2. Created initial `Getting Started` content
3. Created placeholders for `User Guide` content (to follow in a subsequent PR)
4. Added autogenerated docs for our most user-facing code (we will need to add docstrings to our code in a subsequent PR)
5. Made small tweaks to `src/` to allow users to specify policy using kwargs in `.run()`
6. Renamed the `testdata/enron-tiny/` files so that they're not so damn weird
---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* remove registration of sources from CI; only check version bump if there is a code change

* remove filter for only checking version bump when src files changed

* Rename `nocache` --> `cache` everywhere (#128)

* first commit

* Removed myenv

* added to git ignore

* addressed the comments in review

* flip one minor comment

* minor spacing fix

* fix spaces in a few more spots

---------

Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* adding citation (and making 'others' explicit) (#136)

* Make Generator thread-safe (#139)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* Begin Process of Improving Index Abstraction(s) in PZ (#138)

* quick and dirty implementation which tracks retrieve costs

* bug fixes and currently unused index code

* add default search func which I forgot to implement and add chromadb to pyproject.toml

* leaving TODO

* hotfix to add cost for retrieve operation

* another hotfix to add ragatouille dependency

* Add logger for PZ (#134)

* add logger for PZ

1. When verbose=True, we save all logs to log_file and print them on  console;
2. when verbose=False, we only save ERROR+ log to file and print ERROR+.

I just add logging to somewhere I think might be important for the execution, we always can add/remove for more or less.

Also I might update the logging message based on my later annotation work. But this PR should setup the logging mechanism for now.

* ruff fix

* update code based on comments

1. not logging output_records
2. not logging plan_stats
3. make the files to ".pz_logs"

---------

Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* fix merge bug (#141)

* ruff fix

* update log dir and fix tiny bug

* fix merge bug

* Use a singleton API client for operators (#140)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* also create parent dir. if missing

* CUAD benchmark (#143)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* fix CUAD benchmarlk

* fix type

* minor fixes

* Limit the Scope of Logging within the Optimizer (#144)

* making it possible to set log level based on env. variable; adding time limit on seven filters test

* deleting instead of commenting out

* Remove Conventional LLM Convert; Update Bonded LLM Convert retry logic (#145)

* use NullHandler in __init__ and let application control logging config (#146)

* use NullHandler in __init__ and let application control logging config

* ruff fix

* Fix Progress Manager and Simplify `execute_plan` methods (#148)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrit
8000
e of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* address comments

* The great deletion (#149)

* Adding Preliminary Work on Abacus and MAB Sentinel Execution (#147)

* updating models to avoid llama3

* fix parsing bugs and some generation errors

* don't require json for proposer and code synth generations; fix prompt format instruction for proposers

* fix typo/bug

* fix bugs in generator prep for field_answers; fix bug in filter impl.; other improvements

* adding new file for abacus workload

* fix len

* fix errors with dataset copy; prompt construction; and more

* remove JSON instruction from MOA proposer

* fixed bugs in optimizer configuration, llama 3.3 generation, and filter generation

* clean up demos; fix missing base prompt from map

* add one more missing base prompt

* prepare demo for full run; get embedding cost info from RAGConvert; use reasoning output from Critique

* add script to generate text-embedding-3-small reaction embeddings

* write to .chroma

* run full scale generation

* compute embeddings slowly and add progress bar

* add sleep

* fix import

* add total iters

* create embeddings before ingesting

* fix index start and finish

* load embeddings and insert directly

* make chroma use cosine sim.; finish initial search fcn. for biodex workload; naming tweak in rag convert

* capturing gen stats in Retrieve

* added UDF map operator; rewrote biodex pipeline to match docetl impl.; switched to using __name__ for functions instead of str()

* add optimizations back in

* write data to csv in demo

* limit to same model choice(s) as docetl and lotus

* fix punctuation error(s)

* try run without filter

* remove unused demo file

* remove print

* remove prints

* remove costed_phys_op_ids which were used for debugging

* try slightly diff. approach

* remove temp changes while branch is in PR review

* remove depends_on for map

* fix iteration bug in sentinel processors

* one more hotfix

* fix more errors w/SentinelPlanStats and sentinel processors

* remove logger lib to reduce confusion (#159)

* Update research.md (#160)

AISD @ NAACL 2025

* Add Pneuma-Palimpzest Integration Demo (#158)

* Add Pneuma demo

* Remove dataset semantic column addition

* Fix progress managers episode 2 attack of the clones (#156)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* catch errors in generating embeddings

* fix comments

* Merging in Changes for Sentinel Progress Bars; Split Convert (off by default); `demos/enron-demo.py`; and MMQA Benchmark (#163)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* adding prints to generator; turn progress off in favor of verbose for now

* catch errors in generating embeddings

* inspect frontier updates

* remove args.workload

* fix num_inputs in selectivity computation

* pdb in score

* fixed score fn issue

* use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier

* fix progress counter

* debug

* fix empty stats

* only count stats from newly computed results

* fix tuple unpacking

* only update sample counts for llm ops

* de-dup duplicate record

* ugh

* dont forget to increment

* plz

* more plz

* increment

* recycle ops back onto reservoir so they may be reconsidered in the future

* remove pdb

* add progress to script args

* try without rag

* use term recall

* just check in on term recall

* make it easier to turn off progress

* remove pdb

* try to get re-rank to keep all inputs

* try to generate more reactions

* track total LLM calls

* 10x parallelism

* try retrieve directly on fulltext

* up max workers

* adding enron-demo w/optimization

* remove config option

* adding recall and precision to output

* allow operators to be recycled back onto frontier

* revert to using reactions instead of fulltext for similarity

* better cycling of off-frontier operators

* safety check on reservoir ops

* remove pdb

* fixing 5 results per query

* investigate sampling behavior

* check on seeds

* remove pdb

* test SplitConvert

* debug chunking

* fix bug in rag and split convert

* run with chunks

* test chunking logic

* fix chunking logic

* sum list

* remove split merge for now

* minor fixes to CUAD script

* add embedding scripts for mmqa tables and image titles

* address issue with empty titles and title collisions

* prepare script for using clip embeddings for images

* fix bug

* get full space of possible extensions

* debug

* weird bug fix?

* more debug

* fix idiotic mistake

* handle corrupted images and minor things

* add another corrupted image

* another one

* anotha

* more bad images

* last disallow file

* prepare cuad for runs

* specify execution strategy

* up samples

* add sentinel execution strategy to output name

* adding plan str and more stats

* specify no prior

* verbose=False

* fix comment; comment out prints

* make split merge optional for now

* addressing comments

* applying syntax changes to pneuma demo and supporting strings within retrieve

* bump version; fix lint; fix docs

* more docs tweaks; tweaking dependencies

* fix install issues

* one more version fix

* one more version fix

* one more version fix

* one more version fix

* last try

* change runner python version

* actually changing runner python version

* increase time limit for runners

* increase time limit for runners

* Merge in Changes From Final Abacus Work (WIP) (#173)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* adding prints to generator; turn progress off in favor of verbose for now

* catch errors in generating embeddings

* inspect frontier updates

* remove args.workload

* fix num_inputs in selectivity computation

* pdb in score

* fixed score fn issue

* use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier

* fix progress counter

* debug

* fix empty stats

* only count stats from newly computed results

* fix tuple unpacking

* only update sample counts for llm ops

* de-dup duplicate record

* ugh

* dont forget to increment

* plz

* more plz

* increment

* recycle ops back onto reservoir so they may be reconsidered in the future

* remove pdb

* add progress to script args

* try without rag

* use term recall

* just check in on term recall

* make it easier to turn off progress

* remove pdb

* try to get re-rank to keep all inputs

* try to generate more reactions

* track total LLM calls

* 10x parallelism

* try retrieve directly on fulltext

* up max workers

* adding enron-demo w/optimization

* remove config option

* adding recall and precision to output

* allow operators to be recycled back onto frontier

* revert to using reactions instead of fulltext for similarity

* better cycling of off-frontier operators

* safety check on reservoir ops

* remove pdb

* fixing 5 results per query

* investigate sampling behavior

* check on seeds

* remove pdb

* test SplitConvert

* debug chunking

* fix bug in rag and split convert

* run with chunks

* test chunking logic

* fix chunking logic

* sum list

* remove split merge for now

* minor fixes to CUAD script

* add embedding scripts for mmqa tables and image titles

* address issue with empty titles and title collisions

* prepare script for using clip embeddings for images

* fix bug

* get full space of possible extensions

* debug

* weird bug fix?

* more debug

* fix idiotic mistake

* handle corrupted images and minor things

* add another corrupted image

* another one

* anotha

* more bad images

* last disallow file

* prepare cuad for runs

* specify execution strategy

* up samples

* add sentinel execution strategy to output name

* adding plan str and more stats

* specify no prior

* verbose=False

* fix comment; comment out prints

* make split merge optional for now

* addressing comments

* applying syntax changes to pneuma demo and supporting strings within retrieve

* add prints

* debug sample sets

* checking in code before tweaks to mab

* state of repo after running final Abacus experiments

* revert to opt-profiling-data

* removing print statement

* remove prints

* final fixes

* removing ragatouille dependency

* fix ruff lint checks

* bump version

* passing tests locally

* remove pdb

* fix complaint about match

---------

Co-authored-by: Jun <130543538+chjuncn@users.noreply.github.com>
Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Sivaprasad Sudhir <sivaprasad2626@gmail.com>
Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>
Co-authored-by: Bari Bo LeBari <143016395+lilbarbar@users.noreply.github.com>
Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Tranway1 <tranway@qq.com>
Co-authored-by: Luthfi Balaka <luthfibalaka@gmail.com>

Jun 18, 2025
763fc37
zip
tar.gz
Downloads

abacus-paper-experiments

state of repo after running final Abacus experiments

Jun 12, 2025
63fa083
zip
tar.gz

0.7.7

clear output from demo (#171)

Apr 28, 2025
26372c6
zip
tar.gz
Downloads

0.7.6

Update Colab Link and Quickstart Demo (#170)

* update colab link and quickstart demo

* bump version

Apr 28, 2025
b383b1a
zip
tar.gz
Downloads

0.7.5

Hotfix turn off logs for demo (#169)

* update README

* 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py (#78)

* Support add_columns in Dataset. Support demo in df-newinterface.py

Currently we have to do

records, _ = qr3.run()
outputDf = DataRecord.to_df(records)

I'll try to make qr3.run().to_df() work in another PR.

* ruff check --fix

* Support run().to_df()

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output.

We support to_df() in this change.

I'll send out following commits to update other demos.

* run check --fix

* fix typo in DataRecordCollection

* Update records.py

* fix tiny bug in mab processor.

The code will run into issue if we don't return any stats for this function in

```
                            max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets)
                            if (
                                not prev_logical_op_is_filter
                                or (
                                    prev_logical_op_is_filter
                                    and max_quality_record_set.record_op_stats[0].passed_operator
                                )
```

* update record.to_df interface

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

* Update demo for the new execute() output format

* better way to get plan from output.run()

* fix getting plan from DataRecordCollection.

people used to get plan from execute() of streaming processor, which is not a good practice.

I update plan_str to plan_stats, and they need to get physical plan from processor.

Consider use better ways to provide executed physical plan to  DataRecordCollection, possibly from stats.

* Update df-newinterface.py

* update code based on comments from Matt.

1. add cardinality param in add_columns
2. remove extra testdata files
3. add __iter__ in DataRecordCollection to help iter over streaming output.

* see if copilot just saved me 20 minutes

* fix package name

* use sed to get version from pyproject.toml

* bump project version; keep docs behind to test ci pipeline

* bumping docs version to match code version

* use new __iter__ method in demos where possible

* add type hint for output of __iter__; use __iter__ in unit tests

* Update download-testdata.sh (#89)

Added enron-tiny.csv

* Clean up the retrieve API (#79)

* Clean up the retrieve operator interface

* fix comments

* Update to the new to_df() API

* Code update for #84 (#101)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* changed types to make use of Python type system; updated use of types in tests; updated docs and README

* update test to match no longer allowing None default

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Skip an operator if this is a duplicate op instead of raise error (#102)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* Skip an operator when it doesn't need any logicalOP instead of raise error

#Final Effects
1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface.
2. No extra convert() will be added to the plan.
3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error.

#Issue
Currently Dataset(src, schema) initiation has 2 responsibilities:
1. read source
2. convert source to schema.

When we use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like:
1. Read source to schema that DataSource provides. This schema is derived by system, so the users don't know (don't need to know).
2. Convert Source schema to DefaultSchema.

So everytime, the system will make one more convert call to convert SourceSchema to DefaultSchema, which is definitely wrong.

#Solution
1. We use schema from Datasource if exists, which is reasonable.
2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror.

#Real Examples
##Before
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> DefaultSchema
    (contents, filename, text_conte) -> (value)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper
    (value) -> (contents, filename, paper_auth)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.0]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

 3. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 4. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

##After
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> ScientificPaper
    (contents, filename, text_conte) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 3. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

* make equality check for new field names a bit more explicit

* fix fixture usage

* update all plans within code base to explicitly convert when needed; and removed unnecessary schemas for reading from datasource

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Refactor demos to use .sem_add_columns or .add_columns instead of convert(), remove Schema from demos when possible. (#104)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* use field_values instead of field_types as field_values have the actual values,

use field_values instead of field_types as field_values have the actual values, since field_values have the actual key-value pairs, while field_types are just contain fields and their types.

records[0].schema is the schema of the output, which doesn't mean we already populate the schema into record.

* Remove .convert() and use .sem_add_columns or .add_columns instead

This change is based on #101 and #102, please review them first then this change.

1. This is to refactor all demos to use .sem_add_columns or .add_columns, and remove .convert().

2. Remove Schema from demos, except demos using ValidationDataSource and dataset.retrieve() that need schema now. We can refactor these cases later.

* ruff check --fix

* fix unittest

* demos fixed and unit tests running

* fix add_columns --> sem_add_columns in demo

* udpate quickstart to reflect code changes; shorten text as much as possible

* passing unit tests

* remove convert() everywhere

* fixes to correct errors in demos; update quickstart and docs

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Simplify Datasource (#103)

## Summary of PR changes

**Note 1:** I did not change anything related to val_datasource (including tangential functions like Dataset._set_data_source()) as that will all be modified in a subsequent PR to reflect our discussion re: validation data.

**Note 2:** I have completely commented out datamanager.py and config.py; for now I am willing to leave the code around in case we desperately need it for PalimpChat. However, my hope is that PalimpChat can be tweaked to work without the data manager and those files can be deleted before merging dev into main

**Note 3:** Despite the branch name, fixing the progress managers will be part of a separate PR.

- Collapsed all four `DataSource` classes down to a single `DataReader` class
- Limit the number of methods the user needs to implement to just `__len__()` and `__getitem__()`
    - (Switched from using `get_item() --> __getitem__()` in `DataReader`)
- Provided `DataReader` directly to scan operators (also renamed `DataSourcePhysicalOp --> ScanPhysicalOp`
- Removed `DataDirectory()` from `src/` entirely; this included commenting out things which made use of the cache (e.g. caching computed `DataRecords` and codegen examples)
- Got rid of `dataset_id` everywhere (which tracks with the previous bullet)
- Removed the `Config` class which was a relic of a bygone era (and also intertwined with the `DataDirectory()`)
- Updated all demos to use `import palimpzest as pz` to make the import statement(s) more welcoming
- Fixed one bug resulting from converts now producing union schemas. Instead of including the `output_schema` in an operators' `get_id_params()` we simply report the `generated_fields`.
- Changed `source_id --> source_idx` everywhere (this eliminated some weird renaming logic)
- Finally, I added a large set of documentation for the DataSource class(es)

* Multi-LLM Refinement Pipeline for Query Output Validation (#118)

* Multi-LLM Refinement Pipeline for Query Output Validation  (#92)

## Summary of PR

This PR contains the work to add a new `CriticConvert` physical operator to PZ. At a high-level, this operator runs a bonded convert, and then asks a critic model if the answer produced by the bonded convert can be improved upon. The original output and the critique are then fed into a refinement model, which produces the improved output.

The work to implement this includes:
1. Defining the physical operator in `src/palimpzest/query/operators/critique_and_refine_convert.py`
2. Adding an implementation rule for this physical operator in `src/palimpzest/query/optimizer/rules.py`
3. Adding boolean flag(s) to enable allowing / disallowing this physical optimization
4. Adding base prompts for the critique and refinement generations

One other change which this work spawned was an attempt to improve the management and construction of our prompts -- and to decouple this logic from the `BaseGenerator` class. On the management side, I split our single `prompts.py` file into a set of files. On the construction side, I created a `PromptFactory` class which templates prompts based on the `prompt_strategy` and input record.

The `PromptFactory` is not a perfect solution, but I think it is a step in the right direction.

Finally, I fixed an error which previously filtered out `RAGConvert` operators from being considered by the `Optimizer`, and I made 2-3 more miscellaneous small tweaks.

---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* MkDocs Site for Palimpzest API Documentation  (#116)

## Summary of PR Changes
1. Changed `docs` to use [MkDocs](https://www.mkdocs.org/) instead of Sphinx
2. Created initial `Getting Started` content
3. Created placeholders for `User Guide` content (to follow in a subsequent PR)
4. Added autogenerated docs for our most user-facing code (we will need to add docstrings to our code in a subsequent PR)
5. Made small tweaks to `src/` to allow users to specify policy using kwargs in `.run()`
6. Renamed the `testdata/enron-tiny/` files so that they're not so damn weird
---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* remove registration of sources from CI; only check version bump if there is a code change

* remove filter for only checking version bump when src files changed

* Rename `nocache` --> `cache` everywhere (#128)

* first commit

* Removed myenv

* added to git ignore

* addressed the comments in review

* flip one minor comment

* minor spacing fix

* fix spaces in a few more spots

---------

Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* adding citation (and making 'others' explicit) (#136)

* Make Generator thread-safe (#139)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* Begin Process of Improving Index Abstraction(s) in PZ (#138)

* quick and dirty implementation which tracks retrieve costs

* bug fixes and currently unused index code

* add default search func which I forgot to implement and add chromadb to pyproject.toml

* leaving TODO

* hotfix to add cost for retrieve operation

* another hotfix to add ragatouille dependency

* Add logger for PZ (#134)

* add logger for PZ

1. When verbose=True, we save all logs to log_file and print them on  console;
2. when verbose=False, we only save ERROR+ log to file and print ERROR+.

I just add logging to somewhere I think might be important for the execution, we always can add/remove for more or less.

Also I might update the logging message based on my later annotation work. But this PR should setup the logging mechanism for now.

* ruff fix

* update code based on comments

1. not logging output_records
2. not logging plan_stats
3. make the files to ".pz_logs"

---------

Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* fix merge bug (#141)

* ruff fix

* update log dir and fix tiny bug

* fix merge bug

* Use a singleton API client for operators (#140)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* also create parent dir. if missing

* CUAD benchmark (#143)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* fix CUAD benchmarlk

* fix type

* minor fixes

* Limit the Scope of Logging within the Optimizer (#144)

* making it possible to set log level based on env. variable; adding time limit on seven filters test

* deleting instead of commenting out

* Remove Conventional LLM Convert; Update Bonded LLM Convert retry logic (#145)

* use NullHandler in __init__ and let application control logging config (#146)

* use NullHandler in __init__ and let application control logging config

* ruff fix

* Fix Progress Manager and Simplify `execute_plan` methods (#148)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* address comments

* The great deletion (#149)

* Adding Preliminary Work on Abacus and MAB Sentinel Execution (#147)

* updating models to avoid llama3

* fix parsing bugs and some generation errors

* don't require json for proposer and code synth generations; fix prompt format instruction for proposers

* fix typo/bug

* fix bugs in generator prep for field_answers; fix bug in filter impl.; other improvements

* adding new file for abacus workload

* fix len

* fix errors with dataset copy; prompt construction; and more

* remove JSON instruction from MOA proposer

* fixed bugs in optimizer configuration, llama 3.3 generation, and filter generation

* clean up demos; fix missing base prompt from map

* add one more missing base prompt

* prepare demo for full run; get embedding cost info from RAGConvert; use reasoning output from Critique

* add script to generate text-embedding-3-small reaction embeddings

* write to .chroma

* run full scale generation

* compute embeddings slowly and add progress bar

* add sleep

* fix import

* add total iters

* create embeddings before ingesting

* fix index start and finish

* load embeddings and insert directly

* make chroma use cosine sim.; finish initial search fcn. for biodex workload; naming tweak in rag convert

* capturing gen stats in Retrieve

* added UDF map operator; rewrote biodex pipeline to match docetl impl.; switched to using __name__ for functions instead of str()

* add optimizations back in

* write data to csv in demo

* limit to same model choice(s) as docetl and lotus

* fix punctuation error(s)

* try run without filter

* remove unused demo file

* remove print

* remove prints

* remove costed_phys_op_ids which were used for debugging

* try slightly diff. approach

* remove temp changes while branch is in PR review

* remove depends_on for map

* fix iteration bug in sentinel processors

* one more hotfix

* fix more errors w/SentinelPlanStats and sentinel processors

* remove logger lib to reduce confusion (#159)

* Update research.md (#160)

AISD @ NAACL 2025

* Add Pneuma-Palimpzest Integration Demo (#158)

* Add Pneuma demo

* Remove dataset semantic column addition

* Fix progress managers episode 2 attack of the clones (#156)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* catch errors in generating embeddings

* fix comments

* Merging in Changes for Sentinel Progress Bars; Split Convert (off by default); `demos/enron-demo.py`; and MMQA Benchmark (#163)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* adding prints to generator; turn progress off in favor of verbose for now

* catch errors in generating embeddings

* inspect frontier updates

* remove args.workload

* fix num_inputs in selectivity computation

* pdb in score

* fixed score fn issue

* use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier

* fix progress counter

* debug

* fix empty stats

* only count stats from newly computed results

* fix tuple unpacking

* only update sample counts for llm ops

* de-dup duplicate record

* ugh

* dont forget to increment

* plz

* more plz

* increment

* recycle ops back onto reservoir so they may be reconsidered in the future

* remove pdb

* add progress to script args

* try without rag

* use term recall

* just check in on term recall

* make it easier to turn off progress

* remove pdb

* try to get re-rank to keep all inputs

* try to generate more reactions

* track total LLM calls

* 10x parallelism

* try retrieve directly on fulltext

* up max workers

* adding enron-demo w/optimization

* remove config option

* adding recall and precision to output

* allow operators to be recycled back onto frontier

* revert to using reactions instead of fulltext for similarity

* better cycling of off-frontier operators

* safety check on reservoir ops

* remove pdb

* fixing 5 results per query

* investigate sampling behavior

* check on seeds

* remove pdb

* test SplitConvert

* debug chunking

* fix bug in rag and split convert

* run with chunks

* test chunking logic

* fix chunking logic

* sum list

* remove split merge for now

* minor fixes to CUAD script

* add embedding scripts for mmqa tables and image titles

* address issue with empty titles and title collisions

* prepare script for using clip embeddings for images

* fix bug

* get full space of possible extensions

* debug

* weird bug fix?

* more debug

* fix idiotic mistake

* handle corrupted images and minor things

* add another corrupted image

* another one

* anotha

* more bad images

* last disallow file

* prepare cuad for runs

* specify execution strategy

* up samples

* add sentinel execution strategy to output name

* adding plan str and more stats

* specify no prior

* verbose=False

* fix comment; comment out prints

* make split merge optional for now

* addressing comments

* applying syntax changes to pneuma demo and supporting strings within retrieve

* bump version; fix lint; fix docs

* more docs tweaks; tweaking dependencies

* fix install issues

* one more version fix

* one more version fix

* one more version fix

* one more version fix

* last try

* change runner python version

* actually changing runner python version

* increase time limit for runners

* increase time limit for runners

* turning off logs for demo

* fix lint issue

* fix lint issue

---------

Co-authored-by: Jun <130543538+chjuncn@users.noreply.github.com>
Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Sivaprasad Sudhir <sivaprasad2626@gmail.com>
Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>
Co-authored-by: Bari Bo LeBari <143016395+lilbarbar@users.noreply.github.com>
Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Tranway1 <tranway@qq.com>
Co-authored-by: Luthfi Balaka <luthfibalaka@gmail.com>

Apr 25, 2025
6746f78
zip
tar.gz
Downloads

0.7.4

Use Colab Versions for PIL and psutil (#168)

* set colab versions

* bump together

* remove dep.

* try making mkdocs optional

Apr 25, 2025
d720b5f
zip
tar.gz
Downloads

0.7.3

Set Gradio Version (#167)

* give gradio version

* bump version

Apr 25, 2025
50d3dc2
zip
tar.gz
Downloads

0.7.2

Simplify Dependencies in pyproject.toml; remove TokenReducedConvert (#…

…166)

* update README

* 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py (#78)

* Support add_columns in Dataset. Support demo in df-newinterface.py

Currently we have to do

records, _ = qr3.run()
outputDf = DataRecord.to_df(records)

I'll try to make qr3.run().to_df() work in another PR.

* ruff check --fix

* Support run().to_df()

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output.

We support to_df() in this change.

I'll send out following commits to update other demos.

* run check --fix

* fix typo in DataRecordCollection

* Update records.py

* fix tiny bug in mab processor.

The code will run into issue if we don't return any stats for this function in

```
                            max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets)
                            if (
                                not prev_logical_op_is_filter
                                or (
                                    prev_logical_op_is_filter
                                    and max_quality_record_set.record_op_stats[0].passed_operator
                                )
```

* update record.to_df interface

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

* Update demo for the new execute() output format

* better way to get plan from output.run()

* fix getting plan from DataRecordCollection.

people used to get plan from execute() of streaming processor, which is not a good practice.

I update plan_str to plan_stats, and they need to get physical plan from processor.

Consider use better ways to provide executed physical plan to  DataRecordCollection, possibly from stats.

* Update df-newinterface.py

* update code based on comments from Matt.

1. add cardinality param in add_columns
2. remove extra testdata files
3. add __iter__ in DataRecordCollection to help iter over streaming output.

* see if copilot just saved me 20 minutes

* fix package name

* use sed to get version from pyproject.toml

* bump project version; keep docs behind to test ci pipeline

* bumping docs version to match code version

* use new __iter__ method in demos where possible

* add type hint for output of __iter__; use __iter__ in unit tests

* Update download-testdata.sh (#89)

Added enron-tiny.csv

* Clean up the retrieve API (#79)

* Clean up the retrieve operator interface

* fix comments

* Update to the new to_df() API

* Code update for #84 (#101)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* changed types to make use of Python type system; updated use of types in tests; updated docs and README

* update test to match no longer allowing None default

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Skip an operator if this is a duplicate op instead of raise error (#102)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* Skip an operator when it doesn't need any logicalOP instead of raise error

#Final Effects
1. Dataset() init only has one responsibility: wrap a datasource to a Dataset. I think this is a better interface.
2. No extra convert() will be added to the plan.
3. When users add the same op multiple times dataset.convert(File).convert(File), the system will just dedup the same op instead of raise error.

#Issue
Currently Dataset(src, schema) initiation has 2 responsibilities:
1. read source
2. convert source to schema.

When we use default schema for Dataset init(source, schema=DefaultSchema) for users, the code works like:
1. Read source to schema that DataSource provides. This schema is derived by system, so the users don't know (don't need to know).
2. Convert Source schema to DefaultSchema.

So everytime, the system will make one more convert call to convert SourceSchema to DefaultSchema, which is definitely wrong.

#Solution
1. We use schema from Datasource if exists, which is reasonable.
2. If we do 1, then we'll get a dataset node that no actual op as its input_schema ==output_schema, so I updated a line in optimizer to just skip the node if it doesn't do anything instead raiseerror.

#Real Examples
##Before
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> DefaultSchema
    (contents, filename, text_conte) -> (value)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. DefaultSchema -> MixtureOfAgentsConvert -> ScientificPaper
    (value) -> (contents, filename, paper_auth)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.0]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

 3. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 4. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

##After
Generated plan:
  0. MarshalAndScanDataOp -> PDFFile

 1. PDFFile -> LLMConvertBonded -> ScientificPaper
    (contents, filename, text_conte) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Prompt Strategy: PromptStrategy.COT_QA

 2. ScientificPaper -> LLMFilter -> ScientificPaper
    (contents, filename, paper_auth) -> (contents, filename, paper_auth)
    Model: Model.GPT_4o
    Filter: The paper mentions phosphorylation of Exo1

 3. ScientificPaper -> MixtureOfAgentsConvert -> Reference
    (contents, filename, paper_auth) -> (reference_first_author, refere)
    Prompt Strategy: None
    Proposer Models: [GPT_4o]
    Temperatures: [0.8]
    Aggregator Model: Model.GPT_4o
    Proposer Prompt Strategy: chain-of-thought-mixture-of-agents-proposer
    Aggregator Prompt Strategy: chain-of-thought-mixture-of-agents-aggregation

* make equality check for new field names a bit more explicit

* fix fixture usage

* update all plans within code base to explicitly convert when needed; and removed unnecessary schemas for reading from datasource

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Refactor demos to use .sem_add_columns or .add_columns instead of convert(), remove Schema from demos when possible. (#104)

* Create chat.rst (#96)

* Create chat.rst

* Update pyproject.toml

Hotfix for chat

* Update conf.py

Hotfix for chat.rst

* code update for #84

This implementation basically resolves #84.

One implementation is different from the #84:
.add_columns(
  cols=[
    {"name": "sender", "type": "string", "udf": compute_sender},
    ...
  ]
)

If add_columns() uses cols, udf, types as params, it will make this function confusing again. Instead, if users need to specify different udfs for different columns, they should just call add_columns() multiple times for different columns.

* use field_values instead of field_types as field_values have the actual values,

use field_values instead of field_types as field_values have the actual values, since field_values have the actual key-value pairs, while field_types are just contain fields and their types.

records[0].schema is the schema of the output, which doesn't mean we already populate the schema into record.

* Remove .convert() and use .sem_add_columns or .add_columns instead

This change is based on #101 and #102, please review them first then this change.

1. This is to refactor all demos to use .sem_add_columns or .add_columns, and remove .convert().

2. Remove Schema from demos, except demos using ValidationDataSource and dataset.retrieve() that need schema now. We can refactor these cases later.

* ruff check --fix

* fix unittest

* demos fixed and unit tests running

* fix add_columns --> sem_add_columns in demo

* udpate quickstart to reflect code changes; shorten text as much as possible

* passing unit tests

* remove convert() everywhere

* fixes to correct errors in demos; update quickstart and docs

---------

Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* Simplify Datasource (#103)

## Summary of PR changes

**Note 1:** I did not change anything related to val_datasource (including tangential functions like Dataset._set_data_source()) as that will all be modified in a subsequent PR to reflect our discussion re: validation data.

**Note 2:** I have completely commented out datamanager.py and config.py; for now I am willing to leave the code around in case we desperately need it for PalimpChat. However, my hope is that PalimpChat can be tweaked to work without the data manager and those files can be deleted before merging dev into main

**Note 3:** Despite the branch name, fixing the progress managers will be part of a separate PR.

- Collapsed all four `DataSource` classes down to a single `DataReader` class
- Limit the number of methods the user needs to implement to just `__len__()` and `__getitem__()`
    - (Switched from using `get_item() --> __getitem__()` in `DataReader`)
- Provided `DataReader` directly to scan operators (also renamed `DataSourcePhysicalOp --> ScanPhysicalOp`
- Removed `DataDirectory()` from `src/` entirely; this included commenting out things which made use of the cache (e.g. caching computed `DataRecords` and codegen examples)
- Got rid of `dataset_id` everywhere (which tracks with the previous bullet)
- Removed the `Config` class which was a relic of a bygone era (and also intertwined with the `DataDirectory()`)
- Updated all demos to use `import palimpzest as pz` to make the import statement(s) more welcoming
- Fixed one bug resulting from converts now producing union schemas. Instead of including the `output_schema` in an operators' `get_id_params()` we simply report the `generated_fields`.
- Changed `source_id --> source_idx` everywhere (this eliminated some weird renaming logic)
- Finally, I added a large set of documentation for the DataSource class(es)

* Multi-LLM Refinement Pipeline for Query Output Validation (#118)

* Multi-LLM Refinement Pipeline for Query Output Validation  (#92)

## Summary of PR

This PR contains the work to add a new `CriticConvert` physical operator to PZ. At a high-level, this operator runs a bonded convert, and then asks a critic model if the answer produced by the bonded convert can be improved upon. The original output and the critique are then fed into a refinement model, which produces the improved output.

The work to implement this includes:
1. Defining the physical operator in `src/palimpzest/query/operators/critique_and_refine_convert.py`
2. Adding an implementation rule for this physical operator in `src/palimpzest/query/optimizer/rules.py`
3. Adding boolean flag(s) to enable allowing / disallowing this physical optimization
4. Adding base prompts for the critique and refinement generations

One other change which this work spawned was an attempt to improve the management and construction of our prompts -- and to decouple this logic from the `BaseGenerator` class. On the management side, I split our single `prompts.py` file into a set of files. On the construction side, I created a `PromptFactory` class which templates prompts based on the `prompt_strategy` and input record.

The `PromptFactory` is not a perfect solution, but I think it is a step in the right direction.

Finally, I fixed an error which previously filtered out `RAGConvert` operators from being considered by the `Optimizer`, and I made 2-3 more miscellaneous small tweaks.

---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* MkDocs Site for Palimpzest API Documentation  (#116)

## Summary of PR Changes
1. Changed `docs` to use [MkDocs](https://www.mkdocs.org/) instead of Sphinx
2. Created initial `Getting Started` content
3. Created placeholders for `User Guide` content (to follow in a subsequent PR)
4. Added autogenerated docs for our most user-facing code (we will need to add docstrings to our code in a subsequent PR)
5. Made small tweaks to `src/` to allow users to specify policy using kwargs in `.run()`
6. Renamed the `testdata/enron-tiny/` files so that they're not so damn weird
---------

Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>

* remove registration of sources from CI; only check version bump if there is a code change

* remove filter for only checking version bump when src files changed

* Rename `nocache` --> `cache` everywhere (#128)

* first commit

* Removed myenv

* added to git ignore

* addressed the comments in review

* flip one minor comment

* minor spacing fix

* fix spaces in a few more spots

---------

Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* adding citation (and making 'others' explicit) (#136)

* Make Generator thread-safe (#139)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* Begin Process of Improving Index Abstraction(s) in PZ (#138)

* quick and dirty implementation which tracks retrieve costs

* bug fixes and currently unused index code

* add default search func which I forgot to implement and add chromadb to pyproject.toml

* leaving TODO

* hotfix to add cost for retrieve operation

* another hotfix to add ragatouille dependency

* Add logger for PZ (#134)

* add logger for PZ

1. When verbose=True, we save all logs to log_file and print them on  console;
2. when verbose=False, we only save ERROR+ log to file and print ERROR+.

I just add logging to somewhere I think might be important for the execution, we always can add/remove for more or less.

Also I might update the logging message based on my later annotation work. But this PR should setup the logging mechanism for now.

* ruff fix

* update code based on comments

1. not logging output_records
2. not logging plan_stats
3. make the files to ".pz_logs"

---------

Co-authored-by: Matthew Russo <mdrusso@mit.edu>

* fix merge bug (#141)

* ruff fix

* update log dir and fix tiny bug

* fix merge bug

* Use a singleton API client for operators (#140)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* also create parent dir. if missing

* CUAD benchmark (#143)

* fix moa prompt

* fix moa prompt aggregator

* update version

* make generator thread-safe

* update generator to return messages

* address comments

* create a singleton API client

* fix linting

* fix logging in generators

* fix CUAD benchmarlk

* fix type

* minor fixes

* Limit the Scope of Logging within the Optimizer (#144)

* making it possible to set log level based on env. variable; adding time limit on seven filters test

* deleting instead of commenting out

* Remove Conventional LLM Convert; Update Bonded LLM Convert retry logic (#145)

* use NullHandler in __init__ and let application control logging config (#146)

* use NullHandler in __init__ and let application control logging config

* ruff fix

* Fix Progress Manager and Simplify `execute_plan` methods (#148)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* address comments

* The great deletion (#149)

* Adding Preliminary Work on Abacus and MAB Sentinel Execution (#147)

* updating models to avoid llama3

* fix parsing bugs and some generation errors

* don't require json for proposer and code synth generations; fix prompt format instruction for proposers

* fix typo/bug

* fix bugs in generator prep for field_answers; fix bug in filter impl.; other improvements

* adding new file for abacus workload

* fix len

* fix errors with dataset copy; prompt construction; and more

* remove JSON instruction from MOA proposer

* fixed bugs in optimizer configuration, llama 3.3 generation, and filter generation

* clean up demos; fix missing base prompt from map

* add one more missing base prompt

* prepare demo for full run; get embedding cost info from RAGConvert; use reasoning output from Critique

* add script to generate text-embedding-3-small reaction embeddings

* write to .chroma

* run full scale generation

* compute embeddings slowly and add progress bar

* add sleep

* fix import

* add total iters

* create embeddings before ingesting

* fix index start and finish

* load embeddings and insert directly

* make chroma use cosine sim.; finish initial search fcn. for biodex workload; naming tweak in rag convert

* capturing gen stats in Retrieve

* added UDF map operator; rewrote biodex pipeline to match docetl impl.; switched to using __name__ for functions instead of str()

* add optimizations back in

* write data to csv in demo

* limit to same model choice(s) as docetl and lotus

* fix punctuation error(s)

* try run without filter

* remove unused demo file

* remove print

* remove prints

* remove costed_phys_op_ids which were used for debugging

* try slightly diff. approach

* remove temp changes while branch is in PR review

* remove depends_on for map

* fix iteration bug in sentinel processors

* one more hotfix

* fix more errors w/SentinelPlanStats and sentinel processors

* remove logger lib to reduce confusion (#159)

* Update research.md (#160)

AISD @ NAACL 2025

* Add Pneuma-Palimpzest Integration Demo (#158)

* Add Pneuma demo

* Remove dataset semantic column addition

* Fix progress managers episode 2 attack of the clones (#156)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* catch errors in generating embeddings

* fix comments

* Merging in Changes for Sentinel Progress Bars; Split Convert (off by default); `demos/enron-demo.py`; and MMQA Benchmark (#163)

* modifying ProgressManager class to allow for dynamically adding tasks

* beginning to use new progress manager

* initial rewrite of execute_plan methods with new progress manager

* unit tests passing

* trim a few lines

* unit tests passing; changes applied everywhere; MAB and Random coming in a separate PR

* enable final operator to show progress in parallel

* initial work to refactor sentinel processors

* passing unit tests

* checking in minor changes

* remove use of setup_logger inside library

* stuff seems to be working

* big print

* turn off rag for test

* try debugging exception

* checking in code before changes to scoring

* finished initial refactoring of mab sentinel execution strategy

* get random sampling execution working with changes

* passing unit tests

* nosentinel progress looks good

* eyeball test is working for progress bars

* remove the old gods

* revert small change

* pull up progress manager logic in parallel execution

* adding prints to generator; turn progress off in favor of verbose for now

* catch errors in generating embeddings

* inspect frontier updates

* remove args.workload

* fix num_inputs in selectivity computation

* pdb in score

* fixed score fn issue

* use execution cache to avoid unnecessary computation; use sentinel stats for updating frontier

* fix progress counter

* debug

* fix empty stats

* only count stats from newly computed results

* fix tuple unpacking

* only update sample counts for llm ops

* de-dup duplicate record

* ugh

* dont forget to increment

* plz

* more plz

* increment

* recycle ops back onto reservoir so they may be reconsidered in the future

* remove pdb

* add progress to script args

* try without rag

* use term recall

* just check in on term recall

* make it easier to turn off progress

* remove pdb

* try to get re-rank to keep all inputs

* try to generate more reactions

* track total LLM calls

* 10x parallelism

* try retrieve directly on fulltext

* up max workers

* adding enron-demo w/optimization

* remove config option

* adding recall and precision to output

* allow operators to be recycled back onto frontier

* revert to using reactions instead of fulltext for similarity

* better cycling of off-frontier operators

* safety check on reservoir ops

* remove pdb

* fixing 5 results per query

* investigate sampling behavior

* check on seeds

* remove pdb

* test SplitConvert

* debug chunking

* fix bug in rag and split convert

* run with chunks

* test chunking logic

* fix chunking logic

* sum list

* remove split merge for now

* minor fixes to CUAD script

* add embedding scripts for mmqa tables and image titles

* address issue with empty titles and title collisions

* prepare script for using clip embeddings for images

* fix bug

* get full space of possible extensions

* debug

* weird bug fix?

* more debug

* fix idiotic mistake

* handle corrupted images and minor things

* add another corrupted image

* another one

* anotha

* more bad images

* last disallow file

* prepa
416D
re cuad for runs

* specify execution strategy

* up samples

* add sentinel execution strategy to output name

* adding plan str and more stats

* specify no prior

* verbose=False

* fix comment; comment out prints

* make split merge optional for now

* addressing comments

* applying syntax changes to pneuma demo and supporting strings within retrieve

* bump version; fix lint; fix docs

* more docs tweaks; tweaking dependencies

* fix install issues

* one more version fix

* one more version fix

* one more version fix

* one more version fix

* last try

* change runner python version

* actually changing runner python version

* increase time limit for runners

* increase time limit for runners

* simplify pyproject dependencies

* remove unused import

---------

Co-authored-by: Jun <130543538+chjuncn@users.noreply.github.com>
Co-authored-by: Gerardo Vitagliano <vitaglianog@gmail.com>
Co-authored-by: Sivaprasad Sudhir <sivaprasad2626@gmail.com>
Co-authored-by: Yash Agarwal <yash94404@gmail.com>
Co-authored-by: Yash Agarwal <yashaga@Yashs-Air.attlocal.net>
Co-authored-by: Bari Bo LeBari <143016395+lilbarbar@users.noreply.github.com>
Co-authored-by: Bari LeBari <barilebari@dhcp-10-29-207-160.dyn.MIT.EDU>
Co-authored-by: muhamed <muhamed@mit.edu>
Co-authored-by: Tranway1 <tranway@qq.com>
Co-authored-by: Luthfi Balaka <luthfibalaka@gmail.com>

Apr 25, 2025
eff2f8f
zip
tar.gz
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0.7.10

0.7.9

0.7.8

abacus-paper-experiments

0.7.7

0.7.6

0.7.5

0.7.4

0.7.3

0.7.2

Tags: mitdbg/palimpzest