There are two ways to install the package. The first is to install the package from PyPi using pip. The second is to clone the repository.
pip install StataHelper
pip install git+
git clone
cd StataHelper
pip install .
- Python 3.4+
- Stata 16+ (Pystata is shipped with Stata licenses starting Stata 16)
- Pandas
- Numpy
Stata is a powerful package that boasts an impressive array of statistical tools, data manipulation capabilities, and a user-friendly interface. Stata 16 extended its capabilities by introducing a Python interface, Pystata. Intended especially for those with minimal Python experience, StataHelper is a Python wrapper around Pystata that does the following:
- Simplifies the interface to interact with Stata through Python
- Provides a simple interface to parallelize Stata code
- Reads and writes data that cannot be imported directly to Stata, like Apache Parquet files
Note that parallelization in this case is not the same as multithreading as we see in Stata's off-the-shelf parallelization like Stata MP. In these cases, calculations used in a single process (a regression, summary statistics, etc.) are passed through multiple cores. In contrast, StataHelper parallelization is used to run multiple processes across multiple cores simultaneously while still taking advantage of Stata's multithreading capabilities.
Suppose you have a set of regressions you want to run in which you change the dependent variable, independent variables, or control variables. In Stata this would require several foreach-loops over the variables to change.
local ys depvar1 depvar2 depvar3
local xs indepvar1 indepvar2 indepvar3
local controls controlvar1 controlvar2 controlvar3
foreach y in local ys{
foreach x in local xs {
foreach control in local contros {
regress `y' `x' `control'
eststo model_`y'_`x'_`control'
}
}
}
Regression groups like this are common to identify the best model specification, especially in identifying how well a result holds across subsamples, fixed-effect levels, or time periods. Stata is a powerful tool for this type of analysis, but is only designed to run a single regression at a time.
For the sake of argument, let's say that Stata takes X seconds to run a single regression within any combination of parameters. If we have 3 dependent variables, 3 independent variables, and 3 control variables, we would need to run 27 regressions. This would take 27X seconds to run.
Let's say we want to see if the result holds for two segments of the population (e.g. heterogeneous effects), so now we have 3 dependent variables, 3 independent variables, 3 control variables, and 2 segments = 54 regressions, and an additional foreach-loop. This would take 54X seconds to run. As the number of variations increases, the time to run the regressions increases exponentially, each forloop has time complexity O(n), so the total time complexity is O(n^4).
This inefficiency is where StataHelper comes in.
from StataHelper import StataHelper
path = "C:/Program Files/Stata17/utilties"
s = StataHelper(stata_path=path, splash=False)
results = s.parallel("reg {y} {x} {control}", {'y': ['depvar1', 'depvar2', 'depvar3'],
'x': ['indepvar1', 'indepvar2', 'indepvar3'],
'control': ['controlvar1', 'controlvar2', 'controlvar3']})
The idea of parallelization is that we divide the number of regressions into smaller ques and run them simultaneously across multiple cores. This reduces the time to run the regressions. If you have those 27 regressions and divide them evenly across 3 cores, you would reduce the time to run the regressions by 3X.
Additionally, StataHelper provides users a simplified interface to interact with pystata, can read and write data that cannot be imported directly to StataHelper, like Apache Parquet files, and can run StataHelper code from a string or a file.
You can interact with StataHelper in nearly the same way you would interact with pystata. In pystata you would configure the pystata instance as follows (assuming you have not added Stata to your PYTHONPATH):
import sys
stata_path = "C:/Program Files/Stata17/utilties"
sys.path.append(stata_path)
from pystata import config
config.init(edition='mp', splash=False)
config.set_graph_format('svg')
config.set_graph_size(800, 600)
config.set_graph_show(False)
config.set_command_show(False)
config.set_autocompletion(False)
config.set_streaming_output(False)
config.set_output_file('output.log')
from pystata import stata
stata.run('di "hello world"')
stata.run("use data.dta")
stata.run("reg y x")
config.close_output_file() # Close the Stata log
Notice how we have to configure the stata instance before we can even call import the stata
module,
and the stata instance requires a separate config
object to be configured.
In StataHelper, you can configure the Stata instance directly in the constructor.
from StataHelper import StataHelper
s = StataHelper(splash=False,
edition='mp',
set_graph_format='svg',
set_graph_size=(800, 600),
set_graph_show=False,
set_command_show=False,
set_autocompletion=False,
set_streaming_output=False,
set_output_file='output.log')
s.run("di hello world")
s.run("use data.dta")
s.run("reg y x")
s.close_output_file()
StataHelper provides a simple interface to parallelize StataHelper code. Just as with pystata's run
method,
you may pass a string of StataHelper code to the parallel
method. StataHelper is designed to read placeholders in the stata
code for the values you wish to iterate over. There are two methods to do this:
The previous snippet exemplifies brace notation, which is intended to be intuitive. All this is needed is the command, and a dictionary with the keys as the placeholders. The values can be any iterable object.
parameters = {'control': ['controlvar1', 'controlvar2', 'controlvar3'],
'x':['indepvar1', 'indepvar2', 'indepvar3'],
'y': ['depvar1', 'depvar2', 'depvar3']}
Dictionaries are inherently order-agnostic, so the order of the keys does not matter as long as all keys are in the command and all placeholders in the command are keys in the dictionary. The order of the keys will only affect the unique identifier of the results in the output directory (see below).
Let's say you wanted to run a series of regressions but vary the level of fixed effects. You would approach this by
simply introducing a sublist into the fixed effects list. In the following example, we'll use the reghdfe
command to
run a series of regressions with varying high-dimensional fixed effects.
from StataHelper import StataHelper
values = {'y': ['depvar1', 'depvar2', 'depvar3'],
'x': ['indepvar1', 'indepvar2', 'indepvar3'],
'fixed_effects': [['fe1', 'fe2'], ['fe1'], ['fe1', 'fe5']]}
s = StataHelper(splash=False)
s.run("ssc install reghdfe")
results = s.parallel("reghdfe {y} {x} absorb({fixed_effects})", values)
You can pass multiline StataHelper code to the parallel
method just as you would with pystata.stata.run
.
import StataHelper
stata = StataHelper.StataHelper(splash=False)
values = {'y': ['depvar1', 'depvar2', 'depvar3'],
'x': ['indepvar1', 'indepvar2', 'indepvar3'],
'control': ['controlvar1', 'controlvar2', 'controlvar3']}
results = stata.parallel("""
reg {y} {x} {control}
predict yhat
gen residuals = {y} - yhat
""", values)
You can also pass conditional statements to the parallel
method to analyze a subset of the data.
import StataHelper
stata = StataHelper.StataHelper(splash=False)
values = {'y': ['depvar1', 'depvar2', 'depvar3'],
'x': ['indepvar1', 'indepvar2', 'indepvar3'],
'control': ['controlvar1', 'controlvar2', 'controlvar3'],
'subsets': ['var4<=2023 & var5==1', 'var4>2023 | var5==0']}
results = stata.parallel("reg {y} {x} {control} if {subsets}", values)
Estimation commands can be saved to a file by specifying the est save
command in cmd
.
The parallel
method will save the results to a folder titled name
in set_output_dir
if name
is not None
and
an asterisk *
is present in cmd
.
All files in this directory are also called name
, but with a unique identifier appended to the end.
e.g.
from StataHelper import StataHelper
s = StataHelper(edition='mp', splash=False, set_output_dir='C:/Users/me/Documents/StataOutput')
...
s.parallel("eststo: reg y {x} if {subset}\nest save *", values, name='regressions')
produces the following files in C:/Users/me/Documents/StataOutput/regressions
:
regressions_1.ster
regressions_2.ster
regressions_3.ster
regressions_4.ster
regressions_5.ster
regressions_6.ster
You can easily load these files into Stata by looping over the files in the directory and using the est restore
command.
After which, you can use the esttab
command to create a table of the results just as if you had run the regressions in a
loop in Stata.
In general, this method can be used to run many types of stata commands in parallel, not just regressions.
You might, for example, want to run a series of tabstat
commands to summarize the data and save the results
to a file.
from StataHelper import StataHelper
s = StataHelper(edition='mp', splash=False, set_output_dir='C:/Users/me/Documents/StataOutput')
values = {'var': ['var1', 'var2', 'var3', 'var4', 'var5']}
s.parallel("tabstat {var}, stat(mean sd) save *.xlsx", values, name='table1')
produces the following files in C:/Users/me/Documents/StataOutput/table1
:
table1_1.xlsx
table1_2.xlsx
table1_3.xlsx
table1_4.xlsx
table1_5.xlsx
Note: Wrappers and arguments for StataNow functionalities have not been tested. They are included for completeness. See Pystata documentation for more information. See below for information about contributing to the project.
StataHelper(_self, edition=None, splash=None, set_output_dir=None, set_graph_format=None, set_graph_size=None, set_graph_show=None, set_command_show=None, set_autocompletion=None, set_streaming_output=None, set_output_file=None
- edition (str) : The edition of Stata to use.
- splash (bool) : Whether to show the splash screen when StataHelper is opened. It is recommended not use this when running parallel, as it will repeat for every core that is opened.
- set_output_dir (str) : The directory to save the output files such as estimation files. A new folder housing these files is created in this di 8000 rectory.
- set_graph_format (str) : pystata.config.set_graph_format. The format of the graphs to be saved.
- set_graph_size (tup) : pystata.config.set_graph_size. The size of the graphs to be saved.
- set_graph_show (bool) : pystata.config.set_graph_show. Whether to show the graphs in the StataHelper window.
- set_command_show (bool) : pystata.config.set_command_show. Whether to show the commands in the StataHelper window.
- set_autocompletion (bool) : pystata.config.set_autocompletion. Whether to use autocompletion in the StataHelper window.
- set_streaming_output: pystata.config.set_streaming_output. Whether to stream the output to the console.
- set_output_file (str) : pystata.config.set_output_file. Where to save the Stata log file.
All values not specified as an argument default to the pystata defaults. See the pystata documentation.
- Wrapper for
pystata.stata.is_initialized()
. - Returns True if Stata is initialized, False otherwise.
- Wrapper for
pystata.stata.status()
. - Prints the status of the Stata instance to the console. Returns None.
- Wrapper for
pystata.stata.close_output_file()
. - Closes the Stata log file.
- Wrapper for
pystata.stata.get_return()
. - Returns the
return
values from the last Stata command as a dictionary.
- Wrapper for
pystata.stata.get_ereturn()
. - Returns the
e(return)
values from the last Stata command as a dictionary.
- Wrapper for
pystata.stata.get_sreturn()
. - Returns the
sreturn
values from the last Stata command as a dictionary.
- Wrapper for
pystata.stata.run()
. - Runs cmd in the Stata window.
- cmd str : Stata command to run.
- **kwargs dict : Additional arguments to pass to the Stata command.
- Pythonic method to load a dataset into Stata. Equivalent to
use
command in Stata. - data str : The path to the data file to load into Stata.
- columns list or str : The columns to load into Stata. If None, all columns are loaded.
- obs int or str : The number of observations to load into Stata. If None, all observations are loaded.
- Read any pandas-supported file into Stata. Equivalent to
import delimited
in Stata for delimited files. - This method allows some files that cannot be imported directly into Stata to be loaded.
- path str : The path to the file to load into Stata.
- frame str : The name of the frame to load the data into. If None, the file name is used.
- force bool : Whether to overwrite the existing frame. If False, the frame is appended.
- Raises a
ValueError
if the extension is not in the list of supported file types. - Valid file types include: CSV, Excel, Parquet, Stata, Feather, SAS, SPSS, SQL, HTML, JSON, pickle/compressed files, xml, clipboard.
StataHelper.use_as_pandas(self, frame=None, var=None, obs=None, selectvar=None, valuelabels=None, missinglabels=_DefaultMissing(), *args, **kwargs)**
- Read a Stata frame into a Pandas DataFrame. Equivalent to
export delimited
in Stata for delimited files. - frame str : The name of the frame to read into a pandas DataFrame. If None, the active frame is used.
- var list or str : The variables to read into the DataFrame. If None, all variables are read.
- obs int or str : The number of observations to read into the DataFrame. If None, all observations are read.
- selectvar str : The variable to use as the index. If None, the index is not set.
- valuelabels bool : Whether to use value labels. If True, the value labels are used. If False, the raw values are used.
- missinglabels str : The missing value labels to use. If None, the default missing value labels are used.
This method allows some files that stata cannot export directly to be read into a pandas DataFrame.
In the case of .dta files, this method is significantly faster than using the pandas.read_stata
method as the dataset
is first loaded into Stata and then read into a Pandas DataFrame.
StataHelper.save(path, frame=None, var=None, obs=None, selectvar=None, valuelabel=None, missinglabel=None, missval=_DefaultMissing(), *args, **kwargs)
- Save a Stata dataset to a file. Passes the frame to Pandas and saves the file using the Pandas method. Valid file types are the same as
use_file
. - path str : The path to save the file to.
- frame str : The name of the frame to save. If None, the active frame is used.
- var list or str : The variables to save. If None, all variables are saved.
- obs int or str : The number of observations to save. If None, all observations are saved.
- selectvar str : The variable to use as the index. If None, the index is not set.
- valuelabels bool : Whether to use value labels. If True, the value labels are used. If False, the raw values are used.
- missinglabels str : The missing value labels to use. If None, the default missing value labels are used.
- missval str : The missing value labels to use. If None, the default missing value labels are used.
- Raises a
ValueError
if the extension is not in the list of supported file types. - Valid file types include: CSV, Excel, Parquet, Stata, Feather, SAS, SPSS, SQL, HTML, JSON, pickle/compressed files, xml, clipboard.
- Returns the queue of commands to be run in parallel (cartesian product). Analogous to the parallel method, but does not execute the commands.
- cmd str : The Stata command to run in parallel.
- pmap dict : The parameters to iterate over in the Stata command. Can be any iterable object of any dimension, but note that the deeper the dimension, the more (potentially redundant) combinations are created.
- All keys in pmap must be in cmd, and all placeholders in cmd must be in pmap.
For example,
from StataHelper import StataHelper
s = StataHelper(splash=False)
values = {'x': [['indepvar1', 'indepvar2'], 'indepvar1', 'indepvar2', 'indepvar3']}
s.schedule("reg y {x}", values)
would place the following regressions in queue.
reg y indepvar1 indepvar2
reg y indepvar1
reg y indepvar2
reg y indepvar3
Values can also be conditional statements.
from StataHelper import StataHelper
s = StataHelper(splash=False)
values = {'x': ['indepvar1', 'indepvar2', 'indepvar3'],
'subset': ['var1==1', 'var2==2', 'var3==3']}
s.schedule("reg y {x} if {subset}", values)
returns the following regressions in queue
reg y indepvar1 if var1==1
reg y indepvar1 if var2==2
reg y indepvar1 if var3==3
reg y indepvar2 if var1==1
reg y indepvar2 if var2==2
reg y indepvar2 if var3==3
reg y indepvar3 if var1==1
reg y indepvar3 if var2==2
reg y indepvar3 if var3==3
Logical operators can be specified in the conditional statement.
from StataHelper import StataHelper
s = StataHelper(splash=False)
values = {'x': ['indepvar1', 'indepvar2'],
'subset': ['var1==1 & var2==2', 'var2==2 | var3==3', 'var3==3']}
s.schedule("reg y {x} if {subset}", values)
returns:
reg y indepvar1 if var1==1 & var2==2
reg y indepvar1 if var2==2 | var3==3
reg y indepvar1 if var3==3
reg y indepvar2 if var1==1 & var2==2
reg y indepvar2 if var2==2 | var3==3
reg y indepvar2 if var3==3
- Runs a series of Stata commands in parallel. Analogous to the schedule method, but executes the commands.
- cmd str : The StataHelper code to run in parallel, including placeholders for the values to iterate over. Placeholders use brace notation
{}
.pmap
must be a dictionary with keys that match the placeholders in the StataHelper code. - pmap list, dict, tuple : The values to iterate over in the StataHelper code. If a list or tuple, the order of the values. If a dict, the order only matters if you use wildcards. In that case, the keys are ignored. Items in sublists are joined with a whitespace
" "
and allow multiple values for a single placeholder. - name str : The name of the output directory, replacing
*
incmd
. If None, a unique identifier is created based on the order of the process in the queue. - max_cores int : The maximum number of cores to use. If
None
, then themin(os.cpus()-safety_buffer, len(pmap))
is used. ifmax_cores
is greater than the number of combinations and the number of combinations is greater than the number of cores, thenos.cpus()-safety_buffer
are used. - safety_buffer int : The number of cores to leave open for other processes.
Contributions are welcome! If you would like to contribute to the project, please fork the repository and submit a pull request. Specifically, we are looking for contributions in the following areas:
- Testing current functionalities on multiple platforms
- Testing and following StataNow functionalities
- within-Stata multiprocessing (to migrate away from the
multiprocessing
module) - Applications of NLP or LLM in troubleshooting stata errors and summarizing help files
Collin Zoeller, Tepper School of Business, Carnegie Mellon University
zoellercollin@gmail.com
Github: ColZoel
Website: colzoel.github.io
Author Collin Zoeller and StataHelper are not affiliated with StataCorp. Stata is a registered trademark of StataCorp LLC. While StataHelper is open source, Stata and its Python API Pystata are proprietary software and require a license. See stata.com for more information.