Introduction¶
This is a library for building Bayesian embedded autoregressive (BEAR) models, proposed in Amin et al. (link TODO). It provides (1) a script for extracting summary statistics from large sequence datasets (which relies on KMC), (2) a python package with tools for building BEAR models in general (which relies on TensorFlow), and (3) scripts for training and evaluating specific example BEAR models. To use the package, follow the installation instructions and clone the package repository. To get started with an example dataset, jump to the tutorial.
Preprocessing sequence data¶
summarize.py¶
Extract summary statistics (kmer count transitions) from large nucleotide datasets in order to train BEAR models. Usage:
python summarize.py file out_prefix [-l L] [-nf NF] [-r R] [-pr PR] [-mk MK] [-mf MF] [-p P] [-t T]
Input¶
- filestr
The input csv file with rows in the format FILE, GROUP, TYPE where FILE is a path, GROUP is an integer, and TYPE is either fa (denoting fasta) or fq (denoting fastq). Files with the same group will have their counts merged. All files must contain DNA sequences (A, C, G, and T alone).
- out_prefixstr
The file name prefix (including path) for the output files.
- -lint, default = 10
The maximum lag (the truncation level).
- -nfbool
Do not count kmers in the forward direction.
- -rbool
Also run KMC including the reverse compliment of sequences when counting.
- -prbool
Do all lags for pre and full KMCs.
- -mkfloat, default = 12
Maximum amount of memory available, in gigabytes (corresponding to the KMC -m flag).
- -mffloat, default = 0.1
Maximum size of output files, in gigabytes.
- -pstr, default = ‘’
Path to KMC binaries. If these binaries are in your PATH, there is no need to use this option.
- -tstr, default = ‘tmp/’
Folder in which to store KMC’s intermediate results. A valid path MUST be provided.
You can run python summarize.py -h
for help and more advanced options.
Output:
A collection of files:
out_prefix_lag_1_0.tsv, ..., out_prefix_lag_1_N.tsv, ...,
out_prefix_lag_L_0.tsv, ..., out_prefix_lag_L_N.tsv
where L is the maximum lag and there are N total files for each lag. Each file is a tsv with rows of the format:
kmer [[transition counts in group 0 files],[transition counts in group 1 files],...]
The symbol [ in the kmer is the start symbol. Each counts vector is in the order A, C, G, T, $ where $ is the stop symbol.
Caution
KMC discards kmers with more than 4 billion counts, which may lead to errors on ultra large scale datasets.
Caution
The script is not intended for use on sequence data with N symbols; it will run but will not handle such missing data carefully.
Caution
The output is not lexicographically sorted, nor uniformly randomized.
Building BEAR models¶
bear_net¶
The bear_net
submodule contains functions to train, evaluate, and deploy
BEAR (and AR) models.
Two example autoregressive functions are implemented in the submodule ar_funcs
:
a linear function and a simple convolutional neural network (CNN),
bear_model.ar_funcs.make_ar_func_linear()
and
bear_model.ar_funcs.make_ar_func_cnn()
respectively.
The standard workflow for training a BEAR model is to first load a
file of preprocessed kmer transition counts using the dataloader
submodule:
- bear_model.dataloader.dataloader(file, alphabet, batch_size, num_ds, cache=True, header=False, n_par=1, dtype=tf.float64)[source]¶
Load counts data into tensorflow data object.
- Parameters
- filestr
Location of counts data, which should be a tsv file with rows in the format: kmer_sequence counts_matrix, delimited by a tab.
- alphabetstr
One of ‘dna’, ‘rna’, ‘prot’.
- batch_sizeint
By minibatching early, the counts matrices for multiple kmers may be decoded at once.
- num_dsint
Number of columns in the count data. Ex: 3 for train, test and reference.
- cachebool, default = True
Whether or not to cache the loaded data. Increases loading speed at the cost of memory.
- headerbool, default = False
Whether or not there is a header in the counts data.
- n_parint, default = 1
Number of parallel calls to turn counts matrix strings to tensors.
- dtypedtype, default = tf.float64
- Returns
- datatensorflow data object
One element of the data is a list of batch_size kmers and a counts tensor of shape [batch_size, num_ds, alphabet_size+1].
If the counts are in a sparse format one may use
- bear_model.dataloader.sparse_dataloader(file, alphabet, batch_size, num_ds, cache=False, header=True, n_par=1, dtype=tf.float64)[source]¶
Loads counts that are in sparse format into tensorflow data object.
- Parameters
- filestr
Location of counts data, which should be a tsv file with rows in the format: kmer_sequence counts_matrix, delimited by a tab.
- alphabetstr
One of ‘dna’, ‘rna’, ‘prot’.
- batch_sizeint
By minibatching early, the counts matrices for multiple kmers may be decoded at once.
- num_dsint
Number of columns in the count data. Ex: 3 for train, test and reference.
- cachebool, default = True
Whether or not to cache the loaded data. Increases loading speed at the cost of memory.
- headerbool, default = False
Whether or not there is a header in the counts data.
- n_parint, default = 1
Number of parallel calls to turn counts matrix strings to tensors.
- dtypedtype, default = tf.float64
- Returns
- datatensorflow data object
One element of the data is a list of batch_size kmers and a counts tensor of shape [batch_size, num_ds, alphabet_size+1].
The loaded dataset can then be used to train an AR or BEAR model with the
bear_net
submodule:
- bear_model.bear_net.train(data, num_kmers, epochs, ds_loc, alphabet, lag, make_ar_func, af_kwargs, learning_rate, optimizer_name, train_ar, acc_steps=1, params_restart=None, writer=None, loss_save=None, dtype=tf.float64)[source]¶
Train a BEAR or AR model using all available GPUs in parallel.
- Parameters
- datatensorflow data object
Load sequence data using tools in dataloader.py. Minibatch before passing.
- num_kmersint
Total number of kmers seen in data. Used to normalize estimate of loss.
- epochsint
- ds_locint
Column in count data that corresponds with the training data.
- alphabetstr
One of ‘dna’, ‘rna’, ‘prot’.
- lagint
- make_ar_funcfunction
Takes lag, alphabet_size, af_kwargs, dtype and returns an ar_func. See ar_funcs submodule.
- af_kwargsdict
Keyword arguments for particular ar_func. For example, filter_width.
- learning_ratefloat
- optimizer_namestr
For example ‘Adam’.
- train_arbool
Whether to train an AR (True) or BEAR (False) model.
- writertensorboard writer object, default = None
- loss_savelist, default = None
Pass a list to have losses at each step appended to it.
- acc_stepsint, default = 1
Number of steps to accumulate gradients over.
- params_restartlist of tensorflow variables, default = None
Pass the parameter list from a previous run to restart training.
- dtypedtype, default = tf.float64
- Returns
- params: list
List of parameters as tensorflow variables.
- h_signeddtype
\(\log(h)\) where \(h\) is the BEAR concentration parameter.
- ar_funcfunction
The autoregressive function.
One then may evaluate the performance of the trained model:
- bear_model.bear_net.evaluation(data, ds_loc_train, ds_loc_test, alphabet, h, ar_func, van_reg, dtype=tf.float64)[source]¶
Evaluate a trained BEAR, AR or BMM model. Can use multiple GPUs in parallel.
- Parameters
- datatensorflow data object
Load sequence data using tools in dataloader.py. Minibatch before passing.
- ds_loc_trainint
Column in count data that corresponds with the training data. Set to -1 for conditioning on training data.
- ds_loc_testint
Column in count data that corresponds with the testing data.
- alphabetstr
One of ‘dna’, ‘rna’, ‘prot’.
- hdtype
The \(h\) parameter from the BEAR model.
- ar_funcfunction
A function that takes a tensor of shape [A1, …, An, lag, alphabet_size+1] of dtype and returns a tensor of shape [A1, …, An, alphabet_size+1] of dtype of transition probabilities for each kmer. The autoregressive function.
- van_reg1D numpy or tensorflow array
Prior on vanilla BEAR model (Dirichlet concentration parameter).
- dtypedtype, default = tf.float64
- Returns
- log_likelihood_ear, log_likelihood_arm, log_likelihood_vanfloat
Total log likelihood of the data with the model evaluated as a BEAR, AR or vanilla BEAR model.
- perplexity_ear, perplexity_arm, perplexity_vanfloat
Perplexity of the data with the model evaluated as a BEAR, AR or vanilla BEAR model.
- accuracy_ear, accuracy_arm, accuracy_vanfloat
Accuracy of the data with the model evaluated as a BEAR, AR or vanilla BEAR model. Ties in maximum model probability are resolved randomly.
One may also evaluate the performance of the trained BEAR model for mutiple values of the concentration parameter \(h\):
- bear_model.bear_net.h_scan(data, ds_loc_train, ds_loc_test, alphabet, h, ar_func, dtype=tf.float64)[source]¶
Evaluate a trained BEAR model at multiple h values. Can use multiple GPUs in parallel.
- Parameters
- datatensorflow data object
Load sequence data using tools in dataloader.py. Minibatch before passing.
- ds_loc_trainint
Column in count data that corresponds with the training data. Set to -1 for conditioning on training data.
- ds_loc_testint
Column in count data that corresponds with the testing data.
- alphabetstr
One of ‘dna’, ‘rna’, ‘prot’.
- htensor
The \(h\) parameter from the BEAR model.
- ar_funcfunction
A function that takes a tensor of shape [A1, …, An, lag, alphabet_size+1] of dtype and returns a tensor of shape [A1, …, An, alphabet_size+1] of dtype of transition probabilities for each kmer. The autoregressive function.
- dtypedtype, default = tf.float64
- Returns
- log_likelihood_ear, log_likelihood_arm, log_likelihood_vanfloat
Total log likelihood of the data with the model evaluated as a BEAR, AR or vanilla BEAR model.
- perplexity_ear, perplexity_arm, perplexity_vanfloat
Perplexity of the data with the model evaluated as a BEAR, AR or vanilla BEAR model.
- accuracy_ear, accuracy_arm, accuracy_vanfloat
Accuracy of the data with the model evaluated as a BEAR, AR or vanilla BEAR model. Ties in maximum model probability are resolved randomly.
To recover the concentration parameter/misspecification diagnostic \(h\) and
the learned autoregressive function
from the outputted list of parameters, use the change_scope_params
function:
- bear_model.bear_net.change_scope_params(lag, alphabet_size, make_ar_func, af_kwargs, params, dtype=tf.float64)[source]¶
Redefine and get parameters of BEAR or AR distribution in given scope.
Used to to get unmirrored variables after training on multiple GPUs in parallel or to unpack a list of params into h and the autoregressive function.
- Parameters
- lagint
- alphabet_sizeint
- make_ar_funcfunction
Takes lag, alphabet_size, af_kwargs, dtype and returns an ar_func.
- af_kwargsdict
Keyword arguments for particular ar_func. For example, filter_width.
- params: list
List of parameters as tensorflow variables.
- dtypedtype, default = tf.float64
dtype for the ar_func and h.
- Returns
- paramslist
List of parameters as tensorflow variables.
- h_signeddtype
\(\log(h)\) where \(h\) is the BEAR concentration parameter.
- ar_funcfunction
A function that takes a tensor of shape [A1, …, An, lag, alphabet_size+1] of dtype and returns a tensor of shape [A1, …, An, alphabet_size+1] of dtype of transition probabilities for each kmer. The autoregressive function.
To embed your own autoregressive function, create a new
make_ar_func_...
function using the examples
bear_model.ar_funcs.make_ar_func_linear()
and
bear_model.ar_funcs.make_ar_func_cnn()
as templates.
bear_ref¶
Besides standard autoregressive models like linear models and neural networks, one can also build an autoregressive model based on a reference genome. In particular, we can predict the next base by looking up the previous \(L\) bases \(k\) in the reference, normalizing the observed transition counts \(\#_{\text{ref}}(k,b)\) to form a probability, and accounting for noise using a Jukes-Cantor mutation model (with error rate \(\tau\)). This gives the autoregressive function
for \(b\neq\$\) and \(\tilde{f}_{k,\$}=0\), where \(\$\) is the stop symbol and \(|\mathcal{B}|\) is the alphabet size (excluding the stop symbol). When \(\sum_{b'\neq\$}\#_{\text{ref}}(k,b') = 0\), we default to \(\tilde{f}_{k,b} = 1/|\mathcal{B}|\). Reference genomes do not include stop symbols, and so do not provide predictions of read length; to account for this problem in a generalized way we introduce another AR function \(g\) and combine it with the reference-based prediction,
for some \(\nu\in(0, 1)\). The function \(g\) is typically chosen to
predict a stop symbol with probability 1; this particular function is implemented
in bear_model.ar_funcs.make_ar_func_stop()
.
To train the model efficiently, we preprocess the reference sequence
along with the training and testing data, such that
\(\#_{\text{ref}}\) forms a column of the summarized data.
The submodule bear_ref
trains this reference-based
BEAR model, optimizing the hyperparameters \(\tau\) and \(\nu\) as well as
any additional parameters in \(g\).
- bear_model.bear_ref.train(data, num_kmers, epochs, ds_loc, ds_loc_ref, alphabet, lag, make_ar_func, af_kwargs, learning_rate, optimizer_name, train_ar, acc_steps=1, params_restart=None, writer=None, loss_save=None, dtype=tf.float64)[source]¶
Train a BEAR or AR model based on reference transition counts using all available GPUs in parallel.
- Parameters
- datatensorflow data object
Load sequence data using tools in dataloader.py. Minibatch before passing.
- num_kmersint
Total number of kmers seen in the data. Used to normalize estimate of loss.
- epochsint
- ds_locint
Column in count data that corresponds with the training data.
- ds_loc_refint
Column in count data that corresponds with the reference data.
- alphabetstr
One of ‘dna’, ‘rna’, ‘prot’.
- lagint
- make_ar_funcfunction
Takes lag, alphabet_size, af_kwargs, dtype and returns an ar_func. See ar_funcs submodule.
- af_kwargsdict
Keyword arguments for particular ar_func. For example, filter_width.
- learning_ratefloat
- optimizer_namestr
For example ‘Adam’.
- train_arbool
Whether to train an AR (True) or BEAR (False) model.
- writertensorboard writer object, default = None
- loss_savelist, default = None
Pass a list to have losses at each step appended to it.
- acc_stepsint, default = 1
Number of steps to accumulate gradients over.
- params_restartlist of tensorflow variables, default = None
Pass the parameter list from a previous run to restart training.
- dtypedtype, default=tf.float64
- Returns
- params: list
List of parameters as tensorflow variables.
- h_signeddtype
\(\log(h)\) where \(h\) is the BEAR concentration parameter.
- ar_funcfunction
The autoregressive function.
The functions bear_model.bear_ref.evaluate()
and
bear_model.bear_net.change_scope_params()
are analogous to the functions
in bear_net
with the same names.
Getting probabilities of mutations¶
Having trained a BEAR or BMM model using bear_net
(bear_ref
not yet supported) one can calculate the probability of new sequences or the probability of mutations to a wild type sequence using the get_var_probs
submodule:.
One can do this by sampling AR models from BEAR and calculating probabilities ffor each of these AR models, or by simply calculating the probabilities under the MAP model under a BEAR or BMM model.
One may calculaate the probability fo given sequences:
- bear_model.get_var_probs.get_bear_probs_seqs(bear_path, seqs, train_col, mc_samples=41, vans=[0.1, 1, 10], get_map=False, get_marg=False, lag=None, alphabet_name=None, h=None, data=None, kmc_path=None, kmc_reverse=False, no_ends=False)[source]¶
Sample posterior predictive probabilities of variants under BEAR by looping through batches of kmers.
- Parameters
- bear_pathstr
Path to folder of trained BEAR model. None if using BMM.
- seqslist
List of sequences to get probabilities of. Must not be bytes!
- train_colint
Row of counts data that includes the training data
- mc_samplesint, default = 41
Number of samples to take from the posterior predictive of the mutation probabilities.
- vansnumpy array, default = [0.1, 1, 10]
vanilla regularization to use for BMM models.
- get_mapbool, default=False
Gets the probability of the mutations under the MAP model under BEAR instead of sampling models from BEAR.
- get_margbool, default= False
Gets posterior predictive probability exactly instead of sampling AR models.
- lagint, default = None
Specify if not using BEAR.
- alphabet_name :str, default = None
Specify if not using BEAR.
- hnumpy array
For h scans if bear_folder is specified. None if using fit BEAR h.
- datatf dataset
Generator of kmers and counts. Specify if not using BEAR.
- kmc_pathstr
Specify the path kmc files if one wishes to use kmc to count kmers instead of cycling through whole dataset.
- kmc_reversebool, default=False
Whether to include counts of the reverse complement of kmers when counting using kmc.
- no_ends: bool, default=False
Whether or not to include starts and stops in probability calculations.
- Returns
- scoresnumpy array of floats
[num sequences, num models (whather or not to use BEAR + len(vans)), mc_samples].
or a list of mutations of a wild type sequence:
- bear_model.get_var_probs.get_bear_probs(bear_path, wt_seq, vars_, train_col, mc_samples=41, vans=[0.1, 1, 10], get_map=False, lag=None, alphabet_name=None, h=None, data=None, kmc_path=None, kmc_reverse=False, kmc_no_end=False)[source]¶
Sample posterior predictive probabilities of variants under BEAR by looping through batches of kmers.
- Parameters
- bear_pathstr
Path to folder of trained BEAR model. None if using BMM.
- wt_seqstr
Wild type sequence. Must not be bytes!
- vars_numpy array of str
List of SNPs. Of the format (wt_aa) (position in wt_seq without start symbols) (var_aa) without spaces or barckets. For example [‘A0T’,’C45G’]. Must not be bytes!
- train_colint
Row of counts data that includes the training data
- mc_samplesint, default = 41
Number of samples to take from the posterior predictive of the mutation probabilities.
- vansnumpy array, default = [0.1, 1, 10]
vanilla regularization to use for BMM models.
- get_mapbool
Gets the probability of the mutations under the MAP model under BEAR instead of sampling models from BEAR.
- lagint, default = None
Specify if not using BEAR.
- alphabet_name :str, default = None
Specify if not using BEAR.
- hnumpy array
For h scans if bear_folder is specified. None if using fit BEAR h.
- datatf dataset
Generator of kmers and counts. Specify if not using BEAR.
- kmc_pathstr
Specify the path kmc files if one wishes to use kmc to count kmers instead of cycling through whole dataset.
- kmc_reversebool, default=False
Whether to include counts of the reverse complement of kmers when counting using kmc.
- kmc_no_endbool, default=False
Whether to not include ends in variant probability changes in KMC.
- Returns
- scoresnumpy array of floats
[num variants, num models ((1 for the AR model if using BEAR and get_MAP=True) + len(hs) + len(vans)), mc_samples].
Example BEAR models¶
We have also written scripts that will perform the above workflow
- loading data, training, evaluating, and saving the list of parameters using dill
-
with parameters specified in a config file.
Config files contain information about the training and testing data; parameters for training and testing; and parameters of the AR model.
Example config files may be found in bear_model/models/config_files
.
Descriptions of the outputs of these scripts in given in the tutorial below.
bear_model/models/train_bear_net.py¶
Train and evaluate AR or BEAR models using maximal liklihood or empirical Bayes respectively. Usage:
python train_bear_net.py config.cfg
Example config files, with descriptions of the input parameters, can be found
in the subfolder bear_model/models/config_files
(bear_lin_bear.cfg, bear_lin_ar.cfg,
bear_cnn_bear.cfg, and bear_cnn_ar.cfg).
bear_model/models/train_bear_ref.py¶
Train and evaluate AR or BEAR models that require a reference using maximal liklihood or empirical Bayes respectively. Train reference-based Bayesian embedded autoregressive models using empirical Bayes, and evaluate based on heldout likelihood, perplexity, and accuracy. Usage:
python train_bear_ref.py config.cfg
Example config files, with descriptions of the input parameters, can be found
in the subfolder bear_model/models/config_files
(bear_stop_bear.cfg and bear_stop_ar.cfg).
If the data includes a reference, the same config file may be used by either of the above scripts,
with the specified AR model interpreted as \(g\) when used with bear_model/models/train_bear_ref.py
.
These scripts are easiest to use through the command line, as shown in the tutorial below.
Tutorial¶
In this tutorial, we will apply BEAR models to whole genome sequencing data from the Salmonella bacteriophage YSD1. The data and the reference assembly are from Dunstan et al. (2019); the NCBI SRA record is here.
Part 1: preprocessing
First, navigate to the folder bear_model
and download the example dataset and extract its contents (we assume
throughout that you are in the bear_model
folder).
wget https://marks.hms.harvard.edu/bear/ysd1_example.tar.gz
tar -xvf ysd1_example.tar.gz -C data
There should now be five data files in the data
subfolder,
corresponding to
training sequence data (1_train.fasta and 2_train.fasta),
testing data (1_test.fasta and 2_test.fasta), and the genome reference
assembly (virus_reference.fna).
The file datalist.csv lists the dataset paths, data types (all fasta),
and the data groups (training, testing, and reference).
We provide datalist.csv as input to the summary statistic script,
python summarize.py data/datalist.csv data/ysd1 -l 5
This extracts kmer transition counts up to lag 5 (inclusive). The counts for each lag will be found in the files data/ysd1_lag_1_file_0.tsv, data/ysd1_lag_2_file_0.tsv, data/ysd1_lag_3_file_0.tsv, data/ysd1_lag_4_file_0.tsv, and data/ysd1_lag_5_file_0.tsv. Each line consists of an individual kmer and the transition counts in each data group. Finally, before training a non-vanilla BEAR model, datasets should be shuffled. In Linux,
shuf data/ysd1_lag_5_file_0.tsv -o data/ysd1_lag_5_file_0_shuf.tsv
On a Mac, replace shuf
with gshuf
(you may first need to install
GNU coreutils, via e.g. brew install coreutils
).
Note that shuffling is done in memory, so when using large lags on large datasets,
you can use the -mf flag
in summarize.py to ensure its output files are sufficiently small.
A preshuffled dataset is provided in the file
models/data/ysd1_lag_5_file_0_preshuf.tsv
to ensure that part 2 of this tutorial is reproducible and can be run
independently of KMC.
Part 2: training
Now we can train AR or BEAR model via maximal likelihood or empirical Bayes respectively.
We will be using
Config files for training three different AR models and their corresponding
BEAR model can be found in the folder bear_model/models/config_files
.
You can run these examples on your own shuffled dataset from part 1 by editing
the config files to set
start_token = ysd1_lag_5_file_0_shuf.tsv
; by default the config files use the
example preshuffled file.
Note that in these examples we fix the model lag at the small value of 5 so
that it can be trained quickly; normally we would choose the lag based on maximum
marginal likelihood.
All 6 config files use the same training protocol and differ only in
the AR functions they use (see config file section [model]
)
and whether they are BEAR or AR models (see config file entry train_ar
).
To run the examples, move to the bear_models
directory and run the following:
Linear AR python models/train_bear_net.py models/config_files/bear_lin_ar.cfg
CNN AR python models/train_bear_net.py models/config_files/bear_cnn_ar.cfg
Reference AR python models/train_bear_ref.py models/config_files/bear_stop_ar.cfg
Linear BEAR python models/train_bear_net.py models/config_files/bear_lin_bear.cfg
CNN BEAR python models/train_bear_net.py models/config_files/bear_cnn_bear.cfg
Reference BEAR python models/train_bear_ref.py models/config_files/bear_stop_bear.cfg
Each example should not take more than a few minutes to run.
The scripts each output a folder named with the time at which they were run in
out_data/logs
(two folders may already be present if you tested your installation).
This output folders contains:
A progress file that can be used to visualize the training curve using TensorBoard.
A config file that consists of the input config file appended with the training results, in the section
[results]
. (Note that even when the BEAR model is the model that is trained, the perplexity, log likelihood and accuracy of the embedded AR model and vanilla BEAR model (BMM) are also reported; when the AR model is trained, the performance of the corresponding BEAR model with \(h=1\) is reported.)A pickle file with the learned hyperparameters. These hyperparameters can be recovered using the
dill
package.
The performance results from these example models should match closely the following table:
Experiment |
Perplexity |
Accuracy |
h |
---|---|---|---|
Linear AR |
3.99 |
32.9% |
N/A |
CNN AR |
3.85 |
35.8% |
N/A |
Reference AR |
3.84 |
36.5% |
N/A |
BMM |
3.79 |
36.8% |
N/A |
Linear BEAR |
3.79 |
36.8% |
0.0433 |
CNN BEAR |
3.79 |
36.8% |
0.0119 |
Reference BEAR |
3.79 |
36.8% |
0.0142 |
A few things to note from looking at the results:
In the paper we used a lag of 13 for this dataset, chosen by maximum marginal likelihood. Here, using a lag of 5, the perplexities are much larger and the \(h\) values are much smaller. This is due to the fact that the closest (in KL) AR model of lag 5, unsurprisingly, has a much larger perplexity than the lag 13 model, and is also much closer to the linear and CNN models.
As demonstrated in the paper, with enough data, the relative benefit of the BEAR over the vanilla BEAR is minimal. Since the lag is only 5 in this example, the data is sufficiently large (relative to the model flexibility) to make this true. However, the BEAR still outperforms the vanilla BEAR in the fifth or sixth decimal space (not shown in table).
The value of \(h\) is in proportion to expected misspecification of the parametric AR model. Interestingly in this simple case, unlike most example datasets in the publication, the CNN is better specified than the reference (\(h=0.0116\) versus \(h=0.0142\)). Running a reference model with \(g\) set to a CNN, via
python train_bear_ref.py config_files/bear_cnn_bear.cfg
, yields an even lower \(h=0.00422\), suggesting the two models learn complementary information.The learned stop rate \(\nu\) in the reference AR model is near 151 (not shown in table). 150 is the read length, so 1/151 transitions are stops on average.
Embedded AR models trained in the context of a BEAR model produce better BEAR models than AR models trained using maximum likelihood (not shown in table). The same is true for AR models trained using maximal likelihood and evaluated as AR models rather than as BEAR models.