AllTheBacteria’s 0.2 Database Release Creation Report

Published

Friday, the 13th of September, 2024

Databases created for release 0.2

The list of database files:

  - Created allthebacteria-r0.2-k21.zip 

  - Created allthebacteria-r0.2-k31.zip 

  - Created allthebacteria-r0.2-k51.zip 

Summary of database

Looking only at the first k-value defined in 21, 31, 51.

Summary example for the new databases

The manifest for /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2/sigs-r0.2/allthebacteria-r0.2-k21.zip:


** loading from '/group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2/metadata-r0.2/allthebacteria-r0.2-mf.csv'
path filetype: StandaloneManifestIndex
location: /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2/metadata-r0.2/allthebacteria-r0.2-mf.csv
is database? yes
has manifest? yes
num signatures: 5798436

** examining manifest...
total hashes: 23340364269
summary of sketches:
   1932812 sketches with DNA, k=21, scaled=1000, abund 7793780367 total hashes
   1932812 sketches with DNA, k=51, scaled=1000, abund 7785665403 total hashes
   1932812 sketches with DNA, k=31, scaled=1000, abund 7760918499 total hashes

Replicate these results:

sourmash sig summarize database.zip

Workflow details for databases release 0.2

For the 1932812 sequences in the create databases, there are -2 missing signatures from the expected 1932812.

Failures: Sequences failed to download or sketch

The script find_missing_files.sh was used to identify the missing sequences amoung the tar.xz archives. This script found 0 missing sequences.

Reading csv file content...
Processed 1932813 lines from csv file
Found 0 missing files

Within the workflow-cleanup directory, there is a the allthebacteria-r0.2-missing-files.csv file that may be used to manually check the archives for the missing sequences.

Consider running:

cd /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2

~/database-releases/allthebacteria-workflow/scripts/extract_missing_files.py workflow-cleanup/allthebacteria-r0.2-missing-files.csv -d allthebacteria-r0.2-data/ -o missed-files

sourmash scripts manysketch missed-files/*/manysketch.csv -p k=21,k=31,k=51,scaled=1000,abund -o allthebacteria-r0.2-sigs/missed-files.zip

sourmash sig cat allthebacteria-r0.2-sigs/missed-files.zip allthebacteria-r0.2-sigs/allthebacteria-r0.2-k21.zip -k 21 -o allthebacteria-r0.2-k21.zip
sourmash sig cat allthebacteria-r0.2-sigs/missed-files.zip allthebacteria-r0.2-sigs/allthebacteria-r0.2-k31.zip -k 31 -o allthebacteria-r0.2-k31.zip
sourmash sig cat allthebacteria-r0.2-sigs/missed-files.zip allthebacteria-r0.2-sigs/allthebacteria-r0.2-k51.zip -k 51 -o allthebacteria-r0.2-k51.zip