- Created allthebacteria-r0.2-k21.zip
- Created allthebacteria-r0.2-k31.zip
- Created allthebacteria-r0.2-k51.zip
AllTheBacteria’s 0.2 Database Release Creation Report
Databases created for release 0.2
The list of database files:
Summary of database
Looking only at the first k-value defined in 21, 31, 51.
Summary example for the new databases
The manifest for /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2/sigs-r0.2/allthebacteria-r0.2-k21.zip:
[K** loading from '/group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2/metadata-r0.2/allthebacteria-r0.2-mf.csv'
path filetype: StandaloneManifestIndex
location: /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2/metadata-r0.2/allthebacteria-r0.2-mf.csv
is database? yes
has manifest? yes
num signatures: 5798436
[K** examining manifest...
total hashes: 23340364269
summary of sketches:
1932812 sketches with DNA, k=21, scaled=1000, abund 7793780367 total hashes
1932812 sketches with DNA, k=51, scaled=1000, abund 7785665403 total hashes
1932812 sketches with DNA, k=31, scaled=1000, abund 7760918499 total hashes
Replicate these results:
sourmash sig summarize database.zipWorkflow details for databases release 0.2
For the 1932812 sequences in the create databases, there are -2 missing signatures from the expected 1932812.
Expand for details
Failures: Sequences failed to download or sketch
The script find_missing_files.sh was used to identify the missing sequences amoung the tar.xz archives. This script found 0 missing sequences.
Expand for details
Reading csv file content...
Processed 1932813 lines from csv file
Found 0 missing files
Within the workflow-cleanup directory, there is a the allthebacteria-r0.2-missing-files.csv file that may be used to manually check the archives for the missing sequences.
Expand for commands
Consider running:
cd /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2
~/database-releases/allthebacteria-workflow/scripts/extract_missing_files.py workflow-cleanup/allthebacteria-r0.2-missing-files.csv -d allthebacteria-r0.2-data/ -o missed-files
sourmash scripts manysketch missed-files/*/manysketch.csv -p k=21,k=31,k=51,scaled=1000,abund -o allthebacteria-r0.2-sigs/missed-files.zip
sourmash sig cat allthebacteria-r0.2-sigs/missed-files.zip allthebacteria-r0.2-sigs/allthebacteria-r0.2-k21.zip -k 21 -o allthebacteria-r0.2-k21.zip
sourmash sig cat allthebacteria-r0.2-sigs/missed-files.zip allthebacteria-r0.2-sigs/allthebacteria-r0.2-k31.zip -k 31 -o allthebacteria-r0.2-k31.zip
sourmash sig cat allthebacteria-r0.2-sigs/missed-files.zip allthebacteria-r0.2-sigs/allthebacteria-r0.2-k51.zip -k 51 -o allthebacteria-r0.2-k51.zip