Molecular fingerprints are vectors where each position encodes a substructural property of the molecule

Similarly, bile acid molecules such as muricholic acid and taurocholic acid were more abundant in IHH-exposed versus control animals. Bile acids are crucial for not only facilitating transport of dietary fats and cholesterol in the host but also regulating host energy expenditure, glucose homeostasis and anti-inflammatory immune responses . Many metabolic and cardiovascular conditions have been associated with aberrant bile acid profiles, suggesting that prolonged perturbations in these key molecules could contribute to downstream adverse cardiovascular consequences of OSA as well. In summary, our work provides reproducible candidate biomarkers of IHH-exposure in animal models and will be most applicable to designing diagnostic and treatment modalities. Furthermore, by identifying consistent alterations across different model systems, we outline a general pipeline to select for biomarkers and therapeutic targets applicable to other intervention studies as well. We have made these information rich datasets publicly available to promote collaborative progress in this area of research.Intermittent hypoxia and hypercapnia was maintained in a computer-controlled atmosphere chamber system as previously described . IHH exposure was introduced to the mice in short periods of synchronized reduction of O2 and increasing of CO2 separated by alternating periods of normoxia and normocapnia with 1– 2 min ramp intervals for 10 hours per day during the light cycle. This treatment protocol mimics the severe clinical condition observed in obstructive sleep apnea patients. Mice on the same HFD but in room air were used as controls. Fecal samples were collected at baseline and twice each week for 6 weeks or 10 weeks .We performed 16S sequencing on fecal samples from Ldlr-/- and ApoE-/- mice for all the time points. DNA extraction and 16S rRNA amplicon sequencing were done using Earth Microbiome Project standard protocols .

In brief, DNA was extracted using the MO BIO PowerSoil DNA extraction kit . Amplicon PCR was performed on the V4 region of the 16S rRNA gene using the primer pair 515f to 806r with Golay error-correcting barcodes on the reverse primer. Amplicons were barcoded and pooled in equal concentrations for sequencing. The amplicon pool was purified with the MO BIO UltraClean PCR cleanup kit and sequenced on the Illumina HiSeq 2500 sequencing platform. Sequence data were demultiplexed and minimally quality filtered using the QIIME 1.9.1 script split_libraries_fastq.py,nursery pots with a Phred quality threshold of 3 and default parameters to generate perstudy FASTA sequence files. The raw sequence data were processed using the Deblur workflow with default parameters in Qiita . This generated a sub-operational taxonomic unit abundance per sample . Taxonomies for sOTUs were assigned using the sklearn-based taxonomy classifier trained on the Greengenes 13_8 99% OTUs in QIIME 2. The sOTU table was rarefied to a depth of 2,000 sequences/sample to control for sequencing effort. A phylogeny was inferred using SATé-enabled phylogenetic placement, which was used to insert 16S Deblur sOTUs into Greengenes 13_8 at a 99% phylogeny.We acquired LC-MS/MS data on fecal samples from Ldlr-/- and ApoE-/- mice using identical protocol. Details of data acquisition parameters are specified in . Briefly, fecal pellets were extracted in 500 µl of 50:50 methanol-H2O solvent, followed by centrifugation to separate insoluble material. The extracts were dried completely by centrifugal evaporation and resuspended in 150 µl of methanol-H2O . After resuspension, the samples were analysed on a Vanquish ultrahigh-performance liquid chromatography system coupled to a Q Exactive orbital ion trap . A C18 core shell column with a flow rate of 0.5 ml/min was used for chromatographic separation . The raw data sets were converted to m/z extensible markup language in centroid mode using MSConvert . All mzXML files were cropped with an m/z range of 75.00 to 1,000.00 Da. Feature extraction was performed in MZmine2  with a signal intensity threshold of 2.0e5 and minimum peak width of 0.3-s. The maximum allowed mass and retention time tolerances were 10 ppm and 10 s, respectively.

Local minimum search algorithm with a minimum relative peak height of 1% was used for chromatographic deconvolution; maximum peak width was set to 1 min. The detected peaks were aligned across all samples using the above-mentioned retention time and mass tolerances producing the final feature table used in these analyses. We performed molecular networking in GNPS to putatively identify molecular features using MS/MS-based spectral library matches. We analyzed them using the same LC-MS/MS method described above to compare and verify the exact masses, fragmentation patterns, and retention times to ensure level 1 annotations, as defined by the 2007 metabolomics standards initiative .We calculated the sharedness of microbial features as follows. To quality-control the 16S sequences obtained per animal model, we retained only reads that were prevalent within each model i.e. above a sum relative abundance threshold of 10E-06 and present in at least 1% of the samples, thus avoiding sequencing noise. The number of such reads in Ldlr-/- and ApoE-/-animals was 635 and 582, respectively. Out of these, 248 sequences were shared between the two models. Therefore, the percentage of microbiome features shared between the animal models was 39% of unique microbial features found in Ldlr-/- models. For metabolomic data, we quality-controlled the chemical features by retaining those above a sum relative abundance threshold of 10E-01 and present in at least 10% of all samples for each animal model individually. There were 267 and 374 such features in Ldlr-/- and ApoE-/- animals, respectively. Out of these, 137 metabolites were shared between the two models. Thus, the percentage shared between the animal models were 51% of total features in Ldlr-/- knockout models.Effect sizes were calculated over the individual genotype, mice, cage number, age, exposure type. For each of these covariates, we applied the mixed directional FDR methodology to test for the significance of each pairwise comparison among the groups. For each significant pairwise comparison via PERMANOVA , we computed the effect size using Cohen’s d or the absolute difference between the mean of each group divided by the pooled standard deviation. As diversity estimators we used unweighted UniFrac and Bray-Curtis distances matrices for the 16S rRNA sequencing and LC-MS/MS mass-spectrometry, respectively. For the microbiome data layer , when taking both genotypes together, we see that the first three largest effect sizes are mouse number, age and cage number, followed by genotype and exposure type.

It is important to note that the maximum difference on the first three covariates are related to genotype differences. For example, the maximum difference in mouse number is between two mice [mouse numbers 105 vs. 32 ; Figure 4.S1] that belong to two different genotypes and exposure types. To untangle the effect of genotype, we stratified our dataset by genotypes and calculated effect sizes of each of the covariates within each model. It is noteworthy that effects of covariates are ranked differently within each model, hinting towards underlying differences in the characteristics of the microbial community. Nevertheless, the effect of exposure is ranked comparably across models. Similarly, we calculate effect sizes of the above mentioned covariates for the metabolome data layer . When taking both genotypes together, consistent with the microbiome results, mouse number, age and cage number have the largest effect sizes, and the groups with the maximum effects belong to different genotypes [e.g. mouse number 114 vs. 17. We then stratified the data by genotype and observed that different covariates had distinct effects within each genotype. Interestingly, our analysis shows that unlike in Ldlr-/- mice, individual variability was not significant in ApoE-/- mice.Random Forest classifier was trained and evaluated with cross validation for each mouse model, using microbial or chemical features as predictors. During cross validation, all the samples from the same mouse appeared only in either training or validation data but not both to avoid over-optimistic cross-validation accuracy scores as a result of the classifier learning idiosyncrasies of the individual itself rather than the treatment. The classifiers trained for each mouse model were then applied on the samples of the other mouse model for cross-genotype prediction. For the longitudinal prediction, we trained and evaluated a RF classifier on the samples collected at each time point for AUC computation. To assess the capability of individual 16S sequences and metabolites to separate IHH-exposed from control animals,large pots plastic we used the abundance of each feature as the score to plot ROC curve and compute AUC, and highlighted the features that can single-handedly distinguish IHH on ROC plots. These analyses were done using the scikitlearn Python package.Molecular networking , introduced in 2012, was one of the first data organization approaches to visualize the relationships between fragmentation spectra for similar molecules from tandem mass spectrometry data in the context of metadata. It formed the basis for the web-based mass spectrometry infrastructure, Global Natural Products Social Molecular Networking  which sees ~200,000 new accessions per month. Molecular networking is used for a range of applications in drug discovery, environmental monitoring, medicine, and agriculture. While molecular networking is useful for visualizing closely related molecular families, the inference of chemical relationships at a dataset-wide level and in the context of diverse metadata requires complementary representation strategies. To address this need, we developed an approach that uses fragmentation trees and supervised machine learning to calculate all pairwise chemical relationships and visualizes it in the context of sample metadata and molecular annotations. We show that a chemical tree enables the application of various tree-based tools, originally developed for analyzing DNA sequencing data , for exploring mass-spectrometry data. We introduce Qemistree, pronounced chemis-tree, a software that constructs a chemical tree from fragmentation spectra based on predicted molecular fingerprints . Recent methods allow us to predict molecular fingerprints from tandem mass spectra . In Qemistree, we use SIRIUS and CSI:FingerID to obtain predicted molecular fingerprints. The users first perform feature detection to generate a list of observed ions, referred to as chemical features henceforth, to be analyzed by Qemistree .

SIRIUS then determines the molecular formula of each feature using the isotope and fragmentation patterns, and estimates the best fragmentation tree explaining the fragmentation spectrum. Subsequently, CSI:FingerID operates on the fragmentation trees using kernel support vector machines to predict molecular properties . We use these molecular fingerprints to calculate pairwise distances between chemical features that are hierarchically clustered to generate a tree representing their structural relationships. Although alternative approaches to hierarchically cluster features based on cosine similarity of fragmentation spectra exist , we use molecular fingerprints as it allows us to compare features based on a diverse range of structural properties predicted by CSI:FingerID. Additionally, as CSI:FingerID was shown to perform well for automatic in silico structural annotation , we leverage it to search molecular structural databases to provide complementary insights into structures when no match is obtained against spectral libraries. Subsequently, we use ClassyFire to assign a 5-level chemical taxonomyTo verify that molecular fingerprint-based trees correctly capture the chemical relationships between molecules, we generated an evaluation dataset with two human fecal samples, a tomato seedling sample, and a human serum sample. Mixtures of these samples were prepared by combining them in gradually increasing proportions to generate a set of diverse but related metabolite profiles and untargeted tandem mass spectrometry was used to profile the chemical composition of these samples. Mass-spectrometry was performed twice using different chromatographic gradients causing a non-uniform retention time shift between the two runs. The data processing of these two experiments leads to the same molecules being detected as different chemical features in downstream analysis. In Figure 5.1a we highlight how these technical variations make the same samples appear chemically disjointed.Using Qemistree, we map each of the spectra in the two chromatographic conditions to a molecular fingerprint, and organize these in a tree structure . Because molecular fingerprints are independent of retention time shifts, spectra are clustered based on their chemical similarity. This tree structure can be decorated using sample type descriptions, chromatographic conditions, and spectral library matches obtained from molecular networking in GNPS. Figure 5.1 shows that similar chemical features are detected exclusively in one of the two batches. However, based on the molecular fingerprints, these chemical features were arranged as neighboring tips in the tree regardless of the retention time shifts. This result shows how Qemistree can reconcile and facilitate the comparison of datasets acquired on different chromatographic gradients. We demonstrate the use of a chemical hierarchy in performing chemically informed comparisons of metabolomics profiles. In standard metabolomic statistical analyses, each molecule is assumed unrelated to the other molecules in the dataset.