Bioinformatics

Model
Digital Document
Publisher
Florida Atlantic University
Description
Bioinformatics tools applied to large-scale genomic datasets have helped develop our understanding of primate phylogenetics. However, it is becoming increasingly evident that biological data are accumulating faster than the current capacity of the bioanthropological community to analyze, integrate, and mine the data. Subsequently, this affects how anthropologists create and distribute knowledge. There is a growing need for more training in bioinformatics within anthropological spaces and the development of user-friendly bioinformatic tools for analysis, mining, and modeling of both local and global datasets. This thesis showcases the use of (applied) bioinformatics tools to construct seven new whole mitochondrial genomes to study primate variation. Furthermore, this thesis entails an investigation of the guenon radiation to develop and document bioinformatics and statistical tools to perform a phylogenetic analysis of the genus Cercopithecus. Finally, the utility of the pipelines for other researchers in the Detwiler Lab Group and the potential for further phylogenetic studies are discussed.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In response to the massive amounts of data that make up a large number of bioinformatics datasets, it has become increasingly necessary for researchers to use computers to aid them in their endeavors. With difficulties such as high dimensionality, class imbalance, noisy data, and difficult to learn class boundaries, being present within the data, bioinformatics datasets are a challenge to work with. One potential source of assistance is the domain of data mining and machine learning, a field which focuses on working with these large amounts of data and develops techniques to discover new trends and patterns that are hidden within the data and to increases the capability of researchers and practitioners to work with this data. Within this domain there are techniques designed to eliminate irrelevant or redundant features, balance the membership of the classes, handle errors found in the data, and build predictive models for future data.
Model
Digital Document
Publisher
Florida Atlantic University
Description
One of the main applications of machine learning in bioinformatics is the construction of classification models which can accurately classify new instances using information gained from previous instances. With the help of machine learning algorithms (such as supervised classification and gene selection) new meaningful knowledge can be extracted from bioinformatics datasets that can help in disease diagnosis and prognosis as well as in prescribing the right treatment for a disease. One particular challenge encountered when analyzing bioinformatics datasets is data noise, which refers to incorrect or missing values in datasets. Noise can be introduced as a result of experimental errors (e.g. faulty microarray chips, insufficient resolution, image corruption, and incorrect laboratory procedures), as well as other errors (errors
during data processing, transfer, and/or mining). A special type of data noise
called class noise, which occurs when an instance/example is mislabeled. Previous
research showed that class noise has a detrimental impact on machine learning algorithms (e.g. worsened classification performance and unstable feature selection). In
addition to data noise, gene expression datasets can suffer from the problems of high
dimensionality (a very large feature space) and class imbalance (unequal distribution
of instances between classes). As a result of these inherent problems, constructing accurate classification models becomes more challenging.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Mining the human genome for therapeutic target(s) discovery promises novel outcome. Over half of the proteins in the human genome however, remain uncharacterized. These proteins offer a potential for new target(s) discovery for diverse diseases. Additional targets for cancer diagnosis and therapy are urgently needed to help move away from the cytotoxic era to a targeted therapy approach. Bioinformatics and proteomics approaches can be used to characterize novel sequences in the genome database to infer putative function. The hypothesis that the amino acid motifs and proteins domains of the uncharacterized proteins can be used as a starting point to predict putative function of these proteins provided the framework for the research discussed in this dissertation.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This thesis refers to a research addressing the use of binary representation of the DNA for the purpose of developing useful algorithms for Bioinformatics. Pertinent studies address the use of a binary form of the DNA base chemicals in information-theoretic base so as to identify symmetry between DNA and complementary DNA. This study also refers to "fuzzy" (codon-noncodon) considerations in delinating codon and noncodon regimes in a DNA sequences. The research envisaged further includes a comparative analysis of the test results on the aforesaid efforts using different statistical metrics such as Hamming distance Kullback-Leibler measure etc. the observed details supports the symmetry aspect between DNA and CDNA strands. It also demonstrates capability of identifying non-codon regions in DNA even under diffused (overlapped) fuzzy states.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This project used a bioinformatics approach to identify the genetic differential expression of chronic lymphocytic leukemia (CLL) white blood cells as compared to normal white blood cells. Several public access databases and data mining tools were used to collect these data. The data collected was validated by independent bioinformatics databases and the methodology was supported by previously published "gene chip" differential expression data. This research identifies a pattern of differential gene expression specific to CLL white blood cells that can be used for the early diagnosis of CLL. The study also identifies the probable genetic origin for the low expression of tyrosine kinase and IgM immunoglobulin observed in CLL B-cells. Also presented are genes associated with chromosomal regions previously reported as deleted in a high percentage of CLL cases. These data can be used in further research and for the treatment of CLL.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Recently Dr. Narayanan's laboratory, utilizing bioinformatics approaches, identified a novel gene which may play a role in colon cancer. This gene in view of its expression specificity was termed Colon Carcinoma Related Gene (CCRG). The CCRG belongs to a novel class of secreted molecules with a unique cysteine rich motif. The function of CCRG however, remains unknown. The basis of this project revolved around establishing the putative function (functional genomics) of CCRG. The rationale for the project was to test a hypothesis that CCRG may offer a growth advantage to cancer cells. The availability of diverse tumor-derived cell lines, which were CCRG negative offered a possibility to study the consequence of enforced expression of CCRG. A breast carcinoma cell line was transfected with an exogenous CCRG expression vector and the stable clones were characterized. The stable transfectants of CCRG showed enhanced growth and a partial abrogation of serum growth factor(s) requirement. These results provide a framework for future experiments to further elucidate the function of CCRG.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The purpose of sequence alignment is to detect mutual similarity, characterized by the so-called "alignment score", between sequences compared. To quantitatively assess the confidence level of an alignment result requires the knowledge of alignment score statistics under a certain null model and is the central issue in sequence alignment. In this thesis, the score statistics of Markov null model were revisited and the score statistics of non-Markov null model were investigated for two state-of-the-art algorithms, namely, the gapless Smith-Waterman and Hybrid algorithms. These two algorithms were further used to find highly related signals in unrelated sequences and in weakly related sequences corresponding, respectively, to Markov null model and non-Markov null model. The confidence levels of these models were also studied. Since the sequence similarity we are interested in comes from evolutionary history, we also investigated the relationship between sequence alignment, the tool to find similarity, and evolution. The average evolution distance between the daughter sequences was found and compared with their expected values, for individual trees and as an average over many trees.
Model
Digital Document
Publisher
Florida Atlantic University
Description
After the sequencing of many complete genomes, we are in a post-genomic era in which the most important task has changed from gathering genetic information to organizing the mass of data as well as under standing how components interact with each other. The former is usually undertaking using bioinformatics methods, while the latter task is generally termed proteomics. Success in both parts demands correct statistical significance assignments for results found. In my dissertation. I study two concrete examples: global sequence alignment statistics and peptide sequencing/identification using mass spectrometry. High-performance liquid chromatography coupled to a mass spectrometer (HPLC/MS/MS), enabling peptide identifications and thus protein identifications, has become the tool of choice in large-scale proteomics experiments. Peptide identification is usually done by database searches methods. The lack of robust statistical significance assignment among current methods motivated the development of a novel de novo algorithm, RAId, whose score statistics then provide statistical significance for high scoring peptides found in our custom, enzyme-digested peptide library. The ease of incorporating post-translation modifications is another important feature of RAId. To organize the massive protein/DNA data accumulated, biologists often cluster proteins according to their similarity via tools such as sequence alignment. Homologous proteins share similar domains. To assess the similarity of two domains usually requires alignment from head to toe, ie. a global alignment. A good alignment score statistics with an appropriate null model enable us to distinguish the biologically meaningful similarity from chance similarity. There has been much progress in local alignment statistics, which characterize score statistics when alignments tend to appear as a short segment of the whole sequence. For global alignment, which is useful in domain alignment, there is still much room for exploration/improvement. Here we present a variant of the direct polymer problem in random media (DPRM) to study the score distribution of global alignment. We have demonstrate that upon proper transformation the score statistics can be characterized by Tracy-Widom distributions, which correspond to the distributions for the largest eigenvalue of various ensembles of random matrices.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This research refers to studies on information-theoretic (IT) aspects of data-sequence patterns and developing thereof discriminant algorithms that enable distinguishing the features of underlying sequence patterns having characteristic, inherent stochastical attributes. The application potentials of such algorithms include bioinformatic data mining efforts. Consistent with the scope of the study as above, considered in this research are specific details on information-theoretics and entropy considerations vis-a-vis sequence patterns (having stochastical attributes) such as DNA sequences of molecular biology. Applying information-theoretic concepts (essentially in Shannon's sense), the following distinct sets of metrics are developed and applied in the algorithms developed for data-sequence pattern-discrimination applications: (i) Divergence or cross-entropy algorithms of Kullback-Leibler type and of general Czizar class; (ii) statistical distance measures; (iii) ratio-metrics; (iv) Fisher type linear-discriminant measure and (v) complexity metric based on information redundancy. These measures are judiciously adopted in ascertaining codon-noncodon delineations in DNA sequences that consist of crisp and/or fuzzy nucleotide domains across their chains. The Fisher measure is also used in codon-noncodon delineation and in motif detection. Relevant algorithms are used to test DNA sequences of human and some bacterial organisms. The relative efficacy of the metrics and the algorithms is determined and discussed. The potentials of such algorithms in supplementing the prevailing methods are indicated. Scope for future studies is identified in terms of persisting open questions.