Bachelor Thesis from the year 2014 in the subject Computer Science -
Bioinformatics, grade: 8.26, Lovely Professional University, Punjab,
course: b.tech honors biotechnology, language: English, abstract: As the
number of genomes sequenced is increasing at high rate, there is a need
of gene prediction method which is quick, reliable, inexpensive. In such
conditions, the computations tool will serve as an alternative to wet
lab methods. The confidence level of annotation by the tool can be
enhanced by preparing exhaustive training data sets. The aim is to
develop a tool which will read data from a DNA sequence file in the
fasta format and will annotate it. For this purpose Genome Database was
used to retrieve the input data. PERL programming has been put to
develop this tool for annotation. To increase the confidence level of
annotation the data was validated from multiple sources. Perl script was
written to find the promoter region, repeats, transcription factor
binding site, base periodicity, and nucleotide frequency. The program
written was also executed to identify repeats, poly (A) signals, CpG
islands, ARS. The tool will annotate the DNA by predicting the gene
structure based on the consensus sequences of important regulatory
elements. The confidence level of annotation of the predicted gene,
non-coding region, ARS, repeats etc. were checked by running test
dataset. This test dataset was annotated data as reported by genome
database and computational tools. Gene prediction of the non-coding
regions as reported by genome database (SGD) were performed by existing
tools; the regions identified as non-coding by these tools were then
analyzed for presence of repeats. The BLAST was used to annotate on the
basis of sequence similarity with the already annotated genes.
GeneMark.hmm and FGENESH were used for gene prediction. In order to
validate the predicted results, annotations of genome of Saccharomyces
cerevisiae from SGD Database, and output of different comput