Automatically Archiving Twitter Results
Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I've been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter's built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.
Enter t - the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t's documentation for examples.
I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here's the code as of May 14, 2013:
That script, and results for searching for "bioinformatics", "metagenomics", "#rstats", "rna-seq", and "#bog13" (the Biology of Genomes 2013 meeting) are all in the GitHub repository below. (Please note that these results update dynamically, and searching Twitter at any point could possibly result in returning some unsavory Tweets.)
https://github.com/stephenturner/twitterchive
Analyzing Tweets using R
You'll also find an analysis subdirectory, containing some R code to produce barplots showing the number of tweets per day over the last month, frequency of tweets by hour of the day, the most used hashtags within a search, the most prolific tweeters, and a ubiquitous word cloud. Much of this code is inspired by Neil Saunders's analysis of Tweets from ISMB 2012. Here's the code as of May 14, 2013:
Also in that analysis directory you'll see periodically updated plots for the results of the queries above.
Analyzing Tweets mentioning "bioinformatics"
Using the bioinformatics query, here are the number of tweets per day over the last month:
Here is the frequency of "bioinformatics" tweets by hour:
Here are the most used hashtags (other than #bioinformatics):
Here are the most prolific bioinformatics Tweeps:
Here's a wordcloud for all the bioinformatics Tweets since March:
Analyzing Tweets mentioning "#bog13"
The 2013 CSHL Biology of Genomes Meeting took place May 7-11, 2013. I searched and archived Tweets mentioning #bog13 from May 1 through May 14 using this script. You'll notice in the code above that I'm no longer archiving this hashtag. I probably need a better way to temporarily add keywords to the search, but I haven't gotten there yet.
Here are the number of Tweets per day during that period. Tweets clearly peaked a couple days into the meeting, with follow-up commentary trailing off quickly after the meeting ended.
Here is the frequency frequency of Tweets by hour, clearly bimodal:
Top hashtags (other than #bog13). Interestingly #bog14 was the most highly used hashtag, so I'm guessing lots of folks are looking forward to next years' meeting. Also, #ashg12 got lots of mentions, presumably because someone presented updated work from last years' ASHG meeting.
Here were the most prolific Tweeps - many of the usual suspects here, as well as a few new ones (new to me at least):
And finally, the requisite wordcloud:
More analysis
If you look in the analysis directory of the repo you'll find plots like these for other keywords (#rstats, metagenomics, rna-seq, and others to come). I would also like to do some sentiment analysis as Neil did in the ISMB post referenced above, but the sentiment package has since been removed from CRAN. I hear there are other packages for polarity analysis, but I haven't yet figured out how to use them. I've given you the code to do the mundane stuff (parsing the fixed-width files from t, for starters). I'd love to see someone take a stab at some further text mining / polarity / sentiment analysis!
twitterchive - archive and analyze results from a Twitter search
Getting Genetics Done
Getting Things Done in Genetics & Bioinformatics Research
Wednesday, May 15, 2013
Monday, May 6, 2013
Three Metagenomics Papers for You
A handful of good metagenomics papers have come out over the last few months. Below I've linked to and copied my evaluation of each of these articles from F1000.
...
1. Willner, Dana, and Philip Hugenholtz. "From deep sequencing to viral tagging: Recent advances in viral metagenomics." BioEssays (2013).
My evaluation: This review lays out some of the challenges and recent advances in viral metagenomic sequencing. There is a good discussion of library preparation and how that affects downstream sequencing. Alarmingly, they reference another paper that showed that different amplification methods resulted in detection of a completely different set of viruses (dsDNA viruses with LASL, ssDNA with MDA). The review also discusses many of the data management, analysis, and bioinformatics challenges associated with viral metagenomics.
...
2. Loman, Nicholas J., et al. "A Culture-Independent Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic Escherichia coli O104: H4Outbreak of Shiga-toxigenic Escherichia coli." JAMA 309.14 (2013): 1502-1510.
My evaluation: This paper is a groundbreaking exploration of the use of metagenomics to investigate and determine the causal organism of an infectious disease outbreak. The authors retrospectively collected fecal samples from symptomatic patients from the 2011 Escherichia coli O104:H4 outbreak in Germany and performed high-throughput shotgun sequencing, followed by a sophisticated analysis to determine the outbreak's causal organism. The analysis included comparing genetic markers from many symptomatic patients' metagenomes with those of healthy controls, followed by de novo assembly of the outbreak strain from the shotgun metagenomic data. This illustrates both the power, but the real limitations, of using metagenomic approaches for clinical diagnostics. Also see David Relman's synopsis of the study in the same JAMA issue
...
3. Shakya, Migun, et al. "Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities." Environmental microbiology (2013).
My evaluation: This study set out to compare shotgun metagenomic sequencing to 16S rRNA amplicon sequencing to determine the taxonomic and abundance profiles of mixed community metagenomic samples. Thus far, benchmarking metagenomic methodology has been difficult due to the lack of datasets where the underlying ground truth is known. In this study, the researchers constructed synthetic metagenomic communities consisting of 64 laboratory mixed genome DNAs of known sequence and polymerase chain reaction (PCR)-validated abundance. The researchers then compared metagenomic and 16S amplicon sequencing, using both 454 and Illumina technology, and found that metagenomic sequencing outperformed 16S sequencing in quantifying community composition. The synthetic metagenomes constructed here are publicly available (Gene Expression Omnibus [GEO] accession numbers are given in the manuscript), which represent a great asset to other researchers developing methods for amplicon-based or metagenomic approaches to sequence classification, diversity analysis, and abundance estimation.
...
1. Willner, Dana, and Philip Hugenholtz. "From deep sequencing to viral tagging: Recent advances in viral metagenomics." BioEssays (2013).
My evaluation: This review lays out some of the challenges and recent advances in viral metagenomic sequencing. There is a good discussion of library preparation and how that affects downstream sequencing. Alarmingly, they reference another paper that showed that different amplification methods resulted in detection of a completely different set of viruses (dsDNA viruses with LASL, ssDNA with MDA). The review also discusses many of the data management, analysis, and bioinformatics challenges associated with viral metagenomics.
...
2. Loman, Nicholas J., et al. "A Culture-Independent Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic Escherichia coli O104: H4Outbreak of Shiga-toxigenic Escherichia coli." JAMA 309.14 (2013): 1502-1510.
My evaluation: This paper is a groundbreaking exploration of the use of metagenomics to investigate and determine the causal organism of an infectious disease outbreak. The authors retrospectively collected fecal samples from symptomatic patients from the 2011 Escherichia coli O104:H4 outbreak in Germany and performed high-throughput shotgun sequencing, followed by a sophisticated analysis to determine the outbreak's causal organism. The analysis included comparing genetic markers from many symptomatic patients' metagenomes with those of healthy controls, followed by de novo assembly of the outbreak strain from the shotgun metagenomic data. This illustrates both the power, but the real limitations, of using metagenomic approaches for clinical diagnostics. Also see David Relman's synopsis of the study in the same JAMA issue
...
3. Shakya, Migun, et al. "Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities." Environmental microbiology (2013).
My evaluation: This study set out to compare shotgun metagenomic sequencing to 16S rRNA amplicon sequencing to determine the taxonomic and abundance profiles of mixed community metagenomic samples. Thus far, benchmarking metagenomic methodology has been difficult due to the lack of datasets where the underlying ground truth is known. In this study, the researchers constructed synthetic metagenomic communities consisting of 64 laboratory mixed genome DNAs of known sequence and polymerase chain reaction (PCR)-validated abundance. The researchers then compared metagenomic and 16S amplicon sequencing, using both 454 and Illumina technology, and found that metagenomic sequencing outperformed 16S sequencing in quantifying community composition. The synthetic metagenomes constructed here are publicly available (Gene Expression Omnibus [GEO] accession numbers are given in the manuscript), which represent a great asset to other researchers developing methods for amplicon-based or metagenomic approaches to sequence classification, diversity analysis, and abundance estimation.
Thursday, April 4, 2013
List of Bioinformatics Workshops and Training Resources
I frequently get asked to recommend workshops or online learning resources for bioinformatics, genomics, statistics, and programming. I compiled a list of both online learning resources and in-person workshops (preferentially highlighting those where workshop materials are freely available online):
List of Bioinformatics Workshops and Training Resources
I hope to keep the page above as up-to-date as possible. Below is a snapshop of what I have listed as of today. Please leave a comment if you're aware of any egregious omissions, and I'll update the page above as appropriate.
From http://stephenturner.us/p/edu, April 4, 2013
In-Person Workshops:
Cold Spring Harbor Courses: meetings.cshl.edu/courses.html
Cold Spring Harbor has been offering advanced workshops and short courses in the life sciences for years. Relevant workshops include Advanced Sequencing Technologies & Applications, Computational & Comparative Genomics, Programming for Biology, Statistical Methods for Functional Genomics, the Genome Access Course, and others. Unlike most of the others below, you won't find material from past years' CSHL courses available online.
Canadian Bioinformatics Workshops: bioinformatics.ca/workshops
Bioinformatics.ca through its Canadian Bioinformatics Workshops (CBW) series began offering one and two week short courses in bioinformatics, genomics and proteomics in 1999. The more recent workshops focus on training researchers using advanced high-throughput technologies on the latest approaches being used in computational biology to deal with the new data. Course material from past workshops is freely available online, including both audio/video lectures and slideshows. Topics include microarray analysis, RNA-seq analysis, genome rearrangements, copy number alteration,network/pathway analysis, genome visualization, gene function prediction, functional annotation, data analysis using R, statistics for metabolomics, and much more.
UC Davis Bioinformatics Training Program: training.bioinformatics.ucdavis.edu
The UC Davis Bioinformatics Training program offers several intensive short bootcamp workshops on RNA-seq, data analysis and visualization, and cloud computing with a focus on Amazon's computing resources. They also offer a week-long Bioinformatics Short Course, covering in-depth the practical theory and application of cutting-edge next-generation sequencing techniques. Every course's documentation is freely available online, even if you didn't take the course.
MSU NGS Summer Course: bioinformatics.msu.edu/ngs-summer-course-2013
This intensive two week summer course will introduce attendees with a strong biology background to the practice of analyzing short-read sequencing data from Illumina and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq. Materials from previous courses are freely available online under a CC-by-SA license.
Genetic Analysis of Complex Human Diseases: hihg.med.miami.edu/edu...
The Genetic Analysis of Complex Human Diseases is a comprehensive four-day course directed toward physician-scientists and other medical researchers. The course will introduce state-of-the-art approaches for the mapping and characterization of human inherited disorders with an emphasis on the mapping of genes involved in common and genetically complex disease phenotypes. The primary goal of this course is to provide participants with an overview of approaches to identifying genes involved in complex human diseases. At the end of the course, participants should be able to identify the key components of a study team, and communicate effectively with specialists in various areas to design and execute a study. The course is in Miami Beach, FL. (Full Disclosure: I teach a section in this course.) Most of the course material from previous years is not available online, but my RNA-seq & methylation lectures are on Figshare.
UAB Short Course on Statistical Genetics and Genomics: soph.uab.edu/ssg/...
Focusing on the state-of-art methodology to analyze complex traits, this five-day course will offer an interactive program to enhance researchers' ability to understand & use statistical genetic methods, as well as implement & interpret sophisticated genetic analyses. Topics include GWAS Design/Analysis/Imputation/Interpretation; Non-Mendelian Disorders Analysis; Pharmacogenetics/Pharmacogenomics; ELSI; Rare Variants & Exome Sequencing; Whole Genome Prediction; Analysis of DNA Methylation Microarray Data; Variant Calling from NGS Data; RNAseq: Experimental Design and Data Analysis; Analysis of ChIP-seq Data; Statistical Methods for NGS Data; Discovering new drugs & diagnostics from 300 billion points of data. Video recording from the 2012 course are available online.
MBL Molecular Evolution Workshop: hermes.mbl.edu/education/...
One of the longest-running courses listed here (est. 1988), the Workshop on Molecular Evolution at Woods Hole presents a series of lectures, discussions, and bioinformatic exercises that span contemporary topics in molecular evolution. The course addresses phylogenetic analysis, population genetics, database and sequence matching, molecular evolution and development, and comparative genomics, using software packages including AWTY, BEAST, BEST, Clustal W/X, FASTA, FigTree, GARLI, MIGRATE, LAMARC, MAFFT, MP-EST, MrBayes, PAML, PAUP*, PHYLIP, STEM, STEM-hy, and SeaView. Some of the course materials can be found by digging around the course wiki.
Online Material:
Canadian Bioinformatics Workshops: bioinformatics.ca/workshops
(In person workshop described above). Course material from past workshops is freely available online, including both audio/video lectures and slideshows. Topics include microarray analysis, RNA-seq analysis, genome rearrangements, copy number alteration, network/pathway analysis, genome visualization, gene function prediction, functional annotation, data analysis using R, statistics for metabolomics, andmuch more.
UC Davis Bioinformatics Training Program: training.bioinformatics.ucdavis.edu
(In person workshop described above). Every course's documentation is freely available online, even if you didn't take the course. Past topics include Galaxy, Bioinformatics for NGS, cloud computing, and RNA-seq.
MSU NGS Summer Course: bioinformatics.msu.edu/ngs-summer-course-2013
(In person workshop described above). Materials from previous courses are freely available online under a CC-by-SA license, which cover mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.
EMBL-EBI Train Online: www.ebi.ac.uk/training/online
Train online provides free courses on Europe's most widely used data resources, created by experts at EMBL-EBI and collaborating institutes. Topics include Genes and Genomes, Gene Expression,Interactions, Pathways, and Networks, and others. Of particular interest may be the Practical Course on Analysis of High-Throughput Sequencing Data, which covers Bioconductor packages for short read analysis, ChIP-Seq, RNA-seq, and allele-specific expression & eQTLs.
UC Riverside Bioinformatics Manuals: manuals.bioinformatics.ucr.edu
This is an excellent collection of manuals and code snippets. Topics include Programming in R, R+Bioconductor, Sequence Analysis with R and Bioconductor, NGS analysis with Galaxy and IGV, basicLinux skills, and others.
Software Carpentry: software-carpentry.org
Software Carpentry helps researchers be more productive by teaching them basic computing skills. We recently ran a 2-day Software Carpentry Bootcamp here at UVA. Check out the online lectures for some introductory material on Unix, Python, Version Control, Databases, Automation, and many other topics.
Coursera: coursera.org/courses
Coursera partners with top universities to offer courses online for anytone to take, for free. Courses are usually 4-6 weeks, and consist of video lectures, quizzes, assignments, and exams. Joining a course gives you access to the course's forum where you can interact with the instructor and other participants. Relevant courses include Data Analysis, Computing for Data Analysis using R, and Bioinformatics Algorithms, among others. You can also view all of Jeff Leek's Data Analysis lectures on Youtube.
Rosalind: http://rosalind.info
Quite different from the others listed here, Rosalind is a platform for learning bioinformatics through gaming-like problem solving. Visit the Python Village to learn the basics of Python. Arm yourself at theBioinformatics Armory, equipping yourself with existing ready-to-use bioinformatics software tools. Or storm the Bioinformatics Stronghold, implementing your own algorithms for computational mass spectrometry, alignment, dynamic programming, genome assembly, genome rearrangements, phylogeny, probability, string algorithms and others.
Other Resources:
List of Bioinformatics Workshops and Training Resources
I hope to keep the page above as up-to-date as possible. Below is a snapshop of what I have listed as of today. Please leave a comment if you're aware of any egregious omissions, and I'll update the page above as appropriate.
From http://stephenturner.us/p/edu, April 4, 2013
In-Person Workshops:
Cold Spring Harbor Courses: meetings.cshl.edu/courses.html
Cold Spring Harbor has been offering advanced workshops and short courses in the life sciences for years. Relevant workshops include Advanced Sequencing Technologies & Applications, Computational & Comparative Genomics, Programming for Biology, Statistical Methods for Functional Genomics, the Genome Access Course, and others. Unlike most of the others below, you won't find material from past years' CSHL courses available online.
Canadian Bioinformatics Workshops: bioinformatics.ca/workshops
Bioinformatics.ca through its Canadian Bioinformatics Workshops (CBW) series began offering one and two week short courses in bioinformatics, genomics and proteomics in 1999. The more recent workshops focus on training researchers using advanced high-throughput technologies on the latest approaches being used in computational biology to deal with the new data. Course material from past workshops is freely available online, including both audio/video lectures and slideshows. Topics include microarray analysis, RNA-seq analysis, genome rearrangements, copy number alteration,network/pathway analysis, genome visualization, gene function prediction, functional annotation, data analysis using R, statistics for metabolomics, and much more.
UC Davis Bioinformatics Training Program: training.bioinformatics.ucdavis.edu
The UC Davis Bioinformatics Training program offers several intensive short bootcamp workshops on RNA-seq, data analysis and visualization, and cloud computing with a focus on Amazon's computing resources. They also offer a week-long Bioinformatics Short Course, covering in-depth the practical theory and application of cutting-edge next-generation sequencing techniques. Every course's documentation is freely available online, even if you didn't take the course.
MSU NGS Summer Course: bioinformatics.msu.edu/ngs-summer-course-2013
This intensive two week summer course will introduce attendees with a strong biology background to the practice of analyzing short-read sequencing data from Illumina and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq. Materials from previous courses are freely available online under a CC-by-SA license.
Genetic Analysis of Complex Human Diseases: hihg.med.miami.edu/edu...
The Genetic Analysis of Complex Human Diseases is a comprehensive four-day course directed toward physician-scientists and other medical researchers. The course will introduce state-of-the-art approaches for the mapping and characterization of human inherited disorders with an emphasis on the mapping of genes involved in common and genetically complex disease phenotypes. The primary goal of this course is to provide participants with an overview of approaches to identifying genes involved in complex human diseases. At the end of the course, participants should be able to identify the key components of a study team, and communicate effectively with specialists in various areas to design and execute a study. The course is in Miami Beach, FL. (Full Disclosure: I teach a section in this course.) Most of the course material from previous years is not available online, but my RNA-seq & methylation lectures are on Figshare.
UAB Short Course on Statistical Genetics and Genomics: soph.uab.edu/ssg/...
Focusing on the state-of-art methodology to analyze complex traits, this five-day course will offer an interactive program to enhance researchers' ability to understand & use statistical genetic methods, as well as implement & interpret sophisticated genetic analyses. Topics include GWAS Design/Analysis/Imputation/Interpretation; Non-Mendelian Disorders Analysis; Pharmacogenetics/Pharmacogenomics; ELSI; Rare Variants & Exome Sequencing; Whole Genome Prediction; Analysis of DNA Methylation Microarray Data; Variant Calling from NGS Data; RNAseq: Experimental Design and Data Analysis; Analysis of ChIP-seq Data; Statistical Methods for NGS Data; Discovering new drugs & diagnostics from 300 billion points of data. Video recording from the 2012 course are available online.
MBL Molecular Evolution Workshop: hermes.mbl.edu/education/...
One of the longest-running courses listed here (est. 1988), the Workshop on Molecular Evolution at Woods Hole presents a series of lectures, discussions, and bioinformatic exercises that span contemporary topics in molecular evolution. The course addresses phylogenetic analysis, population genetics, database and sequence matching, molecular evolution and development, and comparative genomics, using software packages including AWTY, BEAST, BEST, Clustal W/X, FASTA, FigTree, GARLI, MIGRATE, LAMARC, MAFFT, MP-EST, MrBayes, PAML, PAUP*, PHYLIP, STEM, STEM-hy, and SeaView. Some of the course materials can be found by digging around the course wiki.
Online Material:
Canadian Bioinformatics Workshops: bioinformatics.ca/workshops
(In person workshop described above). Course material from past workshops is freely available online, including both audio/video lectures and slideshows. Topics include microarray analysis, RNA-seq analysis, genome rearrangements, copy number alteration, network/pathway analysis, genome visualization, gene function prediction, functional annotation, data analysis using R, statistics for metabolomics, andmuch more.
UC Davis Bioinformatics Training Program: training.bioinformatics.ucdavis.edu
(In person workshop described above). Every course's documentation is freely available online, even if you didn't take the course. Past topics include Galaxy, Bioinformatics for NGS, cloud computing, and RNA-seq.
MSU NGS Summer Course: bioinformatics.msu.edu/ngs-summer-course-2013
(In person workshop described above). Materials from previous courses are freely available online under a CC-by-SA license, which cover mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.
EMBL-EBI Train Online: www.ebi.ac.uk/training/online
Train online provides free courses on Europe's most widely used data resources, created by experts at EMBL-EBI and collaborating institutes. Topics include Genes and Genomes, Gene Expression,Interactions, Pathways, and Networks, and others. Of particular interest may be the Practical Course on Analysis of High-Throughput Sequencing Data, which covers Bioconductor packages for short read analysis, ChIP-Seq, RNA-seq, and allele-specific expression & eQTLs.
UC Riverside Bioinformatics Manuals: manuals.bioinformatics.ucr.edu
This is an excellent collection of manuals and code snippets. Topics include Programming in R, R+Bioconductor, Sequence Analysis with R and Bioconductor, NGS analysis with Galaxy and IGV, basicLinux skills, and others.
Software Carpentry: software-carpentry.org
Software Carpentry helps researchers be more productive by teaching them basic computing skills. We recently ran a 2-day Software Carpentry Bootcamp here at UVA. Check out the online lectures for some introductory material on Unix, Python, Version Control, Databases, Automation, and many other topics.
Coursera: coursera.org/courses
Coursera partners with top universities to offer courses online for anytone to take, for free. Courses are usually 4-6 weeks, and consist of video lectures, quizzes, assignments, and exams. Joining a course gives you access to the course's forum where you can interact with the instructor and other participants. Relevant courses include Data Analysis, Computing for Data Analysis using R, and Bioinformatics Algorithms, among others. You can also view all of Jeff Leek's Data Analysis lectures on Youtube.
Rosalind: http://rosalind.info
Quite different from the others listed here, Rosalind is a platform for learning bioinformatics through gaming-like problem solving. Visit the Python Village to learn the basics of Python. Arm yourself at theBioinformatics Armory, equipping yourself with existing ready-to-use bioinformatics software tools. Or storm the Bioinformatics Stronghold, implementing your own algorithms for computational mass spectrometry, alignment, dynamic programming, genome assembly, genome rearrangements, phylogeny, probability, string algorithms and others.
Other Resources:
- Titus Brown's list bioinformatics courses: Includes a few others not listed here (also see the comments).
- GMOD Training and Outreach: GMOD is the Generic Model Organism Database project, a collection of open source software tools for creating and managing genome-scale biological databases. This page links out to tutorials on GMOD Components such as Apollo, BioMart, Galaxy, GBrowse, MAKER, and others.
- Seqanswers.com: A discussion forum for anything related to Bioinformatics, including Q&A, paper discussions, new software announcements, protocols, and more.
- Biostars.org: Similar to SEQanswers, but more strictly a Q&A site.
- BioConductor Mailing list: A very active mailing list for getting help with Bioconductor packages. Make sure you do some Google searching yourself first before posting to this list.
- Bioconductor Events: List of upcoming and prior Bioconductor training and events worldwide.
- Learn Galaxy: Screencasts and tutorials for learning to use Galaxy.
- Galaxy Event Horizon: Worldwide Galaxy-related events (workshops, training, user meetings) are listed here.
- Galaxy RNA-Seq Exercise: Run through a small RNA-seq study from start to finish using Galaxy.
- Rafael Irizarry's Youtube Channel: Several statistics and bioinformatics video lectures.
- PLoS Comp Bio Online Bioinformatics Curriculum: A perspective paper by David B Searls outlining a series of free online learning initiatives for beginning to advanced training in biology, biochemistry, genetics, computational biology, genomics, math, statistics, computer science, programming, web development, databases, parallel computing, image processing, AI, NLP, and more.
- Getting Genetics Done: Shameless plug – I write a blog highlighting literature of interest, new tools, and occasionally tutorials in genetics, statistics, and bioinformatics. I recently wrote this post about how to stay current in bioinformatics & genomics.
Tags:
Announcements,
Bioinformatics,
R,
Tutorials
Wednesday, March 27, 2013
Evolutionary Computation and Data Mining in Biology
For over 15 years, members of the computer science, machine
learning, and data mining communities have gathered in a beautiful European
location each spring to share ideas about biologically-inspired
computation. Stemming from the work of
John Holland who pioneered the field of genetic algorithms, multiple approaches
have been developed that exploit the dynamics of natural systems to solve
computational problems. These algorithms
have been applied in a wide variety of fields, and to celebrate and cross-pollinate
ideas from these various disciplines the EvoStar event co-locates five
conferences at the same venue, covering genetic programming (EuroGP),
combinatorial optimization (EvoCOP), music, art, and design (EvoMUSART),
multidisciplinary applications (EvoApplications), and computational biology
(EvoBIO). EvoStar 2013 will be held in
Vienna, Austria on April 3-5, and is always expertly coordinated by the
wonderful Jennifer
Willies from Napier University, UK. Multiple research groups from the US and
Europe will attend to present their exciting work in these areas.
Many problems in bioinformatics and statistical analysis use
what are considered “greedy” algorithms to fit parameters to data – that is,
they settle on a nearby collection of
parameters as the solution and potentially miss a global best solution. This
problem is well-known in the computer science community for toy problems like bin
packing or the knapsack
problem. In human genetics,
related problems are partitioning complex pedigrees or selecting maximally
unrelated individuals from a dataset, and can also appear when maximizing
likelihood equations.
EvoBIO focuses on using biologically-inspired algorithms
(like genetic algorithms) to improve performance for many bioinformatics
tasks. For example, Stephen and I have
both applied these methods for analysis of genetic data using neural networks, and for forward-time
genetic data simulation (additional details here).
EvoBIO is very pleased to be sponsored by BMC Biodata
Mining, a natural partner for this conference. I recently wrote a blog post for BioMed Central about EvoBIO as well. Thanks to their sponsorship, the winner of the EvoBIO best paper award will receive free publication in Biodata Mining, and runners-up will receive 25% discount off the article processing charge.
So, if you are in the mood for a new conference and would like to see and influence some of these creative approaches to data analysis, consider attending EvoSTAR -- We'd love to see you there!
So, if you are in the mood for a new conference and would like to see and influence some of these creative approaches to data analysis, consider attending EvoSTAR -- We'd love to see you there!
Tuesday, March 19, 2013
Software Carpentry Bootcamp at University of Virginia
A couple of weeks ago I, with the help of others here at UVA, organized a Software Carpentry bootcamp, instructed by Steve Crouch, Carlos Anderson, and Ben Morris. The day before the course started, Charlottesville was racked by nearly a foot of snow, widespread power outages, and many cancelled incoming flights. Luckily our instructors arrived just in time, and power was (mostly) restored shortly before the boot camp started. Despite the conditions, the course was very well-attended.
Software Carpentry's aim is to teach researchers (usually graduate students) basic computing concepts and skills so that they can get more done in less time, and with less pain. They're a volunteer organization funded by Mozilla and the Sloan foundation, and led this two-day bootcamp completely free of charge to us.
The course started out with a head-first dive into Unix and Bash scripting, followed by a tutorial on automation with Make, concluding the first day with an introduction to Python. The second day covered version control with git, Python code testing, and wrapped up with an introduction to databases and SQL. At the conclusion of the course, participants offered near-universal positive feedback, with the git and Make tutorials being exceptionally popular.
Software Carpentry's approach to teaching these topics is unlike many others that I've seen. Rather than lecturing on for hours, the instructors inject very short (~5 minute) partnered exercises between every ~15 minutes of instruction in 1.5 hour sessions. With two full days of intensive instruction and your computer in front of you, it's all too easy to get distracted by an email, get lost in your everyday responsibilities, and zone out for the rest of the session. The exercises keep participants paying attention and accountable to their partner.
All of the bootcamp's materials are freely available:
Unix and Bash: https://github.com/redcurry/bash_tutorial
Python Introduction: https://github.com/redcurry/python_tutorial
Git tutorial: https://github.com/redcurry/git_tutorial
Databases & SQL: https://github.com/bendmorris/swc_databases
Everything else: http://users.ecs.soton.ac.uk/stc/SWC/tutorial-materials-virginia.zip
Perhaps more relevant to a broader audience are the online lectures and materials available on the Software Carpentry Website, which include all the above topics, as well as many others.
We capped the course at 50, and had 95 register within a day of opening registration, so we'll likely do this again in the future. I sit in countless meetings where faculty lament how nearly all basic science researchers enter grad school or their postdoc woefully unprepared for this brave new world of data-rich high-throughput science. Self-paced online learning works well for some, but if you're in a department or other organization that could benefit from a free, on-site, intensive introduction to the topics listed above, I highly recommend contacting Software Carpentry and organizing your own bootcamp.
Finally, when organizing an optional section of the course, we let participants vote whether they preferred learning number crunching with NumPy, or SQL/databases; SQL won by a small margin. However, Katherine Holcomb in UVACSE has graciously volunteered to teach a two-hour introduction to NumPy this week, regardless of whether you participated in the boot camp (although some basic Python knowledge is recommended). This (free) short course is this Thursday, March 21, 2-4pm, in the same place as the bootcamp (Brown Library Classroom in Clark Hall). Sign up here.
Software Carpentry's aim is to teach researchers (usually graduate students) basic computing concepts and skills so that they can get more done in less time, and with less pain. They're a volunteer organization funded by Mozilla and the Sloan foundation, and led this two-day bootcamp completely free of charge to us.
The course started out with a head-first dive into Unix and Bash scripting, followed by a tutorial on automation with Make, concluding the first day with an introduction to Python. The second day covered version control with git, Python code testing, and wrapped up with an introduction to databases and SQL. At the conclusion of the course, participants offered near-universal positive feedback, with the git and Make tutorials being exceptionally popular.
Software Carpentry's approach to teaching these topics is unlike many others that I've seen. Rather than lecturing on for hours, the instructors inject very short (~5 minute) partnered exercises between every ~15 minutes of instruction in 1.5 hour sessions. With two full days of intensive instruction and your computer in front of you, it's all too easy to get distracted by an email, get lost in your everyday responsibilities, and zone out for the rest of the session. The exercises keep participants paying attention and accountable to their partner.
All of the bootcamp's materials are freely available:
Unix and Bash: https://github.com/redcurry/bash_tutorial
Python Introduction: https://github.com/redcurry/python_tutorial
Git tutorial: https://github.com/redcurry/git_tutorial
Databases & SQL: https://github.com/bendmorris/swc_databases
Everything else: http://users.ecs.soton.ac.uk/stc/SWC/tutorial-materials-virginia.zip
Perhaps more relevant to a broader audience are the online lectures and materials available on the Software Carpentry Website, which include all the above topics, as well as many others.
We capped the course at 50, and had 95 register within a day of opening registration, so we'll likely do this again in the future. I sit in countless meetings where faculty lament how nearly all basic science researchers enter grad school or their postdoc woefully unprepared for this brave new world of data-rich high-throughput science. Self-paced online learning works well for some, but if you're in a department or other organization that could benefit from a free, on-site, intensive introduction to the topics listed above, I highly recommend contacting Software Carpentry and organizing your own bootcamp.
Finally, when organizing an optional section of the course, we let participants vote whether they preferred learning number crunching with NumPy, or SQL/databases; SQL won by a small margin. However, Katherine Holcomb in UVACSE has graciously volunteered to teach a two-hour introduction to NumPy this week, regardless of whether you participated in the boot camp (although some basic Python knowledge is recommended). This (free) short course is this Thursday, March 21, 2-4pm, in the same place as the bootcamp (Brown Library Classroom in Clark Hall). Sign up here.
Tags:
Announcements,
Conferences,
Databases,
Recommended Reading,
Software,
SQL,
Tutorials
Monday, March 4, 2013
Comparing Sequence Classification Algorithms for Metagenomics
Metagenomics is the study of DNA collected from environmental samples (e.g., seawater, soil, acid mine drainage, the human gut, sputum, pus, etc.). While traditional microbial genomics typically means sequencing a pure cultured isolate, metagenomics involves taking a culture-free environmental sample and sequencing a single gene (e.g. the 16S rRNA gene), multiple marker genes, or shotgun sequencing everything in the sample in order to determine what's there.
A challenge in shotgun metagenomics analysis is the sequence classification problem: i.e., given a sequence, what's it's origin? I.e., did this sequence read come from E. coli or some other enteric bacteria? Note that sequence classification does not involve genome assembly - sequence classification is done on unassembled reads. If you could perfectly classify the origin of every sequence read in your sample, you would know exactly what organisms are in your environmental sample and how abundant each one is.
The solution to this problem isn't simply BLAST'ing every sequence read that comes off your HiSeq 2500 against NCBI nt/nr. The computational cost of this BLAST search would be many times more expensive than the sequencing itself. There are many algorithms for sequence classification. This paper examines a wide range of the available algorithms and software implementations for sequence classification as applied to metagenomic data:
Bazinet, Adam L., and Michael P. Cummings. "A comparative evaluation of sequence classification programs." BMC Bioinformatics 13.1 (2012): 92.
In this paper, the authors comprehensively evaluated the performance of over 25 programs that fall into three categories: alignment-based, composition-based, and phylogeny-based. For illustrative purposes, the authors constructed a "phylogenetic tree" that shows how each of the 25 methods they evaluated are related to each other:
The performance evaluation was done on several different datasets where the composition was known, using a similar set of evaluation criteria (sensitivity = number of correct assignments / number of sequences in the data; precision = number of correct assignments/number of assignments made). They concluded that the performance of particular methods varied widely between datasets due to reasons like highly variable taxonomic composition and diversity, level of sequence representation in underlying databases, read lengths, and read quality. The authors specifically point out that just because some methods lack sensitivity (as they've defined it), they are still useful because they have high precision. For example, marker-based approaches (like Metaphyler) might only classify a small number of reads, but they're highly precise, and may still be enough to accurately recapitulate organismal distribution and abundance.
Importantly, the authors note that you can't ignore computational requirements, which varied by orders of magnitude between methods. Selection of the right method depends on the goals (is sensitivity or precision more important?) and the available resources (time and compute power are never infinite - these are tangible limitations that are imposed in the real world).
This paper was first received at BMC Bioinformatics a year ago, and since then many new methods for sequence classification have been published. Further, this paper only evaluates methods for classification of unassembled reads, and does not evaluate methods that rely on metagenome assembly (that's the subject of another much longer post, but check out Titus Brown's blog for lots more on this topic).
Overall, this paper was a great demonstration of how one might attempt to evaluate many different tools ostensibly aimed at solving the same problem but functioning in completely different ways.
Bazinet, Adam L., and Michael P. Cummings. "A comparative evaluation of sequence classification programs." BMC Bioinformatics 13.1 (2012): 92.
A challenge in shotgun metagenomics analysis is the sequence classification problem: i.e., given a sequence, what's it's origin? I.e., did this sequence read come from E. coli or some other enteric bacteria? Note that sequence classification does not involve genome assembly - sequence classification is done on unassembled reads. If you could perfectly classify the origin of every sequence read in your sample, you would know exactly what organisms are in your environmental sample and how abundant each one is.
The solution to this problem isn't simply BLAST'ing every sequence read that comes off your HiSeq 2500 against NCBI nt/nr. The computational cost of this BLAST search would be many times more expensive than the sequencing itself. There are many algorithms for sequence classification. This paper examines a wide range of the available algorithms and software implementations for sequence classification as applied to metagenomic data:
Bazinet, Adam L., and Michael P. Cummings. "A comparative evaluation of sequence classification programs." BMC Bioinformatics 13.1 (2012): 92.
In this paper, the authors comprehensively evaluated the performance of over 25 programs that fall into three categories: alignment-based, composition-based, and phylogeny-based. For illustrative purposes, the authors constructed a "phylogenetic tree" that shows how each of the 25 methods they evaluated are related to each other:
![]() |
| Figure 1: Program clustering. A neighbor-joining tree that clusters the classification programs based on their similar attributes. |
The performance evaluation was done on several different datasets where the composition was known, using a similar set of evaluation criteria (sensitivity = number of correct assignments / number of sequences in the data; precision = number of correct assignments/number of assignments made). They concluded that the performance of particular methods varied widely between datasets due to reasons like highly variable taxonomic composition and diversity, level of sequence representation in underlying databases, read lengths, and read quality. The authors specifically point out that just because some methods lack sensitivity (as they've defined it), they are still useful because they have high precision. For example, marker-based approaches (like Metaphyler) might only classify a small number of reads, but they're highly precise, and may still be enough to accurately recapitulate organismal distribution and abundance.
Importantly, the authors note that you can't ignore computational requirements, which varied by orders of magnitude between methods. Selection of the right method depends on the goals (is sensitivity or precision more important?) and the available resources (time and compute power are never infinite - these are tangible limitations that are imposed in the real world).
This paper was first received at BMC Bioinformatics a year ago, and since then many new methods for sequence classification have been published. Further, this paper only evaluates methods for classification of unassembled reads, and does not evaluate methods that rely on metagenome assembly (that's the subject of another much longer post, but check out Titus Brown's blog for lots more on this topic).
Overall, this paper was a great demonstration of how one might attempt to evaluate many different tools ostensibly aimed at solving the same problem but functioning in completely different ways.
Bazinet, Adam L., and Michael P. Cummings. "A comparative evaluation of sequence classification programs." BMC Bioinformatics 13.1 (2012): 92.
Wednesday, February 20, 2013
NetGestalt for Data Visualization in the Context of Pathways
Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.
NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.
Netgestalt provides a very nice interface for interacting with data. Extensive documentation on how to use it can be found here. Bing and his colleagues also went the extra mile to create video tutorials on how to use their web tool, and walk you through an analysis of some tumor data.
http://www.netgestalt.org/
NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.
Netgestalt provides a very nice interface for interacting with data. Extensive documentation on how to use it can be found here. Bing and his colleagues also went the extra mile to create video tutorials on how to use their web tool, and walk you through an analysis of some tumor data.
http://www.netgestalt.org/
Tags:
Pathways,
Tutorials,
Visualization,
Web Apps
Tuesday, February 12, 2013
"Document Design and Purpose, Not Mechanics"
If you ever write code for scientific computing (chances are you do if you're here), stop what you're doing and spend 8 minutes reading this open-access paper:
Wilson et al. Best Practices for Scientific Computing. arXiv:1210.0530 (2012). (Direct link to PDF).
The paper makes a number of good points regarding software as a tool just like any other lab equipment: it should be built, validated, and used as carefully as any other physical instrumentation. Yet most scientists who write software are self-taught, and haven't been properly trained in fundamental software development skills.
The paper outlines ten practices every computational biologist should adopt when writing code for research computing. Most of these are the usual suspects that you'd probably guess - using version control, workflow management, writing good documentation, modularizing code into functions, unit testing, agile development, etc. One that particularly jumped out at me was the recommendation to document design and purpose, not mechanics.
We all know that good comments and documentation is critical for code reproducibility and maintenance, but inline documentation that recapitulates the code is hardly useful. Instead, we should aim to document the underlying ideas, interface, and reasons, not the implementation.
For example, the following commentary is hardly useful:
# Increment the variable "i" by one.
i = i+1
The real recommendation here is that if your code requires such substantial documentation of the actual implementation to be understandable, it's better to spend the time rewriting the code rather than writing a lengthy description of what it does. I'm very guilty of doing this with R code, nesting multiple levels of functions and vector operations:
# It would take a paragraph to explain what this is doing.
# Better to break up into multiple lines of code.
sapply(data.frame(n=sapply(x, function(d) sum(is.na(d)))), function(dd) mean(dd))
It would take much more time to properly document what this is doing than it would take to split the operation into manageable chunks over multiple lines such that the code no longer needs an explanation. We're not playing code golf here - using fewer lines doesn't make you a better programmer.
Tags:
Bioinformatics,
R,
Recommended Reading
Subscribe to:
Posts (Atom)














