Latest topics


XML feed

Last update

1 hour 14 min ago

September 19, 2014


Brian Foley wrote:

A user in another phylogenetics discussion group today had a question about analyzing more than 100 sequences each of more than 80,000 bases length, all from one gene. This lead me to assume the sequences were from closely related organisms because otherwise the introns could be too diverse to align while the exons were still alignable. This made me wonder, if we have 100 very long sequences from a single species of mammal (for example humans sampled around the world) what types of tests can be done to look for recombination, and how to measure the phylogentic signal to noise ratio in the data. The consistency index and retention index are two useful measurements, but I rarely see them reported for data sets, and most phylogenetic software packages to not compute them and display them with the results.

Posts: 1

Participants: 1

Read full topic

September 11, 2014


Alex Jeffries wrote:

I'm about to run my Final Year BSc (hons) molecular phylogenetics unit again and am looking for some inspiration. In the past I have used Trypanosome genes as a dataset for the coursework exercises, but it's starting to feel stale to me. Does anyone have any suggestions for an interesting phylogenetic question and dataset that would allow students to collect sequences, align, make inferences and thereby test some sort of hypothesis? Preferably (and this is the hard bit) not previously published so they can't just crib from the papers.

Many thanks in advance.

Posts: 5

Participants: 3

Read full topic

September 7, 2014


Krzysztof M. Kozak wrote:

Dear All,

I am building a pipeline to automatically generate gene trees for about 10,000 CDS alignments (all genes from an exome). The genes were sequenced for 150 individuals in multiple species. Some individuals are worse than others and occasionally have little data in some alignments, and end up on obviously artificially inflated branches. Is anyone aware of a tool to prune those automatically? (I will also use tools to get rid of poor sequence first, but that's a different topic.)

Many thanks, Krzysztof Kozak

Posts: 3

Participants: 3

Read full topic

August 17, 2014


Db60 wrote:

Does anyone know a good way to convert the alignment within a BEAST XML file to PHYLIP (or Nexus, Fasta, etc)?

I could script it myself, but I assume the problem has already been addressed by others.

I realise the non-sequence data within the BEAST XML file will be lost, but for my purposes that's OK.

Thank you very much, in advance.

Daniel Barker

Posts: 2

Participants: 2

Read full topic

August 14, 2014


Scott Handley wrote:

Hello Phylobabble community!

I am assisting in the organization of a Workshop on Molecular Evolution which will be held in Cesky Krumlov, Czech Republic in January 2015. I have helped to organize this event before, but this year we are renewing the program and I am working with several new people to design something that we believe will be of interest to many in the phyologenetics/molecular evolution communities. More details below!

We also organize a Workshop on Genomics immediately prior to the Molecular Evolution Workshop for those interested in those sorts of topics:

2015 Workshop on Molecular Evolution, Český Krumlov, Czech Republic

Dates: 25 January - 7 February, 2015

Application Deadline: 15 October, 2014 is the preferred application deadline, after which time people will be admitted to the course following application review by the admissions committee. However, later applications will certainly be considered for admittance or for placement on a waiting list.

Registration Fee: $1500 USD. Fee includes opening reception and access to all course material, but does not include other meals or housing. Special discounted pricing has been arranged for hotels, pensions and hostels. Information regarding housing and travel will be made to applicants following acceptance.


Useful Links: Direct Link to the Full Workshop Schedule: General Workshop information: Frequently Asked Questions (FAQ) about the Workshop and Český Krumlov can be found here:

Workshop Overview:

The 2015 Workshop on Molecular Evolution brings together an international collection of faculty members and Workshop participants to study and discuss current ideas and techniques for exploring molecular evolution. The Workshop on Molecular Evolution consists of a series of lectures, demonstrations and computer laboratories that cover theoretical and conceptual aspects of molecular evolution with a strong emphasis on data analysis.

The Workshop has a strong focus on molecular phylogenetics, and covers all aspects of phylogenetic workflows, including marker selection, phylogeny reconstruction, time-calibration, as well as detection of natural selection, phylogeography, diversification rates, and trait evolution patterns. A majority of the schedule is dedicated to hands-on learning activities designed by faculty and the workshop team. This interactive experience provides Workshop participants with the practical experience required to meet the challenges presented by modern evolutionary sciences.

Co-directors: Walter Salzburger, Michael Matschiner, Jan Stefka and Scott Handley

For more information and online application see the Workshop web site -

Posts: 1

Participants: 1

Read full topic

July 13, 2014


Guanyang Zhang wrote:

Are there phylogenetic comparative methods/software for testing correlation between characters that have multiple states. BayesTraits only deal with binary characters.

Posts: 3

Participants: 3

Read full topic

July 11, 2014


Jaime Huerta-Cepas wrote:

I have a phylogenetic hypothesis that I would like to test statistically. Although the best Bayesian and ML trees support that hypothesis, bootstrap and posterior probabilities are far from great, so I followed the advice given to me in this forum about testing all possible alternatives and see if I could statistically ruled them out.

For this I used CONSEL to evaluate over a thousand of alternative constraint topologies. All but 10 of the accepted topologies are compatible with my hypothesis using the AU test (pvalue

July 8, 2014


Brian Foley wrote:

This paper is rather specific to HIV-1 with its very large population size within each infected individual, and rapid evolution rate. It would be interesting to see similar work with other organisms. Human Influenza A virus, for example, has an evolution rate very similar to HIV-1 but a very different transmission rate between infected individuals.

Posts: 1

Participants: 1

Read full topic

July 3, 2014


Trevor Bedford wrote:

Andreas Wagner has a new paper on analyzing influenza sequence data using a super simple Hamming-distance network-based approach. A genotype network reveals homoplastic cycles of convergent evolution in influenza A (H3N2) haemagglutinin. A Wagner, Proceedings. Biological sciences / The Royal Society, Jul 7 2014

Networks of evolving genotypes can be constructed from the worldwide time-resolved genotyping of pathogens like influenza viruses. Such genotype networks are graphs where neighbouring vertices (viral strains) differ in a single nucleotide or amino acid. A rich trove of network analysis methods can help understand the evolutionary dynamics reflected in the structure of these networks. Here, I analyse a genotype network comprising hundreds of influenza A (H3N2) haemagglutinin genes. The network is rife with cycles that reflect non-random parallel or convergent (homoplastic) evolution. These cycles also show patterns of sequence change characteristic for strong and local evolutionary constraints, positive selection and mutation-limited evolution. Such cycles would not be visible on a phylogenetic tree, illustrating that genotype network analysis can complement phylogenetic analyses. The network also shows a distinct modular or community structure that reflects temporal more than spatial proximity of viral strains, where lowly connected bridge strains connect different modules. These and other organizational patterns illustrate that genotype networks can help us study evolution in action at an unprecedented level of resolution.

He ends up with plots like:

network.png2096x1348 425 KB

Fundamentally non-phylogenetic, this approach doesn't try to reconstruct evolutionary history, but instead shows a simple overview of genetic relationships. Andreas suggests that these graphs make it easy to detect convergent evolution that would not be apparent in the strictly branching tree.

I don't have a good intuition for how these sorts of graphs translate to trees and vice versa. Does this seem like it's a useful addition to constructing a tree or more of a distraction?

Posts: 3

Participants: 2

Read full topic

July 2, 2014


Miao Sun wrote:

Hi geniuses,

A really big headache issue haunting me recently, required your insightful suggestions to help me out:

I have a large matrix, about 12,000 taxa, and the data is formalized as below:


So using information like "taxa name or gi No.", how can I get the corresponding Accession number from each taxon in this large matrix via NCBI website as a batch job?

Any ideas or experience to share?



Posts: 6

Participants: 3

Read full topic

June 30, 2014


Craig Nelson wrote:

This is as much a moral rumination as a call for opinions and guidance. How can we better practically resolve taxa as amplicon surveys grow out of control? Can placement algorithms replace identity binning? Should they?

Sequence reads derived from environmental surveys of phylogenetic marker gene amplicons, such as 16S rRNA, CO1, etc. are typically “clustered” (Uclust, CD-Hit, mothur) to form operational taxonomic units (OTUs) after alignment to a reference database and before subsequent phylogenetic analysis or classification. In the widespread application of these genes (molecular “clocks”), this creates problems of cohesion across studies and sequencing platforms (and even run-to-run) because OTUs are internally defined by local neighbors.

Because ecology is important (right?), identifying ecologically and evolutionarily meaningful OTUs has become important, in the microbial world now often described as finding “Ecotypes” of broad clades of microbes. This is impossible with databases, where huge diverse groups are often lumped with a code derived from a single clone picked decades ago. Nonetheless, many marker genes have robust, curated databases, and the problem becomes one of annotation. Comparing organisms across studies has become a real problem in my field of aquatic microbiology. We have lots of groups independently "naming" clades, and lots of reference libraries for marker genes (especially 16S), but read binning by identity is dataset dependent and it can be hard to maintain continuity through time or quickly determine if two groups are talking about the same organism.

I am particularly interested in stabilizing this trajectory in time-series work by using placement to assign reads to nodes of a reference tree. Frankly in practice this is now practical computationally because binning can be so slow as datasets grow. I'd guess a ref should be robustly calculated (ML) with backbone constraints from a curated MSA database and possibly initially expanded using existing sequence libraries from previous work in the ecosystem in question (or analogues) to establish un-curated clades relevant locally. Subsequent amplicon surveys could then “classify” reads according to placements (using pplacer, for example @ematsen ) and nodes could serve as stable, reference-able, visualizable, (expandable?), classifiable taxonomic units.

I’ve been working for the last year to derive a robust reference alignment from databases and structure a classified, constrained tree and workflow for curation of marker gene survey outputs within the pplacer ecosystem (pplacer/guppy/rppr/ and especially taxtastic). Importantly, we wouldn't be progressively adding sequences to a tree. We would just be allowing for stable nodal annotation of reads as a short-term way of detecting ecologically meaningful differential placements. In essence our goal is to "classify" to node rather than database annotations.

One topic of discussion would be nodal “Assignment”. In the context of pplacer, it would be nice to get an alternate "unambiguous" placement for a given pquery which is the most derived node for which likelihood weight is above some threshold (say 70%). This seems like a reasonable option to incorporate into pplacer: if pquery has an unacceptably low pointmass likelihood weight, reassign to a basal node monophyletic for the placement using an LCA-like algorithm (already in use for your classification algorithms). Giovannoni's group at OSU has taken an approach to this by modifying pplacer placements with the BioPerl script LCA, which basically re-attaches pqueries to common basal nodes when placements are unacceptably "Fuzzy" as a single pointmass (their group calls this pipeline "Phylotyper" Vergin et al. 2013 (ISME Journal).

Another discussion point that would be really useful is some criteria for determining if pqueries are likely to belong to a "new clade" that isn't well-represented in the tree. Said differently, it would be nice if post-hoc analyses on a placement mass can suggest if the tree needs to be expanded with additional reference sequences to resolve/accommodate new subclades in specific regions Are any of the existing metrics (adcl, edpl) useful for quantifying this likelihood? Would there be a way to make a "new node" in a refpkg during the placement process if some critical mass of sequences was attaching to a basal node in a clade with better likelihood than more derived nodes?


Posts: 7

Participants: 4

Read full topic

June 26, 2014


Erick Matsen wrote:

Efficient Continuous-Time Markov Chain Estimation

Monir Hajiaghayi, Bonnie Kirkpatrick, Liangliang Wang, Alexandre Bouchard-Côté

Many problems of practical interest rely on Continuous-time Markov chains~(CTMCs) defined over combinatorial state spaces, rendering the computation of transition probabilities, and hence probabilistic inference, difficult or impossible with existing methods. For problems with countably infinite states, where classical methods such as matrix exponentiation are not applicable, the main alternative has been particle Markov chain Monte Carlo methods imputing both the holding times and sequences of visited states. We propose a particle-based Monte Carlo approach where the holding times are marginalized analytically. We demonstrate that in a range of realistic inferential setups, our scheme dramatically reduces the variance of the Monte Carlo approximation and yields more accurate parameter posterior approximations given a fixed computational budget. These experiments are performed on both synthetic and real datasets, drawing from two important examples of CTMCs having combinatorial state spaces: string-valued mutation models in phylogenetics and nucleic acid folding pathways.

The first important thing is to figure out how to calculate the transition probability of an x to a y given that some change occurs in the case when the state space is very big. String-valued processes fall in this category, for example. They bias things with a potential:

Pasted image637x657 236 KB

Second, one needs to marginalize out the event (i.e. jump) times. This is done by constructing a CTMC such that the difficult part of the marginalization are the transition probabilities of the CTMC:

Alexandre Bouchard-Côté does fantastic work. H/T @cmccoy.

Posts: 1

Participants: 1

Read full topic


Rob Lanfear wrote:

This post is about a recent Drosophila phylogeny published in MPE, a critique of that paper, and whether MPE has done enough by just publishing the critique. Opinions welcome.

The original paper presented new data and a new tree of Drosophilidae. Obviously lots of people care about this tree since it encompasses some of the best-studied model organisms we've got. It's been cited 23 times since 2012 according to google scholar. Here's the original:

Increasing the data size to accurately reconstruct the phylogenetic relationships between nine subgroups of the Drosophila melanogaster species group (Drosophilidae, Diptera). Yang Y, Hou ZC, Qian YH, Kang H, Zeng QT.

A critique has just been published, showing lots of issues with the original analysis (full disclosure - I have published with the first author of this critique, although I had nothing to do with this critique and hadn't read it until today). Here's the critique:

Problems with data quality in the reconstruction of evolutionary relationships in the Drosophila melanogaster species group: Comments on Yang et al 2012. Catullo RA, Oakeshott JG.

In short - they found many issues with the data in the ms (problems with ~150 fo the ~800 sequences), and couldn't replicate their results. Most worryingly, they show at least one example where this published tree may have already led to incorrect inferences in a published comparative study that relied on the tree.

What seems odd to me is that although the critique seems fairly damning, nothing has changed on the original paper. My understanding was that this is what the COPE guidelines were for:

While there is no evidence of fraud here, if you take the critique at face value then there were a lot of mistakes in the original article and the validity of the results is certainly in question.

MPE is a premier venue for publishing trees, and it would be nice to think they were committed to their publications being accurate. So I'd be interested to hear others' opinions on this paper and the critique. Specifically, have MPE done enough here by just publishing the critique? Should they issue a correction / expression of concern / or worse of the original article? Or should the original article stand unchanged despite the critique?



Posts: 3

Participants: 2

Read full topic

June 25, 2014


Jaime Huerta-Cepas wrote:

We have just released the first beta version of ETE-NPR.

The software is intended for Nested Phylogenetic Reconstruction (NPR) and workflow design. It works as a wrapper to all the necessary steps and programs used in common phylogenetic and phylogenomic pipelines, from input parsing to final image generation.

This is still a work in progress and we will be happy to get any feedback.

Posts: 1

Participants: 1

Read full topic