Latest topics


XML feed

Last update

1 hour 34 min ago

August 17, 2014


Db60 wrote:

Does anyone know a good way to convert the alignment within a BEAST XML file to PHYLIP (or Nexus, Fasta, etc)?

I could script it myself, but I assume the problem has already been addressed by others.

I realise the non-sequence data within the BEAST XML file will be lost, but for my purposes that's OK.

Thank you very much, in advance.

Daniel Barker

Posts: 2

Participants: 2

Read full topic

August 14, 2014


Scott Handley wrote:

Hello Phylobabble community!

I am assisting in the organization of a Workshop on Molecular Evolution which will be held in Cesky Krumlov, Czech Republic in January 2015. I have helped to organize this event before, but this year we are renewing the program and I am working with several new people to design something that we believe will be of interest to many in the phyologenetics/molecular evolution communities. More details below!

We also organize a Workshop on Genomics immediately prior to the Molecular Evolution Workshop for those interested in those sorts of topics:

2015 Workshop on Molecular Evolution, Český Krumlov, Czech Republic

Dates: 25 January - 7 February, 2015

Application Deadline: 15 October, 2014 is the preferred application deadline, after which time people will be admitted to the course following application review by the admissions committee. However, later applications will certainly be considered for admittance or for placement on a waiting list.

Registration Fee: $1500 USD. Fee includes opening reception and access to all course material, but does not include other meals or housing. Special discounted pricing has been arranged for hotels, pensions and hostels. Information regarding housing and travel will be made to applicants following acceptance.


Useful Links: Direct Link to the Full Workshop Schedule: General Workshop information: Frequently Asked Questions (FAQ) about the Workshop and Český Krumlov can be found here:

Workshop Overview:

The 2015 Workshop on Molecular Evolution brings together an international collection of faculty members and Workshop participants to study and discuss current ideas and techniques for exploring molecular evolution. The Workshop on Molecular Evolution consists of a series of lectures, demonstrations and computer laboratories that cover theoretical and conceptual aspects of molecular evolution with a strong emphasis on data analysis.

The Workshop has a strong focus on molecular phylogenetics, and covers all aspects of phylogenetic workflows, including marker selection, phylogeny reconstruction, time-calibration, as well as detection of natural selection, phylogeography, diversification rates, and trait evolution patterns. A majority of the schedule is dedicated to hands-on learning activities designed by faculty and the workshop team. This interactive experience provides Workshop participants with the practical experience required to meet the challenges presented by modern evolutionary sciences.

Co-directors: Walter Salzburger, Michael Matschiner, Jan Stefka and Scott Handley

For more information and online application see the Workshop web site -

Posts: 1

Participants: 1

Read full topic

July 13, 2014


Guanyang Zhang wrote:

Are there phylogenetic comparative methods/software for testing correlation between characters that have multiple states. BayesTraits only deal with binary characters.

Posts: 3

Participants: 3

Read full topic

July 11, 2014


Jaime Huerta-Cepas wrote:

I have a phylogenetic hypothesis that I would like to test statistically. Although the best Bayesian and ML trees support that hypothesis, bootstrap and posterior probabilities are far from great, so I followed the advice given to me in this forum about testing all possible alternatives and see if I could statistically ruled them out.

For this I used CONSEL to evaluate over a thousand of alternative constraint topologies. All but 10 of the accepted topologies are compatible with my hypothesis using the AU test (pvalue

July 8, 2014


Brian Foley wrote:

This paper is rather specific to HIV-1 with its very large population size within each infected individual, and rapid evolution rate. It would be interesting to see similar work with other organisms. Human Influenza A virus, for example, has an evolution rate very similar to HIV-1 but a very different transmission rate between infected individuals.

Posts: 1

Participants: 1

Read full topic

July 3, 2014


Trevor Bedford wrote:

Andreas Wagner has a new paper on analyzing influenza sequence data using a super simple Hamming-distance network-based approach. A genotype network reveals homoplastic cycles of convergent evolution in influenza A (H3N2) haemagglutinin. A Wagner, Proceedings. Biological sciences / The Royal Society, Jul 7 2014

Networks of evolving genotypes can be constructed from the worldwide time-resolved genotyping of pathogens like influenza viruses. Such genotype networks are graphs where neighbouring vertices (viral strains) differ in a single nucleotide or amino acid. A rich trove of network analysis methods can help understand the evolutionary dynamics reflected in the structure of these networks. Here, I analyse a genotype network comprising hundreds of influenza A (H3N2) haemagglutinin genes. The network is rife with cycles that reflect non-random parallel or convergent (homoplastic) evolution. These cycles also show patterns of sequence change characteristic for strong and local evolutionary constraints, positive selection and mutation-limited evolution. Such cycles would not be visible on a phylogenetic tree, illustrating that genotype network analysis can complement phylogenetic analyses. The network also shows a distinct modular or community structure that reflects temporal more than spatial proximity of viral strains, where lowly connected bridge strains connect different modules. These and other organizational patterns illustrate that genotype networks can help us study evolution in action at an unprecedented level of resolution.

He ends up with plots like:

network.png2096x1348 425 KB

Fundamentally non-phylogenetic, this approach doesn't try to reconstruct evolutionary history, but instead shows a simple overview of genetic relationships. Andreas suggests that these graphs make it easy to detect convergent evolution that would not be apparent in the strictly branching tree.

I don't have a good intuition for how these sorts of graphs translate to trees and vice versa. Does this seem like it's a useful addition to constructing a tree or more of a distraction?

Posts: 3

Participants: 2

Read full topic

July 2, 2014


Miao Sun wrote:

Hi geniuses,

A really big headache issue haunting me recently, required your insightful suggestions to help me out:

I have a large matrix, about 12,000 taxa, and the data is formalized as below:


So using information like "taxa name or gi No.", how can I get the corresponding Accession number from each taxon in this large matrix via NCBI website as a batch job?

Any ideas or experience to share?



Posts: 6

Participants: 3

Read full topic

June 30, 2014


Craig Nelson wrote:

This is as much a moral rumination as a call for opinions and guidance. How can we better practically resolve taxa as amplicon surveys grow out of control? Can placement algorithms replace identity binning? Should they?

Sequence reads derived from environmental surveys of phylogenetic marker gene amplicons, such as 16S rRNA, CO1, etc. are typically “clustered” (Uclust, CD-Hit, mothur) to form operational taxonomic units (OTUs) after alignment to a reference database and before subsequent phylogenetic analysis or classification. In the widespread application of these genes (molecular “clocks”), this creates problems of cohesion across studies and sequencing platforms (and even run-to-run) because OTUs are internally defined by local neighbors.

Because ecology is important (right?), identifying ecologically and evolutionarily meaningful OTUs has become important, in the microbial world now often described as finding “Ecotypes” of broad clades of microbes. This is impossible with databases, where huge diverse groups are often lumped with a code derived from a single clone picked decades ago. Nonetheless, many marker genes have robust, curated databases, and the problem becomes one of annotation. Comparing organisms across studies has become a real problem in my field of aquatic microbiology. We have lots of groups independently "naming" clades, and lots of reference libraries for marker genes (especially 16S), but read binning by identity is dataset dependent and it can be hard to maintain continuity through time or quickly determine if two groups are talking about the same organism.

I am particularly interested in stabilizing this trajectory in time-series work by using placement to assign reads to nodes of a reference tree. Frankly in practice this is now practical computationally because binning can be so slow as datasets grow. I'd guess a ref should be robustly calculated (ML) with backbone constraints from a curated MSA database and possibly initially expanded using existing sequence libraries from previous work in the ecosystem in question (or analogues) to establish un-curated clades relevant locally. Subsequent amplicon surveys could then “classify” reads according to placements (using pplacer, for example @ematsen ) and nodes could serve as stable, reference-able, visualizable, (expandable?), classifiable taxonomic units.

I’ve been working for the last year to derive a robust reference alignment from databases and structure a classified, constrained tree and workflow for curation of marker gene survey outputs within the pplacer ecosystem (pplacer/guppy/rppr/ and especially taxtastic). Importantly, we wouldn't be progressively adding sequences to a tree. We would just be allowing for stable nodal annotation of reads as a short-term way of detecting ecologically meaningful differential placements. In essence our goal is to "classify" to node rather than database annotations.

One topic of discussion would be nodal “Assignment”. In the context of pplacer, it would be nice to get an alternate "unambiguous" placement for a given pquery which is the most derived node for which likelihood weight is above some threshold (say 70%). This seems like a reasonable option to incorporate into pplacer: if pquery has an unacceptably low pointmass likelihood weight, reassign to a basal node monophyletic for the placement using an LCA-like algorithm (already in use for your classification algorithms). Giovannoni's group at OSU has taken an approach to this by modifying pplacer placements with the BioPerl script LCA, which basically re-attaches pqueries to common basal nodes when placements are unacceptably "Fuzzy" as a single pointmass (their group calls this pipeline "Phylotyper" Vergin et al. 2013 (ISME Journal).

Another discussion point that would be really useful is some criteria for determining if pqueries are likely to belong to a "new clade" that isn't well-represented in the tree. Said differently, it would be nice if post-hoc analyses on a placement mass can suggest if the tree needs to be expanded with additional reference sequences to resolve/accommodate new subclades in specific regions Are any of the existing metrics (adcl, edpl) useful for quantifying this likelihood? Would there be a way to make a "new node" in a refpkg during the placement process if some critical mass of sequences was attaching to a basal node in a clade with better likelihood than more derived nodes?


Posts: 7

Participants: 4

Read full topic

June 26, 2014


Erick Matsen wrote:

Efficient Continuous-Time Markov Chain Estimation

Monir Hajiaghayi, Bonnie Kirkpatrick, Liangliang Wang, Alexandre Bouchard-Côté

Many problems of practical interest rely on Continuous-time Markov chains~(CTMCs) defined over combinatorial state spaces, rendering the computation of transition probabilities, and hence probabilistic inference, difficult or impossible with existing methods. For problems with countably infinite states, where classical methods such as matrix exponentiation are not applicable, the main alternative has been particle Markov chain Monte Carlo methods imputing both the holding times and sequences of visited states. We propose a particle-based Monte Carlo approach where the holding times are marginalized analytically. We demonstrate that in a range of realistic inferential setups, our scheme dramatically reduces the variance of the Monte Carlo approximation and yields more accurate parameter posterior approximations given a fixed computational budget. These experiments are performed on both synthetic and real datasets, drawing from two important examples of CTMCs having combinatorial state spaces: string-valued mutation models in phylogenetics and nucleic acid folding pathways.

The first important thing is to figure out how to calculate the transition probability of an x to a y given that some change occurs in the case when the state space is very big. String-valued processes fall in this category, for example. They bias things with a potential:

Pasted image637x657 236 KB

Second, one needs to marginalize out the event (i.e. jump) times. This is done by constructing a CTMC such that the difficult part of the marginalization are the transition probabilities of the CTMC:

Alexandre Bouchard-Côté does fantastic work. H/T @cmccoy.

Posts: 1

Participants: 1

Read full topic


Rob Lanfear wrote:

This post is about a recent Drosophila phylogeny published in MPE, a critique of that paper, and whether MPE has done enough by just publishing the critique. Opinions welcome.

The original paper presented new data and a new tree of Drosophilidae. Obviously lots of people care about this tree since it encompasses some of the best-studied model organisms we've got. It's been cited 23 times since 2012 according to google scholar. Here's the original:

Increasing the data size to accurately reconstruct the phylogenetic relationships between nine subgroups of the Drosophila melanogaster species group (Drosophilidae, Diptera). Yang Y, Hou ZC, Qian YH, Kang H, Zeng QT.

A critique has just been published, showing lots of issues with the original analysis (full disclosure - I have published with the first author of this critique, although I had nothing to do with this critique and hadn't read it until today). Here's the critique:

Problems with data quality in the reconstruction of evolutionary relationships in the Drosophila melanogaster species group: Comments on Yang et al 2012. Catullo RA, Oakeshott JG.

In short - they found many issues with the data in the ms (problems with ~150 fo the ~800 sequences), and couldn't replicate their results. Most worryingly, they show at least one example where this published tree may have already led to incorrect inferences in a published comparative study that relied on the tree.

What seems odd to me is that although the critique seems fairly damning, nothing has changed on the original paper. My understanding was that this is what the COPE guidelines were for:

While there is no evidence of fraud here, if you take the critique at face value then there were a lot of mistakes in the original article and the validity of the results is certainly in question.

MPE is a premier venue for publishing trees, and it would be nice to think they were committed to their publications being accurate. So I'd be interested to hear others' opinions on this paper and the critique. Specifically, have MPE done enough here by just publishing the critique? Should they issue a correction / expression of concern / or worse of the original article? Or should the original article stand unchanged despite the critique?



Posts: 3

Participants: 2

Read full topic

June 25, 2014


Jaime Huerta-Cepas wrote:

We have just released the first beta version of ETE-NPR.

The software is intended for Nested Phylogenetic Reconstruction (NPR) and workflow design. It works as a wrapper to all the necessary steps and programs used in common phylogenetic and phylogenomic pipelines, from input parsing to final image generation.

This is still a work in progress and we will be happy to get any feedback.

Posts: 1

Participants: 1

Read full topic

June 19, 2014


josephwb wrote:

Anyone know the best way to visualize conflicting tree toopologies with incomplete overlap in taxon sampling?

What we got: a complete tree (say, species tree), and a whack of gene trees which may or may not have complete taxon sampling. We want a figure with single set of taxon labels that all trees map to. Ignoring edge lengths, as things get messy very quickly. If a gene tree does not contain taxa in the basal split of the species tree, don't want it's root to start at the species tree root, but instead more tipward; otherwise, relationships get obscured.

DensiTree is something we have explored, but it doesn't seem to work well with uneven sampling across trees. We've also been playing with R code graciously provided by @liamjrevell, and we may be able to get this to do what we want, but I thought I would check with with the phylo-timaliids to see if something already exists.

Thanks! JWB.

Posts: 6

Participants: 4

Read full topic

June 17, 2014


Erick Matsen wrote:

New from @bredelings: Erasing Errors Due to Alignment Ambiguity When Estimating Positive Selection. B Redelings, Molecular biology and evolution, May 27 2014

Current estimates of diversifying positive selection rely on first having an accurate multiple sequence alignment. Simulation studies have shown that under biologically plausible conditions, relying on a single estimate of the alignment from commonly used alignment software can lead to unacceptably high false positive rates in detecting diversifying positive selection. We present a novel statistical method that eliminates excess false positives resulting from alignment error by jointly estimating the degree of positive selection and the alignment under an evolutionary model. Our model treats both substitutions and insertions/deletions as sequence changes on a tree, and allows site-heterogeneity in the substitution process. We conduct inference starting from unaligned sequence data by integrating over all alignments. This approach naturally accounts for ambiguous alignments without requiring ambiguously aligned sites to be identified and removed prior to analysis. We take a Bayesian approach and conduct inference using MCMC to integrate over all alignments on a fixed evolutionary tree topology. We introduce a Bayesian version of the branch-site test and assess the evidence for positive selection using Bayes factors. We compare two models of differing dimensionality using a simple alternative to reversible-jump methods. We also describe a more accurate method of estimating the Bayes factor using Rao-Blackwellization. We then show using simulated data that jointly estimating the alignment and the presence of positive selection solves the problem with excessive false positives from erroneous alignments, and has nearly the same power to detect positive selection as when the true alignment is known. We also show that samples taken from the posterior alignment distribution using the software BAli-Phy have substantially lower alignment error compared to MUSCLE, MAFFT, PRANK, and FSA alignments.

This figure definitely made me sit up and pay attention:

Pasted image693x474 110 KB

The sequences were simulated with INDELible.

Posts: 3

Participants: 2

Read full topic

June 7, 2014


Bojian Zhong wrote:


I am currently doing some phylogenetic analyses using Phylobayes, but I haven't figure out how to measure the compositional heterogeneity of each taxa using Phylobayes? I really appreciate it if anyone could provide the commands/details of how to do it.

Many thanks,


Posts: 1

Participants: 1

Read full topic

June 3, 2014


Andrew Rambaut wrote:

Firstly - sorry, this is not an announcement but a suggestion/call for a collaborative project. I have been using iPython Notebook for manipulating data and plotting and think it would be a great environment for phylogenetics. For it to work, it would need a coherent library with standardised objects for storing trees, etc., and some visualisation tools for trees, alignments etc. And some embedded tree building/alignment software.

For the former, the obvious choice (I think) would be to use Dendropy, by @jeetsukumaran and @mtholder but other options may be available. Then a good set of (possibly D3 based, JavaScript) visualisation tools could be built in at the Notebook end (the alternative would be to add plotting routines built on top of MatPlotLib). Extensions could be added by using standard Python package management.

Any thoughts on this? My primary motivation is to replace the various software packages I use for teaching and produce a coherent framework.

Posts: 10

Participants: 8

Read full topic


Jaime Huerta-Cepas wrote:

Hi, I have been days struggling with a phylogenetic tree of around 90 short sequences (domain based) whose support values for many branches are really low (