phylobabble.org

Latest topics

URL

XML feed
http://www.phylobabble.org/latest

Last update

13 min 47 sec ago

April 17, 2014

07:19

Erick Matsen wrote:

In the paper describing MrBayes 3.2, there is the following phrase:

MrBayes 3.2 further includes a completely new type of tree proposal that is guided using parsimony scores. The details of the parsimony-biased proposals will be presented elsewhere; however, tentative empirical results show that they can improve the speed of convergence by an order of magnitude on some problems (see also Höhna and Drummond 2012).

Can someone point me to this paper? I'd like to read about them! (@Alexis_RAxML, I'm assuming that you learned about then from just reading the MB source?)

Posts: 1

Participants: 1

Read full topic

April 12, 2014

15:54

Erick Matsen wrote:

This postdoctoral position is an opportunity to contribute to the design of the upcoming 701 and 702 HIV vaccine trials to maximize power in subsequent statistical analyses. Specifically, it will be to help design the trials so that infection time and founder sequences can be inferred with maximum fidelity. The scope of the study design includes sampling times and sequencing protocol, and also may afford some opportunities to design novel ways of combining sequencing methodologies. The project will last two years, with some possibility of extension. It may also offer an option to travel to South Africa to help teach a short course at the University of Cape Town, and possibly interact with the Fred Hutchinson Research Institute there.

This position will require significant statistics expertise, programming ability, and of course interest in collaboration. For a bit more detail, see the post on my website.

Posts: 1

Participants: 1

Read full topic

15:39

Erick Matsen wrote:

www.ncbi.nlm.nih.gov Automated Reconstruction of Whole-Genome Phylogenies from Short-Sequence Reads. F Bertels, OK Silander, M Pachkov, PB Rainey and E van Nimwegen, Molecular biology and evolution, Mar 23 2014

Studies of microbial evolutionary dynamics are being transformed by the availability of affordable high-throughput sequencing technologies, which allow whole-genome sequencing of hundreds of related taxa in a single study. Reconstructing a phylogenetic tree of these taxa is generally a crucial step in any evolutionary analysis. Instead of constructing genome assemblies for all taxa, annotating these assemblies, and aligning orthologous genes, many recent studies 1) directly map raw sequencing reads to a single reference sequence, 2) extract single nucleotide polymorphisms (SNPs), and 3) infer the phylogenetic tree using maximum likelihood methods from the aligned SNP positions. However, here we show that, when using such methods to reconstruct phylogenies from sets of simulated sequences, both the exclusion of nonpolymorphic positions and the alignment to a single reference genome, introduce systematic biases and errors in phylogeny reconstruction. To address these problems, we developed a new method that combines alignments from mappings to multiple reference sequences and show that this successfully removes biases from the reconstructed phylogenies. We implemented this method as a web server named REALPHY (Reference sequence Alignment-based Phylogeny builder), which fully automates phylogenetic reconstruction from raw sequencing reads.

F8.large.jpg1280x1071 128 KB

Hello there Felsenstein and Farris zones!

F2.large.jpg1280x656 111 KB

Posts: 1

Participants: 1

Read full topic

April 10, 2014

11:37

Erick Matsen wrote:

Very interesting work from @arambaut, @alexei_drummond, @guy_baele, and @philippe_Lemey.

www.ncbi.nlm.nih.gov The Genealogical Population Dynamics of HIV-1 in a Large Transmission Chain: Bridging within and among Host Evolutionary Rates. B Vrancken, A Rambaut, MA Suchard, A Drummond, G Baele, I Derdelinckx, E Van Wijngaerden, AM Vandamme, K Van Laethem and P Lemey, PLoS computational biology, Apr 2014

Transmission lies at the interface of human immunodeficiency virus type 1 (HIV-1) evolution within and among hosts and separates distinct selective pressures that impose differences in both the mode of diversification and the tempo of evolution. In the absence of comprehensive direct comparative analyses of the evolutionary processes at different biological scales, our understanding of how fast within-host HIV-1 evolutionary rates translate to lower rates at the between host level remains incomplete. Here, we address this by analyzing pol and env data from a large HIV-1 subtype C transmission chain for which both the timing and the direction is known for most transmission events. To this purpose, we develop a new transmission model in a Bayesian genealogical inference framework and demonstrate how to constrain the viral evolutionary history to be compatible with the transmission history while simultaneously inferring the within-host evolutionary and population dynamics. We show that accommodating a transmission bottleneck affords the best fit our data, but the sparse within-host HIV-1 sampling prevents accurate quantification of the concomitant loss in genetic diversity. We draw inference under the transmission model to estimate HIV-1 evolutionary rates among epidemiologically-related patients and demonstrate that they lie in between fast intra-host rates and lower rates among epidemiologically unrelated individuals infected with HIV subtype C. Using a new molecular clock approach, we quantify and find support for a lower evolutionary rate along branches that accommodate a transmission event or branches that represent the entire backbone of transmitted lineages in our transmission history. Finally, we recover the rate differences at the different biological scales for both synonymous and non-synonymous substitution rates, which is only compatible with the 'store and retrieve' hypothesis positing that viruses stored early in latently infected cells preferentially transmit or establish new infections upon reactivation.

Methodologically, this paper provides long-overdue inferential methods that distinguish (the rather different) within-host and between-host evolutionary processes, as well as integrate epidemiological information. I was wondering if the authors could say if this method would improve inference with data sets for which we have no epidemiological information, just stratification of sequences by host.

Biologically, it has been known for a while that viruses from chronic infections have certain mutations that are not present in primary infection. Here the authors provide further evidence that infectious HIV derives from "stored" virions rather than mutated virion lineages which have reverted. This builds on the work of:

www.ncbi.nlm.nih.gov Within-host and between-host evolutionary rates across the HIV-1 genome. S Alizon and C Fraser, Retrovirology, 2013

HIV evolves rapidly at the epidemiological level but also at the within-host level. The virus' within-host evolutionary rates have been argued to be much higher than its between-host evolutionary rates. However, this conclusion relies on analyses of a short portion of the virus envelope gene. Here, we study in detail these evolutionary rates across the HIV genome.

Posts: 1

Participants: 1

Read full topic

April 9, 2014

19:26

Miao Sun wrote:

Hi All,

I 've run a 4-gene 12,000-taxon super tree. Given I increased the bootstrap repliactes (200 to 1000), the support for the most internal nodes improved a lot (say 80 - 90% BS). However, the support for external nodes are remain low (some of them are 0-10% BS). Regarding to the knowledge that not all the taxa are fully sampled under the 4 genes, so there are great portion of missing data. Well, beside this, any other way to compensate for this shortage?

Thanks!

Miao

Posts: 1

Participants: 1

Read full topic

April 8, 2014

13:18

Brian Foley wrote:

Over the weekend I was reading _Neanderthal Man_ by Svante Paabo . This book discusses comparative genomics of humans and close relatives, including the complete genome of Pan paniscus published in Nature in 2012, and the complete genome of an Altai Neanderthal in 2014. It is clear from the phylogenetic trees and other comparisons in these papers that the complete genomes can be analyzed, for example to show diversity within the Major Histocompatibility Complex in comparison to diversity in other regions of the genomes.

However, when I attempt to download the MHC region from each of the genomes to do some of my own comparisons, I find that it is not easy to get the data. The genomes are not well assembled and annotated on the public sites, so I cannot just search for the MHC gene region in Pan troglodytes, Altai Neanderthal genome, or Pan paniscus genome.

Does anyone here know if the annotated genomes are available at some other site? Papers discussing the genomes seem to indicate that the authors have access to annotation to determine which genes and gene families are shared between modern humans and Neanderthals.

Posts: 3

Participants: 2

Read full topic

April 4, 2014

09:58

Boronian wrote:

Dear babblers, in the past I followed the "total evidence" approach quite doubtlessly, especially when only DNA regions of one genome type (cp, n, mt) were involed. I merely checked if the (good supported) backbone nodes in all trees from individual markers agreed. They mostly did, so I didn't bother any further and created the supermatrix with respective partitioning.

My question is: Is there any standard procedure or common practice I missed, to decide on a quantitative basis if one is allowed to combine the data sets?

I found hints to a couple of tools that should provide such tests [e.g. concaterpillar - couldn't make it run with recent RAxML and is only for amino acid data(?); arn - a package from Farris (yes, THE Farris), which I couldn't locate], but none of them seemed to be working for me.

I then calculated Robinson Foulds distances between the individual trees and the tree from the supermatrix, but I am unsure what distance value should be considered as a threshold (if this is at all a good way to go for my aim).

Please let me know if you have hints in this respect, especially some automizable thing (in R?) would be nice.

Cheers

Posts: 10

Participants: 5

Read full topic

09:24

Erick Matsen wrote:

I was following a reference chain and came across the PhD thesis of Jonathan Laserson, who was a Daphne Koller student and now works at Google.

The thesis is interesting. It's quite ambitious: in one chapter he sets up a Bayesian inference algorithm for a tree of cells (cc @mathmomike) that explicitly models sequencing error. On the other hand, I can't see how he got any of this to work efficiently for the sizes of data sets discussed, and his software, called ImmuniTree, is not to be found on the web.

@jeetsukumaran and others may know this fellow from his Genovo de novo assembly software for mixed populations.

Posts: 2

Participants: 2

Read full topic

April 1, 2014

18:06

Brendan Larsen wrote:

Hello I have phylogenetic trees of deep sequencing data. The problem is that many of the sequences are identical so i wanted to have those nodes that have x number of identical sequences be represented by a circle of size x. This is easily enough done in R using the APE package. The problem is I have two groups of sequences that I want to color differently. So at node 1 for example, 30% of the identical sequences come from group 1, and 70% come from group 2. Ideally these circles at nodes would actually be pie charts that show the different representation but I cannot figure out how to calculate the vector to feed ape. any ideas?

Thanks, Brendan

Posts: 3

Participants: 2

Read full topic

10:13

Tracy Heath wrote:

https://www.nescent.org/sites/academy/Phylogenetic_analysis_using_RevBayes

We are teaching a workshop on phylogenetic inference in the program RevBayes at NESCent August 25-31, 2014. The instructors include all of the developers of RevBayes including phylobabblers @hoehna, @nicolas_lartill, @mlandis, and myself.

If you know of anyone who might be interested, please encourage them to apply!

Posts: 1

Participants: 1

Read full topic

March 30, 2014

18:28

Sergei Turanov wrote:

There is a problem (at least for me).

I analyse an incomplete sequence matrix when sequences for some genes have not been obtained for all the samples in analysis. For example, I've got Co-1 gene but failed to obtain CytB gene for a given sample. As a result, there are some sequences in my matrix that have no matches in nucleotides at all.

So my questions are:

If there any algorithms to calculate the distance between two sequences without matches and to reconstruct distance-based phylogeny using (for instance) availible information about distance between other sequences in the matrix?

If so, is it possible to evaluate the branch support for a given phylogeny?

Thank you!

Posts: 4

Participants: 3

Read full topic

March 29, 2014

March 28, 2014

12:52

Brian Foley wrote:

A 2010 PLOS publication by Linder et al provides test data sets for phylogenetics and metagenomics.

At the HIV Databses we have gathered and organized several data sets which are useful for comparing analysis methods or developing new methods. The Phylogenetic Handbook contains sample data sets for each chapter along with the tutorials of how to use the phylogenetic software. Neither of these sites were specifically set up for providing test data sets, but they could be useful.

I am sure there are other such sets available, and it would be nice to list some of the better ones here in the PhyloBabble site.

Posts: 7

Participants: 3

Read full topic

March 27, 2014

10:45

Erick Matsen wrote:

I hope I'm not bringing up bad memories, but can someone (perhaps @hlapp?) tell us a little about Google's response to the phyoinformatics GSOC application? Was there something special about this year, or was it shifting priorities on their part?

You all have a very strong track record with them in the past! It would be great to continue this program.

Posts: 3

Participants: 3

Read full topic

March 24, 2014

15:48

pterror wrote:

I'm new to all of this and I'm having trouble finding out how to combine continuous and discrete data into the same matrix, or if that's even possible. I'm using Mesquite- is it a problem to insert discrete data into a continous data matrix? I have the continuous data scaled from 0 to 1. Any help would be much appreciated. Thanks.

Posts: 6

Participants: 3

Read full topic

07:30

Erick Matsen wrote:

Correct me if I'm wrong but I don't think it's right to say that you are not computing a matrix exponential because you are doing it via diagonalisation. This is just one way (of many) of calculating a matrix exp. There is a semi-famous paper on this: http://www.cs.cornell.edu/cv/researchpdf/19ways+.pdf

It would be a different story if P was the same for all rate matrices (as it is for the K3ST model where P=H , the Hadamard matrix), but in general P changes as the rate matrix changes. I'm interested in generalisations of this situation to models more complicated then K3ST. That's essentially what I mean by "explicit formulas".

I see what uniformization is now. Actually a bit of a coincidence because we were considering exactly of this over our summer break (southern hemisphere) but of course have done nothing about it. It's a very interesting way of thinking about a CTMC.

Posts: 4

Participants: 4

Read full topic

March 21, 2014

07:06

Erick Matsen wrote:

Phylogenetic Stochastic Mapping without Matrix Exponentiation

Jan Irvahn, @vminin

Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organisms carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and work well for small state spaces. The computations slow down considerably for larger state spaces (e.g. space of codons), because current methodology relies on exponentiating CTMC infinitesimal rate matrices -- an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, that does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. The computational complexity of our MCMC method grows as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and increase the computational efficiency of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be for our method to outperform matrix exponentiation approaches. We show that even on the moderately large state space of codons our MCMC method can be significantly faster than currently used matrix exponentiation methods.

Posts: 38

Participants: 6

Read full topic

06:51

Alexandros_Stam wrote:

Dear All,

I was wondering what would be the best way to regularly publish negative results, since such work is not likely to be accepted anywhere and this is a problem for the community.

Maybe setting up some web-based resource might be a good idea. Is anybody else worried about this and looking for a good way to publish and more importantly, make available, negative results.

I have recently been trying to implement a simple heterotachous model in RAxML to resolve a hard phylogeny, but it doesn't work at all. So where could I report this?

Alexis

Posts: 5

Participants: 5

Read full topic

March 18, 2014

11:11

Brian Foley wrote:

PIV and PTLV-TsTvPlotsJPG.jpg618x969 116 KB

One quick and easy tool for visualizing the DNA distances in a data set, is the DAMBE function under graphics to plot the transitions and transversions vs F84 phylogenetic distance for each pairwise comparison in the data set. I have attached a plot here, showing the plots for Primate T-Cell Leukemia Viruses and Primate Lentiviruses. We know that HIV-1 M group, with F84 distances less than 0.15 here, represent roughly 100 years of evolution. But comparing HIV-1 to SIV from African Green Monkey (distances > 0.5) the age estimate to the common ancestor is in the millions of years range.

Using DNA or protein distances, we don't have any methods as far as I know, that would extrapolate from 100 years to get 15% (phylogenetically corrected) distance to more than 100,000 years for 50% (phylogenetically corrected) distance, let alone millions of years. Silent sites become more than saturated with mutations while at the same time many other sites remain absolutely invariant over time. Thus, calibrating the "molecular clock" has to be done in a time/distance range that is applicable to the data.

The same plots can be done for mammals, vertebrates, etc. The DNA distances between the most distant mammals is more than saturated with mutation in mitochondrial DNA, but not nuclear genes. DNA distances in nuclear genes become saturated when comparing vertebrates (fish, amphibians, reptiles, birds, mammals etc) and at those distances the mitochondrial genomes are easy to align but quite misleading for the "molecular clock" methods.

Posts: 1

Participants: 1

Read full topic

March 15, 2014

12:07

Erick Matsen wrote:

@cmccoy, @trvrb, @vminin and I have just put up a paper on our recent work on B cell molecular evolution. I have also put up a talk about this work.

I wanted to start a thread here so we could get some feedback from the community. Please let us know what you think!

Posts: 2

Participants: 2

Read full topic