phylobabble.org

Latest topics

URL

XML feed
http://www.phylobabble.org/latest

Last update

1 hour 13 min ago

September 4, 2015

06:57

@handley_scott wrote:

We are pleased to announce the first Workshop on Population and Speciation Genomics, a new concept within the workshop series that also includes the popular Workshop on Molecular Evolution and Workshop on Genomics, being held in the UNESCO World Heritage town Český Krumlov, Czech Republic. This workshop will be taking place between 24 January and 5 February, 2016. More information is below or can be found on our website at http://evomics.org.

An on-line application form can be found at: http://evomics.org/registration-form/2016-workshop-on-population-and-speciation-genomics/

Dates: 24 January-5 February, 2016

Application Deadline: 15 October, 2015 is the preferred application deadline, after which time people will be admitted to the course following application review by the admissions committee. However, later applications will certainly be considered for admittance or for placement on a waiting list.

Registration Fee: $1,500 USD. Fee includes opening reception and access to all course material, but does not include other meals or housing. Special discounted pricing has been arranged for hotels, pensions and hostels. Information regarding housing and travel will be made to applicants following acceptance.

APPLY HERE: http://evomics.org/registration-form/2016-workshop-on-population-and-speciation-genomics/

Workshop Overview: The Workshop on Population and Speciation Genomics consists of a series of lectures, demonstrations and computer laboratories that cover various aspects of genomic analyses with a focus on the use of next generation sequencing data at the level of populations and closely related species. Faculty are chosen exclusively for their effectiveness in teaching theory and practice. The course is designed for established investigators, postdoctoral scholars, and advanced graduate students. Scientists with strong interests in the uses of genome-scale sequencing data and the application of modern analysis tools to study population dynamics and interactions are encouraged to apply. Lectures and computer laboratories total ~90 hours of scheduled instruction. No prior programming experience is required.

This independent workshop is also very suitable to be used as a complement to the Workshop on Genomics, which will take place just before the Workshop on Population and Speciation Genomics, at the same location.

Topics to be covered include:

  • Introductions to UNIX, R, and Python
  • Analyzing genomic data in the “cloud” using Amazon Web Services (AWS)
  • Genomics data handling and file formats
  • RAD (Restriction site Associated DNA) data analysis
  • Analysis of low-coverage resequencing data
  • Variant detection
  • Likelihood and Bayesian inference
  • Coalescent analyses of population structure and demography
  • Analysis of adaptation and natural selection
  • Selective sweep analyses
  • Detection of introgression and admixture

Co-directors: Walter Salzburger, Michael Matschiner, Jan Stefka, and Scott Handley

For more information and online application see the Workshop web site - http://evomics.org

Posts: 1

Participants: 1

Read full topic

September 1, 2015

17:26

@mathmomike wrote:

Does anyone know references or previous work to the following three questions:

  1. Consider the graph G_NNI of unrooted binary phylogenetic trees on leaf set {1,..,n} where two trees form an edge if they are one NNI apart. Similarly G_SPR (where trees are joined if they are one SPR apart). Both these graphs are regular (each tree has the same number of neighbors) so if you start anywhere then after a while you will be at each tree with (asymptotically) uniform probability - the question is what is the mixing time (time to be near uniform as a function of n) for this random walk? Note I'm not talking about data (or Bayesian methods) here - just a pure mixing time question regarding this discrete graph. Does anyone know a paper that studies this (I think I saw one once, but all my papers/notes got lost in the 2010/2011 earthquakes here!).

  2. There are various notions of the centre of a tree - e.g. the "centroid" (explained very nicely in the recent Tanglegrams paper by Matsen/Billey/Kas/Konvalinka paper on ArXiv). However does anyone know any paper that discusses what we might call a 'leaf-centroid'. Given a tree T call a vertex v of T a leaf-centroid if each of the components of T-v contains at most half of the leaves of T. If T has no vertices of degree 2 then (just like the centroid) a tree either has a unique leaf-centroid or two adjacent leaf centroids. However even for this class ( trees without vertices of degree 2) the leaf-centroid can be different from the centroid! Just wondering if anyone has seen this notion mentioned or studied before?

  3. Anyone know an early (pre-1960s) reference to the simple but fundamental result: A collection C of nonempty subsets of X (including X) forms a hierarchy (nested family) if and only if C is the set of clusters of a rooted X-tree? It may be implicit in Linneaus (or even Aristotle!) but some more explicitly mathematical statement would be good.

Any advice will help for a book ('mathematical phylogeny') that I'm now half-way through writing I might ask a few further questions over the coming month or two.

Posts: 5

Participants: 3

Read full topic

August 29, 2015

14:39

@ss107 wrote:

The rate matrix (Q) for Jukes-Cantor generally is expressed as described here: https://en.wikipedia.org/wiki/Models_of_DNA_evolution#Most_common_models_of_DNA_evolution

The factor (1/4) comes because the base frequency is 0.25 in Jukes-Cantor analysis and μ is the rate of substitution.

The transition/probability matrix [P(t)] for any branch length (t) is the exponentiation of Q multiplied by t.

What I understand is: The matrix P will be entirely different if the substitution rate μ have two different values (say 0.25 and 1).

The question I have: Is there any restriction on the value of μ?

Related software packages like SeqGen/MrBayes expresses the rate matrix in terms of six rates {AC, AG, AT, CG, CT, GT} which all equals μ in this discussion.

From the examples from the related manuals, I think they normalize the rates such that their sum would be 1. That means, for JC, 6 μ = 1 always!

Generally, the substitution rates are taken as input either as percentages of the rate sum or they are scaled to the GT rate. I believe this information is not enough to decide the actual rate.

In simple terms, for JC, saying,

AC = AG = AT = CG = CT = GT

is not enough. You have to explicitly tell what the value is. Based on that value, the transition/probability matrix can change drastically.

If I always normalize the values then rate matrix for JC reduces to a matrix with static values.

I know some of my understanding is wrong. Where am I wrong?

Posts: 3

Participants: 2

Read full topic

August 28, 2015

18:08

@ematsen wrote:

I'm going to keep on advertising this position with @trvrb until I'm hoarse, because I think that B cell receptor analysis is just about the coolest thing right now.

If you need a practical reason to check this out, this is a way for you to get in on the ground level of a new subfield of molecular evolution / phylogenetics which is rich in data and very weak in methods, and to get biomedical funding for phylogenetic work (which is a good thing in the current funding environment).

Here is the text of the ad:

A new postdoctoral position is available in the groups of Trevor Bedford and Erick Matsen at the Fred Hutchinson Cancer Research Center located in Seattle, WA. This position will focus on analyzing immune repertoire sequence data from an evolutionary perspective.

Deep sequencing of B cell repertoires has opened up the possibility of understanding the immune response at a molecular level. This project will analyze repertoire data to compare processes of clonal expansion and affinity maturation in primary and secondary immune responses. These studies will allow, for the first time, a detailed mechanistic understanding of original antigenic sin in viral infection; results from these studies may allow construction of vaccines that better deal with a diversity of pathogen exposures. This work represents a collaboration with experimental colleagues at the Fred Hutch, Emory University and Adaptive Biotechnologies. There will be no want of fresh data.

The ideal candidate will have experience with working with sequence data and a strong interest in model-based statistical analysis of real data sets. Data sets will comprise millions of B cell repertoire sequences and so strong coding abilities are essential. Candidates should have experience in at least one programming language and a proven track-record of peer reviewed publications. Candidates with PhDs from diverse backgrounds are encouraged to apply, including biology, mathematics, statistics, physics and computer science. However, experience in phylogenetics or immunological bioinformatics is especially opportune. The Fred Hutch is an equal opportunity employer, committed to workforce diversity. Women and minorities are particularly encouraged to apply.

The position is available immediately with flexible starting dates for a 2-year appointment with possibility of extension. Informal inquires are welcome. Applications will be accepted until the position is filled. The Fred Hutchinson Cancer Research Center offers competitive salaries commensurate with experience and skills, complete with benefits.

To apply please send (1) cover letter that includes the names and contacts for three references and a short statement of research interests, (2) a current CV and (3) code samples or links to published/distributed code to trevor@bedford.io.

Posts: 2

Participants: 1

Read full topic

18:08

@ematsen wrote:

I'm going to keep on advertising this position with @trvrb until I'm hoarse, because I think that B cell receptor analysis is just about the coolest thing right now.

If you need a practical reason to check this out, this is a way for you to get in on the ground level of a new subfield of molecular evolution / phylogenetics which is rich in data and very weak in methods, and to get biomedical funding for phylogenetic work (which is a good thing in the current funding environment).

Here is the text of the ad:

A new postdoctoral position is available in the groups of Trevor Bedford and Erick Matsen at the Fred Hutchinson Cancer Research Center located in Seattle, WA. This position will focus on analyzing immune repertoire sequence data from an evolutionary perspective.

Deep sequencing of B cell repertoires has opened up the possibility of understanding the immune response at a molecular level. This project will analyze repertoire data to compare processes of clonal expansion and affinity maturation in primary and secondary immune responses. These studies will allow, for the first time, a detailed mechanistic understanding of original antigenic sin in viral infection; results from these studies may allow construction of vaccines that better deal with a diversity of pathogen exposures. This work represents a collaboration with experimental colleagues at the Fred Hutch, Emory University and Adaptive Biotechnologies. There will be no want of fresh data.

The ideal candidate will have experience with working with sequence data and a strong interest in model-based statistical analysis of real data sets. Data sets will comprise millions of B cell repertoire sequences and so strong coding abilities are essential. Candidates should have experience in at least one programming language and a proven track-record of peer reviewed publications. Candidates with PhDs from diverse backgrounds are encouraged to apply, including biology, mathematics, statistics, physics and computer science. However, experience in phylogenetics or immunological bioinformatics is especially opportune. The Fred Hutch is an equal opportunity employer, committed to workforce diversity. Women and minorities are particularly encouraged to apply.

The position is available immediately with flexible starting dates for a 2-year appointment with possibility of extension. Informal inquires are welcome. Applications will be accepted until the position is filled. The Fred Hutchinson Cancer Research Center offers competitive salaries commensurate with experience and skills, complete with benefits.

To apply please send (1) cover letter that includes the names and contacts for three references and a short statement of research interests, (2) a current CV and (3) code samples or links to published/distributed code to trevor@bedford.io.

Posts: 1

Participants: 1

Read full topic

17:21

@ematsen wrote:

News to me: p4 phylogenetics toolkit, which does inference and manipulation.

P4 is a Python package for maximum likelihood and Bayesian analysis of molecular sequences. Its specialty is that it can use heterogeneous models, where the characteristics of the model can differ over the data or over the tree.

  • P4 can be used as a phylogenetic toolkit, the elements of which you can string together in different ways depending on the job at hand. It is useful for programmatic manipulation of phylogenetic data and trees. If you want to do something interesting with your trees or data, p4 might have at least some of what you want to do already in place.
  • P4 will read data in a few of the common phylogenetic formats (eg Nexus, Phylip, clustalw, fasta, pir/nbrf), but does not read other formats in bioinformatics (eg EMBL, genbank). P4 will read in trees in Nexus or Phylip format.
  • P4 will do some elementary data manipulation, eg extracting a Nexus-defined charset from an alignment, or converting data from one format to another. P4 will also do tree manipulation, and tree drawing. It has a big tree viewer, to be able to view big trees (eg up to 5000 taxa) on the screen.
  • P4 is meant to be easily extensible, so if you want to do something that it cannot do, it is often easy to add that functionality.

As far as I can tell, there isn't an overlap with @jeetsukumaran's DendroPy or Paul Lewis' Phycas.

Posts: 1

Participants: 1

Read full topic

August 19, 2015

08:44

@rdmpage wrote:

This may seem like a strange question, but here goes. I'm looking for a computationally efficient way to represent a set of sequences in a 2D space. For example, imagine that we have 10,000 DNA barcodes. I could compute a tree, but (a) that get's computationally hard (if it doesn't seem hard, make it 100,000, or 1M), and (b) a tree drawing isn't stable in the sense that there's no global coordinate system that helps us compare trees for different subsets of data.

I thought about using something like DNA walks, where we start at 0,0 in a x-y graph, walk along a sequence, and make moves -1, 1 in the x or y direction depending on the next base in the sequence. For example, we could plot the final x,y coordinates at the end of the walk, and we'd have a simple to compute measure that depends solely on the sequence at hand, and which locates a sequence in a shared coordinate space. What I'd really like are broad-brush clusters that are recognisable enough to say "OK, over there are fish, these are insects, that cluster is molluscs"). I'm guessing this might work if there are clade-specific sequence properties such as base-composition, etc., otherwise, not so much.

Hope this doesn't sound to ridiculous. I'm curious as to whether there's a method for getting a quick sense of the taxonomic composition of a large set of N sequences that doesn't require the N^2 comparisons needed to compare the sequences in order to build a tree.

Posts: 4

Participants: 3

Read full topic

August 3, 2015

18:33

@ematsen wrote:

I thought it would be fun to have a discussion about interesting optimization heuristics for phylogenetic inference. Please post to contribute! Here's one from the parsimony literature:

The Parsimony Ratchet Nixon, Cladistics 1999; PDF.
  1. Generate a starting tree (e.g., a “Wagner” tree followed by some level of branch swapping or not)
  2. Randomly select a subset of characters, each of which is given additional weight (e.g., add 1 to the weight of each selected character).
  3. Perform branch swapping (e.g., “branch-breaking” or TBR) on the current tree using the reweighted matrix, keeping only one (or a few) trees.
  4. Set all weights for the characters to the “original” weights (typically, equal weights).
  5. Perform branch swapping (e.g., branch-breaking or TBR) on the current tree (from step 3) keeping one (or a few) trees.
  6. Return to step 2. Steps 2–6 are considered to be one iteration, and typically, 50–200 or more iterations are performed. The number of characters to be sampled for reweighting in step 2 is determined by the user; I have found that between 5 and 25% of the characters provide good results in most cases.

In this context, a "weight" is a per-column multiplier of the parsimony score used in the grand parsimony total.

I think this one is interesting, and I don't know of anything like it being used in the likelihood literature. Does anyone else? Seems to me that the closest would be a love child between heated chains and the bootstrap.

Posts: 4

Participants: 2

Read full topic

July 18, 2015

19:52

@mathmomike wrote:

Phylomania - November - Hobart, Tasmania, Australia http://www.maths.utas.edu.au/phylomania/phylomania2015.htm and then the 20th annual NZ phylo meeting - February, Tongariro National Park, NZ http://www.math.canterbury.ac.nz/bio/events/doom16/

Posts: 1

Participants: 1

Read full topic

June 25, 2015

09:01

@ematsen wrote:

Interoperability, reproducibility, sharing, and reuse of phylogenetic data and software.

Posts: 1

Participants: 1

Read full topic

June 18, 2015

07:38

@arlin wrote:

Hi everyone. One of the consequences of NESCent closing is that its mailing lists will be going away in the coming months. I was part of 2 working groups that maintained email lists. The 2 lists include pretty much everyone who has attended a hackathon at NESCent (wg-phyloinformatics, which began in 2006, and hip, which began in 2011).

We considered just migrating to a new non-NESCent email list, but an alternative would be to encourage current list members to sign up on phylobabble. If we could start a topic on phyloinformatics, then presumably users could set up a feed and it would have the same immediacy as an email list.

What do you think? I welcome your thoughts on that idea.

Posts: 6

Participants: 3

Read full topic

June 11, 2015

14:45

@BrianFoley wrote:

Are there any distance calculators, such as DNAdist in PHYLIP, which can be set to treat ambiguity codes as a partial match? For example, I want a R to be counted as half a match to A or G. I believe that PHYLIP DNAdist counts R as a full match to either A or G.

For diploid organisms an "R" is usually indicating that one allele had A and the other G. But for populations such as a swarm of HIV-1 in a single patient, the R usually means that part of the population had A and the other part G.

Posts: 4

Participants: 4

Read full topic

June 10, 2015

09:16

@erikvolz wrote:

We are seeking software developer to assist with incorporating new models into BEAST:
https://goo.gl/Z1m9XC
This is a fixed 2-year position based at Imperial College. The focus is on software development, but it could potentially be a good fit for someone with a scientific background and extensive programming experience. Please circulate to anyone you feel would be interested & qualified.

Posts: 1

Participants: 1

Read full topic

June 9, 2015

06:54

@max wrote:

Dear phylobabblers,

I got invited to give a talk at the Brazilian Mathematics Colloquium (end of July) on the interface of Maths and Stats with phylogenetics/phylodynamics.

I'm now gathering ideas from fellow mathematicians and mathematical biologists on what exactly would catch the attention of an audience of mathematicians. I have spoken to Statisticians before, but never to hardcore pure mathematicians, so that's why I'm asking.

My initial ideas involve talking about the Kingman coalescent, and/or some of Susan Holmes's work on the geometry of tree space and/or the connections with macroscopic ODE-based models such as the SIR.

I'm looking specially at you, @cwhidden, @ematsen and @mathmomike.

Best,

Luiz

Posts: 6

Participants: 4

Read full topic

05:22

@Gadget wrote:

Hi Folks, I am generating phylogenies for a large number of GPCR gene families from a variety of organisms. A lot of these gene families contain 2-3000 genes, and I can run prottest on them. As they are transmembrane, the JTT+G model has the best fit for some of the alignments. However there are some with more than 4000 sequences, which prottest cannot handle. Does anyone know of an alternative multithreaded program to determine the best fit amino acid model of sequence evolution for large datasets? The 4000+ gene families are close orthologues of the smaller families, however this is surely not enough to justify assuming the larger families must be JTT too. Can anyone share some advice on how to proceed? Thank you!

Posts: 2

Participants: 2

Read full topic

June 4, 2015

16:45

@jeetsukumaran wrote:

The fourth major version series of the DendroPy Phylogenetic Computing Library has been released!

http://dendropy.org

Get it now with:

$ sudo pip install -U dendropy
  • DendroPy 4 runs under Python 2.7 and Python 3.x
  • Re-architectured and re-engineered from the ground up, yet preserving (as much as possible, though certainly not all) the public API of DendroPy 3.x.
  • MAJOR, MAJOR, MAJOR performance improvements in data file reading and processing! Newick and Nexus tree file parsing crazily optimized, with performance scaling at O(1) rather than O(N) or O(n^2) (i.e., in practical terms, you will see better performance improvements with bigger trees when comparing DendroPy 4 vs. DendroPy 3). A thousand-tip tree can be parsed in 0.1 seconds with DendroPy 4 vs. 0.2 seconds with DendroPy 3, while a one million-tip tree can be parsed in under two minutes with DendroPy 4, vs. over 4 days with DendroPy 3. These performance improvements will percolate down to all applications based on DendroPy, including, for example, SumTrees.
  • Tests, tests, tests, tests, and more tests! The core library has a stupendous amount of new tests added, and with each one the ability to zero in and identify, isolate, and deal with bugs is improved.
  • Related to above: dozens of nasty bugs have been dealt with. No, not killed, because we are not that kind of organization. Rather, they have been taken to the big testing farm in the quarantine zone where they can lead healthy lives munching on mock constructs and helping us test the the library to ensure that it works as advertised so that your code works as advertised.
  • Documentation, documentation, documentation! The goal is to have every public method, function, or class fully-documented.
  • Many, many, many, many new features: e.g., a high-performance TreeArray class, calculation of MCCT topologies, new simulation models, new tree statistics, new tree manipulation routines.
  • SumTrees works faster than ever before thanks to the above improvements, and also allows for many new operations such as rerooting the target tree, using an MCCT tree as the target topology, extensive extra information summarized, auto-detection of number of parallel processors etc.: http://dendropy.org/programs/sumtrees.html .
  • The newly rewritten DendroPy primer is just full of information to get you started: http://dendropy.org/primer/index.html .
  • The "work-in-progress" migration primer will help ease the transition from 3 to 4: http://dendropy.org/migration.html .
  • Comprehensive documentation of all the data formats supported, plus all the keyword arguments you can use to control and customize reading and writing in all these different formats: http://dendropy.org/schemas/index.html .
  • A glossary of terms, to clarify the simultaneously redundant and oversubscribed/conflicting terminological soup that characterizes a lot of phylogenetics: http://dendropy.org/glossary.html .

Posts: 3

Participants: 2

Read full topic