Help   ClustalW



  • INTRODUCTION

    Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families.

    Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for pairwise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments can avoid them, aligning regions between gaps. ClustalW is a fully automatic program for global multiple alignment of DNA and protein sequences. The alignment is progressive and considers the sequence redundancy. Trees can also be calculated from multiple alignments. The program has some adjustable parameters with reasonable defaults.


    ClustalW Nucleotide Tutorial
    ClustalW Protein Tutorial

    For additional help on ClustalW also see:

  • YOUR SEQUENCES

    Please make sure that your sequences have different names as the first 30 characters of the name are significant. If clustalw finds two or more sequences with the same name it will fail! View example. Click here for more information on sequence errors.

    ClustalW currently supports 7 multiple sequence formats. These are:

  • NBRF/PIR
  • EMBL / UniProt/SwissProt
  • Pearson (Fasta)
  • GDE
  • ALN/ClustalW
  • GCG/MSF
  • RSF

    More on Sequence Formats

    Please remove any white space space or empty lines from the beginning of your input.

    N.B. Bootstrapping is possible, you can paste your .aln files from your results back into the clustalw submission form.

  • YOUR EMAIL

    You must type your email address in this text box if you are running a job via email. It is not necessary to fill in the box if you are running your search interactively.

  • ALIGNMENT TITLE

    You may type any text you want to help you identify your search results.

  • RESULTS

    This option lets you choose between email and interactive runs. The email run requires you to type an email address in the email text box, and your results will be delivered when they are ready to your email address, thus avoiding waiting for your results as with an interactive run. For example: joe@somewhere.domain.country.

    The default is interactive.

  • ALIGNMENT

    You may choose to run a full alignment or using a stringent algorithm for generating the tree guide or a fast algorithm.

  • CPU MODE

    The multiple CPU option run a special version of clustalw using several linux pc nodes in a parallel fashion to increase the speed of the job without compromising the quality of the results. This option is to be chosen when the user has a large number of sequences (50+ but less than 500) to align. However, care should be taken not to overestimate the quantification of the results. A very large alignment is difficult to read and handle by other software.

  • OUTORDER

    Decide which order the sequences should be printed in the alignment.

  • COLOUR
    You can change this option in the output results, if you click the button, the alignment will be show in colour.
    NOTE: This option only works when you have chosen ALN or GCG as the output format. The colouring of residues takes place according to the following physiochemical criteria:

    AVFPMILW RED Small (small+ hydrophobic (incl.aromatic -Y))
    DE BLUE Acidic
    RHK MAGENTA Basic
    STYHCNGQ GREEN Hydroxyl + Amine + Basic - Q
    Others Gray  

    The default is not to colour the alignments.

    Courtesy of http://prowl.rockefeller.edu

    CONSENSUS SYMBOLS:

    An alignment will display by default the following symbols denoting the degree
    of conservation observed in each column:

    "*" means that the residues or nucleotides in that column are identical in all
    sequences in the alignment.

    ":" means that conserved substitutions have been observed, according to the
    COLOUR table above.

    "." means that semi-conserved substitutions are observed.

    Example:

    CLUSTAL W (1.82) multiple sequence alignment
    
    
    FOSB_MOUSE      MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
    FOSB_HUMAN      MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
                    ************************************************************
    
    FOSB_MOUSE      ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS 120
    FOSB_HUMAN      ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS 120
                    ********************************.***************:*.**:******

    More on sequence/alignment formats

    FAST PAIRWISE ALIGNMENT OPTIONS:

    • KTUP
      This option allows you to choose which 'word-length' to use when calculating fast pairwise alignments.(note: make sure you have chosen 'fast' in the ALIGNMENT.

    • WINDOW
      Use this option to set the window length when calculating fast pairwise alignments.(Note: make sure you have chosen 'fast' in the ALIGNMENT.

    • SCORE
      This option allows you to decide which score to take into account when calculating a fast pairwise alignment. (Note: make sure you have chosen 'fast' in the ALIGNMENT.

    • TOPDIAG
      Select here how many top diagonals should be integrated when calculating a fast pairwise alignment.(Note: make sure you have chosen 'fast' in the ALIGNMENT.

    • PAIRGAP
      Select here to set the gap penalty when generating fast pairwise alignments.

    MULTIPLE SEQUENCE ALIGNMENT OPTIONS:

    • MATRIX
      This option allows you to choose which matrix series to use when generating the multiple sequence alignment. The program goes through the chosen matrix series, spanning the full range of amino acid distances.
      • BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity searches.
        The matrix used is Blosum30.
      • PAM (Dayhoff). These have been extremely widely used since the late '70s. We use the PAM 350 matrix.
      • GONNET. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 250 matrix.

        We also supply an identity matrix which gives a score of 10 to two identical amino acids and a score of zero otherwise.

      • Default values are:
        • DNA: DNA Identity matrix.
        • Protein: Gonnet 250.

    More on Matrices

    • GAPOPEN
      You can set here the penalty for opening a gap. The default value is 10.

    • ENDGAP
      You can set here the penalty for closing a gap.

    • GAPEXT
      You can set here the penalty for extending a gap. The default value is 0.05.

    • GAPDIST
      You can set here the gap separation penalty. The default value is 8.

    More about gaps.


  • EXAMPLE

    e.g. A multiple sequence alignment was done with ClustalW. 5 sequences were input in the fasta format (Download):

>FOSB_MOUSE Protein fosB
     MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA 
     ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP GTSYSTPGLS AYSTGGASGS
     GGPSTSTTTS GPVSARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT
     DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD
     LPGSTSAKED GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
     TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL  
                     
>FOSB_HUMAN Protein fosB
     MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA
     ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPVVDPYDMP GTSYSTPGMS GYSSGGASGS 
     GGPSTSGTTS GPGPARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT 
     DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD
     LPGSAPAKED GFSWLLPPPP PPPLPFQTSQ DAPPNLTASL FTHSEVQVLG DPFPVVNPSY
     TSSFVLTCPE VSAFAGAQRT SGSDQPSDPL NSPSLLAL 
                       
>FOS_CHICK Proto-oncogene protein c-fos 
     MMYQGFAGEY EAPSSRCSSA SPAGDSLTYY PSPADSFSSM GSPVNSQDFC TDLAVSSANF 
     VPTVTAISTS PDLQWLVQPT LISSVAPSQN RGHPYGVPAP APPAAYSRPA VLKAPGGRGQ 
     SIGRRGKVEQ LSPEEEEKRR IRRERNKMAA AKCRNRRREL TDTLQAETDQ LEEEKSALQA  
     EIANLLKEKE KLEFILAAHR PACKMPEELR FSEELAAATA LDLGAPSPAA AEEAFALPLM 
     TEAPPAVPPK EPSGSGLELK AEPFDELLFS AGPREASRSV PDMDLPGASS FYASDWEPLG 
     AGSGGELEPL CTPVVTCTPC PSTYTSTFVF TYPEADAFPS CAAAHRKGSS SNEPSSDSLS 
     SPTLLAL  

>FOS_RAT Proto-oncogene protein c-fos 
     MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC ADLSVSSANF  
     IPTVTAISTS PDLQWLVQPT LVSSVAPSQT RAPHPYGLPT PSTGAYARAG VVKTMSGGRA  
     QSIGRRGKVE QLSPEEEEKR RIRRERNKMA AAKCRNRRRE LTDTLQAETD QLEDEKSALQ  
     TEIANLLKEK EKLEFILAAH RPACKIPNDL GFPEEMSVTS LDLTGGLPEA TTPESEEAFT 
     LPLLNDPEPK PSLEPVKNIS NMELKAEPFD DFLFPASSRP SGSETARSVP DVDLSGSFYA
     ADWEPLHSSS LGMGPMVTEL EPLCTPVVTC TPSCTTYTSS FVFTYPEADS FPSCAAAHRK
     GSSSNEPSSD SLSSPTLLAL  

>FOS_MOUSE Proto-oncogene protein c-fos 
     MMFSGFNADY EASSSRCSSA SPAGDSLSYY HSPADSFSSM GSPVNTQDFC ADLSVSSANF
     IPTVTAISTS PDLQWLVQPT LVSSVAPSQT RAPHPYGLPT QSAGAYARAG MVKTVSGGRA
     QSIGRRGKVE QLSPEEEEKR RIRRERNKMA AAKCRNRRRE LTDTLQAETD QLEDEKSALQ
     TEIANLLKEK EKLEFILAAH RPACKIPDDL GFPEEMSVAS LDLTGGLPEA STPESEEAFT
     LPLLNDPEPK PSLEPVKSIS NVELKAEPFD DFLFPASSRP SGSETSRSVP DVDLSGSFYA
     ADWEPLHSNS LGMGPMVTEL EPLCTPVVTC TPGCTTYTSS FVFTYPEADS FPSCAAAHRK
     GSSSNEPSSD SLSSPTLLAL
		 

Output was in the format:

The sequences are aligned with each other, with the query sequence at the top and subsequent sequences below. Gaps are represented by the "-" symbol. The running total number of amino acids or nucleotides are shown on the right. The degree of similarity is illustrated underneath the alignments with a series of consensus symbols.
FOS_RAT         MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60
FOS_MOUSE       MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANF 60
FOS_CHICK       MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANF 60
FOSB_MOUSE      -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54
FOSB_HUMAN      -MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-CAGLGEMPGSF 54
                 *:..* .:*:: .***** **:.:*   * *..***.*  :.. :*: *:.*.  ...*

FOS_RAT         IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVV 112
FOS_MOUSE       IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMV 112
FOS_CHICK       VPTVTAISTSPDLQWLVQPTLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL 112
FOSB_MOUSE      VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTS----YSTPGLS 110
FOSB_HUMAN      VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS----YSTPGMS 110
                :******:** **********:**:* **... ::.    .**.:*  :    *: ..: 

FOS_RAT         KTMSGGRAQSIG--------------------RRGKVEQLSPEEEEKRRIRRERNKMAAA 152
FOS_MOUSE       KTVSGGRAQSIG--------------------RRGKVEQLSPEEEEKRRIRRERNKMAAA 152
FOS_CHICK       KAP-GGRGQSIG--------------------RRGKVEQLSPEEEEKRRIRRERNKMAAA 151
FOSB_MOUSE      AYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAA 170
FOSB_HUMAN      GYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAA 170
                   :** . * *.::: :::.. .: .: : .** : * *:********:******:***

FOS_RAT         KCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGF 212
FOS_MOUSE       KCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPDDLGF 212
FOS_CHICK       KCRNRRRELTDTLQAETDQLEEEKSALQAEIANLLKEKEKLEFILAAHRPACKMPEELRF 211
FOSB_MOUSE      KCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEG- 229
FOSB_HUMAN      KCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEG- 229
                *********** *********:**: *::***:* ****:***:*.**:*.**:* :   

FOS_RAT         PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD 270
FOS_MOUSE       PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD 270
FOS_CHICK       SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD 265
FOSB_MOUSE      PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ 267
FOSB_HUMAN      PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ 267
                .   . :   ** .     :..  *:.*   *   . *                   **:

FOS_RAT         DFLFPASSRPSGSETARSVPDVDLSG--SFYAADWEPLHSSSLGMGPMVTELEPLCTPVV 328
FOS_MOUSE       DFLFPASSRPSGSETSRSVPDVDLSG--SFYAADWEPLHSNSLGMGPMVTELEPLCTPVV 328
FOS_CHICK       ELLFSAGPR----EASRSVPDMDLPGASSFYASDWEPLGAGSGG------ELEPLCTPVV 315
FOSB_MOUSE      --------------SSRDAP-PNLTA--SLFTHS----------------EVQVLGDPFP 294
FOSB_HUMAN      --------------TSQDAP-PNLTA--SLFTHS----------------EVQVLGDPFP 294
                              :::..*  :*..  *::: .                *:: *  *. 

FOS_RAT         TCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL 380
FOS_MOUSE       TCTPGCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL 380
FOS_CHICK       TCTPCPSTYTSTFVFTYPEADAFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL 367
FOSB_MOUSE      VVSP---SYTSSFVLTCPEVSAF---AGAQR--TSGSEQPSDPLNSPSLLAL 338
FOSB_HUMAN      VVNP---SYTSSFVLTCPEVSAF---AGAQR--TSGSDQPSDPLNSPSLLAL 338
                . .*   :***:**:* **..:*   *.*:*  :*..: .**.*.**:****


 


    • EDITING AN ALIGNMENT

      You can edit the alignment using jalview. Click on the button below to view the above alignment.

    • PHYLOGENETIC TREE

      Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths are proportional to the amount of inferred evolutionary change. A Cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa. Tree distances can be shown, just click on the diagram to get a menu of options. The ".dnd" file is a file that describes the phylogenetic tree.

      These are now in controlled with new buttons in the output file as well as a pop up menu, that is available by right-clicking on the applet. The buttons on the page include "Show as Phylogram Tree", "Show as Cladogram Tree" and "Show Distances".

      IMPORTANT!

      Please note applets are not printed out with html pages, You will need to:

      • Use the "Print Screen" button in the top right corner of your keyboard.
      • Open an imaging application like paint or photoshop.
      • Go "file>new" from the menu or "control+N" from the keyboard to create a new image.
      • Go "edit>paste from the menu or "control+V" from the keyboard to paste your screen capture.
      • The use the crop function to trim the image (e.g. "image>crop").
      • Then save or print the image.

      example:

      Right-click in the area below for options!


      IMPORTANT:
      To use this option you will need to input a sequence alignment. Please make sure this alignment is in PIR or PHYLIP format. ALN and GCG MSF files are not supported so you will have to convert your MSF files to PIR format with, for example, GCG's ToPir:
      topir pileup.msf{*} -outf=pileup.pir 
      
      Please refer to the GCG documentation to find out how to use this program correctly. You may then use this file as input (cut and paste or upload) to this service.

      The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First you calculate distances (percent divergence) between all pairs of sequence from a multiple alignment; second you apply the NJ method to the distance matrix.

      This option allows you to choose the following output formats for the tree:

        • Neighbour
        • Phylip
        • Distance
      In order to view these trees you must have a program capable of displaying the data. Please refer to this pages section on OUTPUT for more information.

    • Kimura Correction of distances
      This options allows you to set on distances correction (correction for multiple substitutions). This is because, as sequences diverge, more than one substitution will happen at many sites. However, you only see one difference when you look at the present day sequences. Therefore, this option has the effect of stretching branch lengths in trees (especially long branches). The corrections used here (for DNA or proteins) are both due to Motoo Kimura.

    • Ignore Gaps in alignment
      With this option, any alignment positions where ANY of the sequences have a gap will be ignored. This means that 'like' will be compared to 'like' in all distances. It also, automatically throws away the most ambiguous parts of the alignment, which are concentrated around gaps (usually). The disadvantage is that you may throw away much of the data if there are many gaps.


    • UPLOAD A FILE

      You may upload a file from your computer which containing a valid set of sequences in any format (GCG, FASTA, EMBL, GenBank, PIR, NBRF, Phylip or UniProt/Swiss-Prot) using this option. Please note that this option only works with Netscape Browsers or Internet Explorer version 5 or later. Some word processors may yield unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden windows characters. Some examples of common sequence formats may be seen here.
    • SCORES TABLE

      Scores Table is a new view to ClustalW output. Users can sort the scores by Alignment Score, Sequence Number, Sequence Name and Sequence Length.