My Way of Life: October 2010

Friday, October 29, 2010

running pace new method

Step 1. generate datafile with n sequences

head -n*2 original.fsa > small_sample.fsa

Step 2. run preprocess

./preprocessPaCE 5k.fsa 1

Step 3. run GItitle.pl

./GItitle.pl 5k.fsa.PaCE > 5k.GI.fsa

Step 4. run PaCE

#!/bin/csh
#@ job_type = parallel
#@ class = LONG
#@ account_no = NONE
#@ node = 2
#@ tasks_per_node = 4
#@ checkpoint = no
#@ wall_clock_limit = 05:00:00
#@ error = $(Executable).$(Cluster).err
#@ output = $(Executable).$(Cluster).out
#@ environment = COPY_ALL
#@ queue

llmachinelist
mpirun -v -np 8 -machinefile /tmp/machinelist.$LOADL_STEP_ID ./PaCE_v9
/N/gpfs/cap3/leesangm/HumanMRNA/split1mil/5k.GI.PaCE 5000
./Phase.cfg

llsubmit submitTest.sh

Step 5. split PaCE output

./PaCEclusterFasta.pl test.fsa estClust.5000.3.PaCE

#####
Notes

1. Pace: How to run

mpirun -v -np 8 ./PaCE /N/gpfs/cap3/leesangm/HumanMRNA/example5K/5k.GI.fsa 5000 ./PaCE.cfg

Wednesday, October 27, 2010

installing Pace

Run the "make" command from the shell.

(For system specific MPICH and C compilers modify the MakeFile appropriately.)
After successful build, copy the executables into the folder:

cp PreprocessPaCE.pl PaCE-pipeline/PaCE-clustering
cp PaCE PaCE-pipeline/PaCE-clustering

Checklist for a successful build:
- Executables created: PaCE
- No fatal errors or "serious" warnings flagged by the compiler (minor wa$
- The mk uses O3 optimization level. If this high level is not
supported by the C compiler being used, change it to a lower level
like O2 or O1 or O0 as appropriate.

parameters in pace

Dynamic Programming scores and their default values:
--------------------------------------------------------

(1) match 2 (match)
(2) mismatch -5 (mismatch)
(3) gap continuation -1 (gap)
(4) gap opening -6 (hgap)
(5) Score for alignment
with base 'N' -5 (AlignmentWithN)

Load balancing and Work related parameters:
-----------------------------------------------

(6) Fixed window size for bucketing 11 (window)
If the data size <=10,000 ESTs then a window size of 10 is recommended. Constraint: window <= MinLen and window<=11 Clustering parameters (Quality control): ---------------------------------------- (7) (7.1) MinLen (default 30) Signifies the minimum length cutoff of a maximal match between any pair of sequences to be considered for alignment computation. Nessary but not sufficient condition for a pair of sequences to cause merging of their two clusters. (7.1) MaxStringsInABucket (default 100000) Ignores exact matches of length "window" which occur in >= MaxStringsInABucket
number of distinct input sequences.

(8)
Flag for Gene Homology and Transcript Homology (TranscriptsTogether):
1 means Gene Homology
0 means Transcript Homology
(PS: No other values are valid.)

(9)
Clustering criteria for accepting dynamic programming alignment results:
-------------------------------------------------------------------------

(9.1) Parameters computed:

(a)
EndToEndScoreRatioThreshold (default 15%):
|Global alignment Obtained Score - Global alignment Ideal Score|
---------------------------------------------------------------- X 100
Global alignment Ideal Score

(9.2)
EndtoEndAlignLenThreshold (default 100 bp)

Global alignment length = length of aligning region
(w.r.t the minimum of the number of
bases participating from both sequences in the alignment)

(9.3)
MaxScoreRatioThreshold (default 5%)

Local alignment Score Ratio
|Local alignment Obtained Score - Local alignment Ideal Score|
= ---------------------------------------------------------------- X 100
Local alignment Ideal Score

(9.4)
TranscriptCoverageThreshold (default 40%)

Local alignment length Coverage
= (Local alignment length / minimum of the lengths of the two sequences) X 100

Condition for merging two clusters based on evidence from an aligned pair of sequences:

Condition#1:= ( (Global alignment score ratio <=25%) AND (Global alignment length>=100) )
Condition#2:= ( (Local alignment score ratio <=5%) AND (Local alignment length Coverage >= 40%) )

Gene Homology:
A pair of ESTs will be put in one cluster if:
either (Condition#1 OR Condition#2 OR both) is/are satisfied.

Transcript Homology:
A pair of ESTs will be put in one cluster if:
(Condition#1) is satisfied

(10)
ClonePairsFile None
Clone Mates/Pairs Information:
------------------------------

Clone Mates or Clone Pairs information can be specified in a file and can be used to improve quality of clustering (esp., cases where ESTs do not show complete over their corresponding transcript). Give the name of the file containing Clone Mate/Pair information agains$

(11)
Reporting features:
------------------

(11.1)
ReportSplicedCandidates (default 0)
If 1:
Reports all pairs of sequences generated that pass the local alignment test (Condition#2)
but FAIL the global alignment test (Condition#1). This can be used as a
set of potential pairs of sequences that flag an alternative splicing or unspliced
intron event.

(11.2)
ReportMaximalPairs (default 0)
If 1:
Reports all pairs of sequences that were generated by PaCE. The pairs are the ones
which have at least one maximal common substring of length >= MinLen.
Warning: The output is quadratic (#pairs) and so use it only for analysis purposes.

(11.3)
ReportMaximalSubstrings (default 0)
If 1:
Reports all maximal common substrings (length >= MinLen) generated by PaCE.
Warning: The output is linear but for large input size can be quite high. So use it only for analysis purposes.

(11.4)
ReportAcceptedPairs (default 0)
If 1:
Reports all pairs of sequences that led to merging of clusters. The number of such pairs
is linear in the number of sequences.

(11.5)
OutputLargeMerges (default 0)

argeClusterThreshold (default 500)

If 1:
Reports a pair of sequences leading to a cluster merge, if the individual
sizes of the two clusters of these two sequences (at the time of merge)
are both >= LargeClusterThreshold. The reports go into a file called:
large_merges.*

(11.6)
ReportGeneratedPairs (default 0)

If 1:
Report all promising pairs generated. This may take up a lot of disk space
because the number of such pairs in the worst case can be quadratic
in the input number of sequences.

(11.7)
ReportPairsCountUnit (default 1000)

The basic unit of display in the final report on the number of promising pairs
generated, aligned, and accepted.

(11.8)
DumpClustersMidway (default 0)

If 1:
Output intermediate sets of clusters during the course of execution.
Handy, if the run is expected to be a long one.

(12)
OutputFolder (default .)
All PaCE output files will be written into this folder.

(13)
Miscellaneous:
--------------

(13.1)
MPI_Block_Sends (default 1)

If 1:
Uses MPI_Ssend to communicate from slaves to master during the
alignment phase. Is expected to be about 3-4 times slower than
using MPI_BlockSends as 0 (i.e., just MPI_Isend and MPI_Wait).
Feature incorportated to ensure no message is loss in case of
large number of processors. Recommended to turn the flag on
if >= 512 processors are used.

(13.2)
Keep_Mbuf_Full (default 0)

Deprecated.

Assembly of PaCE clusters using CAP3 :

This is a wrapper script for running CAP3 assembly on each of the PaCE cluster.
This package does NOT include the "cap3" executable.
If CAP3 is not available, it can be obtained by mailing
Dr.Xiaoqiu Huang (xqhuang@cs.iastate.edu).

For the scripts to work, the CAP3 executable ("cap3") should be
present in the directory PaCE-pipeline/CAP3-assembly/ or in the path of the system.

PS:
Although the scripts are for running CAP3 as the assembly tool,
they can be easily modified accordingly to enable usage of alternative assembly tools.
The only script to be modified for this purpose
is "PaCE-pipeline/CAP3-assembly/caploop".

Input:
-------
Must have generated the cluster file using PaCE.
For illustration,
- let the organizm be denoted by "tEST",
- let the FASTA data file be "tEST.data"
(PS: this is the original data file - NOT the PaCE preprocessed version),
- let the PaCE cluster file be denoted by "estClust.500.3.PaCE".

Output:
--------
CAP3 assembly output for each of the PaCE cluster in estClust.500.3.PaCE.

Steps for Assembly:
-------------------

To start with, the following steps assumes the PaCE cluster file (estClust.500.3.PaCE) is
in the directory PaCE-clustering.

- From PaCE-pipeline directory:
cd CAP3-assembly

- mkdir tEST

- cp estClust.500.3 tEST/tEST
ie., copy the PaCE cluster file (from its current location) to the file named "tEST" inside the newly created directory

- You can use the perl script:
perl extractCF.pl tEST/tEST tEST.data

This step will create one FASTA data file corresponding to each PaCE cluster for tEST.

- rm tEST/tEST
ie., Remove the cluster file from the tEST folder - so that now it has only
the data files corresponding to each PaCE cluster

- caploop tEST
This script runs CAP3 on each of the FASTA data file present in the tEST folder,
generating the CAP3 output for each in the same folder.
This is the only script that requires modification if in case the assembly
tool is NOT CAP3.

Friday, October 22, 2010

Running cap3

after installing cap3

just take a sequence file and use the following terminal at command prompt

> cap3 filename [options]

you can specify constraints and qual

CAP3 takes as input a file of sequence reads in FASTA format. If the names of reads contain a dot ('.'), CAP3 requres that the names of reads sequenced from the same subclone contain the same substring up to the first dot. CAP3 takes two optional files: a file of quality values
in FASTA format and a file of forward-reverse constraints.

The file of quality values must be named "xyz.qual", and the file of forward-reverse constraints must be named "xyz.con", where "xyz" is the name of the sequence file.

my work involved

input
-----
>cap3 seq

seq is the file consisting of some sequence of data...

output
------
6 files

seq.cap.ace
seq.cap.contigs.links
seq.cap.info
seq.cap.contigs
seq.cap.contigs.qual
seq.cap.singlets

seq.cap.ace
-----------

AS 1 5

CO Contig1 1422 5 11 U
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGTTTT
AGTTTTCCTCTGAAGCAAGCACACCTTCCCTTTCCCGTCTGTCTATCCATCCCTGACCCT
GTTGTCTGTCTATCCCTGACCCCGTAGTCTCCTAAGTCGCCCCAGATTTTGTGAACACCC
TCTGGAACTAGAATCTAGTGGGCGGATGGACCATTTACTAGACGGAGGTAGAGGTGGGTG
GATGCGAACGACAGGGTGCATAGTCAGCCCGGTTTTAAGGGCAGGTCACTTGGTAGGTCA
GCAGGCGGGTCAGTGGGCGGGTGCCTGCAGCATTTATGAACTTATTTGGCCCAGCAAACA
TTTTGAGTGTCAGGCCGTGCCTACCCAAGGTGAGGGTAAGGAGCAAAATCAGCCCAGCCC
AGAGCACTGGGTGGCTACACAGAGCCGACCTCTAATGTGCGCTCCGGGTCGGGATGGCAC
TCAGCTCGCCTTTAGGGAGTGATGATCTGGATGCCTGGCTTGGAGGTGACAGAGCCTGCC
CTTATGAGACAATTAAGAGACTGACTAAGCACCCGGCAGGAGGCCACGAGAATCCCCATG
TGAGAAAGAAGAGCATAAACAGGAAACACATTTAATAATTAAACAAAGATAACTCCCTCG
TGTGCGCGCACCGGGCCAGCCCCTATAGAAACATCTGAGGAGTCACTTCCTCCCATGACT
CTCGCCCGCCCGGCCGGCTGGAGTCGGCTCCTGGCAAGCTTCAGGCACCTCAGTTGTCCT
GAATACACACAGCACCCTTTCCTTACTGAAGCCCCTGAGAGCCTCCAGTTCTCCCTCCTT

BQ
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1$
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2$
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2$
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2$
20 20 20 20 20 20 20 5 5 5 20 2

AF R3 U 1
AF R1 U 44
AF R2 U 1
AF R4 C 478
AF R6 C 571
BS 1 248 R2
BS 249 250 R3
BS 251 849 R2
BS 850 850 R3
BS 851 974 R2
BS 975 1155 R4
BS 1156 1157 R6
BS 1158 1159 R4
BS 1160 1168 R6
BS 1169 1242 R4
BS 1243 1422 R6

-------------------------------------------------------------------
seq.cap.contigs.qual
---------------------

>Contig1
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

a clear documentation is present about pace at the link below

http://icb.med.cornell.edu/crt/CAP3/example_usage.xml

Wednesday, October 20, 2010

Running the pace

- cd PaCE-pipeline/PaCE-clustering

- Preprocessing:

This preprocessing step formats the input FASTA file and generates
the corresponding input datafile for PaCE.

Run the preprocessPaCE command to preprocess the EST input file (fasta format)
eg.,
$ PreprocessPaCE.pl {FASTA data file} {1: prune PolyA/T, 0 otherwise} > tEST.data.PaCE

If the second argument is :
0: Does not modify the input sequences. Just concatenate each sequence
such that each sequence appears in one line and also
converts all DNA characters to upper case.
1: In addition to the two effects produced by option '0' this option
trims off streaks of As and Ts occurring at either ends
of the sequence. This option is not required if the input FASTA
GNU nano 1.2.5 File: PaCE-README

sequences have already been stripped of poly As and Ts or if they
they are required to be input. If you use this option,
make sure you inspect manually some modifications to confirm
there will be no negative impacts on the clustering. This inspection
can be done by first running preprocessPaCE using flag 0 and then
using flag 1 (both on the original file) and taking a diff of both
the output files.

eg.,
$ PreprocessPaCE.pl ../datafiles/tEST.data 0 > ../datafiles/tEST.data.PaCE
This generates another file by name "tEST.data.PaCE" in ../datafiles/ .

- Find "n":
Find the number of sequences in the preprocessed output data file.
Let us call it "n".

This can be found by a simple unix command like:
eg., grep -c ">" ../datafiles/tEST.data.PaCE

ie., n=500 for ../datafiles/tEST.data.PaCE.

- Parameterization:
Parameters to PaCE are kept in the file PaCE.cfg.
Check if PaCE.cfg is there in the directory of the executable.

You typically do NOT need to change any parameters except "window".

If the data size <=10,000 ESTs then a window size of 7 is recommended. If the data size <=30,000 and >=10,000 ESTs then a window size of 8 is recommended.
Otherwise a window size of 9 is recommended.

- Run PaCE:
PaCE takes two parameters:
Usage: {MPIRunCommand} PaCE {preprocessed FASTA datafile} {number of ESTs}

(P.S: Number of processors should be AT LEAST 2)

eg.,
$ mpirun -np 4 PaCE ../datafiles/tEST.data.PaCE 500

where 4 is the number of processors to run on.

"{MPIRunCommand}" depends on the available MPI implementation and job scheduler
of the parallel cluster being used.
For batch mode parallel platforms, use the specified batch submission routines like "qsub" or "llsubmit".

- PaCE output:

The results are of two categories: Run-time and Cluster results.
A summary of these results are printed on the standard output. If the parallel platform
uses batch processing which outputs the standard output to a file, then the summary
will be present in that file.

(i) Run-time results:

The total run-time and the run-time in different components of the system
is shown for each SLAVE processor. This indicates the total run-time for PaCE clustering.

For eg.,

Time taken by slave 1 : Load<0> + Preprocessing Phase<3> + Clustering Phase<1>= 5 secs

Here processor rank 1 took a total of 5 seconds to complete with the run-times in phases
indicated separately. The numbers are truncated to integer values.
Almost all the slave processors take the same amount of total
run-time. Also the time to load the data file into memory
indicated by Load<> first might vary with systems and as it is the
time for initialization, subtracting it from the total run-time
tells the actual time taken by the software.

(ii) Clustering results:

The standard output will also have something like this:

Master: #Clusters Output:= 357 #Singletons=258
Master: #Contained ESTs:= 63

This means: The total number of clusters generated is 357, out of which 258 are singletons.
Also out of the n(=500) ESTs supplied, 63 ESTs are completely contained
(with 100% identity) in other ESTs.

The clusters themselves are located in file estClust.n.p.PaCE
where n is the number of ESTs and p is the number of slave processors.
eg., estClust.500.3.PaCE
PS: The number of slave processors is always one less than the total
number of processors used to run PaCE.

The size distribution of these clusters (in the number of ESTs) is present
in estClustSize.n.p.PaCE.
eg., estClustSize.500.3.PaCE
Each line is of the format "{Cluster#} {Number of Members in Cluster#}".

The set of EST sequences which are contained in (an)other EST sequence(s) are
indicated in ContainedESTs.n.PaCE.
eg., ContainedESTs.500.PaCE
Each line is of the format "{Contained EST sequence header} IN {Container EST sequence header}".
PS: One EST can be contained in multiple EST sequences, and only one container
sequence is indicated here.
Also the set of contained ESTs reported need not comprise of all the ESTs
that are actually contained. More specifically, the contained ESTs
reported are only based on the set for which alignments were performed by PaCE.

The set of EST sequences which are not contained in any other EST sequence in the
provided data set is present in NonContainedESTs.n.PaCE.
eg., NonContainedESTs.500.PaCE

Adding clone mates information :

--------------------------------

As additional input, PaCE is designed to take as input clone mates information.
This information should be present in a file which should be present in
the following format:

>clone mate id #
FASTA header for sequences belonging to this clone mate (each in separate lines)
....

eg., Let the clone mates be in a file myCloneMates that looks like this:

>CloneId1
gi|19863109|
gi|19800130|
>CloneId2
gi|19863111|
gi|19800132|
...

Step 1:

Run the following command (for the above example) :
$perl formatCloneMates.pl myCloneMates tEST.data.PaCE > myCloneMates.PaCE

This will generate the PaCE formatted myCloneMates.PaCE.
(The numbers in each line indicates the sequence number that PaCE
provides for each input sequence.)

Step 2:

The PaCE formatted clone mate file should be in the folder where the
PaCE executable resides.
To specify the clone mate file as input, it has to be provided in the PaCE.cfg file as:

ClonePairsFile clonematefilename

eg.,
ClonePairsFile myCloneMates.PaCE

This has to be done before starting to execute PaCE. PaCE will put each
set of sequences linked by the same clone mate id together in one cluster
in its final output.a

If you do not have any clone pairs information to provide then set:

ClonePairsFile None

PS: Both these steps should be performed for each input FASTA file (even
if you intend to run for subsets of the original FASTA data file).

Tuesday, October 19, 2010

whats the magic number

started work on pace got an error about some magic number...magic magic...!!!!!

tg-qdong@BigRed:/N/gpfs/tg-qdong/PaCE> /N/gpfs/tg-qdong/PaCE/PaCE_v9

Error: Need to obtain the job magic number in MXMPI_MAGIC !

my first job on tera grid ( running repeatmasker)...

Repeatmasker est.fasta

output

est.fasta.cat
est.fasta.log
est.fasta.masked > gives the output file removing the unwanted and redundant data
est.fasta.out > gives where the changes had been made
est.fasta.tbl

est.fasta.out
--------

SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID

180 25.0 2.1 0.0 seq1165 373 420 (263) + L2b LINE/L2 3234 3282 (93) 1
180 25.0 2.1 0.0 seq1545 549 596 (16) + L2b LINE/L2 3234 3282 (93) 2
180 25.0 2.1 0.0 seq3921 473 520 (48) + L2b LINE/L2 3234 3282 (93) 3
180 25.0 2.1 0.0 seq5025 181 228 (350) + L2b LINE/L2 3234 3282 (93) 4
180 25.0 2.1 0.0 seq5225 149 196 (531) + L2b LINE/L2 3234 3282 (93) 5
180 25.0 2.1 0.0 seq5514 431 478 (123) + L2b LINE/L2 3234 3282 (93) 6
180 25.0 2.1 0.0 seq5573 359 406 (177) + L2b LINE/L2 3234 3282 (93) 7
180 25.0 2.1 0.0 seq7881 289 336 (296) + L2b LINE/L2 3234 3282 (93) 8
180 25.0 2.1 0.0 seq9083 218 265 (302) + L2b LINE/L2 3234 3282 (93) 9

est.fasta.masked
--------------
>seq1
CAAAGATTAGCTCAACCCCACCCGTGCAGCAGTGGCCAGGAATCCACCAG
TCGCTTTAAATATGCACTACACGCGTTATCGTTGGCGGGGATCCCGGGGG
ATGATCCATTGGCCTTGATACGGGCCACAACCACTGACACGCCCCACAAG
CGAGTTTGGTTCACTGGGGCGAGCTGAAGGTTAGGTCTATCTATTGCGTA
GCCGAAAAAGTCATCACCAGTAACTATGCCGCTCTCCTTGCCCTTTGCCA
TGTGGTAGCTACGTCCTCCTCTATTGCACGTAACCCGGTTCAAAACCACG
AATCGTATCACTTTGCTTCGAAAAACCACTTTGATCCAGATACAAATTAA
ACTGCGATCCGAGCTTACCACCAAATCCGGGAGCATAAGCTGAAAGTGCA
GGAGAGGCAGAGCTGGCGGTGGCCCTTCCACCTCCTTTTATCGTAACACG
ATCGGCACGTCGGGCACTTTCGGGAAGGGCTGCCACACGAAAAGTCTTTT
ATAACTACGAGC
>seq2
CCACGCATTATGTAAAAATTACCGACGTGGAGCTTCGTGCCGCCGGATTG
TAAGAATAGCTCGTTGGAGATTACCTGAGTTGGTTGTTCTTTGTGTGTCA
ATCCATCTCGCCCTGACGACGGGGGCAAACATAAAAGTCCGATAACCCCG
AATTCACCTGGAATCATGGACAACCCGGACATTCCTACTTGGAAATCAGA
TCAGAGTTCGCTCGGATCAGCGACTGTTTGCCCGATAACGCGGGACAGGC
ATGGTTTAAAAAGTTCCGTTTTTTGTGCAAATTTCAAGACAATGCGTCTA
ACCGTAGCCTCGAAGGGCTACTTATCGGCTCGCCGGTTATGGGCGGTGAT
TTAAGTTAACGCGTTAGTTTGTAGAGCAGACACGTCCCTGTACCCATCGA
AGAGTTATCTGTGACTATACCCAATATAGTTCACTTAACTACTTATACGT
CCCCTGACAACATATGATCGAAGATAGAACGTTCGGCTATAGGCTATTAG
AGGCGAGTTGTCTCGAGCACACTTATTAATAAATGTCC
>seq3
TGCGCACGCGAACGCCCCAGCCTAACGTCGGAAATAAGGGGCCTCTAAAT
CCAAGTGCATTGTTCTCCGCTGGTGCGACACCGGACAAGGGTATGATGGA
GATATGACACTTATACTAACGTATGCGTTCCTAACATTCCACCCCAGGCA
ACGCGTACATGGAGGGCCCCGTTGCCTAAGTTGTATCCCCATGGGTGTGC
CAACCAAAAGACCCAGCGCAAGAGGCACTGCACATTCGAGTTATTGAGGA
CGTAGTTAACCGTAGACCCTTTCCATAAATATGCACCTGAGAAGAATCCT
TATGGTCGGCCACGTTCAGTTCCTCATCGTAAACCGGCAAGACCCTATCG
GGCTACAATTCAGATAACTGCGTAAATTGAATTCTAAGTCTGTTAATCCT
CCTGATAACCGGACGTAGTCAGGTCGCTCAATCTATGCGGGTTGAAGGAC
GCTCTACGGGCCCACCCCACGGAGATTGGGTCCTCGGATCGATCTTGGAT
GTCGATACCAAGAAGTCGGATCAAACACGTCTACTACTGCCGGGGTTTTG
CATCCCTAGCGAGCGGGCGGCCGTCGAACGCACTTCGCCCACCTTGATGA
CCCCCCAACCCTTCAGAGGACCCGCCATGGTGAGCACAAGCAACAGAAGT
TAGGGCGTGATTTGAAGAT
>seq4
GGTGACAAGCGGTCTAACGGAGCCTCTACGGGATGGGGATTTTCTGATCA
AACTCTGCCAAGGAGTAGTATAGCACATGAGCTCAGTCGTAGGGAAGCAA
AACAATCACGAGATCCCTCTATATTCTCACCTCCCGAAAGATCTGACCCC
GCCAAGACAGAATATCAGATCCTGAAGATTTATCTCATCCTTAACAGCCA
GGCCGTGGTTCACGTACGGATGACTTTTCGTTAGTGGCGGTATATCGTCC
CTATGTAGGGGAGACGAGACACAAGTATGAGCAGACTTAGCACTTACATC
GGGGACGTTATGGCAAGTGTGCCTGCCCAATGTTCGACTGGCTCCCATGG
GAACTTATAGTTCTTCGGCTACAGACACGGTATATGCCACTCCCTAAAAA
GCTTACAATGTTCAAGCGTAGTCTGTGCAAGGAGAAATGTGATATCATTC
ATGCGAGGCCTCAATTTTTTTTAACAGAGCGGCGGGGCCGCTCGCCGGCC
>seq5
AGCTTTGATGGTAGCTGCGAGCCCATCGCCGGTCTAAGGGACTTGATGGA
GGATCAACTCATCCCATTCGGCCTGGCTTGAGTCCTTCAGTCACTCATTC
TCTTGGCCGCGGATAGGATGCACTTATAAACGACCCAAAAGGGCAGGAGT
TTCAGCCCTAGGAGGGTTTCTGGGTTCACTGGCATTAGCACCGACTCTGG
GCTTAATCCAGGTCCCACGGCTGTGCCGAGAGTCCCTTAGCAGGAGTTGG

Monday, October 18, 2010

installing Java

i know many of us get struck with the little things... the code below are few simple steps to install java on your machine

Linux
-----
Add partner repository using the following command

> sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"

Update the source list

> sudo apt-get update

Now install sun java packages using the following commands

> sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts

test by using the following command

java -verison

Just installing new Java flavours does not change the default Java pointed to by /usr/bin/java. You must explicitly set this:

Open a Terminal window
Run sudo update-java-alternatives -l to see the current configuration and possibilities.
Run sudo update-java-alternatives -s XXXX to set the XXX java version as default. For Sun Java 6 this would be sudo update-java-alternatives -s java-6-sun
Run java -version to ensure that the correct version is being called.
You can also use the following command to interactively make the change;

Open a Terminal window
Run sudo update-alternatives --config java
Follow the onscreen prompt
You can also try IcedTea NPR Web Browser Plugin

Monday, October 11, 2010

Many things to learn

"ALL men of whatsoever quality they be, who have done anything of excellence, or which may properly resemble excellence, ought, if they are persons of truth and honesty, to describe their life with their own hand; but they ought not to attempt so fine an enterprise till they have passed the age of forty."

Marlon pierce blog

http://communitygrids.blogspot.com/

Friday, October 8, 2010

Inferal installation

Infernal

* Infernal ("INFERence of RNA ALignment") is for searching DNA sequence databases for RNA structure and sequence similarities.
* It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs).
* A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus, so in many cases, it is more capable of identifying RNA homologs that conserve their secondary structure more than their primary sequence.
* URL: http://infernal.janelia.org/

$ wget ftp://selab.janelia.org/pub/software/infernal/infernal.tar.gz
$ gunzip infernal.tar.gz
$ tar xvf infernal.tar
$ cd infernal-1.0.2/
$ ./configure [--enable-mpi]
$ make
$ sudo make install

Testing

* Just running the 7 commands that are available with this package

$ cmalign
$ cmbuild
$ cmcalibrate
$ cmemit
$ cmscore
$ cmsearch
$ cmstat

* if it gives the standard usage message, then it is installed.

installing TRNAscan-SE

TRNAscan-SE

* tRNAscan-SE is a program for improved detection of transfer RNA genes in genomic sequence.
* URL: http://lowelab.ucsc.edu/tRNAscan-SE/

[edit] Installation

* Get the link location

$ wget http://lowelab.ucsc.edu/software/tRNAscan-SE.tar.gz
$ cd tRNAscan-SE-1.23/

* Edit Makefile to provide the following details

## where you want things installed
BINDIR = /usr/local/bin
LIBDIR = /usr/local/lib/tRNAscan-SE
MANDIR = /usr/local/man

If you dont change the above path it will install in your home directory $HOME

* Now make the package

$ make
..
..
sqio.c:238: error: conflicting types for ‘getline’
/usr/include/stdio.h:651: note: previous declaration of ‘getline’ was here
make: *** [sqio.o] Error 1
..

* The make did not complete because, there were 2 getline subroutines in 2 different files
* Solution:
o Checked if getline is present in any of the *.c files in this directory
o opened sqio.c and changed all the getline to getLine

$ make

make ran with no error.

NOTE:

* there are some instructions at the end of make. It requires us to run source setup.tRNAscan-SE; rehash for the current session
* Or include a line source /home/krevanna/Desktop/TOOL_TEST/tRNAscan-SE-1.23/setup.tRNAscan-SE in ~/.cshrc
* This wont work because we are in bash shell and it expects us to be in C-shell

* I did not follow the above instructions and i went ahead with make install

$ sudo make install
$ make testrun
$ make clean

installing blast

Steps to download and install BLAST

* Visit the URL: ftp://ftp.ncbi.nih.gov/blast/executables/release/.
* Click on 'LATEST' to get the latest version of BLAST.
* Right click on the version and 'Copy link address'.
* On the server, type wget and paste the url

$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/release/2.2.24/blast-2.2.24-x64-linux.tar.gz
$ tar zxvf blast-2.2.24-x64-linux.tar.gz
$ cd blast-2.2.24/bin
$ ls
bl2seq blastall blastclust blastpgp copymat fastacmd formatdb formatrpsdb impala makemat megablast rpsblast seedtop

installing glimmer

Installation

* On Ubuntu OS

$ sudo apt-cache search glimmer
tigr-glimmer - Gene detection in archea and bacteria
$ sudo apt-get install tigr-glimmer

[edit] Testing

* To check if the program has been installed

$ tigr-glimmer
Usage: /usr/bin/tigr-glimmer
Existing programs are:
anomaly build-icm entropy-score glimmer3 multi-extract start-codon-distrib uncovered
build-fixed entropy-profile extract long-orfs score-fixed test window-acgt

installing diya

* Download url: http://sourceforge.net/projects/diyg/

$ tar zxvf diya-1.0-rc4.tar.gz
$ cd diya-1.0-rc4/

[edit] Prerequisites

* Perl Modules (How to check Perl module)
o Bioperl
o Data::Merger
o Getopt::Long
o FileHandle
o XML::Simple
o File::Basename
* Software
o Perl (>= 5.8)
o MUMmer v3.20
o Glimmer v3.02
o BLAST
o tRNAscan-SE v1.23
o Infernal v0.81
o rfamscan.pl v0.1
* Database
o UniRef50 (refer to Others)
o Protein Clusters (refer to Others)

[edit] Installation

* Steps to install

$ perl Makefile.PL
$ make
$ sudo make install

installing mira

* download url: http://sourceforge.net/projects/mira-assembler/files/
* version: 3.2.0

$ wget http://sourceforge.net/projects/mira-assembler/files/MIRA/V3.2.0/mira_3.2.0_prod_linux-gnu_x86_64_static.tar.bz2/download
$ bunzip2 mira_3.2.0_prod_linux-gnu_x86_64_static.tar.bz2
$ tar xvf mira_3.2.0_prod_linux-gnu_x86_64_static.tar
$ cd mira_3.2.0_prod_linux-gnu_x86_64_static/

* All the executables are in bin directory
* Export the path to bin to bashrc

$ vim ~/.bashrc
export PATH=$PATH:/path/to/mira/bin
$ source ~/.bashrc
$ mira
...

[edit] Instructions

* Open the index.html file present in docs folder on firefox.

[edit] Usage

mira \
[-project=]
[--job=arguments]
[-fasta[=] | -fastq[=] | -caf[=] | -phd[=]] [-notraceinfo] [-noclipping[=...]] [-highlyrepetitive] [-lowqualitydata] [-highqualitydata] [-params=] [-GENERAL:arguments]
[-STRAIN/BACKBONE:arguments]
[-ASSEMBLY:arguments]
[-DATAPROCESSING:arguments]
[-CLIPPING:arguments]
[-SKIM:arguments]
[-ALIGN:arguments]
[-CONTIG:arguments]
[-EDIT:arguments]
[-MISC:arguments]
[-DIRECTORY:arguments]
[-FILENAME:arguments]
[-OUTPUT:arguments]
[COMMON_SETTINGS | SANGER_SETTINGS | 454_SETTINGS | SOLEXA_SETTINGS | SOLID_SETTINGS]

ESTs (simulated)

* Generate a genome of 1,000,000 length
* Generated ESTs of length 500-800 bp for the above genome
* Around 10,000 ESTs were generated and stored in the file est.fasta

$ mira --project=EST --job=denovo,genome,normal,454 -fasta=est.fasta -SK:mnr=yes:nrr=10 454_SETTINGS -LR:wqf=no -LR:mxti=no -AS:epoq=no >&log_assembly.txt
$ cd EST_assembly/
$ ls -R
.:
EST_d_chkpt EST_d_info EST_d_log EST_d_results

./EST_d_chkpt:
passInfo.txt readpool.maf

./EST_d_info:
EST_info_assembly.txt EST_info_consensustaglist.txt EST_info_contigstats.txt EST_info_readrepeats.lst
EST_info_callparameters.txt EST_info_contigreadlist.txt EST_info_debrislist.txt EST_info_readtaglist.txt

./EST_d_log:
EST_error_reads_invalid EST_info_reads_tooshort EST_out_pass.2.caf
EST_info_consensustaglist.1.txt EST_info_readtaglist.1.txt EST_out_pass.3.caf
EST_info_consensustaglist.2.txt EST_info_readtaglist.2.txt EST_readpoolinfo.lst
EST_info_consensustaglist.3.txt EST_info_readtaglist.3.txt hashstat.bin
EST_info_contigreadlist_pass.1.txt EST_int_clippings.0.txt miralog.ads_pass.4.adsfacts
EST_info_contigreadlist_pass.2.txt EST_int_normalisedskims_pass.4.bin miralog.ads_pass.4.adsfacts.pclusters
EST_info_contigreadlist_pass.3.txt EST_int_posmatchc_pass.4.lst miralog.ads_pass.4.complement
EST_info_contigstats_pass.1.txt EST_int_posmatchc_pass.4.lst.reduced miralog.ads_pass.4.forward
EST_info_contigstats_pass.2.txt EST_int_posmatchf_pass.4.lst miralog.ads_pass.4.reject
EST_info_contigstats_pass.3.txt EST_int_posmatchf_pass.4.lst.reduced miralog.noqualities
EST_info_debrislist_pass.1.txt EST_int_posmatch_megahubs_pass.4.lst miralog.usedids
EST_info_debrislist_pass.2.txt EST_int_posmatch_multicopystat_preassembly.0.txt
EST_info_debrislist_pass.3.txt EST_out_pass.1.caf

./EST_d_results:
EST_out.ace EST_out.maf EST_out.padded.fasta.qual EST_out.unpadded.fasta EST_out.wig
EST_out.caf EST_out.padded.fasta EST_out.tcs EST_out.unpadded.fasta.qual

[edit] Inference

* Snippet of some of the information in EST_d_info/EST_info_assembly.txt

..
..
Length assessment:
------------------
Number of contigs: 106
Total consensus: 944829
Largest contig: 36580
N50 contig size: 13625
N90 contig size: 5366
N95 contig size: 2945
..
..

* Number of contigs in EST_out.unpadded.fasta and EST_out.padded.fasta

$ grep -c '>' ./EST_d_results/EST_out.unpadded.fasta
107