How to determine 'Cluster of Orthologous Groups' for our proteins?

More Muhammad Sufian's questions See All

Why don't we cite Microsoft Excel (generally) in our scientific publications ?

Whatever field of research we are in, we usually employ many commercial softwares for the analysis of our results. Whether it is Biology, Chemistry or Physics; we cite the commercial software with...

31 December 2015 5,484 5 View

How can I find out sequence similarity between two proteomes ?

I want to find out the overall protein sequence similarity among 2 strains of same bacterial specie. Let say, strain 1 has 4500 proteins and strain 2 has 4300. My strategy was; I performed...

10 November 2015 8,807 4 View

How do I tackle frequent data update of NCBI GenBank in Bioinformatics research ?

I have downloaded complete protein sequences of some bacterial genomes in March 2014 from following NCBI FTP site, ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ I had a list of NCBI GIs which were...

02 March 2015 4,323 4 View

How can you visualize large datasets of protein multiple sequence alignment (MSA)?

I have 100 clusters of paralogous protein sequences, each cluster containing at least 50 sequences having 80% sequence identity among them (determined using CD-HIT). To visualize the MSA of these...

04 May 2014 3,918 1 View

How can I find out enzymes in large dataset of proteins ?

I have a list of thousands of NCBI-GIs of proteins. I want to determine; 1. how many of them are enzymes ? 2. what are their EC numbers ? 3. What is the reference database for this information...

02 March 2014 6,661 3 View

How to convert gene acronyms to full names ?

I have got a long list of gene acronyms e.g., proS pheS glyQ and so on....... I want to convert them to full names e.g., proS = prolyl-tRNA synthetase pheS = phenylalanyl-tRNA synthetase glyQ =...

01 February 2014 5,970 4 View

Why would two proteins of the same name from the same specie have low similarity?

############################################# UPDATED Now consider following two proteins; http://www.ncbi.nlm.nih.gov/protein/194443845 http://www.ncbi.nlm.nih.gov/protein/194443076 In previous...

01 February 2014 3,664 7 View

Can anybody help me with Geneplotter R package?

Using Geneplotter R package, there is a function named plotMA (http://www.bioconductor.org/packages/2.13/bioc/manuals/geneplotter/man/geneplotter.pdf). To get the plot, your object (data.frame)...

31 December 2013 5,693 1 View

Manipulating DIstance Matrix file created from Clustal Omega (standalone)

I have ~17k amino acid sequences in FASTA format in a single file. Using following command of Clustal Omega on Linux system, I created the distance matrix; clustalo -i filename.faa...

31 December 2013 2,123 4 View

What is 'unidentified ORF' ?

There is a protein in Salmonella enterica serovar Typhi with the title "Unidentified ORF" (YP_005216057.1, GI:378958571). What does it mean ? Since as per my thinking, ORF is concerned for nucleic...

31 December 2013 8,824 5 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

Are there any instruments for studying time similar to the way it is in space?

There are a huge number of methods for studying objects in space, according to the senses (and not only). Mechanical, thermal, optical, acoustic, electrical, magnetic, based on particle beams,...

06 August 2024 7,102 0 View

Weak DAPI staining after immunohistochemistry - how to improve?

After immunohistochemistry of previously fixed in PFA and EtOH and then frozen 20 μm sections of zebrafish brain, DAPI staining is very weak (right) compared to the same sections stained without...

05 August 2024 9,637 2 View

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

I aim to be as skeptical as possible regarding whether a pair of orthologous genes results in the same phenotype in their different but related bacterial organisms under similar environmental...

05 August 2024 6,787 4 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended?

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended? If no axillary lymph node dissection was not...

05 August 2024 8,056 1 View

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity?

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity? What is the acceptable percentage of error (regardless of the metric)? Could you suggest...

03 August 2024 5,358 0 View

Interested in a SCOPUS collaboration?

Hi RG family. My team and I are working on some SCOPUS publications and we need co-authors who are willing and capable of undertaking both qualitative and quantitative-based studies. The scope...

02 August 2024 7,843 0 View

Dave Lee Popular answer

You'll need to do a couple of things:

1) One to convert the GI accessions to uniprot IDs

Try using the uniprot mapper (I've never used the website version as I always write my own but this looks fine).

http://www.uniprot.org/help/mapping

2) Map the uniprot IDs to OGs

You should probably stick with the version 3 release of eggNOG for this part 4 doesn't seem to have the actual 'UniprotAC2eggNOG' file :

http://eggnog.embl.de/version_3.0/downloads.html

The first part doesn't require any coding but the second one does. Should be quite straight forward though.

For the 2), some pseudo-code would be to:

1. Read in the uniprot2OG file

2. Create a matrix with 2 columns and n rows where n = unique(c(uniprotOGs, yourUniprot)). These are also the rownames.

3. Then in: TotalMatrix[uniprotOGs, FirstColumn]

Dave Lee

Hey Muhammad,

Instead of running everything yourself. An alternative could be to instead use resources whereby COGs, NOGs (and other millions of other OGs there are) have already been done for you and available for download.

Examples include eggNOG from embl which uses a blast-approach for assigning OGs -> This is the second approach you suggested -> I would not recommend running it yourself since it would take AGES.

http://eggnog.embl.de/version_4.0.beta/

Here, you could assign your ~6000 proteins (uniprot accessions?) to the OG's that they have already done.

If you don't like the blast-type OGs, you could find a resource where the OGs have been determined via trees instead -> I have less experience but there are bound to be some.

Dave

Muhammad Sufian

Thank you Dave. My proteins have NCBI GI accessions. Does any of the OGs can accept my 6000 NCBI GIs ? eggNOG can take maximum upto 30 records only.

Snehal Karpe

Hi Muhammad,

If no other database gives you such information about already annotated COGs, you can try tools like Proteinortho (https://www.bioinf.uni-leipzig.de/Software/proteinortho/) , OrthoMCL (http://orthomcl.org/orthomcl/) , etc (I am sure there are many more!). OrthoMCL has been used for many eukaryotic genomes, proteinortho for few. Proteinortho can be run by one step command after installation (in which all-against-all BLAST is done at the back-end) whereas OrthoMCL is very complex. You can give them a try.

Hope this helps.

Snehal

Stefano Levi Mortera

Hi Muhammad

I know your topic is a little out-of-date but I got in touch with it today, and I have quite the same problem to solve. I'm not a bioinformatician so I'm not so friendly with such issues, anyway I must face it somehow. I'd like to know how did you achieved your COG analysis and, if you are currently involved in such problems, how do you manage it now. I know MEGAN5 can be a tool for this, and I used it once with BLASTp outputs, but I'm looking for less cpu-consuming paths.

Thanks in advance for your help

Regards

Stefano

Dear Stefano,

Actually I did not included this experiment in my previous study as it was taking much time, eventually delaying my publication. Hopefully, if I came across any way, I will definitely share in this thread.

Thanks.

Daniel Kurth

Using the NCBI Batch Web CD-Search Tool (http://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) you can search up to 40,000 sequences at once against several databases, including COG database. It takes roughly one day though, but it's done on their servers, and you might send several batches. However, I think it's the old COG database before the recent update (http://nar.oxfordjournals.org/content/43/D1/D261.abstract)

Robert Rentzsch

Also scanning yourself is now much simpler with the eggNOG (currently) 4.5 HMM library. Even simpler: use the new eggNOG mapper at http://beta-eggnogdb.embl.de/#/app/emapper

Shakhinur Islam Mondal

You can use WebMGA (http://weizhong-lab.ucsd.edu/metagenomic-analysis/server/cog/). Just upload your sequences and get result.