Most of the tools you would need, are designed for UNIX/Linux based systems. What is probably the easiest for you to do is set up a virtual linux machine on your windows machine. When i was starting up in this area i used https://www.virtualbox.org/ , set up ubuntu 14.04 and started testing out different programs.
Take a look at the upcoming Bioinformatics MOOC; they run "MOORs" along with the MOOC, where a MOOR is a "massive, open, online research project". Not only are they projects you can work on at home but they are genuine research problems with scope for new work.
If you wanted to you could of course do the MOOC, but there's no obligation, and you can access the research project without completing the MOOC. You do not have to wait for the next MOOC to start (September 15th), you can access the archive for the previous course and look at the research projects there, but it might be more fun to take part in the new ones so you can work with others.
You might also want to check out Rosalind, some of the problems there are quite good fun and will allow you to build up a portfolio of code solutions.
"Bioinformatics" isn't nearly as homogenous as people think, so a lot depends on what made you become interested in the topic. If you're an MD, why do you want a phd, and what made you decide to pursue bioinformatics?
Unless a group is specifically looking to hire a programmer for a specific project, I find that most prefer a biology background to a computer science background. A logical mind is definitely necessary, but the computer skills you need to solve most of your problems can be picked up in a few months.
The one thing you definitely need to know is how to navigate a unix-like interface. This means, if you own a Mac, learning how to work the "terminal" and if you own a windows machine, installing an Ubuntu dual boot or Cygwin. Either way, not being a unix noob will help when you talk to a bioinformatics lab.
The other skill you'll want to pick up is a scripted programming language. Perl or Python. The book "Learning Perl" is a great place to start. The vast majority of your time will be spent parsing data files and scripting languages is how you wanna do it.
I can't really recommend papers until I have an idea of what interests you. But you might want to have a look at Bioinformatics, PLoS comp biol, and the comp biol section of PLoS One for some examples of papers. Give me more details on your interests and I can probably help more.
Depends on you, mostly.
It's definitely one of the hot areas right now and is likely going to stay that way for at least a decent bit. A lot of the skills are also transferable into areas if you want.
Because of that, there are a lot of people getting into the field that aren't very well suited for it. They've got minimal training, no experience and just don't have the good instincts for it. We get a ton of CVs like that for our open positions at my company and we just put them aside. We just don't have the time to try to train someone who might not work out.
So you need to show that you're a step ahead of the others. Finishing an online course doesn't mean that much. Show that you finished the full Coursera Data Science specialization (https://www.coursera.org/specialization/jhudatascience/1), including the capstone and I'll definitely be taking a second look. Building a presence on online forums (biostars, bioconductor help, etc.) of the tools you use is another way of building a name for yourself.
leetcode is a website where you can take interview-ish questions and see how you performed and how to critically think through the questions to properly/efficiently answer them.
I strongly suggest learning Python via Codecademy. By far the best interactive and hands-on learning I've seen.
In general, with programming, reading a book is not too helpful. The best way to learn is through practice. Good luck!
The only thing better than another file format is another programming language. ;-)
Actually, I always thought Julia had cool language features. I just haven't had the time.
First, I'd like to ask if you understand anything about Git or other version control systems. If not, read this.
It's hard to fix bugs on someone else's code if you don't really understand it. I'd rather recommend you to keep using an open-source app (specially one you use often) and try to think of small and easy-to-implement things you'd do to improve your experience with that particular application. Then, try to modify the code to reflect your ideas and submit it for approval. By the way, don't forget to use best coding practices.
Definitely check out Coursera's Genome Data Science Specialization https://www.coursera.org/specializations/genomics. You can take courses for free or pay for the for some credentials. The profs for the course are topnotch biostatisticians from JHU and are R experts. There's a ton of free online resources, just take the time to look!
If available use rsync on linux, if you are on windows use wsl (ubuntu), or you can use this peer to peer website for sharing data :- https://toffeeshare.com/
If all of these options fail, quit bioinformatics, buy a RV and travel the country. I have considered it before.
Cacher (previously gistbox) has a desktop app, gist/VScode/Atom integration, CLI tooling etc. It offers unlimited public snippets for free, or full features at $6/month (via annual subscription). It seems a better choice, tbh.
Backup solutions are more about matching technology to the scale you need to backup, rather than being bioinfo focused. For example, we have about 25 petabytes of data right now, so our automated nightly tape backups and RAID in striped parity mode are going to be overkill for you.
For just a few disks needing backup, (and probably just some of the folders on those disks)... a typical one is an rsync cronjob.
The first example shows how to do 7 days of backups. (I'm not sure if you meant you want 7 days of history or just once a week). http://rsync.samba.org/examples.html
rsync is nice because it attempts to only transmit the differences since last time you backed up, so it should save lots of transfer time.
Here is one of the first google results on using the cron: http://www.unixgeeks.org/security/newbie/unix/cron-1.html
You can get an external harddrive and dump to there, or any other networked computer that you know the IP or DNS name of.
If i understand correctly, this is the usual "Open source licence" deal.
The project was started as an open source piece of code, with a reasonable licence. You can still download the open source code, because open source code, once released, can't be close source. However, it looks like Schrodinger decided that they couldn't do better as a proprietary software company, and so they've taken the project under their wing... and are now offering to sell you a licence.
The licence is probably not for the code - you can still get that "free" as in "free beer", but if you want support from schrodinger, you'll have to pay them for a licence, which is how they cover the cost of support.
I have two points of advice for people working with SOLiD data:
If you're still interested, read my colourspace rants:
https://groups.google.com/forum/#!topic/trinityrnaseq-users/HUOmE-3JgSc
And if you're still interested in working with SOLiD data, I admire your stubborn tenacity, and wish you well on the long journey into the anguish of SOLiD bioinformatics.
The online Coursera course(s) that go with this book are also very good:
I too have discovered differences in Linux and Mac BASH implementations.For example sometimes MacOS BASH will require a flag where a Linux box won't. My 2cents if using a mac? https://brew.sh/ Homebrew is a package manager for mac, you can install UNIX/Linux tools like awk grep and sed that act exactly like UNIX/Linux distros implementation. Use those and you won't run into problems.
You can write computationally expensive parts of your code in C++ and interface it with Python using Boost.Python. It's still easier than writing the entire application in C++.
> Git
Sourcetree is by far the easiest way to work with Git.
Make a repository at either GitHub (for public / shared stuff) or BitBucket (for private stuff), and use SourceTree to connect to it. Fill it with code.
Then you have access to your code from anywhere, since it's online. And you can rewind to an earlier version any time. So useful!
Have you looked at http://www.cytoscape.org I think you can garner more interest if you target an area which needs attention, or you can contribute to one of the many existing open-source bioinformatics tools!
I found this from the RSEM google group from Colin Dewey (https://groups.google.com/forum/#!topic/rsem-users/GRyJfEOK1BQ):
If you want to compare relative abundances, then you should be using TPM, which is a simply a fraction. As we (and others) have noted in our papers, FPKM/RPKM are not good measures of relative abundance because the FPKM/RPKM of a transcript can change between two samples even if its relative abundance stays the same.
The trouble with looking at relative abundances (which is what RNA-Seq directly measures) is that the abundance of one gene affects the relative abundances of all other genes. For example, if a very highly expressed gene increases in its abundance, then the relative abundances of all other genes will go down, even though their absolute abundances may remain the same. Thus, a number of "normalization" schemes (e.g., TMM, third-quartile normalization) have been devised that effectively transform counts or FPKM/RPKM from RNA-Seq into absolute measures of abundance (or more accurately, they put measures from several samples onto a common absolute scale). Note that you cannot apply these normalization schemes to TPM values because they are relative values and, by definition, the TPM values of all transcripts must sum to 10^6.
So an even briefer summary is:
if you want to compare relative abundances: use TPM if you want to compare absolute abundances: use normalized read count or normalized FPKM values (where "normalized" = the results of TMM or a similar method)
Yes it is very affordable, when you sign up you should get some credits for free. If you are still a student you can get additional credits for free by signing up for the GitHub student package. Use Spot instances which let you bid on computational time, pricing here https://aws.amazon.com/ec2/spot/pricing/ for example the r4.8xlarge is $0.23 per hour for 32 cores and 244gb of RAM currently, this will fluctuate.
It is pretty easy to get AWS running if you aren't very computational. If you have your pipeline in a workflow language already check to make sure they don't support provisioning AWS resources as that will make it even easier.
We need an aligner comparable to the antivirus sites VirusTotal and Jotti that uses all available aligners to align each sequence and form a consensus.
Also check out zotero.
I guess I don't quite understand the question. Are you trying to pull out the all the references in journal articles for later use?
Having taken a undergrad in Biochemistry, I have to say that transitioning from math/physics/computer science to life science is probably much easier than going the other way (personal opinion); which is good news for you:)
The amount of biology you need to know to do bioinformatics varies, and obviously the more you know the better, but at a minimum you need to know the logical connections between the various nodes in the specific area you're working on (i.e.: if you're working on apoptosis, you need to know the various pro- and anti- apoptotic proteins and how, logically, they interact with each other; I refer you to this Review for a much better example than I have provided).
So yeah, coming from computer science, you should have the more difficult part down already (how to program, statistics, modelling, math in general). Keep in mind you won't be working alone either, so your peers who are trained in the life sciences can help you understand, from a biological perspective, how the system works. You just need to be able to figure out how that translates into a logical system and/or what it is they want you to look for.
As for resources for learning by yourself, here is a link to a pdf of a book by Uri Alon, a prominent systems biologist, written as an introduction to systems biology. Yup, that's the full book:P
I think you've misunderstood what's going on. The PyMOL code was originally licensed under a permissive (non-copyleft) open source license. That code is still available and is still being developed. Here is the latest version. According to the pymol website:
> PyMOL is a commercial product, but we make most of its source code freely available under a permissive license.
The only thing odd about this is that they're using the same name for their product as for the open source project. This might constitute trademark violation, but it's also possible that they actually purchased all rights from the company that the original author of PyMOL founded (he died in 2009). Apart from the possible trademark issue, since the original PyMOL was licensed under a permissive license, anyone is free to include it in a commercial product that expands on the code as long as the original code is freely available. This new code will then contain both open source code and closed source code and there is nothing preventing someone from creating such a combined work and licenseing it under a commercial software license.
I suggest taking Pavel Pevzner's online Coursera course. The chapter on sequence alignment goes over using affine gap penalties. Instead of defining a single gap penalty, you define two gap penalties: one is the "gap open" penalty (denoted by the lower-case rho in the link /u/Corm posted), and the other is the "gap extend" penalty (denoted by the lower-case sigma in the link /u/Corm posted). The "gap open" penalty is large and the "gap extend penalty" is small. Basically, instead of blindly giving the same gap penalty to all indels, you make a quick check: If the previous box had an indel, just apply the "gap extend" penalty. If the previous box did NOT have an indel, apply both the "gap open" and "gap extend" penalty.
Again, as I mentioned, Pavel Pevzner's online Coursera course is really great, and you implement this algorithm in the class. I took the course and have been TAing for it for ~7 months now, and it's great. The one you would take for learning about sequence alignment is Comparing Genes, Proteins, and Genomes (Bioinformatics III)
> you probably just want a cc-by-atribution
CC-BY is not an appropriate license for software.
Instead, use MIT, BSD style or Apache (or GPL if you think copyleft is a good idea).
For the geographic area you are in, look at Leidos or MedImmune if you don't have your heart set on working on an academic lab. You might also consider some of the jobs here.
FWIW, I would really bump up your R/stats knowledge in the interim, since there's a good chance someone trying to hire a bioinformaticist is looking for someone to run differential expression analyses...
If you want to use Unix without dual booting you can spin up a virtual machine. Use VMware Player or VirtualBox and download your favorite Unix OS iso.
Then run the virtualization program and install the OS. It will run on top of your Windows environment and share the resources.
AWS has a free tier for a year. check it out.
https://aws.amazon.com/s/dm/optimization/server-side-test/free-tier/free_np/
Its complex but start simple. Don't be freaked out. Careful with automation though, easy to spin up a 50 machines in zone with circular dependacies :-)
Do this for 6 months and understand AWS - someone else will be paying for your next laptop...
You know the biology, you'll pick up the bioinformatics from the methods section of the papers you read, but you need to work on understanding algorithms. <strong>CLR</strong> is pretty much the standard - the "Alberts" of the field. Bioinformatics books on paper get outdated so quickly, the information half-life is just too short.
Sorry for the late response, I was traveling all day yesterday back home from the holidays.
Yes, that sort of thing is exactly what AAP Career Services is for. I've also personally helped several students get placed.
I'm a professor in the program, not working in the administration, so they'd have better collected data than I do. I know that in the last year students in my courses have gone on to get hired at places like the Mayo Clinic, Institute for Genome Sciences and within individual labs at universities all over.
I wouldn't stay teaching in the program if I didn't think it was a great one. It helps that the professors are given extreme leeway over their course content, and that most of us are active in the field rather than only teaching. A few years ago one of the major languages used throughout the courses was Perl, for example, and I made the case for the switch to Python. Within the year we had converted each of the courses in the progression to that. Similarly, I felt my Metagenomics students needed experience with cloud computing to handle data at that scale, so I transitioned the course to being done on Google Cloud Platform. When I want to add entirely new material I am able to just do it on the fly without approval, keeping the course content much more modern than I would be able to do if I were always under a committee.
I hope this helps, and am happy to answer more.
This Cousera course on Experimental Genome Science is pretty good for a summary of the biology side.
My advice is to learn the soft stuff first before you get into actually doing work. You see a lot of really cool papers come out of CS types which have limited practical application. Learning about the full pipeline from data generation to application of tools by biologists is very important to success in this field.
I LOVE using Rakefiles to define pipelines. They're similar to makefiles, just written in Ruby which is nicer for me.
Also, git.
This is my workflow:
$ git clone git://experiments/experiment.git $ cp ~/data/datafiles* . $ rake experiment
EDIT: Here's an example https://github.com/audy/taxcollector/blob/master/Rakefile
Overall, with literature managers I found myself reluctant to add new papers because of the mental transaction cost associated with "filing things away". It is much easier to both "read" and "write" (mentally) to my existing pool of literature with this tool - try it.
Also I don't find myself worrying about how to organize some autoencoder architecture with an immunotherapy review. Everything goes in the same place, which is how our brain works.
Of course if you're not comfortable with mouseless (tmux/vim/yabai), Zotero is probably a much better option...
yes!
So, VM will be the easiest option—I recommend VirtualBox. It will be really clunky at first, until you install “VirtualBox guest additions” on your Linux VirtualBox, then things will get a lot smoother.
I recommend you install Ubuntu. Not only is it one of the easiest distributions, but also one of most software-compatible ones.
The distribution doesn’t affect how the operating system looks, though. It’s just the internals. If you install Ubuntu with defaults, it comes with a Gnome desktop, which is pretty resource intensive.
You can install any desktop manager you’d like, but Ubuntu makes it really easy by giving some pre-configured ones set up by default. The lightest one is LUbuntu, see here: https://lubuntu.me.
Install that one.
VM will be good to practice, and eventually you may choose to install Linux as your primary OS. I generally don’t recommend dual booting.
Head over to /r/linuxfornoobs if you need help, or PM me! (:
PS. Like the other poster above, I also recommend the windows subsystem! It works well for most learning situations.
never used it but a very popular example of a container is docker (that link says what a container is) and singularity just looks like another container flavour (maybe specific to HPCs?)
Thanks for helping us continue the dialog. Scripting is definitely on the way. I should point out this is still in beta and not commercially released yet.
One thing I have noticed with a lot of bioinformatics software, is that cheminformatics is ignored. Our goal is to unite the two disciplines with our product bases to better serve the bioinformatics community. For instance, IUPAC naming, MS simulation, high quality graphics and interfaces and more. Our ChemDoodle product line is immensely popular and we get many requests for bioinformatics features. Any suggestions you have are very much appreciated.
Of course, no one is forcing anyone to buy anything. Competition is a very important part of a software market, as it helps increase quality and reduce prices, whether commercial, free or open source. Then users win because they get to make the choice about what they use!
Check out benchling! It integrates a lot of features from these products like primer design/annotation, but also allows you to manage/share your DNA sequence data more easily.
Shameless plug for my free online interactive text, Data Structures, which is currently being used at UC San Diego, the University of San Diego, and the University of Puerto Rico
Try Coursera. Right now it has a plethora of different courses where after finishing you will even receive a certificate of achievement. You will find there courses in biology and computer science, some about programming in Python, analyzing data in R etc. For sure you will find something interesting for you.
Do you have a ticketing system? If not, either talk to your sysadmin to install one and/or give you access, or use jira ( https://www.atlassian.com/software/jira ) as an immediate solution, so you can prioritise projects.
Whatever language you're with, stay persistent, and try not to get discouraged when you're frustrated. It takes a long time to get a good feel for this stuff.
I'd also recommend using PyCharm when you're starting with python. https://www.jetbrains.com/pycharm/ Their debugger will probably help you a lot when you get stuck on certain complicated processes.
In addition, if you get stuck and google something, try to understand "why" the answer is correct and not necessarily that it works.
Lastly, when I learn a language the first thing I do is try to implement a binary tree. http://en.wikipedia.org/wiki/Binary_tree It will give you a good introduction to a broad number of paradigms in a new language and also introduce a widely used low-ish level datastructure.
Good luck!
If you have an Excel file with 3 columns of data (chromosome, start position, end position) just copy the 3 columns into a text file (I recommend using Notepad++) and save it as "file.bed" or whatever. Or in Excel use "Save as" and save it as a "Text (tab delimited) .txt" file then manually change the extension from .txt to .bed after it's saved. Make sure you remove any headers from the file, I don't think bedtools likes header lines.
we don't use it for strictly science but we've got a 100% virtual/distributed/remote team and need good collaboration software for different needs. Some of what I like:
For lighter/easier project management and easy integration of both internal and external collaborators: https://basecamp.com/
For more hardcore project management, also with internal and external collaborators: https://asana.com/
For pure internal stuff we like the Atlassian stack and do a lot with both Confluence and Jira tying efforts and projects together
https://aws.amazon.com/s3/pricing/ You can store things long term in S3 which is fairly cheap. Something to consider is that workflow languages like Nextflow can abstract away accessing S3 in your workflow.
Pymol is free and open-source. As others mentioned, you can get the source from here: https://sourceforge.net/projects/pymol/
However, Schrodinger added an 'Incentive Version' that gets features before the open-source version, as well as dedicated support. They also added integration with third-party tools that you can get also on the open-source version, but it's up to you to get it to work (e.g. APBS). See details here: http://pymol.org/pymol
Pymol 2.0 had some major changes but I guess it's a matter of time until we see the open-source code online somewhere. You can still get educational versions for free as well.
It is free if you take the courses individually. You just search for the course name on the Coursera search bar and then enroll by clicking the enroll button in the course page. Then you get the choice to participate for free.
I suggest you look into Coursera and other platforms like it. For example, Coursera has a Genomic Data Science specialization consisting of 7 courses given by John Hopkins University. You can take the courses for free, but if you want or need an official certificate you need to pay some fees.
In this Python course, a biology professor and data mining expert will introduce you to the core concepts. (2 hour YouTube course): https://www.freecodecamp.org/news/python-for-bioinformatics-use-machine-learning-and-data-analysis-for-drug-discovery/
In one of the lectures on yet another Coursera course Introduction to Genomic Technologies, I remember Steven Salzberg, one of the authors of the paper I linked to mentioning "Bowtie2 -> HISAT -> Ballgown -> StringTie" as an example of a more modern pipeline.
I'm somebody who has been trying to dabble in RNA-seq recently and this is my experience with it, so you should probably take what I say with a pinch of salt.
I thought that the paper "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks by Trapnell et al." provided a nice introduction to using command line tools for RNA-seq analysis. If I recall correctly, newer tools have superseded the use of some of the tools mentioned in the paper, but otherwise the paper provides a good introduction to the field.
The online course "Genomic Data Science with Galaxy" seems like good course to start with Galaxy.
Cloud providers like Amazon AWS provide multipart upload, where the breaking into small chunks for upload and reassemble on cloud happens automatically. https://aws.amazon.com/cli/ This will work for all file types, any size.
Which service are you using?
Soemone already did it:
http://hackaday.com/2012/09/12/64-rasberry-pis-turned-into-a-supercomputer/
Also:
>If you’re wondering what it would take to get a Raspberry Pi supercomputer into the TOP500 list of supercomputers, a bit of back-of-the-envelope computation given the Raspi’s performance and the fact the 500th fastest computer can crank out about 60 TeraFLOPS/s, we’ll estimate about 1.4 Million Raspis would be needed. At least it’s a start.
https://www.researchgate.net/post/Can_I_merge_454_and_miseq_data_together
But there are surly more ways.
His remark on bias / errors is important & I would add the different number of reads between both technologies. Even if you normalize the abundance table, you might get sort of a batch effect. Thinking about that, if your illumina data is from soil and your 454 data from seawater (considering you use the exact same primers), then you will never now if your observed pattern is biological or not. If you have samples from different environments, same 16S region, but different primers (or even the same), and two technologies, then I don't know.
Coursera has online courses that offer official certificates.
Here is my search for the term "Python" and I check-marked to only search for courses with "Verified Certificates":
https://www.coursera.org/courses?query=python&certificates=VerifiedCert
I wish I was majoring in Bioinformatics & I really wish there was a Bioinformatics club at my uni, haha! Anyway, here're my suggestions: + Version Control Workshop - Codeacademy is a good place to learn about vc systems like Github. + Fortnightly or monthly club meetings - to discuss papers, current advances in Bioinformatics, past lectures, your research projects, or run revision sessions to prepare for exams, etc. I'm sure there's plenty to do/discuss. + Have a movie night at the end of each semester, & arrange to attend research seminars or public lectures at other universities.
I'm guessing your club is open for all students to join, but if not then make it open so that students from different courses (for example, computer science) can join too! Whatever you do, aim to create an open and fun environment :) I hope you find the suggestions useful.
Scripting in bash is still a useful skill and a solid alternative to python/perl. But why not hop over to code academy and spend a few hours getting to grips with the basics of python?
You can install Window Subsystem for Linux and if you use the Ubuntu base image just install with sudo apt install clustalo
Coursera has some courses that might be helpful. I know of at least one group which is heavily involved in bioinformatics software development that has hired video game programmers, so there may be positions out there where you could learn on the job.
As said above, it's much easier to install from the central repo (CRAN), and let it take care of dependencies. And it will give you the most up-to-date version.
One small thing you may have to do beforehand is define what repo you are using. Try chooseCRANmirror()
or use the repo argument in install.packages: install.packages('seqinr', repos='http://cran.us.r-project.org')
.
See this for more discussion.
> What is the better solution in you mind?
Parallelism at a higher level than the thread primitive. Something like Amazon Lambda, maybe, or AirBnB's Airflow. Or just the Celery library in Python. Multihost execution at the application level, instead of at the thread/process level.
> Sometimes, it is the developers who should be blamed.
Usually, but in this case it's not our fault. MPI is the wrong execution model for parallelism in the modern application; DRMAA API's only work from a host that's actually integrated into the cluster, there's no remote API into the system at all (an astonishing omission in something that purports to act as a compute farm.) It's worth noting that when the developers get to choose how compute farming will work, they build something that is absolutely nothing like the modern HPC.
I don't want to give advice on the job-market side of the issue, but as far as learning Perl/Python, I definitely suggest taking time to at least learn one of them, no matter what path you end up choosing. Personally, if you're restricted on time, you could just learn Python via the online interactive instruction stuff at Codecademy. In this day and age, I think everybody needs at least a bit of computational skills, and the Codecademy classes aren't too ridiculously time-consuming.
> BioStars is a terrible Q&A platform (this really cannot be overstated), and in order to get a better one we need to pull users over.
I agree with you on this, but possibly not for the same reasons. I get a vibe of pretentiousness that puts me off posting on BioStars, don't like the benevolent dictatorial nature of the lead (?only) code developer, and notice that they have a very low answer rate for questions.
I've been thinking about other sites that serve a similar purpose of getting researchers to talk to each other:
I think it's good that different methods and communities exist for this. It may be that BioStars will find its own different identity after the Stack Exchange reshuffle; maybe there's a group of researchers out there who find it useful and would be willing to keep the embers alive after it has suffered a fiery death.
I would look at something like Digital Ocean - it's $5 a month to play around with configuration, different distributions, etc. If you screw up you can just wipe it or revert to a backup.
I manage about 140 different bioinformatics web apps and databases, some with millions of users a month, so if you need any more specific advice feel free to PM me!
> Did you do your undergraduate degree in CS?
No, I come from a pure biology background. Most of my programming knowledge comes from self-directed learning.
> How comfortable are you with an implementation intensive workload?
A big part of the job would be coding in Python. Python is the programming language I am most comfortable with and in which I have the most experience (about 5+ years). So, however I have never worked professionally as a programmer before, I am very much looking forward to this part of the job.
> Was the interview very technical?
Not really. They asked for some code samples of work I had done in Python. Since I haven't done any big projects in Python, I submitted my solutions to some 40 odd problems of Project Euler. Apparently that was sufficient (their remarks on my code: I have a good grasp of the language and write Pythonic code, but it is clear that I have a biology background and not a CS background since I mostly go for the brute force approach where most of the Project Euler problems can be solved by using some mathematical 'trick').
> From what I have seen in Biotech salaries tend to be tied to the amount of quantitative analysis and programming experience you have.
I have no (professional) programming experience. So what range are you thinking then?
If you need to run command line Linux programs on Windows and you have Windows 10 you can use the Windows Subsystem for Linux, installed through the Windows app store.
Just for fun I'll flog my own piece of Java garbage: Bootsie, and its associated paper as published in Pakistan Entomologist. My only primary-author paper (to date)!
Well it has a command line so worse comes to worse you can run the commands from python.
If you go down this route generally what you would do is build an api with each mothur command a python class or function that allows you to call a python function which handles calling the command line with the given parameters and transforming the results back into a python structure for further manipulation within python.
Mothur is in C++ I mention that because it took me way too long to figure that out though the research paper they wrote on it had the info I couldn't find it in the wiki etc.
If you are willing to code in c++ and get really really involved you could use one of the methods here for exposing the C++ classes etc, but that is probably a ton of work and honestly the above recommendation is probably better as you would be stuck maintaining the python interface etc. https://stackoverflow.com/questions/16067/prototyping-with-python-code-before-compiling
I assume they don't already have a python interface. I have not used it, but a quick reading of their wiki doesn't mention python anywhere.
Anyone who has actual experience feel free to chime in if I am saying something wrong. I have to be general because I have never used it and my knowledge is from a quick reading of the wiki/paper on it.
Turing completeness is not terribly useful to judge a language’s usefulness: Brainfuck is Turing complete and utterly useless, whereas SQL isn’t Turing complete and tremendously useful.
R isn’t — and isn’t trying to be — a Swiss army knife. It’s a special tool for a special purpose. And, although it’s admittedly far from perfect, it excels at that special purpose better than any Swiss army knife could.
However, I can’t resist pointing out that there’s a Flask-like library for R called plumber which makes writing web services in R a breeze. And for more special applications there’s also Shiny.
As mentioned there is Shiny for R, although you have to pay for a license if you want concurrent users.
I've built a little front end for scripts on the web called Wooey (for "Web UI"). It's automatically generates a user interface from a Python command-line definition (argparse). It can support other commandlines, though you have to define the interface manually in that case.
look into jupyter notebook too. it's a popular choice for interactive python programming. you can try it online too at https://jupyter.org/try. i tend to use that for python. rstudio is definitely the way to go for r
There are definitely tools that do this sort of thing. Ignoring the biological specificity of the project, it is basically network analysis and there are a bunch of things that can help! Check out networkx. If you can parse the metabolic data you need and create a graphical structure, it can visualized using this library to help (and matplotlib). Or if you want to get a little fancy, plotly has some examples using networkx
Do you follow along with their github organization?
They've been trying to add a lot more plugins. Now they have a cookiecutter template for creating spyder-ide plugins.
An R3 Large has 2 cores. See instance types here.
So yes you should be using -p 2. cores/threads/cpus are essentially the same for your purposes.
A couple other tips on EC2 (I've been using it extensively for the past year for similar work to what you seem to be doing.)
Create your own image where you install everything from scratch. Set up keys to easily ssh/scp files to your other computers. Do all this set up using a free tier server. Do as much testing as possible on the free tier. When you have real work to do think about how many threads the program you are using can make use of, and how much memory and storage space you will need. Look through that link I provided and find several that seem in the range you want. Then look into spot instances and check the prices for the different servers you are think would be acceptable. Choose the cheapest. Set your bid price high enough that a short spike in price won't kill your run.
Many programs you will be running can take a command line argument where you set the number of threads (like -p). But most programs can't make great use of extra threads beyond a handful. So if you have multiple tasks to run its usually better to run them on a small number of threads and run multiple tasks at the same time using something like xargs. Here is an example using barges to run Transdecoder (an ORF caller):
ls *.fa | xargs -n 1 -P 20 -I % ~/bin/trinityrnaseq_r20140717/trinity-plugins/transdecoder/TransDecoder -t %
or blast (from a shell script I wrote):
ls $QUERY | xargs -n 1 -P $THREADS -I % blastp -query % -db ${i/.phr/} -num_threads 1 -outfmt 5 -out $OUT/'query_'%'db'$DBSPECIES'.blastp.out'
If you're going to be doing any data visualisation for a webpage, definitely checkout D3. It's a data visualisation API for JS. See the Link below.
Bit of a learning curve but there are some good youtube tutorials to start off with. I found this really great for a bioinformatics web application i helped develop a few months ago.
+1 for Anaconda. Download and install Anaconda, which installs Python and a number of other useful packages, and it's much easier to maintain environments and different versions of software down the road
In Excel (or LibreOffice Calc):
data.xls
I'm going to assume that your data of interest has a column name of myData
.
In RStudio:
data.xls
is saved in<Ctrl>+2
)Enter the following into the Console window:
> setwd > install.packages("gdata") > library(gdata) > data.df <- read.xls("data.xls") > plot(density(data.df$myData))
A plot should appear in the Plots
tab of the bottom-right pane of Rstudio. From this plot window:
Hi!
If I understand your question correctly, this is indeed a graph you are describing. If you can reduce your problem to a dataset with the interacting metabolites it should be very easy to handle.
There are tools that can deal with biological data like this, the most commonly used is Cytoscape. It's pretty simple to use, here's the manual.
I'm doing my PhD on network/systems biology, let me know if you have any questions about it, this is my main tool in the day to day work
Disclaimer - I'm not part of a bioiformatics lab, but we generate and analyse large-scale data.
For note taking I highly recommend CherryTree (https://www.giuspen.com/cherrytree/), a kind of open-source Evernote based on local individual files with support for media and rich text. Very easy to share and backup and future-proof as well, by export to xml.
One of my big concerns switching from Windows to Linux was also the loss of Microsoft Word. But actually, for me, everything I've needed LibreOffice to do, I was able to do the same way on Microsoft Word. The two programs are so alike that there is no learning curve at all. You can even save your files as a Microsoft Word document if that would be an issue. My suggestion is to download LibreOffice for free for your Windows machine, and see how it feels to you before making your decision.
I'm afraid I'm not familiar enough with HMMs to answer your question specifically, but if you want resources on HMMs for biology, the textbook that comes to mind is Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. While a bit dated nowadays, it covers HMMs extensively. You should be able to find versions in the usual places.
If you're looking for publications, I know one of the authors, Anders Krogh, has done a lot of work with HMMs, so maybe some of his papers can help you.
Sorry if I wasn't clear. I'm in medical school, and it's somewhat common to pursue an MD/PhD combination degree that is a bit more efficient time-wise than if you got them separately. The why is a long story, but in short, the field I applied to medical school for is not in good shape. I enjoyed working long hours in lab during undergrad, but my experience in the other medical fields so far has shown me that I would not be happy working long hours in them. So I want to get back to my first interest. Why bioinformatics? I really enjoy working with computers, and would have entered the field much sooner if not for my undergrad lacking opportunities.
I'm glad that my biology background will be an asset. I have toyed with Ubuntu in the past, but Windows is definitely my primary OS. Forgive my ignorance, but what is the terminal used for other than running scripts (relevant to bioinformatics, that is)?
I started learning Python via the MIT opencourseware over the summer but got distracted by another research project. Would that also be a good source, or do you recommend Learning Perl over it?
My interests are pretty broad--I admittedly don't know the scope of bioinformatics very well, so any pearls you can pass in that regard would be great. What I've seen so far that interested me are things like:
-associating biomarkers with disease traits eg treatment response, prognosis
-pharmacological modeling
-image analysis for path/radiology (bit more in the computational biolgoy realm, it seems? and perhaps an unreasonable goal without a heavy computing background?)
Thanks for your time
The default regex tools in linux are already optimized for fast searching capabilities over text files/input: https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux
Its pretty powerful, if you take some time to learn it..
Lucky for you, Dr. Phillip Compeau, Dr. Pavel Pevzner, and I recently released the newest version of the course, called "Biology Meets Programming: Bioinformatics for Beginners"! You should check it out!
I liked this class. It doesn't have bioinformatics context but you probably shouldn't learn to code in a bioinformatics context anyways. To really start understanding programming, you have to look at the abstract motifs for what they are in a computer science sense. Then you can look at your particular context and understand how programs apply to them.
I'm not a bioinformatician myself and am primarily self taught with some extra help from friends who are. I'm just really interested in the field so honestly, my best advice is really just to Google a bunch (I know, shitty advice).
For example, I knew I was interested in Genetics so I took some quick refresher courses on the basics of genetics on like Coursera and what not. There's a really good specialty course from John Hopkins specifically for Bioinformatics which covers more data sciency stuff. Apart from that, I started with the questions above then backed into types of algorithms that are used (bayes, locality sensitive hashing, etc.).
Bioinformatics is an incredibly diverse field so answering the question is hard without knowing what areas of the field you may be interested in.
This sounds similar to my situation! I'm working in forming a phylogenetic tree from specific proteins.
Anyway, before last week I had no experience at all, and was given this link (https://www.coursera.org/learn/bioinformatics-methods-1/home/welcome) which introduced BLAST and MEGA, the database and alignment software respectively, and more besides. I've found it really useful, so it might be good for you too.
Sorry for the poor formatting, I'm no my phone.
Well what did you think? This is a real head scratcher.
Section 2.4.3.4 here seems to indicate that desolvation is the stabilizing force.
Have you done any lit search?
I use Notion while reading papers to copy in quotes and figures, jot down questions and dumping a lot of links and resources. When you read something interesting, find the reference and add that to your reading list. It may too holistic of an approach but doing this long enough crystallizes something. Also, the notebook idea is good though I usually don't write code in the early stages.
I have used Datacamp for training in work, I was ok with R from my MSc but a complete novice in Python and SQL and found it really useful for getting a feel for the languages and the syntax and theory.
They also have some bioinformatic courses coming up in the next few months.
I largely agree with the answers here. R's APE package is a good beginning. https://www.researchgate.net/post/Is_there_a_software_that_can_compare_two_phylogenetic_trees_to_each_othe_face_to_face_comparison
Of course the comparison is only as useful as the trees are correct. (A), you need the best MSA you can produce, (B) Distance matrix methods are inferior to likelihood-based or Bayesian methods for all but very highly similar sequences. Regarding 16S rRNA you are stuck with what you have, but regarding Cellulase, you can tune the level of similarity by using nucleotide or protein data.
I've noticed that lots of MS and PhD programs realize that lots of their applicants are "polar" (i.e. strong in CS but weak in biology, or strong in biology and weak in CS), so many programs have relatively basic-level intro classes to bring everybody up-to-speed with the prerequisite information. Applying to an MS program in bioinformatics after even a BS in biology is totally feasible. Be sure to learn more than just programming, though (Data Structures, Algorithms)
As far as what you can do to prepare, you mentioned you're learning Python and introducing yourself to coding. [Codecademy](www.codecademy.com/en/tracks/python) is an excellent resource to learn Python. Also, specifically for you, the Coursera class "Algorithms, Biology, and Programming for Beginners" (taught by Dr. Pavel Pevzner and Dr. Phillip Compeau) sounds perfect to practice your programming skills on biological problems. The next session begins on August 17th, so you should definitely sign up for it (100% free!).
As far as companies go, what companies? And what positions? In general, for bioinformatics positions, they want people with strong CS backgrounds (not just programming, but actual theory, like Data Structures, Design/Analysis of Algorithms, etc.). However, they also want people with strong biological backgrounds. Still, I think having a strong CS background will be preferred, so that's definitely something you wanna strengthen during the MS!
Thanks a lot for all responses. Actually, I'm trying to use the script (PAL finder at https://sourceforge.net/projects/palfinder/, but it returned me an error "Non-valid paired end read", and added /1 and /2 to the end of my header fastq reads! however, my fastq files are paired. I think the problem is related to fastq header. The fastq header of example data (one of the PE reads) is like below:
@ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/1
and the header of one of my fastq PE reads is here:
@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/1
I got the below error:
Non-valid paired end read: SRR707811.1- FCD0CDRABXX:4:1101:1290:2174/1/1:TCAGCATCAGTGACAGAGGGCCAGCAGAACGAGCAGTGACAAGACAGGTGGGGCCTGGCTCCCCCCCCGCCAGCTCCANNNNNNCCCCTTGCTGCATCTG:eeeddeeeeeeeeecddd\dbdWVdddWcdc_c`bd_dcdeeeaec^c^^cdbddL_]^^ddddUdddVdBBBBBBBBBBBBBBBBBBBBee\eedecee
SRR707811.1-
FCD0CDRABXX:4:1101:1290:2174/2/2:TAACTCCTCCTGGGAAAATAATCCTGTTGGAGTTGGGGGCTCTTCCCAGTTGTCTGGTTAGTTGGCCCAGGAAGGGGCAG:dae\ddddcd\ddddefdWffegffdefbdbZ\c
O_]ZX]]L]b]acbcbZccd\Tdf_]]SbZ^__^fae^BBBBBB
(Please note the end of header where /1 and /2 were added during running the script)
When I manually remove /1 and /2 in the very short data, the script worked, but when I changed the end of my header (/1 or /2) to (#0/1 or #0/2), similar to example data, the same error appeared. So I have to remove /1 and /2 from my header, right? Sorry, I got a bit confused, perl -pe 'if(/^@SRR/){s#/[12]$##}' orig.fastq > changed.fastq or sed /^@SRR/s/\/[12]$// orig.fastq > changed.fastq are the correct commands for this modeification, yes?
Thanks in advance
Hi!
First of all, thank you for answering and giving me some tips, it really helped me!
I wanted to add to the answers this linux server course in freecodecamp. Haven't seen it yet and it may be overkill with some IT stuff, but maybe it can be useful to somebody that stumbles on this post.
Here's the link!
https://www.freecodecamp.org/news/linux-server-course-system-configuration-and-operation/
Yeah, if you're already editing via the command line, I think it's worth it to hunker down and learn key-based navigation.
That being said, plenty of the more standard editors, like VSCode, are usually able to launch within an ssh connection, and then you have access to the normal mouse-based features that they supply. See eg https://code.visualstudio.com/docs/remote/ssh
+1 for all of this. Web development is fun and enjoyable, but it's rarely bioinformatics related (and front-end development is arguably not even programming, it's more like a form of digital art or graphic design).
Anyways, to answer OP's question: you can just use the "inspect" option on most web apps to see their HTML/CSS and get a general idea of their structure and design. It appears that NCBI is using Drupal. There is also a Drupal module for pubmed. However, most US government websites generally use their own web design tools. Drupal is a pretty old way to make a website imo, and there is no reason to use it for a personal site when plain CSS and HTML is easier and looks as good, if not better. The NCBI site pulls a little bit from jquery, but otherwise, it is using proprietary widgets and styling.
As for the server side stuff, Flask is a good choice, but "serious" sites like NCBI work differently; they are hosting their database on one server, and hosting the site on another (and sometimes outsourcing the front-end entirely). NCBI is using C for most of it tools, but in the case of Flask, you can just run the scripts in the web server and have it serve static webpages with the results back to the user. It is probably not worth setting up a dedicated database to keep data for the scale that most personal projects operate on (although having your own web server is a good exercise in learning networking, even if it's not reliable).
There are a variety of janky ways to string together web applications, the way that large organizations do it is not optimal for personal websites. Flask is a good option, but there as also npm packages like Blast.js if you want to go with something more modern like React.
For me, this slideshow is awesome: https://www.slideshare.net/MattHarrison4/learn-90
I took a 10 week course a few years ago and loved it. A couple weeks ago had to refresh my memory on a few things and found this slideshow and think it's the best instruction I've seen. Now it's lacking narration but I think it presents the language really well
I've also been using the dev version of SNAP but I was not able to get SNAP + freebayes running (always writes out non-Unicode characters in bam, maybe it has something to do with my reference).
I would recommend posting to their google groups page they are usually pretty responsive tho I would keep in mind that this is still very experimental software that can change a lot in the future so might not be ideal for pipelines...