Data Science is a rapidly growing field, with natural opportunities for astronomers moving to industry. Recently, we’ve heard from Jessica Kirkpatrick about becoming a data scientist at Microsoft: part 1 and part 2. We’ve also heard from Stephanie Gogarten on switching to biostatistics at UW.
Let’s pick up where Jess and Stephanie left off, and discuss the technical side: what data science is, and how one can learn it. To do this, I interviewed my wife, Dr. Andrea Leistra. [Linked in link here??] After getting her Ph.D. in Astronomy from the University of Arizona in 2006, Andrea went into the tech sector. She’s worked for three companies: Yahoo, Concur, and now Audax Health, a tech start-up.
JR: So, tell me about data science.
AL: I was fortunate to already be in industry when data science became a big thing. I made the transition from astronomy to analytics, and then to data science. Data science is extremely hot right now, and astronomers are decently positioned to take advantage of that.
JR: Why don’t you back up and define data science?
AL: Data science is a set of statistical analysis techniques to deal with very large datasets. “Very large” is a moving target; we’re talking about many Terabytes to Petabytes of data. The exact nature of the data doesn’t matter to people who read your resume. What’s needed are the statistical chops and programming chops to handle large data volumes, for example techniques like mapreduce and nosql databases.
JR: How big is big, in big data?
AL: If your dataset fits on your computer, it’s not big. Unless you need to use Amazon’s cloud services, or use another remote server farm, with tens of terabytes to petabytes of data, it’s not big data. You can use big data techniques on smaller datasets, of course.
JR: Did you really put on your resume “I fear no petabyte”?
AL: Um, no.
JR: Too bad. Astronomers pride themselves on having big data. The archive for a great observatory is tens of terabytes. LSST will be 60 PB. Is astronomy data big data?
AL: LSST is legitimately big data. The WISE and SDSS catalogs are. Your observing run is not.
What kinds of companies care about big data?
AL: Internet companies. Any company that wants to drive traffic, or sell something on the internet. Some government institutions handle big data.
JR: If I measure some data, and fit it with a Gaussian, that’s science with data, but I’m not a data scientist. What makes a real data scientist?
AL: For example, if someone comes up to you and says, Here’s the data from the last week of Twitter — mine it and tell me about demographics and trends, and you can do that.
JR: How would an interested astronomer get their feet wet?
AL: An easy way is to take a Coursera machine learning course. These are free online courses; they don’t give you any formal academic credit, but you can get your feet wet, see if you like it, and gain some useful skills. It also looks good on a resume.
JR: How hard is one of those courses, compared to graduate-level Interstellar Medium?
AL: Piece of cake. These are 5-10 hour a week courses.
JR: How did you learn the statistics you need to do your job?
AL: I took a statistics course in college, and brushed up on my own in grad school. Since then, I’ve taught myself what else I’ve needed.
JR: What kind of tools are useful?
AL: An advantage of diving into data science is that the state-of-the-art tools are open source. I would advise becoming intimately familiar with “R”, the statistics and data analysis package. R is your friend. The most widely used MapReduce platform is the Hadoop ecosystem, including Hive and Mahout. Since all this is open source, one can download them and start playing.
JR: If grad students reading this are interested in data science, are there ways they can incorporate these tools in their astrophysical dissertations?
AL: It depends on their thesis. If it’s archival, or using a large survey like SDSS, then it might make sense to use big data techniques. In some cases it would be contrived, but it could still work.
JR: The great book “Put your Science to Work for You” says that your real job skills are often not your real skills. What are the real job skills a person learns while getting an astronomy PhD?
AL: The obvious ones are coding skills, statistics, data analysis. The more subtle skills are the ability to quickly learn new skills, to present results to a variety of audiences, from a technical audience like the readers of a scientific paper, to non-technical audiences, like students.
JR: Let’s talk about applying for data science jobs with an astronomy resume.
AL: Outside of academia, a Ph.D. in astrophysics is taken as a marker of being extremely smart. As an astronomer transitioning to industry, you are likely to be extremely over-qualified on some axes, and under-qualified in others. You are not likely to have the programming experience that other applicants will.
JR: What programming languages should a grad student learn, to enable these types of careers?
AL: Python is a good balance of being widely used in the outside world, and in astronomy. Java is the most widely used language in software engineering. But you’ll never beat a java hacker whose been using it from day one of college.
JR: How does one turn their academic CV into a resume?
AL: First, know that your resume will first be read by a machine. Don’t lie, but mention all of your skills. Second, don’t mention Fortran. It’ll make you look like a dinosaur.
JR: Even Fortran 90?
AL: Even Fortran 90. No Fortran!
JR: You realize we’ll be making Python jokes in 20 years.
AL: Yes, which is actually important. You need to be willing and able to keep up with current languages. I know Python, but I’m currently working in Scala, which is a lot like Java. I learned it 6 months ago.
JR: What was the job interview process like, for your most recent job?
AL: There were three stages, which is pretty standard. First, a 15 min phone screening with human resources. Second, a 45 min scheduled phone interview, which included technical questions about programming, with some example problems. Third, an in-person interview, which is usually about 3 hr long. It can be by Skype if you’re not local. Expect to talk to several different people, for 30 min each. Several people asked me to write code on pen and paper. Some asked me technical thinking questions. And there were also some typical job interview questions — for example, asking what difficult problems had I solved in the past.
JR: Tell me the timeline of that interview process.
AL: I found the job ad online. I applied. A week later, I was contacted by the company, and a week later, interviewed. The whole process, from finding the job, interviewing, getting an offer, negotiating, and accepting, took a month.
JR: What’s the culture like at places you’ve worked?
AL: I’ve had 3 industry jobs now. Each has had its own culture. All have a business-casual dress code, or less formal (jeans and T-shirts). My current employer is a startup. It’s a bunch of really smart people who talk about a bunch of stupid stuff, and who work really hard, but also goof off. There’s a foosball table and a pingpong table.
JR: What about the gender breakdown?
AL: In terms of gender breakdown, the tech sector makes the AAS look like yoga class. I am even more in the minority as a woman than I was in astronomy.
JR: Tell me about meetups.
AL: If you’re trying to make the transition, and learn what are buzzwords, and also network, meetup.com has interesting data science meetups in many cities. Washington DC’s Data Science meetup is excellent.
JR: What are keywords to find meetups?
AL: Data science, Hadoop, MapReduce, big data.
JR: Do you like what you do for a living?
AL: Yes. I love my current job.
JR: Are you having more fun than you were as an astronomer?
AL: Yes. The pace is much faster. A large-scale, so-called “Epic” project is something I work on for a month. Most of my tasks are a few days to weeks. That doesn’t mean they’re not all related, but it’s something different. I get much more immediate feedback than I did as an astronomer, and I like what I’m doing.