Over the last several years, understanding and analyzing data has become central to many companies’ business strategies. It’s estimated that, by 2026, the market for big data will be worth $273B. Product and service companies alike use data to understand their customers and provide a better experience and invest in strategies to improve data security and privacy.
A couple of years ago, the data scientist role was the hottest position in the job market, As organizations of all sizes rushed to bring data scientists into their teams, it became clear that many weren’t entirely sure what this role needed to succeed on their team. While today we have a better understanding of the variations in this role and the skills and experience needed for different specializations, it’s still challenging to hire a good data scientist.
In this article, we’ll look at the most common data science roles, the core background and skills you should look for, and the main differences between each role. We’ll also share how to evaluate candidates and give you some sample technical questions to use during your interview process. If you’re looking to hire data scientists and would like a little help, get in touch with our Talent team! They’d be happy to set up a call with you.
Understanding the Different Data Roles: Data Scientist, Data Analyst, Data Engineer and Machine Learning Engineer
As technology has evolved, so too has the role of the data scientist. The 2010s were the dawn of the “Era of Big Data”, and companies suddenly had access to more data than they knew what to do with. At this point, the need to store, process, and structure data was a main focus area. Today, some key areas include integrating data analytics with internal business tools, using AI and machine learning to improve the quality of insights and inform decisions, and data governance. The goal today is how to balance the need to use data to understand the customer with maintaining transparency, protecting their privacy and security.
The Data Engineer
The precursor of all data science roles was the data engineer, which had many names, including ETL (Extract, Transform, Load) developer and sometimes even BI (Business Intelligence) developer. The abilities of a data engineer have been needed since we started creating digital solutions. The main tasks of this role are:
- Clean and format raw data
- Store, move and transform data
- Create the product’s data model
The Data Analyst
The Data Analyst is the natural evolution of the engineer and emerged from our need to further explore data, digging into it and trying to find patterns. Data Analysts have business acumen and use data to understand how the business is developing and retrieve insights from it. This role is responsible for:
- Crunching data to uncover insights
- Create graphs and diagrams to better understand data
- Transform its findings into insights that help stakeholders make decisions
The Data Scientist
One step further is the data scientist, which is similar to the data analyst. They try to retrieve insights from the data. Additionally, data scientists use statistical models and machine learning to create models that can predict how the data will behave in the future. One main difference between the analyst and the scientist is the business understanding, which tends to be stronger with analysts. The main tasks of a data scientist include:
- Create a model to represent the data
- Train the model with the data provided
- Make predictions using the model
The Machine Learning Engineer
The machine learning engineer is the final piece in this puzzle. This role is responsible for taking the models created by data scientists and deploying them in the cloud, making them available for use by services and applications inside the company.
- Create the infrastructure needed to run the application
- Deploy models created
- Monitor the algorithms
It is important to understand the different roles when hiring a data expert to solve problems in your company. Although responsibilities may vary within different organizations, it’s not expected for a data scientist to create pipelines to store the data, for example. On the other hand, finding a T-shaped professional who understands the whole process could be extremely useful.
Skills to Look for When You’re Hiring a Freelance Data Scientist
Regardless of which data science role you need to hire, all good candidates will have some traits in common. More than digging into data and creating models to fit it, a good candidate will have strong analytical skills and domain knowledge to fully address the problem. Sometimes Data Scientists are also responsible for presenting their findings, and in these cases, soft skills become extremely important. Below we have quickly summarized the essential competencies you may look for while searching for a Data Scientist.
Because the types of projects data science practitioners take on vary widely, there’s no prescriptive answer when it comes to what education and work experience to look for. However, there are some trends we see coming up in data science resumes again and again.
Along with those who have software engineering backgrounds, we are seeing more people from physics and mathematics. But candidates can have backgrounds outside of this too, in fields like biology or civil engineering for example, as long as they have an analytical mindset and the ability to think outside of the box. Who you hire will also depend on your business. For example, if you have a hardware company that’s trying to leverage data science in your products, hiring a candidate with a background in electrical engineering will likely be beneficial.
One undeniable expertise for every data scientist is a good mathematical understanding. Matrices. Matrices play a huge role in both statistical models and machine learning, therefore linear algebra is an important tool. At the same time, the use of regression and gradient descent in machine learning are common in several applications. These are calculus-dependent and so it’s essential to understand them and know how they work. Bayes’ Theorem And probability are also fundamental pieces in the data science role, contributing to many classifiers. To summarize, look for a candidate that has knowledge in Calculus, Linear Algebra, Probability, and Statistics.
Data scientists need to have exceptionally strong analytical skills. Being able to look at the data and find patterns is essential to connect business problems with the huge amount of data points one may have. Sometimes the game changer is in the details, and this ability to spot and retain these features can help decision makers. This critical thinking will help connect the dots to make sense of facts and assist the company in making better, data-driven decisions. If someone checks all the other skills, but lacks analytical expertise, it may be a red flag in the selection process.
At the same time, no data scientist can develop a product without knowing how to code. As the need to take a deeper look into data and sometimes even create a Proof of Concept arises, knowledge of queries and object-oriented languages become critical to thrive in data science. In particular, we recommend hiring data scientists with Python experience, as it’s become the industry standard in data science. In part, that’s because of the extensive Python libraries available: from data exploration to cleaning and modeling, there are several libraries that developers can leverage.
R has also recently become popular among companies and is present outside of academia. Python or R are basic programming skills that data scientists should have, but being exposed to SQL, Unix and software development is also a desirable bonus.
Increasingly, data scientists are required to present their findings in an approachable and useful way. After uncovering insights and behaviors with data, the most important thing is to make the information available so that decisions can be made upon it. Therefore, being able to communicate the main insights for both technical and non-technical audiences, managing stakeholders’ expectations, and making clear what is feasible is a huge plus. Remember that programming and math skills can be taught, while soft skills are much harder to coach – so interviewing for these skills is critical.
Essential Tools for Data Scientists to Know
Like the skills mentioned above, a good set of tools is essential for data scientists to explore and retrieve insights from data. We recommend testing for these skills in technical interviews and also taking some time to review a portfolio of the candidate’s work to assess their knowledge. Here are the main tools you should expect a candidate to have a solid grasp of:
- Versioning: Code versioning is a must for developers and it is present in all serious projects. A tool like Git helps teams to organize the code development and avoid programmers having totally different codes while working. Check for the candidate repositories in GitHub and Gitlab to have an idea of how they code.
- Notebooks: Notebooks are a type of IDE that allows developers to execute each block of the code independently, making it easy to debug the code and explore the data. This is particularly useful when developing Proof of Concepts. Some popular options are Jupyter Notebooks and Databricks.
- Package/Environment Management System: When working with different libraries for different projects, having a management system in place is extremely important. Systems like Conda and Pyenv help programmers to have different versions of the same library for different projects. A good understanding of how to deploy models and cloud computing is also desirable.
- Visualization tools: After retrieving insights, data scientists show their findings to stakeholders. For technical audiences, this can be done using Python libraries such as Seaborn and Matplotlib. Tableau and PowerBI are good tools to show discoveries and let the user explore the data themselves.
How to Design a Data Science Interview
When it comes to evaluating candidates, it’s important to include a mix of questions that test for hard and soft skills, as well as challenges that allow you to test their technical abilities. There are several details that are only visible when looking at the candidates’ code. Functions structures, documentation, and readable code are expected from a good fit.
There are two main approaches for testing a data scientist’s technical abilities: whiteboard and take-home challenges.
- Whiteboard tests usually propose a problem that must be solved in a small timeframe with a given programming language. This is done in person using a whiteboard or by having the candidate share their screen if coding remotely.
- In take-home challenges, candidates are provided bigger tasks but allowed a wider time frame to solve them.
The best approach depends on which aspects you want to evaluate. For instance, whiteboard interviews and live coding allow for a peek into the candidate’s resilience to pressure, while take-home challenges give the developer more room to showcase their creativity.
One downside when handling take-home challenges is that candidates almost always spend more time than requested on it. On the other hand, the whiteboard approach puts a lot of pressure on candidates, with their code being scrutinized every line, and may influence their performance. With either approach, it’s important to have clear goals and give the candidate guidance in order to get the most out of the interview for both sides.
It’s worth investing time and money to create a good evaluation method. Don’t base your decision solely on it being a trend or because big companies are doing it. Instead, choose an approach that makes sense for the reviewer and also the candidate, having a clear objective on what should be evaluated.
Data Scientist Interview Questions
Here are five questions that will help you better understand a candidate’s knowledge and draw a clear line between junior and senior data scientists.
Question: Give an example of a non-gaussian distribution.
Answer: As good candidates should have a good understanding of statistics, this first question has the perfect size to set them apart. The family of exponential distribution has several distributions, being the normal distribution (also known as Gaussian) the most common. Good candidates will bring up Poisson or Bernoulli distribution, for example.
Question: Can you explain the steps to clean a dataset?
Answer: While it depends on the type of the data, a general sequence would look like these:
- Remove duplicates
- Remove irrelevant data
- Convert data types
- Clear formatting
- Fix errors
- Handle missing values
Question: Can you explain when to normalize and when to standardize data?
Answer: One should use normalization when the distribution of the dataset is unknown. It is useful when data varies in scale and the algorithm, like k-nearest neighbors, does not make assumptions about the distribution. On another hand, standardization assumes that the data has a gaussian distribution and should be applied before algorithms like linear regression.
Question: What is splitting and pruning in the Decision Tree algorithm?
Answer: There are several Machine Learning algorithms that help data scientists to uncover insights. Answering this question requires that the candidate understand and explain how a Decision Tree works. Splitting is used to add nodes to a tree, while pruning removes them. This is done to add or remove complexity to the algorithm and can be useful to remove nodes with little power when classifying, for example.
Question: What is the difference between Supervised and Unsupervised Learning? Give examples.
Answer: Almost every Machine Learning algorithm lies in one of these two categories. An algorithm is supervised when it is given labeled data and based on that it can infer what is the test data, e.g. Support Vector Machines, Decision Trees, and Regression. Instead, unsupervised learning tries to group similar data, without being given labeled data. Common examples are Clustering and Anomaly Detection models.
Extra non-technical question: Outside of your technical skills, what value will you bring to our company as a Data Scientist?
This question will help you understand how much your candidate has researched your company and how well they can communicate the main benefits of having a Data Scientist within the team. Use this question to assess their communicational skills and ability to explain complex topics. Ask them to explain a model mentioned in Layman’s terms.
Final Thoughts on Hiring a Data Scientist
We hope that with this guide you can have a framework to select the best Data Scientist and leverage data in your organization. Certainly, finding great data scientists is no easy task. Don’t follow the tips blindly, because it is going to be very difficult to find someone with the right background, solid math, and coding skills, and a good set of soft skills to present the results. The key is to identify which skills you can teach and which candidate is a good fit for the company. As it is an important role for your project, spend some time preparing the selection process, evaluating what we shared above, and creating a fair test to assess their skills.