Widget Image
Follow us:
Friday / October 15.
  • No products in the cart.

Must-Know Data Science Interview Questions And Answers


What are feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.


What is K-means? 

K-means clustering can be termed as the basic unsupervised learning algorithm. It is the method of classifying data using a certain set of clusters called as K clusters. It is deployed for grouping data in order to find similarity in the data.


Which language R or Python is most suitable for text analytics?

As Python consists of a rich library of Pandas, due to which the analysts can use high-level data analysis tools and data structures, this feature is absent in R, so Python is more suitable for text analytics.


Explain cross-validation.

It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and gain insight on how the model will generalize to an independent data set.


What is the importance of data cleansing in data analysis?

As the data come from various multiple sources, so it becomes important to extract useful and relevant data and therefore data cleansing become very important. Data cleansing is basically the process of correcting and detecting accurate and relevant data components and deletion of the irrelevant one. For data cleansing, the data is processed concurrently or in batches.


Explain the steps in making a decision tree.

1. Take the entire data set as input.
2. Look for a split that maximizes the separation of the classes. A split is any test that divides the data in two sets.
3. Apply the split to the input data (divide step).
4. Re-apply steps 1 to 2 to the divided data.
5. Stop when you meet some stopping criteria.
6. This step is called pruning. Clean up the tree if you went too far doing splits.


What are the two main components of the Hadoop Framework?

HDFS and YARN are basically the two major components of Hadoop framework.
• HDFS- Stands for Hadoop Distributed File System. It is the distributed database working on top of Hadoop. It is capable of storing and retrieving bulk of datasets in no time.
• YARN- Stands for Yet Another Resource Negotiator. It allocates resources dynamically and handles the workloads.


What is Machine Learning?

Machine learning represents the logical extension of simple data retrieval and storage. It is about developing building blocks that make computers learn and behave more intelligently. Machine learning makes it possible to mine historical data and make predictions about future trends. Search engine results, online recommendations, ad targeting, fraud detection, and spam filtering are all examples of what is possible with machine learning. Machine learning is about making data-driven decisions. While instinct might be important, it is difficult to beat empirical data.


What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas. It is a problem solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.


What is logistic regression?

It is a statistical technique or a model in order to analyze a dataset and predict the binary outcome. The outcome has to be a binary outcome that is either zero or one or a yes or no. Random Forest is an important technique which is used to do classification, regression and other tasks on data.


What are Recommender Systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.


What is the goal of A/B Testing?

This is a statistical hypothesis testing for randomized experiment with two variables A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.


Explain star schema.

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.