Friday, September 14, 2018

24 Last Scientist scientific projects to increase your knowledge and skills (and free access)

Introduction

Data science projects offer a promising way to start your career in this field. Not only do you learn the science of data by applying it, you also get projects to show on your resume! Currently, recruiters evaluate the potential of a candidate for their work and do not place much emphasis on certifications. It does not matter if you say how much you know if you have nothing to show them! This is where most people struggle and lack.
You may have worked on several problems before, but if you can not present it and it is easy to explain, how will someone know what you are capable of? That's where these projects help you. Think about the time you will spend on these projects, like your training sessions. The more time you spend practising, the better you will become!
We are sure to provide you with a sample of several problems from different domains. We believe that everyone must learn to work intelligently with large amounts of data, so large datasets are included. In addition, we guarantee that all datasets are open and free to access.

Useful information

To help you decide where to start, we divide that list into three levels, namely:

1. Level for beginners: this level is composed of data sets that are relatively easy to work with and do not require complex data science techniques. You can solve them using basic regression or classification algorithms. In addition, these data sets have enough open tutorials for you to move forward. In this list, we also provide tutorials to help you get started.

2. Intermediate level: This level comprises data sets that are more challenging in nature. It consists of medium and large data sets that require certain serious pattern recognition skills. In addition, resource engineering will make a difference here. There is no limit to the use of ML techniques; everything under the sun can be used.

3. Advanced Level: This level is more suitable for people who understand advanced topics, such as neural networks, deep learning, recommendation systems, etc. The highly dimensional data sets are also presented here. Also, this is the time to be creative. See the creativity that the best data scientists bring to their jobs and codes.


Beginner level


1. Iris data set

This is probably the most versatile, easy and ingenious data set in the pattern recognition literature. Nothing could be simpler than the Iris data set to learn classification techniques. If you are totally new in data science, this is your starting line. The data has only 150 rows and 4 columns.


2. Loan prediction data set

Among all industries, the insurance domain has one of the largest uses of data analysis and science methods. This data set provides you with a proven way to work on insurance company datasets: what challenges you face, what strategies are used, what variables influence the outcome, etc. This is a classification problem. The data has 615 rows and 13 columns.

3. Bigmart sales data set

Retail is another industry that widely uses analytics to optimize business processes. Tasks such as product placement, inventory management, personalized offers, a grouping of products, etc. they are handled intelligently using data science techniques. As the name implies, this information is made up of transaction records from a sales store. This is a regression problem. The data has 8523 rows of 12 variables.

4. Boston housing data set

This is another popular data set used in the pattern recognition literature. The data set comes from the real estate industry in Boston (USA). This is a regression problem. The data has 506 rows and 14 columns. Therefore, it is a fairly small data set in which you can try any technique without worrying about the excessive use of your laptop's memory.

5. Set of time series analysis data

The time series is one of the most commonly used techniques in data science. It has wide-ranging applications: weather forecast, sales prediction, trend analysis year after year, etc. This dataset is specific to time series and the challenge here is to forecast traffic in a mode of transport. The data has ** rows and ** columns.

6. Wine quality data set

This is one of the most popular datasets along the beginnings in data science. It is divided into 2 data sets. You can perform regression and classification tasks on this data. Check your understanding in different fields: detection of outliers, selection of features and unbalanced data. There are 4898 rows and 12 columns in this data set.

7. Turkiye student assessment dataset
This dataset is based on an evaluation form completed by students for different courses. It has different attributes that include assistance, difficulty, the score for each evaluation question, among others. This is an unsupervised learning problem. The dataset has 5820 rows and 33 columns.


8. The dataset of heights and weights

This is a fairly direct problem and is ideal for people who start with data science. It is a regression problem. The dataset has 25,000 rows and 3 columns (index, height and weight).
Problem: Predict the height or weight of a person.

Intermediate level


1. Black Friday Dataset

This dataset consists of captured sales transactions in a retail store. It's a classic dataset to explore and expand your resource engineering skills and daily understanding of various buying experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.

2. Human activity recognition dataset

This dataset is collected from recordings of 30 human subjects captured through smartphones enabled with integrated inertial sensors. Many machine learning courses use this information for teaching purposes. This is your turn. This is multiple classification problem. The dataset has 10,299 rows and 561 columns.

3. Set of text mining data

This data set is originally from the Siam text mining competition held in 2007. The data are comprised of aviation safety reports describing the problems that occurred on certain flights. It is a problem of multiple classification and high dimension. It has 21,519 rows and 30,438 rows.

4. Travel History Dataset

This set of data comes from a bicycle sharing service in the United States. This set of data requires you to exercise your professional skills. Data are provided quarterly as of 2010 (Q4). Each file has 7 columns. It's a classification problem.

 5. Millions of song data sets

Did you know that data science can also be used in the entertainment industry? Do it yourself now. This data set presents a regression task. It consists of 5,15,345 observations and 90 variables. However, this is only a small subset of the original data database in one million songs.

6. Census Revenue Data Set

It is an unbalanced classification and a classic problem of machine learning. You know, machine learning is widely used to solve unbalanced problems such as cancer detection, fraud detection, and so on. It's time to get your hands dirty. The dataset has 48,842 rows and 14 columns. As a guide, you can check out this unbalanced data project.

 7. Film lens data set

Have you already built a recommendation system? This is your opportunity! This dataset is one of the most popular datasets cited in the data science industry. It is available in several dimensions. Here I used a fairly small size. It has 1 million ratings from 6,000 users in 4,000 movies

8. Set of Twitter rating data

Working with Twitter data has become an integral part of the problems of feeling analysis. If you want to create a space in this area, you will have fun working on the challenge presented by this dataset. The dataset has a size of 3MB and 31,962 tweets.
Problem: Identify tweets that are hatreds of hatred and those that are not.


Advanced level


1. Identify your advanced level set

This set of data allows to study, analyze and recognize elements in the images. It's exactly like that your camera detects your face, using image recognition! It is your turn to build and test this technique. It's a digit recognition problem. This data set has 7,000 images of 28 X 28, totalling 31MB.

2. Classification of urban sound

 When you start your machine learning journey, you go with simple machine learning problems, such as the Titanic survival forecast. But you still do not have enough practice when it comes to real-life problems. Therefore, this practical problem serves to present the audio processing in the usual classification scenario. This set of data consists of 8,732 fragments of urban sounds of 10 classes.
Problem: Classify the type of audio sound.

3. Vox Celebrity Dataset

Audio processing is fast becoming an important field in deep learning, so here is another challenging problem. This set of data is for the identification of large-scale speakers and contains words spoken by celebrities, extracted from YouTube videos. It is a case of intriguing use to isolate and identify speech recognition. The data contains 100 thousand utterances spoken by 1,251 celebrities.

 4. ImageNet Dataset

ImageNet offers a variety of problems including object detection, location, classification and screen analysis. All images are available for free. You can search any type of image and build your project around it. As of now, this image mechanism has more than 15 million images in various ways, scaling up to 140 GB.

5. Chicago crime data set

The ability to handle large datasets is expected from all data scientists today. Companies no longer prefer to work on samples when they have the computational power to work on the full data set. This dataset gives you a much needed hands-on experience in handling large datasets on your local machines. The problem is easy, but data management is the key! This dataset has observations of 6M. It is a problem of multiple classifications.

 6. Age detection of the data set of Indian actors

This is a fascinating challenge for any deep learning enthusiast. The dataset contains thousands of images of Indian actors and their task is to identify their age. All the images are selected and cut manually from the video frames, resulting in a high degree of scale variability, approach, expression, lighting, age, resolution, occlusion and makeup. There are 19,906 images in the training set and 6,636 in the test set.

7. Dates of the Recommendation Mechanism

This is an advanced challenge of the recommendation system. In this practical problem, you receive the data from the programmers and the questions you solved previously, along with the time it took to resolve this particular issue. As a data scientist, the model you create will help online judges decide the next level of questions to recommend to a user.


8. Dataset of VisualQA

VisualQA is a set of data that contains open questions about the images. These issues require an understanding of the vision and language of the computer. There is an automatic evaluation metric for this problem. The dataset has 265,016 images, 3 questions per image and 10 true answers per question.

This set of data allows to study, analyze and recognize elements in the images. It's exactly like that your camera detects your face, using image recognition! It is your turn to build and test this technique. It's a digit recognition problem. This data set has 7,000 images of 28 X 28, totalling 31MB.




2 comments:

Merits & Demerits of Data Analytics

Definition:  The data analysis process was concluded with the conclusions and/or data obtained from the data analysis. Analysis data show...