Apogee Suite: AI-Powered Legal Document Research Platform

The 6 crucial skills Jr Data Scientists need​ - Apogee Suite: AI-Powered Legal Document Research Platform

Apogee Suite: AI-Powered Legal Document Research Platform

This entry is part 1 of 2 in the series 1000ml Pillars

The 6 crucial skills Jr Data Scientists need

Yes, you need to model; and so much more!

By VICTOR ANJOS

I speak and write a lot about broken education systems and learning in general and give my readers just a little insight into what to do about it.  Well, if you’re an aspiring Data Scientist (or Data Science Analyst), then this post is all about you and the plight of getting your foot in the door.

Obviously, the easiest way to gain these skills is to come to one of our upcoming intro sessions and join in a upcoming cohort.  We are still the only Mentor program which has a recruitment (or actually placement) function.  It is similar to how executive search works, except this part of our business is exclusively for entry level data candidates.

"I honestly don't know why everyone thinks they need to know neural networks and all take Andrew Ng's Coursera course"

So you want to be a Data Scientist

So what is it that entry level candidates really need?  It’s actually rather simple when you think of the entire life-cycle of a data science or machine learning project you would undertake.  Let’s quickly walk through those steps and talk about the necessary skills to excel at this, and ultimately, get hired.

"Hey you, how can I tell if this campaign is effective?"

Starting a Project

Someone finds you, and asks you to do some “data science” or analysis because they know that you’re capable. You’re in the stage of Problem Discovery now and it’s time to hunker down, it’s about to get real!

Problem Discovery

As a data scientist you will routinely discover or be presented with problems to solve. Your initial objective should be to determine if your problem is in fact a Data Science problem — and if so, what kind. There is great value in being able to translate a business idea or question into a clearly formulated problem statement. And being able to effectively communicate whether or not that problem can be solved by applying appropriate Machine Learning algorithms.
 
 A good data science problem will aim to make decisions, not just predictions. Keep this objective in mind as you contemplate each problem you are faced with. In the example above, some action might be taken to reduce the number of inconclusive predictions, thereby avoiding the need for subsequent rounds of testing, or delaying needed treatment. Ultimately, the predictions from your model should empower your stakeholders to make informed decisions — and take action!
Below is a sample project brief we present at the 1000ml Mentorship Program.

Ok, good – problem defined and we’re ready to work on some data.  Let’s get that data in my Jupyter notebook ASAP.

But wait, data in school and in all these online courses and tutorials was always readily available, in a good state and easy to parse through.  What do you mean data is not perfect in the real world?

Data Engineering and Hygiene

Datasets usually contain large volumes of data that may be stored in formats that are not easy to use. That’s why data scientists need first to make sure that data is correctly formatted and conforms to the set of rules. Moreover, combining data from different sources can be tricky, and another job of data scientists is making sure that the resulting combination of information makes sense.
 
Data sparseness and formatting inconsistencies are the biggest challenges – and that’s what data cleansing is all about. Data cleaning is a task that identifies incorrect, incomplete, inaccurate, or irrelevant data, fixes the problems, and makes sure that all such issues will be fixed automatically in the future.
 
This is where 1000ml shines most, is shifting the paradigm in the following ways;
 
    • Dealing with missing data

    • Standardizing the process

    • Validating data accuracy

    • Removing duplicate data

    • Handling structural errors

    • Getting rid of unwanted observations

"Data scientists spend 60% of the time organizing and cleansing data!"

As part of a good product organization, you will need to fit into their feature development practices, but that’s kinda weird and awkward for a typical Data Scientist right?  You sort of want to live in a notebook and have all this piping and infrastructure just there, shouldn’t it be that easy?

Production Engineering

 Here’s the absolute minimum every single (aspiring and experienced) Data Scientist should know about production engineering (i.e. development practices):
 

Code is not written in isolation, it is written collaboratively

COMMENT YOUR CODE. Twice as more as you’re doing now! Commenting before writing code is a good practice.
Write politely. Use explicit variable names, file names (I see you, Mr untitled09.ipynb). Delete/comment the code you’re not using anymore.
Document your code. Put a README, give your reader a helping hand.
Collect versions of the modules you use. In 1 year when you’ll have to run the project in a rush, you’ll find the versions of the modules you need.
 

If it’s not in version control, it doesn’t exist

Treat code laying on your PC just as if it didn’t exist. It’s not versioned, it’s not reviewed, it’s not CIed, it’s not backed up. It. Just. Doesn’t. Exist.
 
Oh, and did you know Jupyter notebooks were actually versionable files?
 

Make the safe play as much as possible with architecture

You’re a data scientist? Great. You would like to use the latest framework that has half a star on a Github repository that’s been published 3 years ago and never touched ever since (sorry, no links). Don’t.
You’re going to have a lot of uncertainty to deal with, so please avoid skating on thin ice by using an obscure Pytorch fork from an abandoned university.
 

Dangling code and poorly written code is the worst

Successful coding sessions often involve trimming unnecessary parts of your code. This is one simple way to manage complexity.
 

Model Training is for others, not for you, silly

Never train for yourself but for someone else. That means, documenting, commenting, and two major things:
1. Give ways to others to train against your original data (version your data and give its URL, for example!)
2. Give ways to others to evaluate against your test data
Seems obvious? Go and read some deep learning papers, and you’ll understand it’s not always that obvious 🙂
 

Don’t copy & paste, steal (but give credit)

Everyone does it, you find a solution online, you stick it in your code base and later can’t figure out who wrote that…  Once you’ve copy+pasted the solution, just be kind enough to reference the original Stack Overflow post, for example as a comment. It’s not much but it really helps a lot when you’re trying to figure out why you did that.
 

No one uses Python2 (if they have a choice)

Seriously, it’s old and Python3 rocks – just use it.

Tune in next week for the remaining pillars!

So what do you do when all signs point to having to go to University to gain any sort of advantage?  Unfortunately it’s the current state of affairs that most employers will not hire you unless you have a degree for even junior or starting jobs. Once you have that degree, coming to a Finishing and Mentor Program, with 1000ml being the only one worldwide, is the only way forward to gaining the practical knowledge and experience that will jump start your career.

Check out our next dates below for our upcoming seminars, labs and programs, we’d love to have you there.

Be a friend, spread the word!

Series NavigationThe 6 crucial skills Jr Data Scientists need​ – Part 2 >>