Open in app

Sign in

Write

Sign in

Richard Pelgrim
Richard Pelgrim

255 Followers

Home

About

Published in

Towards Data Science

·Pinned

Arabic NLP: Unique Challenges and Their Solutions

Pre-processing Arabic text for machine-learning using the camel-tools Python package — In this article, I provide a concise and to-the-point overview of the challenges of working with Arabic text in NLP projects…and the tools available to overcome them. I rely heavily on the camel-tools Python package developed at the NYU Abu Dhabi CAMeL Lab and this excellent webinar by its director…

NLP

7 min read

Arabic NLP: Unique Challenges and Their Solutions
Arabic NLP: Unique Challenges and Their Solutions
NLP

7 min read


Published in

Towards Data Science

·Pinned

The Beginner’s Guide to Distributed Computing

7 Fundamental Concepts to Succeed With Distributed Computing in Python — Enter the Distributed Universe More and more data scientists are venturing into the world of distributed computing to scale up their computations and process larger datasets faster. But starting your distributed computing journey can feel a bit like entering an alternate universe: overwhelming, intimidating and confusing. But here’s the good news: you don’t need…

Data Science

12 min read

The Beginner’s Guide to Distributed Computing
The Beginner’s Guide to Distributed Computing
Data Science

12 min read


Published in

Towards Data Science

·Mar 29

Sliding Windows in Pandas

Identify Patterns in Time-Series Data with Overlapping Window Techniques — Windowing techniques enable data analysts to identify valuable patterns in time-series data. Sliding windows are particularly powerful because they allow you to spot patterns earlier than other techniques. This is an important feature in situations when making a key decision a few minutes (or seconds) earlier can save you money.

Data Science

10 min read

Sliding Windows in Pandas
Sliding Windows in Pandas
Data Science

10 min read


Jun 1, 2022

Not Everyone Can Become a Data Scientist

Why we need to talk more openly about privilege — This one’s going to be short and to the point, because I’m actually quite upset. I’m upset about privilege. Or, to be more precise, I’m upset about how little we talk about privilege in the data industry. A lot of us act like the tech industry is this golden land…

Data Science

3 min read

Not Everyone Can Become a Data Scientist
Not Everyone Can Become a Data Scientist
Data Science

3 min read


Published in

Towards Data Science

·May 17, 2022

Accessing the NYC Taxi Data in 2022

Everything you need to know about the recent changes — As of May 13, 2022, access to the NYC Taxi data has changed. Parquet has now become the new default file format, instead of CSV. Practically, this means you will need to change two things in your code: Change the path to the S3 bucket Use the dd.read_parquet() method instead…

Python

5 min read

Accessing the NYC Taxi Data in 2022
Accessing the NYC Taxi Data in 2022
Python

5 min read


Apr 1, 2022

Julia vs Python for Data Science in 2022

Comparing Programming Languages for Data Science — This article compares Julia to Python in terms of general performance, package availability and adoption and gives guidance on whether you should consider learning it. Know Your Programming Languages for Data Science In 2021 Python achieved #1 ranking in the TIOBE Index of programming languages for the second year in a row. …

Data

5 min read

Julia vs Python for Data Science in 2022
Julia vs Python for Data Science in 2022
Data

5 min read


Published in

Towards Data Science

·Feb 10, 2022

5 Rookie Mistakes to Avoid when Using Dask

Strategies for Successful Distributed Computing in Python — Using Dask for the first time can be a steep learning curve. This post presents the 5 most common mistakes people make when using Dask — and strategies for how you can avoid making them. Let’s jump in. 1. “Dask is basically pandas, right?” The single-most important thing to do before starting to build things with…

Python

7 min read

5 Rookie Mistakes to Avoid when Using Dask
5 Rookie Mistakes to Avoid when Using Dask
Python

7 min read


Published in

Towards Data Science

·Jan 7, 2022

How to Build Powerful Airflow DAGs for Big Data Workflows in Python

Scale your Airflow pipelines to the cloud — Airflow DAGs for (Really!) Big Data Apache Airflow is one of the most popular tools for orchestrating data engineering, machine learning, and DevOps workflows. But it has one important drawback. Out-of-the-box, Airflow will run your computations locally, which means you can only process datasets that fit within the resources of your machine. To use Airflow for…

Data Science

5 min read

How to Build Powerful Airflow DAGs for Big Data Workflows in Python
How to Build Powerful Airflow DAGs for Big Data Workflows in Python
Data Science

5 min read


Published in

Towards Data Science

·Jan 5, 2022

Why You Should Save NumPy Arrays with Zarr

Read and Write Arrays Faster with Dask — tl;dr This post tells you why and how to use the Zarr format to save your NumPy arrays. It walks you through the code to read and write large NumPy arrays in parallel using Zarr and Dask. Here’s the code if you want to jump right in. If you have questions…

Data Science

5 min read

Why You Should Save NumPy Arrays with Zarr
Why You Should Save NumPy Arrays with Zarr
Data Science

5 min read


Published in

Towards Data Science

·Dec 25, 2021

How to Write NumPy Arrays to CSV Files

And why you should consider other file formats — This post explains how to write NumPy arrays to CSV files. We will look at: the syntax for writing different NumPy arrays to CSV the limitations of writing NumPy arrays to CSV alternative ways to save NumPy arrays Let’s get to it. Writing NumPy Arrays to CSV You can use the np.savetxt() method to save…

Data Science

4 min read

How to Write NumPy Arrays to CSV Files
How to Write NumPy Arrays to CSV Files
Data Science

4 min read

Richard Pelgrim

Richard Pelgrim

255 Followers

Mindful techie crunching data at scale | Connect: https://www.linkedin.com/in/richard-pelgrim/ | Unlimited Reads: https://richardpelgrim.medium.com/membership

Following
  • Jude Ellison S. Doyle

    Jude Ellison S. Doyle

  • Mehul Gupta

    Mehul Gupta

  • Salvatore Raieli

    Salvatore Raieli

  • ODSC - Open Data Science

    ODSC - Open Data Science

  • Tomas Peluritis

    Tomas Peluritis

See all (193)

Help

Status

About

Careers

Blog

Privacy

Terms

Text to speech

Teams