Cleansing a Dump of 1.5 Million Email Addresses

, posted

Lately, I got to cleanize a massive dump of 1.5 Email Addresses for my employer who is planning to launch an ad campaign for his upcoming project. I am not sure about the source but the text file I was provided with had random comments, commas & breaks in it besides the emails.

I downloaded the .txt file which was 34.6 MB in size and imported in my code editor to have a look. It had millions of lines & I even got a small memory warning popup from the editor, which I simply ignored.

Contemplating the best approach, I resolved to develop a bash script for email extraction. Given my familiarity with regular expressions, I knew it would required for pattern recognition, whether through grep, awk, sed, or Python.

After few iterations and refinements, I arrived at the following solution:


# Accept the file as input

# Extract email addresses
emails=$(grep -Eo '^\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' $input_file)

# Count the number of emails extracted
count=$(echo "$emails" | wc -l)

# Save the emails to a new file
echo "$emails" > clean_emails.txt

# Display the count of extracted emails
echo "Successfully extracted $count email(s)."


Within milliseconds the script extracted all the emails found in the raw dump and compiled them into a new file namely clean_emails.txt.

alt text

I converted the final .txt file into .csv and handed over to the employer, done & dusted.

This is Day 12 of #100DaysToOffload

That is all for today. If you found this post useful consider sharing it with friends.

Your Signature