Cleansing a Dump of 1.5 Million Email Addresses
Lately, I got to cleanize a massive dump of 1.5 Email Addresses for my employer who is planning to launch an ad campaign for his upcoming project. I am not sure about the source but the text file I was provided with had random comments, commas & breaks in it besides the emails.
I downloaded the .txt
file which was 34.6 MB in size and imported in my code editor to have a look. It had millions of lines & I even got a small memory warning popup from the editor, which I simply ignored.
Contemplating the best approach, I resolved to develop a bash script for email extraction. Given my familiarity with regular expressions, I knew it would required for pattern recognition, whether through grep, awk, sed, or Python.
After few iterations and refinements, I arrived at the following solution:
#!/bin/zsh
# Accept the file as input
input_file=$1
# Extract email addresses
emails=$(grep -Eo '^\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' $input_file)
# Count the number of emails extracted
count=$(echo "$emails" | wc -l)
# Save the emails to a new file
echo "$emails" > clean_emails.txt
# Display the count of extracted emails
echo "Successfully extracted $count email(s)."
Result
Within milliseconds the script extracted all the emails found in the raw dump and compiled them into a new file namely clean_emails.txt.
I converted the final .txt
file into .csv
and handed over to the employer, done & dusted.
This is Day 12 of #100DaysToOffload
That is all for today. If you found this post useful consider sharing it with friends.