Data Processing with sed
source link: https://allsyed.com/posts/data-pipeline-with-sed/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Input and output
I have received a file which was roughly SQL. Here is the sample
CREATE TABLE DimState
StateSK int NOT NULL
StateBK int NOT NULL
StateCode varchar(10) NULL
StateDescription nvarchar(50) NULL
CREATE TABLE DimLeadType
LeadTypeSK int NOT NULL
LeadTypeBK int NOT NULL
LeadTypeCode nvarchar(255) NOT NULL
LeadTypeDescription nvarchar(255) NOT NULL
LeadTypeCRMBK int NULL
LeadTypeActiveStatus nvarchar(10) NULL
The Algorithm
- Deleting tabs
- Deleting Empty lines
- Opening brace
- Closing brace
- Comma,
- Clean up
Implementation
Deleting tabs
Tab character are usually not displayed in most non-developer focused text-editors. they fall under hidden characters category, this can be removed using s command.
sed 's/\t//g' source.txt > stage_1.txt
Deleting Empty lines
In, The next stage of our data processing pipeline, we will eliminate all empty line from the file generated in step 1. This is achieved with d command.
sed '/^$/d' stage_1.txt > stage_2.txt
Opening brace
Line starting with CREATE should have an opening brace at the end the line.
sed '/CREATE/s/$/(/' stage_2.txt > stage_3.txt
Closing brace
This was an interesting one, The goal was to mark end of the table definition ie. ); .This was solved by inserting line before line containing CREATE. this uses i command
sed '/^CREATE/i);' stage_3.txt > stage_4.txt
For the output file to be valid SQL, every column definition must be separated by a comma. The below command does a reverse match, So it matches all lines not starting with CREATE and appends a comma at the end of the line.
sed '/CREATE/!s/$/,/' stage_4.txt > stage_5.txt
Clean up
This command is handling the undesired byproduct created in above step. The reverse match in previous step matched all lines including the table definition endings. This removes that extra comma
sed 's/);,/);/g' stage_5.txt > output.txt
Complete program
sed 's/\t//g' source.txt > stage_1.txt
sed '/^$/d' stage_1.txt > stage_2.txt
sed '/CREATE/s/$/(/' stage_2.txt > stage_3.txt
sed '/^CREATE/i);' stage_3.txt > stage_4.txt
sed '/CREATE/!s/$/,/' stage_4.txt > stage_5.txt
sed 's/);,/);/g' stage_5.txt > output.txt
# rm stage_*
Conclusion
Did I get this working in one attempt? No. It was a series of trial and errors involving refering official manual and some quick searches on the web.
Here are the key takeaways.
- Breaking down problem into small sub-problem will always help
- Avoid using -i flag with sed, this performs in updates to the file, which may not be desirable depending on the end goal.
References
man sed
sed -h
- Manual
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK