Data Processing with sed

📅 Mar 16, 2021 · ☕ 3 min read · ✍️ Syed Dawood

Input and output

I have received a file which was roughly SQL. Here is the sample

CREATE TABLE DimState 				
				
StateSK 		int 	NOT NULL	
StateBK 		int 	NOT NULL	
StateCode 		varchar(10) 	NULL	
StateDescription 		nvarchar(50) 	NULL	
				
				
CREATE TABLE DimLeadType 				
				
LeadTypeSK 		int 	NOT NULL	
LeadTypeBK 		int 	NOT NULL	
LeadTypeCode 		nvarchar(255) 	NOT NULL	
LeadTypeDescription 		nvarchar(255) 	NOT NULL	
LeadTypeCRMBK 		int 	NULL	
LeadTypeActiveStatus 		nvarchar(10) 	NULL

The Algorithm

Deleting tabs
Deleting Empty lines
Opening brace
Closing brace
Comma,
Clean up

Implementation

Deleting tabs

Tab character are usually not displayed in most non-developer focused text-editors. they fall under hidden characters category, this can be removed using s command.

sed 's/\t//g' source.txt > stage_1.txt

Deleting Empty lines

In, The next stage of our data processing pipeline, we will eliminate all empty line from the file generated in step 1. This is achieved with d command.

sed '/^$/d' stage_1.txt > stage_2.txt

Opening brace

Line starting with CREATE should have an opening brace at the end the line.

sed '/CREATE/s/$/(/' stage_2.txt > stage_3.txt

Closing brace

This was an interesting one, The goal was to mark end of the table definition ie. ); .This was solved by inserting line before line containing CREATE. this uses i command

sed '/^CREATE/i);' stage_3.txt > stage_4.txt

Comma,

For the output file to be valid SQL, every column definition must be separated by a comma. The below command does a reverse match, So it matches all lines not starting with CREATE and appends a comma at the end of the line.

sed '/CREATE/!s/$/,/' stage_4.txt > stage_5.txt

Clean up

This command is handling the undesired byproduct created in above step. The reverse match in previous step matched all lines including the table definition endings. This removes that extra comma

sed 's/);,/);/g' stage_5.txt > output.txt

Complete program

sed 's/\t//g' source.txt > stage_1.txt
sed '/^$/d' stage_1.txt > stage_2.txt
sed '/CREATE/s/$/(/' stage_2.txt > stage_3.txt
sed '/^CREATE/i);' stage_3.txt > stage_4.txt
sed '/CREATE/!s/$/,/' stage_4.txt > stage_5.txt
sed 's/);,/);/g' stage_5.txt > output.txt
# rm stage_*

Conclusion

Did I get this working in one attempt? No. It was a series of trial and errors involving refering official manual and some quick searches on the web.

Here are the key takeaways.

Breaking down problem into small sub-problem will always help
Avoid using -i flag with sed, this performs in updates to the file, which may not be desirable depending on the end goal.

References

man sed
sed -h
Manual

Input and output

The Algorithm

Implementation

Deleting tabs

Deleting Empty lines

Opening brace

Closing brace

Clean up

Complete program

Conclusion

References

Recommend

Apple corporate staffers have returned to Apple Park, one day a week

8999元！补全短板的折叠屏，就是vivo X Fold？

[Video] Discussing Honeycomb code code with Michael Newman

关于PHP，关于Realsee

终于来了！特斯拉交付搭载4680电池的新车：续航大幅飙升

网店刮起“纯‘预’风” 消费风险不容忽视

How to reset AirPods, remove AirPods Apple ID - 9to5Mac

How Engineers Turned 10,000 Devices into an Analytics Goldmine

省掉80%配置时间，这款Mock神器免费又好用 - Liamhong

Linux驱动 | 充电芯片bq24735调试，你学会了吗？

About Joyk