In a previous post, Importing “Weird” Data Files into SQL Server, I described a couple of unusually structured data files and what I had to do to get them into an SQL Server database table. Each full row of data is imported into one database column. Now I am going to describe how I went about splitting the data from this column into multiple columns.
Creating a Consistent Number of Column Delimiters
One of the “weird” file types was tab-delimited and had an irregular number of tabs in each row of data. In fact, there were between 8 and 13 tabs in each row. I needed to make sure that each row had the same number of tabs, i.e. 13; so that I could easily split them.
UPDATE [Schema].[Table]
SET Data = CASE WHEN 13 - (DATALENGTH(Data) - DATALENGTH(REPLACE(Data, CHAR(9), ''''))) = 1 THEN Data + CHAR(9)
WHEN 13 - (DATALENGTH(Data) - DATALENGTH(REPLACE(Data, CHAR(9), ''''))) = 2 THEN Data + CHAR(9) + CHAR(9)
WHEN 13 - (DATALENGTH(Data) - DATALENGTH(REPLACE(Data, CHAR(9), ''''))) = 3 THEN Data + CHAR(9) + CHAR(9) + CHAR(9)
WHEN 13 - (DATALENGTH(Data) - DATALENGTH(REPLACE(Data, CHAR(9), ''''))) = 4 THEN Data + CHAR(9) + CHAR(9) + CHAR(9) + CHAR(9)
WHEN 13 - (DATALENGTH(Data) - DATALENGTH(REPLACE(Data, CHAR(9), ''''))) = 5 THEN Data + CHAR(9) + CHAR(9) + CHAR(9) + CHAR(9) + CHAR(9)
ELSE Data
END
This works by comparing the length of the data in the string with the length of the data in the string when the tab characters are replaced. The result is deducted from 13 to determine how many additional tab characters need to be appended so that each row has 13 tabs in it. The case statement causes the correct number of tabs to be added.
If anyone has a neater solution, I would be very interested to have your comments.
Splitting the Data in the SELECT
We could if we chose split the data using the SUBSTRING function combined with CHARINDEX and CHAR functions in the SELECT part of the query, e.g.
SELECT SUBSTRING(Data, 1, CHARINDEX(CHAR(9), Data) - 1) AS Column0
,SUBSTRING(Data, CHARINDEX(CHAR(9), Data) + 1, (CHARINDEX(CHAR(9), Data, CHARINDEX(CHAR(9), Data) + 1) - CHARINDEX(CHAR(9), Data) - 1)) AS Column1
FROM [Schema].[Table];
For the Column0, we know that the start index for the SUBSTRING function is 1, but we require the CHARINDEX function to determine the length of the substring (which is 1 less than the index of the first tab character, which we have to identify using the CHAR function). This is not too complicated. For the 2nd column, we need to determine the index of the first tab character and the substring starts at the next index value. The length is determined by finding the index of the 2nd tab by starting to look for it from the index of the first tab plus 1; then we have to take away the index of the first tab the take away another 1 to get the length of the substring. Getting more complicated. The last column required would be slightly easier since we can use the LEN (or DATALENGTH, if required) function and take away the index of last tab.
I had 14 columns. You can see how complex the code would become if I were to pursue this approach. Each column is determined by repeating calculations already done for the preceding column. We need a method of preserving the indexes of the tabs. That leads me to …
Using CROSS APPLY to Split Data
The APPLY operator specifies that the table source (table-valued function or table-valued expression) on the right of the operator is evaluated against every row of the record set on the left. Then they are joined together, so that all the columns from both sources appear in the result set. We can use multiple CROSS APPLYs to avoid repeatedly calculating the index of each tab.
SELECT Column1
,Tab1
,Column2
,Tab2
,Column3
,Tab3
,Column4
,Tab4
,Column5
,Tab5
,Column6
,Tab6
,Column7
,Tab7
,Column8
,Tab8
,Column9
,Tab9
,Column10
,Tab10
,Column11
,Tab11
,Column12
,Tab12
,Column13
,Tab13
,Column14
FROM [Schema].[Table]
CROSS APPLY (SELECT CHARINDEX(CHAR(9), Data) AS Tab1) AS GetTab1
CROSS APPLY (SELECT SUBSTRING(Data, 1, Tab1 - 1) AS Column1) AS GetColumn1
CROSS APPLY (SELECT SUBSTRING(Data, Tab1 + 1, (CHARINDEX(CHAR(9), Data, Tab1 + 1) - Tab1 - 1)) AS Column2
,CHARINDEX(CHAR(9), Data, Tab1 + 1) AS Tab2
) AS Column2AndTab2
CROSS APPLY (SELECT SUBSTRING(Data, Tab2 + 1, (CHARINDEX(CHAR(9), Data, Tab2 + 1) - Tab2 - 1)) AS Column3
,CHARINDEX(CHAR(9), Data, Tab2 + 1) AS Tab3
) AS Column3AndTab3
CROSS APPLY (SELECT SUBSTRING(Data, Tab3 + 1, (CHARINDEX(CHAR(9), Data, Tab3 + 1) - Tab3 - 1)) AS Column4
,CHARINDEX(CHAR(9), Data, Tab3 + 1) AS Tab4
) AS Column4AndTab4
CROSS APPLY (SELECT SUBSTRING(Data, Tab4 + 1, (CHARINDEX(CHAR(9), Data, Tab4 + 1) - Tab4 - 1)) AS Column5
,CHARINDEX(CHAR(9), Data, Tab4 + 1) AS Tab5
) AS Column5AndTab5
CROSS APPLY (SELECT SUBSTRING(Data, Tab5 + 1, (CHARINDEX(CHAR(9), Data, Tab5 + 1) - Tab5 - 1)) AS Column6
,CHARINDEX(CHAR(9), Data, Tab5 + 1) AS Tab6
) AS Column6AndTab6
CROSS APPLY (SELECT SUBSTRING(Data, Tab6 + 1, (CHARINDEX(CHAR(9), Data, Tab6 + 1) - Tab6 - 1)) AS Column7
,CHARINDEX(CHAR(9), Data, Tab6 + 1) AS Tab7
) AS Column7AndTab7
CROSS APPLY (SELECT SUBSTRING(Data, Tab7 + 1, (CHARINDEX(CHAR(9), Data, Tab7 + 1) - Tab7 - 1)) AS Column8
,CHARINDEX(CHAR(9), Data, Tab7 + 1) AS Tab8
) AS Column8AndTab8
CROSS APPLY (SELECT SUBSTRING(Data, Tab8 + 1, (CHARINDEX(CHAR(9), Data, Tab8 + 1) - Tab8 - 1)) AS Column9
,CHARINDEX(CHAR(9), Data, Tab8 + 1) AS Tab9
) AS Column9AndTab9
CROSS APPLY (SELECT SUBSTRING(Data, Tab9 + 1, (CHARINDEX(CHAR(9), Data, Tab9 + 1) - Tab9 - 1)) AS Column10
,CHARINDEX(CHAR(9), Data, Tab9 + 1) AS Tab10
) AS Column10AndTab10
CROSS APPLY (SELECT SUBSTRING(Data, Tab10 + 1, (CHARINDEX(CHAR(9), Data, Tab10 + 1) - Tab10 - 1)) AS Column11
,CHARINDEX(CHAR(9), Data, Tab10 + 1) AS Tab11
) AS Column11AndTab11
CROSS APPLY (SELECT SUBSTRING(Data, Tab11 + 1, (CHARINDEX(CHAR(9), Data, Tab11 + 1) - Tab11 - 1)) AS Column12
,CHARINDEX(CHAR(9), Data, Tab11 + 1) AS Tab12
) AS Column12AndTab12
CROSS APPLY (SELECT SUBSTRING(Data, Tab12 + 1, (CHARINDEX(CHAR(9), Data, Tab12 + 1) - Tab12 - 1)) AS Column13
,CHARINDEX(CHAR(9), Data, Tab12 + 1) AS Tab13
) AS Column13AndTab13
CROSS APPLY (SELECT REPLACE(SUBSTRING(Data, Tab13 + 1, LEN(Data) - Tab13), CHAR(13), '') AS Column14
) AS LastColumn;
This last method is the one that I chose.
You must be logged in to post a comment.