Parsing Strings in C# : Tokenizing strings while Ignoring White Spaces

Information/Programming : 2007. 7. 5. 20:05

Today at work, I ran into a situation where I needed to parse a string into strings in C#. I was expecting something like what I did in C++ where I could use a class derived from iostream class and an extraction operator(>>). Although I believe implementing a class which will make is possible to write code similar to C++(because operator overloading is allowed in C# as far as I know), I was sure that the there is no direct way of doing such thing in the .Net platform.

All of a sudden, StringTokenizer class in Java(http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html) came across my mind. I was able to find Split() method in the String class which performed somthing similar to the StringTokenizer in Java which wasn't so surprising.

According to the MSDN .Net Framework reference(http://msdn2.microsoft.com/en-us/library/b873y76a.aspx), Split() method basically splits the string and returns an array of strings. And there are various ways to call the Split() method by passing a parameter or parameters which will act as your delimiting criteria when splitting the string. When the parameter is null, it is said that the defualt value for delimter is assumed to be white spaces. So...

As anyone could have guessed, "Darb\nSmarba".Split() returns {"Darb", "Smarba"}. However, I found it very weird about how it operated against the string that I was working on. The string that was working on had the form "Darb\r\nSmarba\r\n".

The only difference is that it has additional white space characters at the end of the string and also has carriage return character along with the newline character. The result of Split() on the string that I was working on had additional string elements in the returned array of strings which were "". To make things clear, {"Darb", "", "Smarba", ""} is what I got as a result. Huh???

I needed a way to work around the weird reaction of the Split() method. So, I looked up Google as usual. After doing some research, I have found out that I have to use Regular Expression to do the splitting. Maybe, Regular Expressions make it clear and precise of that must be used as a delimiter. But here is another WEIRD thing. I had to use Regex class from System.Text.RegularExpressions namespace.

Anyway, here is how I got my way through.
First of all I need

using System.Text.RegularExpressions;

then afterwards, where I need to do the split...

string[] tokens;
tokens = Regex.Split(text_to_parse.TrimStart().TrimEnd(), "\\s+");

This results in ignoring all the so called white space characters(\t,\n,\r," ", and more if they exist).

Not something that I can call as an accomplishment, but I think it is worth sharing what I have learned to do today considering that it took me a while to search for the solution. I have made my story short, but it took me while to figure things out and refresh my memory of Regular Expressions.

By the way, I found a good reference for Regular Expressions.
http://opencompany.org/download/regex-cheatsheet.pdf

Introduction:
오늘 회사에서 일을 하다가 C#으로 문자열을 빈칸을 기준으로 여러개의 문자열로 나눠야 할 일이 생겼다. String class의 Split Method를 사용하면 가능할것으로 보였다. MSDN .Net Framework Reference(http://msdn2.microsoft.com/en-us/library/b873y76a.aspx)에 보면 만약에 "Darb\nSmarba"을 Split하게 되면 리턴값이 {"Darb", "Smarba"} 된다고 나와있다. 하지만 나는 "Darb\r\nSmarba\r\n"을 Split으로 넘겨줬더니 리턴값이 {"Darb", "", "Smarba", ""}이 되었다.
오늘 나는 C#에서 "Darb\r\nSmarba\r\n"를 {"Darb", "Smarba"} 으로 만드는 방법을 찾았다.

프로그래밍을 배웠지만 아직 사회에 나가서 전문적으로 프로그래밍을 하지 않는 우리나라 사람이라면 아직까지는 C가 C++보다 익숙하고, C++이 Java보다 더 익숙하고, Java가 기타 언어보다 더 익숙하리라 생각된다.

잘은 모르지만 우리나라는 아직까지 학교에서 미국과는 달리 프로그래밍은 C부터 가르친다는 말을 많이 듣는것 같다. 아니라면, 그만큼 나도 나이가 들었고, 요즘 대학생들과의 교류가 상당히 없어졌다는 얘기일것 같다...^^

어쨌든, 밀레니엄학번인 나는 미국 대학에서 프로그래밍을 배울때 C++부터 배웠다. 한때는 C++이 제일 편했지만, Java와 C#과 같은 C++보다는 좀더 객체지향적인 언어를 많이 사용하다보니 polymorphism같은 내용이나 여러가지 구체적이고 세부적인 C++의 내용은 상당히 많이 까먹은것 같다. 한때는 C++이 제일 편했지만, 최근에는 Java 나 C#과 같이 C++보다 객체지향 개념이 강한 언어를 선호하다보니 C++의 복잡한 문법들이 가물가물하다. 지금은 C++이 가장 편하다고 하지는 못하지만, 상당히 제한적이고 간단한 작업을 수행하는 프로그램을 짤때는 C++만큼 편한 언어는 없다.

내가 언급한 상당히 제한적이고 간단한 작업이란 텍스트 기반으로도 쉽게 짜고 사용할 수 있는 프로그램들을 말한다. 그중에 가장 많은 비중을 차지하는 작업이 Text의 parsing(문자열 분석(?))이 요구되는 작업들이다. 가령 역대 로또 당첨 번호들을 텍스트로 읽어들여서 분석하는 것과 같은 일이다... <-- 실제로 작업중이다

C로는 parsing 작업을 많이 해보지 않았으므로, 얼마나 그 작업을 용이하게 할 수 있는지 잘 모르겠다. 그리고 언어에 따른 parsing 작업의 편리함 정도는 아주 주관적인 것이므로 parsing 작업은 C++이 가장 편하다고 하지는 않겠다. 하지만 적어도 나에게는 가장 편하다.

오늘 문득 회사에서 일을 하다가 C#으로 문자열을 분석할 일이 생겼다. 어려운 작업은 아니었고, 문자열에 포함된 빈칸(White Space)를 기준으로 문자열을 나누고 싶었다. C++같으면 이 작업을 iostream으로부터 파생된 class와 extraction operator(>>)으로 쉽게 수행할 수 있다. 하지만 C++에서 사용하던 parsing 방법과 비슷한 효과를 낼 수 있는 방법을 못찾아 잠시 애를 먹었다. Google을 통해 그 방법을 찾아보고 어렵사리 그 방법을 터득하게 되었다. 그래서 parsing 작업이 C++에서는 익숙하지만 C#에서 익숙하지 않은 사람들을 위해 그 방법을 공유하고자 한다. 대단한건 아니지만, google에서 쉽게 찾지 못했으므로 여기에 기록해 두는것만으로도 많은 사람들에게 도움이 될것 같다는 생각에 충분히 가치있는 일이라고 생각된다.

그럼 본론으로 들어가서, C++처럼 iostream으로부터 파생된 클래스와 ">>" 연산자를 사용해서 하는 방법과 똑같은 방법은 없지만(C#에서는 operator overload가 가능한것으로 알고 있는데, 그래서 구현은 가능할것으로 보인다), 예전에 Java를 조금씩 하면서 사용했던 방법과 비슷한 방법으로 parsing하는 방법을 찾았다.

*앞으로 사용될 예제들에서 text_to_parse라는 문자열을 사용하겠다.
Java에는 StringTokenizer class를 사용해서 다음과 같이 문자열을 parsing할 수 있다.

   StringTokenizer st = new StringTokenizer(text_to_parse);
   while (st.hasMoreTokens()) {
   System.out.println(st.nextToken());
   }

StringTokenizer 객체는 string 객체를 매개변수로 받아들여 생성되고, 그 string object를 토큰(쉽게 말해 프로그래머가 원하는 문자열의 단위)으로 분리시켜준다. Default는 가장흔히 쓰이는 빈칸 기준으로 문자열을 나눠준다. 만약 다른 방법으로 나누고 싶다면 문자열 분리의 기준이 되는 문자열을 StringTokenizer 객체 생성시 추가로 매개변수로 넘겨주면 된다. 자세한 내용은 다음 링크를 참고하기 바람.
http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

그래서 위의 예제는 빈칸을 기준으로 분리된 문자열을 차례차례 빈칸이 없을때까지 문자열을 나누어 출력하게 된다.

introduction에서 언급했듯이 C#에도 Java에서 StringTokenizer가 해주는 일이 비슷한 String class의 Split이라는 method가 있다. Split method 역시 매개변수로 문자열을 넘겨줘서 문자열을 나누는 기준을 정해줄 수 있다. 하지만 StringTokenizer와 마찬가지로 default는 빈칸이다. 그리고 MSDN .Net Framework Reference에 보면 만약에 문자열이 "Darb\nSmarba"을 Split하게 되면 결과가 {"Darb", "Smarba"} 나온다고 했지만, "Darb\r\nSmarba\r\n"을 넘겨주니 예상했던 것과 사뭇 다른 {"Darb", "", "Smarba", ""}라는 결과를 얻었다.

Split을 조금더 정밀하게 하기 위해서 Regular Expression을 써야 함을 알게 되었고 다음과 같은 방법으로 해결책을 찾았다.
string[] tokens;
tokens = Regex.Split(text_to_parse.TrimStart().TrimEnd(), "\\s+");
결국 "\s+"라는 regular expression을 써서 빈칸에 대한 정의를 좀더 명확히 해줌으로써 문자열에서 빈칸을 완전히 배제하고 문자열을 나눌 수 있었다.

그리고 오늘 Regular Expressions 사용할일 있을때 참고할만한 좋은 자료를 찾았다.
http://opencompany.org/download/regex-cheatsheet.pdf

'Information > Programming' 카테고리의 다른 글

[S/E] Term confusion - Association and Object Association (4)	2008.03.23
[VC++] Need a more accurate timer? (0)	2008.03.18
[S/E] How is Aggregation and Composition different? (11)	2008.03.17
[Algorithm] 변수 없이 Swap하기, but not an option? (11)	2008.02.26
[VC++] Window의 초기 크기와 위치 정해주기 (1)	2008.02.24
[C++] in/out parameter passing by reference? And your preference? (0)	2008.02.21
[C++] How to make Circular Dependency work??? (0)	2008.02.19
[VC++ 6.0] Having Trouble with CSpintButtonCtrl??? (0)	2008.01.31
[VC++] Serial Port implementation (How I did it) (4)	2007.11.26
Absurd Compiler Behavior in Visual Studio 6.0 (2)	2007.10.07