用XmlTextReader分割大型XML文件

by shinichi_wtn 2008-11-21 18:52

前不久从维基百科上下了一个中文维基百科的主要数据包,解压后发现竟然有900M,如此大的XML文件除非内存超大,否则根本无法打开。为此,我想到了把它分割为一些较小的XML文件,并且在文件中保留原有的XML结构。

首先,要分析XML的结构,然后用C#专门针对处理XML的高效工具XmlTextReader对文件进行分割,用XmlTextWriter写入新的小文件中,最后,编程即可。

using System;
using System.Xml;
using System.IO;
using System.Text;

namespace data
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("正在执行XML分割,请耐心等待......");
            DateTime starttime = DateTime.Now;
            XmlTextReader xr = new XmlTextReader("pages-articles.xml");
            xr.WhitespaceHandling = WhitespaceHandling.None;
            xr.MoveToContent();
            //StreamWriter sw = File.CreateText("out.txt");
            xr.Read();
            xr.Skip();//跳过无用节点
            int n=1;
            StreamWriter indexer = File.CreateText("indexer.txt");//构建文本索引
            while (!xr.EOF)
            //for (int j = 1; j <= 20; j++)
            {
                XmlTextWriter xw = new XmlTextWriter("../db/" + XmlConvert.ToString(n) + ".xml", xr.Encoding);//构建分布式数据文件
                xw.WriteStartDocument();
                xw.WriteStartElement("file");
                xw.WriteAttributeString("id", XmlConvert.ToString(n));

                for (int i = 0; i < 200; i++)
                {
                    if (!xr.EOF)
                    {
                        xr.Read();
                        xw.WriteStartElement("page");
                        string temp1 = xr.ReadInnerXml();//记录关键字,用来创建索引
                        xw.WriteElementString("title", temp1);
                        string temp2 = xr.ReadInnerXml();//记录ID,用来创建索引
                        xw.WriteElementString("id", temp2);
                        if (temp1 != "")
                        {
                            indexer.WriteLine("{0}|{1}|{2}", temp1, n, temp2);
                        }

                        if (xr.Name == "restrictions")
                        {
                            xw.WriteElementString("restrictions", xr.ReadInnerXml());
                        }

                        xr.Read();
                        xw.WriteStartElement("revision");
                        xw.WriteElementString("id", xr.ReadInnerXml());
                        xw.WriteElementString("timestamp", xr.ReadInnerXml());
                        xr.Read();
                        xw.WriteStartElement("contributor");
                        xw.WriteElementString("username", xr.ReadInnerXml());
                        xw.WriteElementString("id", xr.ReadInnerXml());
                        xr.Read();
                        xw.WriteEndElement();
                        xw.WriteElementString("minor", xr.ReadInnerXml());
                        xw.WriteElementString("comment", xr.ReadInnerXml());
                        xw.WriteElementString("text", xr.ReadInnerXml());
                        xw.WriteEndElement();
                        xr.Read();
                        xr.Skip();
                        xw.WriteEndElement();
                    }
                    else
                    {
                        break;
                    }
                }
                //xw.WriteStartElement("page");
                //xw.WriteString(xr.ReadInnerXml());
                //xw.WriteEndElement();
                xw.WriteEndDocument();
                xw.Flush();
                xw.Close();

                n++;
            }
            /*for (int i = 0; i < 200; i++)
            {
                Console.WriteLine(xr.Name);
                //if (xr.NodeType == XmlNodeType.Text)
                //{
                //   sw.WriteLine(xr.Value);
                //}
                xr.Read();
            }*/
            xr.Close();
            DateTime endtime = DateTime.Now;
            //sw.Close();
            Console.Write("分割成功,用时{0}",endtime-starttime);
            Console.Read();
        }
    }
}

没有想到如此大的一个文件,分割为1500多个文件只用了40多秒,足以见得XmlTextReader的高效。

Tags: ,

C#

(仅用于Gavatar)

  Country flag

biuquote
  • Comment
  • Preview
Loading

About

shinichi_wtnI'm Shinichi_wtn

Software Engineering Manager at Microsoft

[More...]

Recent Tweets

Twitter May 4, 23:00
因为疫情,五一只能在周边爬山,天气不错 https://t.co/ZJxRHj6fux

Twitter April 17, 22:20
坡峰岭爬山 https://t.co/gevMr3dkop

Twitter March 26, 23:21
月球陨落,剧情不错,特效牛逼 https://t.co/cyoCCXzVFu

Twitter February 1, 00:19
虎年快乐! https://t.co/Dltr5IMfcn

Twitter January 1, 00:05
Happy new year 2022! https://t.co/dCUV2yhO3K

Twitter December 26, 18:18
圣诞节🎄环球城市大道看电影,国产动画《雄狮少年》还是不错的 https://t.co/yVI76EZUVC

Twitter December 24, 22:06
圣诞快乐 https://t.co/SX3QadYBHY

Twitter December 18, 12:16
体验了下联通5g,感觉比移动的5g快不少 https://t.co/KblkbEI99l


Follow me on twitter >>

Month List