XML是萬維網(wǎng)上使用標(biāo)準(zhǔn)ASCII文本,內(nèi)部網(wǎng)和其他地方共享文件格式和數(shù)據(jù)的文件格式。 它代表可擴展標(biāo)記語言(XML)。 與HTML類似,它包含標(biāo)記標(biāo)簽。但與標(biāo)記標(biāo)簽描述頁面結(jié)構(gòu)的HTML不同,標(biāo)記標(biāo)簽描述了文件中包含的數(shù)據(jù)的含義。
可以使用“XML”包讀取R中的xml文件,使用以下命令安裝此軟件包。
install.packages("XML")
通過將以下數(shù)據(jù)復(fù)制到文本編輯器(如記事本)中來創(chuàng)建XMl文件。 使用.xml
擴展名保存文件,并將文件類型選為所有文件(*.*
)。創(chuàng)建一個XML文件:input.xml,內(nèi)容如下 -
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
R使用xmlParse()
函數(shù)來讀取xml
文件,它作為列表存儲在R中。
# Load the package required to read XML files.
library("XML")
# Also load the other required package.
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Print the result.
print(result)
當(dāng)我們執(zhí)行上述代碼時,會產(chǎn)生以下結(jié)果 -
<?xml version="1.0"?>
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
獲取XML文件中存在的節(jié)點數(shù)
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Find number of nodes in the root.
rootsize <- xmlSize(rootnode)
# Print the result.
print(rootsize)
當(dāng)我們執(zhí)行上述代碼時,會產(chǎn)生以下結(jié)果 -
output
[1] 8
第一個節(jié)點的詳細(xì)信息
下面來看看如何解析文件的第一條記錄,它將給出對頂級節(jié)點中存在的各種元素的詳細(xì)信息。
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Print the result.
print(rootnode[1])
當(dāng)我們執(zhí)行上述代碼時,會產(chǎn)生以下結(jié)果 -
$EMPLOYEE
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
獲取節(jié)點的其它元素
# Load the packages required to read XML files.
library("XML")
library("methods")
# Give the input file name to the function.
result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file.
rootnode <- xmlRoot(result)
# Get the first element of the first node.
print(rootnode[[1]][[1]])
# Get the fifth element of the first node.
print(rootnode[[1]][[5]])
# Get the second element of the third node.
print(rootnode[[3]][[2]])
當(dāng)我們執(zhí)行上述代碼時,會產(chǎn)生以下結(jié)果 -
<ID>1</ID>
<DEPT>IT</DEPT>
<NAME>Michelle</NAME>
為了在大文件中有效處理數(shù)據(jù),我們以xml文件的形式讀取數(shù)據(jù)作為數(shù)據(jù)幀。然后處理數(shù)據(jù)幀進(jìn)行數(shù)據(jù)分析。
# Load the packages required to read XML files.
library("XML")
library("methods")
# Convert the input xml file to a data frame.
xmldataframe <- xmlToDataFrame("input.xml")
print(xmldataframe)
當(dāng)我們執(zhí)行上述代碼時,會產(chǎn)生以下結(jié)果 -
ID NAME SALARY STARTDATE DEPT
1 1 Rick 623.3 1/1/2012 IT
2 2 Dan 515.2 9/23/2013 Operations
3 3 Michelle 611 11/15/2014 IT
4 4 Ryan 729 5/11/2014 HR
5 5 Gary 843.25 3/27/2015 Finance
6 6 Nina 578 5/21/2013 IT
7 7 Simon 632.8 7/30/2013 Operations
8 8 Guru 722.5 6/17/2014 Finance
由于數(shù)據(jù)現(xiàn)在已經(jīng)轉(zhuǎn)為數(shù)據(jù)幀,所以我們可以使用數(shù)據(jù)幀相關(guān)函數(shù)來讀取和操作文件。