Spark on yarn.docx
- 文档编号:8650181
- 上传时间:2023-02-01
- 格式:DOCX
- 页数:25
- 大小:29.44KB
Spark on yarn.docx
《Spark on yarn.docx》由会员分享,可在线阅读,更多相关《Spark on yarn.docx(25页珍藏版)》请在冰豆网上搜索。
Sparkonyarn
Sparkprovidesthreelocationstoconfigurethesystem:
∙Sparkproperties controlmostapplicationparametersandcanbesetbyusinga SparkConf object,orthroughJavasystemproperties.
∙Environmentvariables canbeusedtosetper-machinesettings,suchastheIPaddress,throughthe conf/spark-env.sh scriptoneachnode.
∙Logging canbeconfiguredthrough log4j.properties.
SparkProperties
Sparkpropertiescontrolmostapplicationsettingsandareconfiguredseparatelyforeachapplication.ThesepropertiescanbesetdirectlyonaSparkConf passedtoyour SparkContext. SparkConf allowsyoutoconfiguresomeofthecommonproperties(e.g.masterURLandapplicationname),aswellasarbitrarykey-valuepairsthroughthe set() method.Forexample,wecouldinitializeanapplicationasfollows:
valconf=newSparkConf().setMaster("local").setAppName("CountingSheep").set("spark.executor.memory","1g")valsc=newSparkContext(conf)
DynamicallyLoadingSparkProperties
Insomecases,youmaywanttoavoidhard-codingcertainconfigurationsina SparkConf.Forinstance,ifyou’dliketorunthesameapplicationwithdifferentmastersordifferentamountsofmemory.Sparkallowsyoutosimplycreateanemptyconf:
valsc=newSparkContext(newSparkConf())
Then,youcansupplyconfigurationvaluesatruntime:
./bin/spark-submit--name"Myapp"--masterlocal[4]--confspark.shuffle.spill=false--conf"spark.executor.extraJavaOptions=-XX:
+PrintGCDetails-XX:
+PrintGCTimeStamps"myApp.jar
TheSparkshelland spark-submit toolsupporttwowaystoloadconfigurationsdynamically.Thefirstarecommandlineoptions,suchas --master,asshownabove. spark-submit canacceptanySparkpropertyusingthe --conf flag,butusesspecialflagsforpropertiesthatplayapartinlaunchingtheSparkapplication.Running ./bin/spark-submit--help willshowtheentirelistoftheseoptions.
bin/spark-submit willalsoreadconfigurationoptionsfrom conf/spark-defaults.conf,inwhicheachlineconsistsofakeyandavalueseparatedbywhitespace.Forexample:
spark.masterspark:
//5.6.7.8:
7077spark.executor.memory512mspark.eventLog.enabledtruespark.serializerorg.apache.spark.serializer.KryoSerializer
AnyvaluesspecifiedasflagsorinthepropertiesfilewillbepassedontotheapplicationandmergedwiththosespecifiedthroughSparkConf.PropertiessetdirectlyontheSparkConftakehighestprecedence,thenflagspassedto spark-submit or spark-shell,thenoptionsinthe spark-defaults.conf file.
ViewingSparkProperties
TheapplicationwebUIat http:
//
4040 listsSparkpropertiesinthe“Environment”tab.Thisisausefulplacetochecktomakesurethatyourpropertieshavebeensetcorrectly.Notethatonlyvaluesexplicitlyspecifiedthrougheither spark-defaults.conf orSparkConfwillappear.Forallotherconfigurationproperties,youcanassumethedefaultvalueisused.
AvailableProperties
Mostofthepropertiesthatcontrolinternalsettingshavereasonabledefaultvalues.Someofthemostcommonoptionstosetare:
ApplicationProperties
PropertyName
Default
Meaning
spark.app.name
(none)
Thenameofyourapplication.ThiswillappearintheUIandinlogdata.
spark.master
(none)
Theclustermanagertoconnectto.Seethelistof allowedmasterURL's.
spark.executor.memory
512m
Amountofmemorytouseperexecutorprocess,inthesameformatasJVMmemorystrings(e.g. 512m, 2g).
spark.serializer
org.apache.spark.serializer.
JavaSerializer
Classtouseforserializingobjectsthatwillbesentoverthenetworkorneedtobecachedinserializedform.ThedefaultofJavaserializationworkswithanySerializableJavaobjectbutisquiteslow,sowerecommend usingorg.apache.spark.serializer.KryoSerializerandconfiguringKryoserializationwhenspeedisnecessary.Canbeanysubclassof org.apache.spark.Serializer.
spark.kryo.registrator
(none)
IfyouuseKryoserialization,setthisclasstoregisteryourcustomclasseswithKryo.Itshouldbesettoaclassthatextends KryoRegistrator.Seethe tuningguide formoredetails.
spark.local.dir
/tmp
Directorytousefor"scratch"spaceinSpark,includingmapoutputfilesandRDDsthatgetstoredondisk.Thisshouldbeonafast,localdiskinyoursystem.Itcanalsobeacomma-separatedlistofmultipledirectoriesondifferentdisks.NOTE:
InSpark1.0andlaterthiswillbeoverridenbySPARK_LOCAL_DIRS(Standalone,Mesos)orLOCAL_DIRS(YARN)environmentvariablessetbytheclustermanager.
spark.logConf
false
LogstheeffectiveSparkConfasINFOwhenaSparkContextisstarted.
Apartfromthese,thefollowingpropertiesarealsoavailable,andmaybeusefulinsomesituations:
RuntimeEnvironment
PropertyName
Default
Meaning
spark.executor.memory
512m
Amountofmemorytouseperexecutorprocess,inthesameformatasJVMmemorystrings(e.g. 512m, 2g).
spark.executor.extraJavaOptions
(none)
AstringofextraJVMoptionstopasstoexecutors.Forinstance,GCsettingsorotherlogging.NotethatitisillegaltosetSparkpropertiesorheapsizesettingswiththisoption.SparkpropertiesshouldbesetusingaSparkConfobjectorthespark-defaults.conffileusedwiththespark-submitscript.Heapsizesettingscanbesetwithspark.executor.memory.
spark.executor.extraClassPath
(none)
Extraclasspathentriestoappendtotheclasspathofexecutors.Thisexistsprimarilyforbackwards-compatibilitywitholderversionsofSpark.Userstypicallyshouldnotneedtosetthisoption.
spark.executor.extraLibraryPath
(none)
SetaspeciallibrarypathtousewhenlaunchingexecutorJVM's.
spark.files.userClassPathFirst
false
(Experimental)Whethertogiveuser-addedjarsprecedenceoverSpark'sownjarswhenloadingclassesinExecutors.ThisfeaturecanbeusedtomitigateconflictsbetweenSpark'sdependenciesanduserdependencies.Itiscurrentlyanexperimentalfeature.
spark.python.worker.memory
512m
Amountofmemorytouseperpythonworkerprocessduringaggregation,inthesameformatasJVMmemorystrings(e.g. 512m, 2g).Ifthememoryusedduringaggregationgoesabovethisamount,itwillspillthedataintodisks.
spark.executorEnv.[EnvironmentVariableName]
(none)
Addtheenvironmentvariablespecifiedby EnvironmentVariableName totheExecutorprocess.Theusercanspecifymultipleoftheseandtosetmultipleenvironmentvariables.
spark.mesos.executor.home
driversideSPARK_HOME
SetthedirectoryinwhichSparkisinstalledontheexecutorsinMesos.Bydefault,theexecutorswillsimplyusethedriver'sSparkhomedirectory,whichmaynotbevisibletothem.NotethatthisisonlyrelevantifaSparkbinarypackageisnotspecifiedthroughspark.executor.uri.
ShuffleBehavior
PropertyName
Default
Meaning
spark.shuffle.consolidateFiles
false
Ifsetto"true",consolidatesintermediatefilescreatedduringashuffle.Creatingfewerfilescanimprovefilesystemperformanceforshuffleswithlargenumbersofreducetasks.Itisrecommendedtosetthisto"true"whenusingext4orxfsfilesystems.Onext3,thisoptionmightdegradeperformanceonmachineswithmany(>8)coresduetofilesystemlimitations.
spark.shuffle.spill
true
Ifsetto"true",limitstheamountofmemoryusedduringreducesbyspillingdataouttodisk.Thisspillingthresholdisspecifiedby spark.shuffle.memoryFraction.
press
true
Whethertocompressdataspilledduringshuffles.Compressionwillpression.codec.
spark.shuffle.memoryFraction
0.2
FractionofJavaheaptouseforaggregationandcogroupsduringshuffles,ifspark.shuffle.spill istrue.Atanygiventime,thecollectivesizeofallin-memorymapsusedforshufflesisboundedbythislimit,beyondwhichthecontentswillbegintospilltodisk.Ifspillsareoften,considerincreasingthisvalueattheexpenseofspark.storage.memoryFraction.
press
true
Whethertocompressmapoutputfiles.Generallyagoodidea.Compressionwillpression.codec.
spark.shuffle.file.buffer.kb
32
Sizeofthein-memorybufferforeachshufflefileoutputstream,inkilobytes.Thesebuffersreducethenumberofdiskseeksandsystemcallsmadeincreatingintermediateshufflefiles.
spark.reducer.maxMbInFlight
48
Maximumsize(inmegabytes)ofmapoutputstofetchsimultaneouslyfromeachreducetask.Sinceeachoutputrequiresustocreateabuffertoreceiveit,thisrepresentsafixedmemoryoverheadperreducetask,sokeepitsmallunlessyouhavealargeamountofmemory.
spark.shuffle.manager
HASH
Implementationtouseforshufflingdata.Ahash-basedshufflemanageristhedefault,butstartinginSpark1.1thereisanexperimentalsort-basedshufflemanagerthatismorememory-efficientinenvironmentswithsmallexecutors,suchasYARN.Tousethat,changethisvalueto SORT.
spark.shuffle.sort.bypassMergeThreshold
200
(Advanced)Inthesort-basedshufflemanager,avoidmerge-sortingdataifthereisnomap-sideaggregationandthereareatmostthismanyreducepartitions.
SparkUI
PropertyName
Default
Meaning
spark.ui.port
4040
Portforyourapplication'sdashboard,whichshowsmemoryandworkloaddata
spark.ui.retainedStages
1000
HowmanystagestheSparkUIremembersbeforegarbagecollecting.
spark.ui.killEnabled
true
Allowsstagesandcorrespondingjobstobekilledfromthewebui.
spark.eventLog.enabled
false
WhethertologSparkevents,usefulforreconstructingtheWebUIaftertheapplicationhasfinished.
spark.eventLpress
false
Whethertocompressloggedevents,if spark.eventLog.enabled istrue.
spark.eventLog.dir
file:
///tmp/spark-events
BasedirectoryinwhichSparkeventsarelogged,if spark.eventLog.enabled istrue.Withinthisbasedirectory,Sparkcreatesas
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Spark on yarn