Project Logistics Arc

Discussion in 'Core Projects' started by Cervator, Feb 16, 2014.

  1. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Replies, without quotes, because I suck at dividing them up :D
    • Not sure what you mean by pulling out version numbers from module.txt? I hope to modify them in-place, as if module.txt is the sole source for versioning
    • gradle.properties is what the Artifactory release management seems hard coded to use. I'm reluctant to have anything that touches Gradle in module repos as possible, might make them need more access control or other forms of protection. Rather leave them as open as possible :)
    • Yeah I'm looking at a Gradle script piece to do the release management. It is nice to get a release icon in Jenkins but that can be done custom too.
    • I put the template for gradle.properties in the code base for those wondering what in the world the two of us are talking about, since I bet we're the only ones who have one :D
    • Yep I ran into the JGit code, thanks! Good inspiration. I'm trying to keep auto-commits down, like one per stable release and that's it.
    • Sure, live chat is easier, when we happen to run into each other. In the meantime I'll do what I can and we can keep tweaking :)
  2. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Oh boy, dependency resolution with multiple source options with varying stats gets hairy. Main Artifactory repos we care about:
    • libs-release-local - used to hold an engine build, now moved and only holding our own libs
    • libs-snapshot-local
    • terasology-release-local - now contains the 0.1.42 engine build used by @msteiger's WorldViewer (or supposed to be). Moved there from libs release.
    • terasology-snapshot-local - latest engine available here is "0.1.0-SNAPSHOT+1206"
    • nanoware-release-local - testing new release publishing to here, has worked just fine for both engine and modules. Emptied it now for other testing
    • nanoware-snapshot-local - latest engine available here is "0.0.1-SNAPSHOT"
    • Virtual: repo - global virtual repo, contains everything
    • Virtual: everything-except-terasology-release - just what it sounds like
    Trouble with module builds right now is everything uses the "repo" global to resolve dependencies (config in Gradle), which gets them believing the 0.1.42 build from Terasology release is latest, which it isn't anymore (it is the latest "release" build, as per its version number)

    *** GrowingFlora remotely resolved org.terasology.engine - engine - version 0.1.42

    Idea is that by varying the resolution repo in Jenkins we can intelligently tell jobs to draw from appropriate virtual repos that are a collection of specific repos in Artifactory. So right now I can run with the "everything-except-terasology-release" repo and correctly exclude that 0.1.42 build for a snapshot build of a module. In Gradle.

    However, instead of getting the newest engine snapshot (0.1.0-SNAPSHOT under nanoware-snapshot-local) it ends up with the 0.1.0-SNAPSHOT+1206 as it seems the stuff after the + is included in sorting, which probably makes sense in Gradle/Artifactory, but isn't how the SemVer standard we're trying to follow works.

    Still, that's fixable, we could probably just delete all snapshots using the old format. The new "anonymous" snapshots still get stored with unique identifiers in Artifactory, but you just get a generic copy of latest, leaving me I'm a little hesitant about not knowing the exact version I have of something.

    Anyway, the bigger problem is that we want to put a reasonable default into Gradle, then vary the resolution Artifactory repo in the Jenkins jobs. This I have yet to find a way to get working. No matter what I do the setting in Jenkins doesn't seem to override our repositories.maven.url "http://artifactory.terasology.org/artifactory/repo" - despite the publishing setting overriding fine. I can send build artifacts anywhere I want easily. I want to be able to entirely isolate the Nanoware test line of stuff, but no luck so far.

    I've got a snapshot vs. release design that looks solid, even though the Artifactory Release option or even the plain Jenkins Release plugin are kinda too inflexible to use in this case. Since the target publish repo works you can simply have a "-Release" job that sends artifacts to a release repo. Should even be able to do it with a single parameter and a single commit, never needing to edit a version tag, keeping complexity to a minimum. Have to sometimes remind myself that eventually most the people that use this stuff won't have our level of familiarity with SemVer, artifact resolution, etc.

    Also found a bit of a bug - without "from components.java" in publishing.publications.mavenJava(MavenPublication) in Artifactory Gradle config we don't get any dependencies listed in the generated .pom. @Immortius I think this is also a problem in gestalt-module (view the .pom for the latest version in Artifactory), it just happens to work anyway as the only place using it (via Artifactory) happens to cover all the dependencies. That slowed me down a bit :)

    Still in good spirits here, got a ton of work done over this 4-day-weekend, and learned lots. Just need to figure out exactly how we want things to resolve in which situations then see how we can configure it best. Then test it out, set it live, and see how it works and what still needs to be modified (my draft changes in the develop branch for Nanoware remain just a draft)
    • Like Like x 1
  3. msteiger

    msteiger Active Member

    The WorldViewer app also runs fine with snapshot builds. You can just deactivate it. I thought it would be good to have a few of the latest stable builds available in Artifactory - so it's currently a write-only repo :)
  4. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Merged the changes :)

    Seems pretty stable and I've bumped up the version so the older artifacts won't get picked for resolution. Goes with our existing stable build numbering (in the minor position just because) until we do something different (SemVer isn't in full effect for major release 0)

    We should be able to start building things again, although Jenkins is giving me some trouble. While my Nanoware test build runs in less than 3 minutes (no analytics) the main engine build is taking forever (40+ minutes), getting stuck on otherwise simple things like "Archiving Test Results". I suspect this is a combination of that job's extensive history + the strength of the droplet running Jenkins. Rebooted already, no change. I'll need to focus more very soon on dynamic builder droplets and/or permanent archiving of older builds elsewhere.

    Need to also put the new Distro thing into place soon, before next stable. May need to tweak the launcher to get "fat" downloads with modules in a different way soon, maybe also for build archiving. Onwards to round 2!

    Please let me know if anything pops up out of the ordinary (other than Core worlds not working right now, that's a different PR and should be fixed soon). Also happy to adjust the process more if needed. Big thing to highlight is that by convention (ours) now all versioning is assumed to include "-SNAPSHOT" and all artifacts go into the main Terasology snapshot repo - I've tested the release piece but it isn't live yet. But I'm not tweaking the version files (module.txt) yet, nor adding the -SNAPSHOT for actual built jars (which Gestalt in turn can read)

    Main reason for not explicitly putting -SNAPSHOT in the module.txt files on GitHub is, well, it would be everywhere. Which kinda makes it fluff that might just confuse people less familiar with dependency management. That's personal opinion and I'm more than open to be talked out of it if needed :)
  5. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    I went ahead and cleaned out a pile of old engine builds in Jenkins. I've got a local backup anyway should we want to put up a historical archive some time for the fun of it. Engine builds went from 40 minutes (due to memory/cpu thrashing) back down to 9 minutes :)

    Almost upgraded to the next droplet up at $80/month but glad this fixed it so easily. I bet with some more routine cleaning and the dynamic build agents it could even be downsized a bit.

    Merged a few PRs and rebuilt all the stable modules as well. Deleted old module snapshots from Artifactory so the new builds would resolve (the new SNAPSHOT pattern again means old snapshots at the same base release get version precedence)

    Made a new virtual repository in Artifactory that excludes the Nanoware repos I test with, so now they're isolated from each other. Beginning to look pretty good.

    Still need to do distros and merge the last few open PRs, then make release jobs, and think more about snapshots. Good times!
    • Like Like x 2
  6. Immortius

    Immortius Lead Software Architect Staff Member

    Good catch - I've repaired this in gestalt-modules and will release a fixed version.
  7. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    (I think we sorted out the above on IRC, been a while) whoops page 2

    The quest for dynamic build agent droplets for Jenkins continues!

    I've been working on that on and off for, well, pretty much two months I guess! Got some extra time in while in NYC and then some more over the last few days. All day today spent on Chef, test-kitchen, vagrant, virtualbox, knife, berkshelf, supermarket, Ruby unit testing, serverspec integration testing .... yeah, good times ;)

    Been hitting a lot of dead-ends previously. Lots of plugins for Jenkins, including a swarm plugin that could hook up agents dynamically to Jenkins just fine - just not with GitHub OAuth enabled. Another plugin that could spin up droplets automatically - but not take them back down. Had Chef's "knife" command making droplets dynamically with its Digital Ocean plugin, but that was straight command line with lots of hard coded config. And so on. Got all the pieces as usual, just putting them together.

    With full-blown Chef in its test-kitchen setup it looks like I'm getting somewhere. I've got Vagrant spinning up VMs locally on my PC, and finally am spinning up Jenkins stuff - so far a Jenkins Master entirely set up through automation. Same "cookbook" also covers agents, creating jobs, and so on, which is what I really need. And test-kitchen appears to have a Digital Ocean "driver" as well, so probably I can spin up droplets there instead of local VMs with a simple config tweak.

    When this is out of the way I've got the release stuff and even dynamic Groovy tweaking of module jobs with dependencies about ready to plug in. It also helps some related stuff become more consistent, like the default Git our scripts in Jenkins might get (ref: @msteiger). And I can reprovision a new master from scratch in about two minutes, woo!

    • Like Like x 1
  8. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Eureka! After many moons (well, one and a half since last?) I've dug back down to this topic and got a working setup! :geek:

    I've got a two step process in creating new droplet server from inside Jenkins using Chef, then establishing a new Jenkins agent on the new droplet with a quick Groovy script + the native SSH support. I got it working with Chef doing the install on the node as well, but realized that's a poor pattern for something dynamic like builder droplets. Chef cookbooks are more for how you ultimately want something static to look, with easy ability to then upgrade and expand it over time. The nice SSH setup also works with our GitHub OAuth out of the box.

    Incidentally, the Jenkins master I'm testing on was also built by Chef, but that's not important right now. Pushing that off for later so we can get past the obstacle of elastic build capacity, to use a fancy way of putting it.

    At this point I just need to write a third step, a Groovy queue scanner (CongaGooey?) to determine when we spin up an extra builder droplet. Then when needed (new engine commits or an engine PR) that kicks off a more polished version of what I got working tonight. When we have spare capacity the queue scanner notices and shuts the builder droplet off again.

    With that in place we can fully enable the PR builder (yay @msteiger!) and attach module snapshot builds to every engine build so we can see when stuff breaks. We could even combine it and see what modules (if any) an engine PR breaks :D

    First - sleep! :sleep:
    • Like Like x 1
  9. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    One more step! Got the remaining Groovy scripts mostly figured out in my local Jenkins. Got a QueueWatcher job that'll look for engine / module builds and if there are enough it'll trigger the provisioning and create a Jenkins agent on a build node. Conversely if there is nothing in the queue it'll trigger a RetireWorkers job that'll confirm still no activity a minute or so later (queue or active builds, since those may in turn trigger more builds) then destroy the build agent.

    I also played with sketchboard.io (ref: recent tinkering with a sketching plugin for Confluence with @shartte) and made a sketch there really easily, even exported it to GitHub, but it only shows my personal repos, not org repos :(

    Anyway, sketch!

    [​IMG]

    Next up: Run it through a test Jenkins and split the master's queue in two, putting a local agent on there. That way we can more easily separate engine + module builds from the utility jobs like the QueueWatcher.
  10. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Success! All the pieces are in place and confirmed on a test server! :cool:

    Works much like in the diagram above, tweaked a bit especially with the numbers, and with a throttle plugin added for Jenkins. The testing with the below screenshot was done with each droplet builder assigned to handle up to 2 engines + 5 modules in the queue. Any more and another builder would spawn to help out. The Jenkins master itself doesn't do any module or engine builds, but a special Jenkins agent on the master machine (testjenkins.terasology.net at the bottom) is allowed to run the module quota.

    This results in individual module pushes triggering builds that'll be able to run without spawning dedicated builders. Any single engine build will immediately result in one builder spawning, which then also adds to the module capacity. If enough stuff piles on (say I enable all modules to build after engine builds) then more builders will spawn. I expect to increase the module quota to something high like 40-60 - they build pretty quickly (usually 1-2 mins each) and builder droplets might build 3-5 in parallel (the master 1-3). We have about 90 total modules right now.

    The queue gets checked every 3 minutes. If it is empty and any builder droplets are hanging around idle a separate job will be triggered and go look at shutting one down.

    I'll commit scripts and apply to our live Jenkins tomorrow. Will look at getting some PRs out of the way first, then do a Jenkins backup, version upgrade, and new stuff. Then we can finally has PR building! I expect we can even make throwaway module builds on engine PRs so we can see what modules break :D

    Dashboard [Jenkins]_2015-03-14_22-25-47.png
    • Winner Winner x 2
  11. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Our live Jenkins has been updated to the latest version with all plugins updated and a few added.

    I've configured the Nanoware jobs for the new setup, but haven't yet set up the Groovy jobs - nearing 1 am and it is going to take a little longer :)

    Did stress test a little and found just 1 engine + 2 module jobs can break that server, whoops ;)

    Groovy jobs up next I get a chance, then I have to edit every job ever to enable it :cautious: Unless I fit in a quick "Hmm, how do I do that with Groovy!" snippet somewhere first
  12. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    New setup is in place and working on the live server for the test Nanoware jobs. Hit a few assorted issues:
    • JDepend for some reason crashes the entire Jenkins server if it is run on an agent instead of on the master. Wat. I'm not sure we really need it and it just adds to build time anyway so I might just disable everywhere until we decide to care.
    • Git was causing trouble as we've taken extra steps to make it work in different cases on the master. The same tweaks affected agent machines, causing trouble. I've switched the default to JGit (a bundled Java lib) instead of relying on any actual Git install. Seems like it may cause failures the first time it is run on a some jobs that used to work the old way - but then goes away.
      • @msteiger - I know we had done some extras to make some Gitty magic work for you. The actual Git install on the master is still there and I haven't taken out the PATH tweaks so it probably still works and/or can be made to work with JGit. Keep an eye out and let me know if you spot anything
      • Old Git install in Jenkins was named "Default" and pathed to /usr/local/git/bin/git - I removed that in favor of the JGit config, probably won't cause trouble?
    • Something is causing an occasional job to stall during Checkstyle. I wonder if it is a memory spike when more than one job is running, bad enough to make Checkstyle run out of memory and die quietly (no errors in the log, sadness). Need to test more and maybe split apart engine and module builds further (current builder droplets get to run one engine build and one module build in parallel, or two modules)
    • I'm not 100% sure if the builder droplet startup is perfect yet. If two builds are queued and start at the exact same time on a totally fresh server Java still needs to be installed. Any risk that both jobs attempt to initialize the tool? It is possibly to initialize the server more during initial setup.
    I haven't activated the main jobs to use the new system yet. Probably more testing tomorrow.

    Crashing Jenkins seems to slowly let memory leak, so full reboots help. At one point the server was using 1.7 GB idle without Jenkins running, after reboot it used 1.1 GB with Jenkins running.
    • Like Like x 1
  13. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Ran into a fair bit more fun testing with the actual Java building jobs. Who would've thought that would be more complex than simple shell jobs that sleep for 20-30 seconds!

    As it turns out a 2 GB droplet cannot handle an engine build at the same time as a module build, eventually memory spikes would line up and kill the one job that would just stall in Jenkins and sit there forever. Not a good obstacle to have in the way.

    A 1 GB droplet cannot handle an engine build on its own, but can manage a single module build.

    A 4 GB droplet would have plenty of memory for engine + module, could probably even handle 2-3 modules along with an engine, but it is still just a 2 CPU server like the 2 GB option so not a good upgrade. Heck even the two processes running on a 2 GB droplet seem to slow down substantially vs just a single process.

    So I split out the droplet options into two - one for engines (2 GB) and one for modules (1GB) with modules allowed to also build (one at a time) on engine builder droplets. At first that caused the problem of ending up stuck running module builds when we want to get the engine build out of the way first, but a quick priority plugin for the Jenkins queue fixed that. Don't need the throttle one now that all the options are single executors, including the builder living on the master Jenkins server. That may also help with automated testing as jobs needing external resources like ports shouldn't conflict.

    Each engine/module builder is assigned a quota of up to 20 modules in the queue, if there are more waiting (and there will be, soon(tm)) a new module builder will be created. Same for engine builders with a quota of 2 engines each. The master gets half a module quota (so 10) for its sole executor builder, which won't be allowed to build engines.

    When activity drops again the builders will be retired, one every 3 minutes or so, favoring the module builders first since they can't be used for engine builds (the engine builders can build modules)

    To really throw some load at this I merged an engine PR, triggered the two test engine builds for Nanoware (develop + master), and made a test PR for Nanoware to auto-trigger the PR tester. I also fired off nearly 20 module test jobs. The live engine jobs (develop + master) will be set at a higher priority than test jobs (and all modules) so plain "Terasology" ran first. When it was done it furthermore triggered the applet build and the Iota Distro. Had me a little worried there with the master's memory usage (one module + applet build + distro + the queue scanner) but it made it!

    Two engine builders popped up and ran through everything perfectly - everything was done in less than 30 minutes or so! Here's a screen grab showing the hectic action. I think I set the priority too low for the PR and Release jobs (it is a scale of 1-5, they were at 3-4ish, maybe unset == 3?)

    Nanoware [Jenkins]_2015-03-23_00-47-33.png

    Develop builds for the engine will now use the new dynamic droplet builds, I haven't tweaked the other jobs quite yet, past 1 am again. Will probably also classify the libraries and launcher as "modules" just as far as build load goes, then they can build on the master without triggering droplet creation yet take advantage of a builder droplet if one is hanging out.

    The only error this time around was single upload failure to Artifactory, which is probably unrelated. System is fairly streamlined now, main view in Jenkins is http://jenkins.terasology.org/view/Utility/ and the Groovy magic lives here

    Oh: For bonus points Slack notifications get triggered when the droplets are created/destroyed :D In the #logistics channel only
    Last edited: Mar 23, 2015
  14. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    And we're live! I applied priorities and labels to almost all the jobs, just leaving a few old broken modules alone that need a bigger overhaul.

    I also hooked up a PR builder job for the engine and triggered the four outstanding PRs. Worked perfectly first try :)

    If anybody notices anything weird like stalled job please let me know!

    Server is really low on space again, hoping to fix it up tomorrow
  15. Cervator

    Cervator Project Lead and Community Wizard Staff Member

    Boy, been a fun few days :)

    In short since I'm up late (again) and sleepy:
    • Artifactory broke once by chance on a random problem with a redirect temporarily not working. I copied in the right URL once and it started working again. Go figure.
    • GitHub has been getting DDoSed leaving to some occasional timeouts and failed builds / maybe quirky PR builder test triggering
    • A few times new builder droplets hit connection issues while pulling all needed dependencies (which can be a lot, and aren't reused like on the master), also leading to build failures.
    • That Artifactory plugin for Jenkins bug where a config option didn't matter, so we didn't have it set. Suddenly it started working and the old global (virtual) "repo" entry took effect when "virtual-repo-live" is correct. That's fixed now, but oddly the Nanoware test version is set on those jobs yet they still go use "virtual-repo-live" for some things.
    • Jenkins upgrade and/or using agents knocked loose an old hack where module builds would start with a project name of "workspace" because that's the name of the directory they start in, due to a Jenkins build path thing. Resulted in artifacts being published to Artifactory named "workspace" so naturally they couldn't be found. Fixed
    • Snapshot builds as dependencies suck ;) Reset the cache on the Jenkins server a few times. And there are two of them now on that server (one for the master, one for the local agent)
    • All the stabilization after @Josharias' big Vector3i PR should be done now. Helped to also update the Nanoware repos, was building those a while with actual outdated code rather than quirky dependency resolution ...
    I think pretty much everything is stable again. Couple more PRs including javadoc fixing so we might be able to build with Java 8. Then time for module releases, I hope!

Share This Page