DTU Course 02819: Data mining using Python
    
    
    
    Previous called DTU Course 02820: Python Programming
    Discontinued
    This course has unfortunately been discontinued and will not be
    running in the autumn 2015.
    You might want to consider DTU 02807 Computational Tools for Big Data.
    Other related courses
    are 02450
    Introduction to Machine Learning and Data Mining
    and DTU 02805: Social graphs and
    interactions.
    Practical information
    
    
    Tentative schedule for autumn 2014
    (to be updated)
    
    
      - Introduction to the course, installation and introduction to Python
	
      
- Introduction to Python
	
      
- Numerical Python: "Python as Matlab". 
	Numpy, Scipy, matplotlib, ...
	
      
- Databasing, text and Web mining and Python
	
      
- Web serving
	
      
- Project work for the rest of the time.
      
- Exam: Sometime in December 2014 Building 321, floor 100 from
	8.00-16:00(?) (Studiehåndbogen)
      
- Report hand-in deadline: Sometime in December 2012.
    
The following topics may be taken up:
      - Python development
	
	
      
- Data analysis and the Semantic Web
	
	
      
- Misc.
	
      
- Pythonic Python
	
    
Course material
    There are no fixed course material for the project, but we
    recommend a number of books and other material:
    
      - The Python Tutorial.
      
- Kevin Sheppard (2014)
	Introduction
	  to Python for Econometrics, Statistics and Data Analysis.
	Good in-depth introduction to Python with installation,
	datatypes, matrices, math, mathematical functions, pandas,
	control structures, date time, graphics, performance, time
	series.    
      
- Mark Pilgrim
	(2004) Dive 
	  into Python. A good and free introduction to
	Python. Basic Python, Unit testing, XML, regular
	expressions, etc.
	
      
- Allen Downey (2002-2008)
	Think
	  Python: How to Think Like a Computer
	  Scientist. (previously (apparently!?) GFDL and CC-BY-SA
	license, now CC-BY-NC). Covers the basics of the Python
	language and Tkinter GUI. 
	Even better: An interactive version with multiple choice tests and online
	programming is available.
      
- Hans Petter Langtangen (2008)
	Python
	  Scripting for Computational Science.
	Available as a e-book through the DTU library.
	Covers basic Python, numerical (NumPy and SciPy),
	combining Python and C, CGI and GUI programming.
      
- Vidar Bronken Gundersen (2006) MATLAB commands in
	  numerical Python (NumPy). Indispensable if you are trained
	in the Matlab or R languages as this short note provides a
	concise Rosetta stone between Python, Matlab and R. Probably
	good even if you do not have knowledge of Matlab and R. 
      
- Online
	  Python Tutor is an interactive code visualizer.
      
- Internet search engine,
	e.g., "Python
	  tutorial",
	"Python
	  introduction" and 
	"numerical python filetype:pdf"
      
- Finn Årup Nielsen (2014)
	Data Mining with Python (working draft).
    
Extra:
      - Toby Segaran (2007), Programming Collective Intelligence. 
	A commercial book with Python examples in machine learning for
	Web 2.0 applications, e.g., naïve Bayes classifier and
	non-negative matrix factorization.
	Polyteknisk Boghandel: 279 kroner (2012)
	
      
- Steven Bird et
	al. (2009), Natural
	  Language Processing with Python. Describes the NLTK Python
	package and written by the developers of this nice language
	processing toolkit.  Also available under a Creative
	Commons licence from https://sites.google.com/site/naturallanguagetoolkit/book. Polyteknisk
	  Boghandel.
	
      
- Mitchell
	L. Model,  Bioinformatics 
	  Programming Using Python: Practical Programming for Biological Data.
	Probably recommendable for people interested in
	bioinformatics. Covers basic Python presented with problems from
	bioinformatics. Also web and database programming.
	Polyteknisk
	  Boghandel: 469 kroner (2012).
      
- Toby Segaran, Colin Evans og Jamie Taylor (2009),
	Programming the Semantic Web. 
	Specialized book about the Semantic Web and how Python can be
	used in that context.
	Polyteknisk
	  Boghandel: 275 kroner (2012).
      
- Alex Martelli (2005), Python Cookbook. Shows good code
	examples. Not so good as an introductory book.
	Polyteknisk Boghandel
      
- Ivan Idris
	(2011), NumPy
	  1.5 Beginner's Guide. As the name implies: A beginner's
	guides to NumPy.
    
Evaluation
    
    You are evaluated on the written report, the printed poster and
    the oral poster presentation.
    The grade will be based on an overall combined assessment.
    So do not expect  to get a grade or "points" after the poster presentation. 
    You will first get the grade after we have evaluated the reports
    together with the poster presentation.  
    
    
    The report should contain: 
    
      - Use a template:  this latex template
	(IEEE style with 10 point font size)
	or similar in style. Please do not squizze the font size or
	layout beyond the style file.
      
- Information on first page: name, study number, title.
      
- Your programming code in an appendix.
	You may exclude code that contains long wordlists, etc.
      
- You are allowed to use code that you found on the Internet
	(provider that the author distributes it under a suitable
	license such as BSD or GPL), but remember to give full and
	clear attribution to the author. Any text citations should
	also be fully referenced and in quotation
	marks. Paraphrased
	
	citations should also be fully referenced.
    
There are length requirements for the report:
      - One-person report: 2 pages
      
- Two-persons report: 3 pages
      
- Three-persons report: 4 pages
    
These values are maximum limits.
    This limit does not apply for the appendix, which can be
    any page length. The appendix may contain code and automated
    generated content, e.g.,
    from Epydoc, pylint
    results or other.
      The text and the code should be included as text, rather than
      as images, so the plagiarization detection system can read it.
      The report could contain, e.g.:
    
      - Discussion of the problem that is going to get solved, the
	data available and its features.
      
- Discussion of the design of the program
      
- Description of the implementation. 
      
- Table or graphical overviews of modules and/or classes
      
- Database schema description
      
- Description of central parts of the code
      
- Screenshots of the program
      
- Plots of results
      
- Plots of code performance
      
- Details of the development process, e.g., editor, IDE, revision
	control system, operating system, cloud service, ...
      
- Coverage
      
- ...
    
When looking into the code in the report apart from the actual
    functionality we may possibly examine the below items:
      - How well-structured it is (modules/classes/functions)?
      
- Is it effective? Does it use the Python language
	constructions effectively? 
	.
      
- Is it secure? In a web application you should sanitize
	input and escape during output. If your program receives
	strange input does it crash?
      
- Is the code documented? A structured way would use the
	__doc__ variable and, e.g., pydoc. See the "documentation" part
	of
	the introduction slides
      
- Is the code tested? Are errors and exceptions handled
	well. You have likely run the program and seen that it works,
	but a structured approach would also utilize some of the testing
	functionality in Python. See "Testing" part of the
	introduction
												    slides and the testing slides. 
																									  
- Is the coding "nice looking" and consistent? For inspiration 
																									    see Style
																									      Guide for Python Code.
    
      The poster shows the design of the program and "results" e.g.,
      screen shots or plots and it must be self-contained. 
      On the top, the poster must contain title, names of students,
      course title and date. 
      Examples on posters can be seen in building 321, 1. floor 
      (Note that some of the posters are not from the Data Mining
      using Python course but from a machine learning course).
      The external examiner will attend the presentation of the posters
      and will receive the final reports. 
    
      We are able to print posters. They are printed in A1.
      If you opt for this possibility, please email a PDF file
      in sufficient resolution to 
      Finn
	Kuno Christensen (or 
      Erik
	Lund Poulsen) - with the email:
      print (a) compute. dtu. dk. In good time
      before the exam.  
      The printer room is in building 322. 
      You need to pick up the poster 
      yourself from building 322 room 030 where they are put on a
      table, and bring to building 321 and find a place to hang it
      up. 
      If you are scheduled late do not hang your poster up early,
      because there is only limited space.
      Kuno and Erik tell me that unless you hear from them the posters
      will be available minimum 3 hours after you sent dem, — in
      the opening hours 9:00-15:00. To be sure that your poster is
      printed in time it should be sent before 9:00 the day before
      your exam.
      If there is any special issues about printing Erik is in
      building 322 room 033 while Kuno is in building 324 room 150.
    
      The DTU Library may
      also offer
      printing service. I am unaware of how much the cost is.
    
      As an alternative you can print the poster on multiple A4 sheets on glue it
      on a post board or hang them next to each other on the pin board we have.
    
      There are no rules about the size of the poster, but A3 will
      probably be too small. A1 is what DTU Compute will print.
    
      A hint: 
      A standard issue with presenting material on a poster is
      that one tends to put too much text on the poster. Better it is
      to have good drawings, diagrams or plots.  
      
Poster presentation
    
      For the poster presentation the two- and three-men teams should each give a five
      minute presentation, while a one-man "teams" 
      should prepare a presentation between 5 and 10 minutes. After the
      presentation we (teachers and censor) will ask questions. 
    
      In a group each participant should present a part of the
      project/poster, i.e., each participant should not present
      the entire project/poster.
    
      Note that each participant's presentation and answers to questions
      need to be individual. This is not a group exam and we may
      direct individual questions to one of you in the group,
      - and the others should not answer.
    
      A hint:
      It will be a good idea to rehearse exactly what you are going to
      say at the poster presentation.  
    
      You can demonstrate the program on the computer. It might give
      us a nice impression.
      The risk is that you cannot make the computer work or that it
      will take too much time to setup and draw attention away.
      If you choose to also demonstrate the program on a computer, be
      sure to rehearse that along with the oral presentation.
    
      You should expect that other students follow around and listen
      while you a given a presentation.
      There is also ordinary office noise in the hallway where we have
      the exam.
      You are welcome to follow around and listen to the other
      students while the they present.