DTU Course 02819: Data mining using Python

Previous called DTU Course 02820: Python Programming


This course has unfortunately been discontinued and will not be running in the autumn 2015. You might want to consider DTU 02807 Computational Tools for Big Data. Other related courses are 02450 Introduction to Machine Learning and Data Mining and DTU 02805: Social graphs and interactions.

Practical information

Tentative schedule for autumn 2014

(to be updated)
  1. Introduction to the course, installation and introduction to Python
  2. Introduction to Python
  3. Numerical Python: "Python as Matlab". Numpy, Scipy, matplotlib, ...
  4. Databasing, text and Web mining and Python
  5. Web serving
  6. Project work for the rest of the time.
  7. Exam: Sometime in December 2014 Building 321, floor 100 from 8.00-16:00(?) (Studiehåndbogen)
  8. Report hand-in deadline: Sometime in December 2012.
The following topics may be taken up:
  1. Python development
  2. Data analysis and the Semantic Web
  3. Misc.
  4. Pythonic Python

Course material

There are no fixed course material for the project, but we recommend a number of books and other material:
  1. The Python Tutorial.
  2. Kevin Sheppard (2014) Introduction to Python for Econometrics, Statistics and Data Analysis. Good in-depth introduction to Python with installation, datatypes, matrices, math, mathematical functions, pandas, control structures, date time, graphics, performance, time series.
  3. Mark Pilgrim (2004) Dive into Python. A good and free introduction to Python. Basic Python, Unit testing, XML, regular expressions, etc.
  4. Allen Downey (2002-2008) Think Python: How to Think Like a Computer Scientist. (previously (apparently!?) GFDL and CC-BY-SA license, now CC-BY-NC). Covers the basics of the Python language and Tkinter GUI. Even better: An interactive version with multiple choice tests and online programming is available.
  5. Hans Petter Langtangen (2008) Python Scripting for Computational Science. Available as a e-book through the DTU library. Covers basic Python, numerical (NumPy and SciPy), combining Python and C, CGI and GUI programming.
  6. Vidar Bronken Gundersen (2006) MATLAB commands in numerical Python (NumPy). Indispensable if you are trained in the Matlab or R languages as this short note provides a concise Rosetta stone between Python, Matlab and R. Probably good even if you do not have knowledge of Matlab and R.
  7. Online Python Tutor is an interactive code visualizer.
  8. Internet search engine, e.g., "Python tutorial", "Python introduction" and "numerical python filetype:pdf"
  9. Finn Årup Nielsen (2014) Data Mining with Python (working draft).
  1. Toby Segaran (2007), Programming Collective Intelligence. A commercial book with Python examples in machine learning for Web 2.0 applications, e.g., naïve Bayes classifier and non-negative matrix factorization. Polyteknisk Boghandel: 279 kroner (2012)
  2. Steven Bird et al. (2009), Natural Language Processing with Python. Describes the NLTK Python package and written by the developers of this nice language processing toolkit. Also available under a Creative Commons licence from https://sites.google.com/site/naturallanguagetoolkit/book. Polyteknisk Boghandel.
  3. Mitchell L. Model, Bioinformatics Programming Using Python: Practical Programming for Biological Data. Probably recommendable for people interested in bioinformatics. Covers basic Python presented with problems from bioinformatics. Also web and database programming. Polyteknisk Boghandel: 469 kroner (2012).
  4. Toby Segaran, Colin Evans og Jamie Taylor (2009), Programming the Semantic Web. Specialized book about the Semantic Web and how Python can be used in that context. Polyteknisk Boghandel: 275 kroner (2012).
  5. Alex Martelli (2005), Python Cookbook. Shows good code examples. Not so good as an introductory book. Polyteknisk Boghandel
  6. Ivan Idris (2011), NumPy 1.5 Beginner's Guide. As the name implies: A beginner's guides to NumPy.


You are evaluated on the written report, the printed poster and the oral poster presentation. The grade will be based on an overall combined assessment. So do not expect to get a grade or "points" after the poster presentation. You will first get the grade after we have evaluated the reports together with the poster presentation.


The report should contain: There are length requirements for the report: These values are maximum limits. This limit does not apply for the appendix, which can be any page length. The appendix may contain code and automated generated content, e.g., from Epydoc, pylint results or other.

The text and the code should be included as text, rather than as images, so the plagiarization detection system can read it. The report could contain, e.g.:

When looking into the code in the report apart from the actual functionality we may possibly examine the below items:


The poster shows the design of the program and "results" e.g., screen shots or plots and it must be self-contained. On the top, the poster must contain title, names of students, course title and date. Examples on posters can be seen in building 321, 1. floor (Note that some of the posters are not from the Data Mining using Python course but from a machine learning course). The external examiner will attend the presentation of the posters and will receive the final reports.

We are able to print posters. They are printed in A1. If you opt for this possibility, please email a PDF file in sufficient resolution to Finn Kuno Christensen (or Erik Lund Poulsen) - with the email: print (a) compute. dtu. dk. In good time before the exam. The printer room is in building 322. You need to pick up the poster yourself from building 322 room 030 where they are put on a table, and bring to building 321 and find a place to hang it up. If you are scheduled late do not hang your poster up early, because there is only limited space. Kuno and Erik tell me that unless you hear from them the posters will be available minimum 3 hours after you sent dem, — in the opening hours 9:00-15:00. To be sure that your poster is printed in time it should be sent before 9:00 the day before your exam. If there is any special issues about printing Erik is in building 322 room 033 while Kuno is in building 324 room 150.

The DTU Library may also offer printing service. I am unaware of how much the cost is.

As an alternative you can print the poster on multiple A4 sheets on glue it on a post board or hang them next to each other on the pin board we have.

There are no rules about the size of the poster, but A3 will probably be too small. A1 is what DTU Compute will print.

A hint: A standard issue with presenting material on a poster is that one tends to put too much text on the poster. Better it is to have good drawings, diagrams or plots.

Poster presentation

For the poster presentation the two- and three-men teams should each give a five minute presentation, while a one-man "teams" should prepare a presentation between 5 and 10 minutes. After the presentation we (teachers and censor) will ask questions.

In a group each participant should present a part of the project/poster, i.e., each participant should not present the entire project/poster.

Note that each participant's presentation and answers to questions need to be individual. This is not a group exam and we may direct individual questions to one of you in the group, - and the others should not answer.

A hint: It will be a good idea to rehearse exactly what you are going to say at the poster presentation.

You can demonstrate the program on the computer. It might give us a nice impression. The risk is that you cannot make the computer work or that it will take too much time to setup and draw attention away. If you choose to also demonstrate the program on a computer, be sure to rehearse that along with the oral presentation.

You should expect that other students follow around and listen while you a given a presentation. There is also ordinary office noise in the hallway where we have the exam. You are welcome to follow around and listen to the other students while the they present.